AMS subject classifications. 15A06, 65F10, 65F20, 65F22, 65F25, 65F35, 65F50, 93E24

Size: px

Start display at page:

Download "AMS subject classifications. 15A06, 65F10, 65F20, 65F22, 65F25, 65F35, 65F50, 93E24"

Christian Griffith
6 years ago
Views:

1 : AN IERAIVE ALGORIHM FOR SPARSE LEAS-SQUARES PROBLEMS DAVID CHIN-LUNG FONG AND MICHAEL SAUNDERS Abstract. An iterative method is presented for solving linear systems Ax = b and leastsquares problems min Ax b 2, with A being sparse or a fast linear operator. is based on the Golub-Kahan bidiagonalization process. It is analytically equivalent to the MINRES method applied to the normal equation A Ax = A b, so that the quantities A r are monotonically decreasing (where r = b Ax is the residual for the current iterate x. In practice we observe that r also decreases monotonically. Compared to, for which only r is monotonic, it is safer to terminate early. Improvements for the new iterative method in the presence of extra available memory are also explored. Key words. least-squares problem, sparse matrix,, MINRES, Krylov subspace method, Golub-Kahan process, conjugate-gradient method, minimum-residual method, iterative method AMS subject classifications. 15A6, 65F1, 65F2, 65F22, 65F25, 65F35, 65F5, 93E24 DOI. xxx/xxxxxxxxx 1. Introduction. We present a numerical method called for computing a solution x to the following problems: Unsymmetric equations: solve Ax = b Linear least squares (LS: minimize Ax b 2 ( Regularized least squares: minimize A b ( x λi 2 where A R m n, b R m, and λ. he matrix A is used as an operator for which products of the form Av and A u can be computed for various v and u. hus A is normally large and sparse and need not be explicitly stored. is similar in style to the well nown method [15, 16] in being based on the Golub-Kahan bidiagonalization of A [5]. is equivalent to the conjugategradient (CG method applied to the normal equation (A A+λ 2 Ix = A b. It has the property of reducing r monotonically, where r = b Ax is the residual for the approximate solution x. (For simplicity, we are letting λ =. In contrast, is equivalent to MINRES [14] applied to the normal equation, so that the quantities A r are monotonically decreasing. In practice we observe that r also decreases monotonically, and is never very far behind the corresponding value for. Hence, although and ultimately converge to similar points, it is safer to use in situations where the solver must be terminated early. Stopping conditions are typically based on bacward error: the norm of some perturbation to A for which the current iterate x solves the perturbed problem exactly. Experiments on many sparse LS test problems show that for, a certain cheaply computable bacward error for each x is close to the optimal (smallest possible bacward error. his is an unexpected but highly desirable advantage. Version of August 16, 211. echnical Report SOL 21-2, revised March 14, for Copper Mountain Special Issue 21 icme, Stanford University (clfong@stanford.edu. Partially supported by a Stanford Graduate Fellowship. Systems Optimization Laboratory, Department of Management Science and Engineering, Stanford University, CA (saunders@stanford.edu. Partially supported by Office of Naval Research grant N and by the U.S. Army Research Laboratory, through the Army High Performance Computing Research Center, Cooperative Agreement W911NF

2 2 DAVID FONG AND MICHAEL SAUNDERS 1.1. Overview. Section 2 introduces the Golub-Kahan process and derives the basic algorithm with λ =. Section 3 derives various norms and stopping criteria. Section 4 discusses singular systems and complexity. Section 5 derives the algorithm with λ. Section 6 describes bacward error estimates. Section 7 gives numerical results on a range of overdetermined and square systems. Section 8 summarizes our findings, and Appendix A proves one of the main lemmas Notation. Matrices are denoted by A, B,..., vectors by v, w,..., and scalars by α, β,.... wo exceptions are c and s, which denote the significant components of a plane rotation matrix, with c 2 + s 2 = 1. For a vector v, v always denotes the 2-norm of v. For a matrix A, A usually denotes the Frobenius norm, and the condition number of a matrix A is defined by cond(a = A A +, where A + denotes the pseudoinverse of A. Vectors e 1 and e denote columns of an identity matrix. Items lie β and β are about to change to something similar lie β. 2. Derivation of. We begin with the Golub-Kahan process [5], an iterative procedure for transforming ( b A to upper-bidiagonal form ( β 1 e 1 B he Golub-Kahan process. 1. Set β 1 u 1 = b (shorthand for β 1 = b, u 1 = b/β 1 and α 1 v 1 = A u For = 1, 2,..., set After steps, we have β +1 u +1 = Av α u and α +1 v +1 = A u +1 β +1 v. (2.1 AV = U +1 B and A U +1 = V +1 L +1, where we define V = ( v 1 v 2... v, U = ( u 1 u 2... u, and Now consider α 1 β 2 α 2 B = β, α L +1 = ( B α +1 e +1. β +1 A AV = A U +1 B = V +1 L +1B = V +1 ( B α +1 e +1 = V +1 ( B B α +1 β +1 e his is equivalent to what would be generated by the symmetric Lanczos process with matrix A A and starting vector A b. (For this reason we define β α β below Using Golub-Kahan to solve the normal equation. Krylov subspace methods for solving linear equations form solution estimates x = V y for some y, where the columns of V are an expanding set of theoretically independent vectors. (In this case, V and also U are theoretically orthonormal. For the equation A Ax = A b, any solution x has the property of minimizing r, where r = b Ax is the corresponding residual vector. hus, in the development of it was natural to choose y to minimize r at each stage. Since r = b AV y = β 1 u 1 U +1 B y = U +1 (β 1 e 1 B y, B.

3 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 3 where U +1 is theoretically orthonormal, the subproblem min y β 1 e 1 B y easily arose. In contrast, for we wish to minimize A r. Let β α β for all. Since A r = A b A Ax = β 1 α 1 v 1 A AV y, we have ( A r = β B 1 v 1 V B +1 α +1 β +1 e and we are led to the subproblem min y A r = min y y = V +1 ( β 1 e 1 ( B B β +1 e ( β B 1 e 1 B β +1 e y, y. (2.2 Efficient solution of this LS subproblem is the heart of algorithm wo QR factorizations. As in, we form the QR factorization ρ 1 θ 2 ( R. Q +1 B =, R = ρ θ. (2.3 ρ If we define t = R y and solve R q = β +1 e, we have q = ( β +1 /ρ e = ϕ e with ρ = (R and ϕ β +1 /ρ. hen we perform a second QR factorization ρ 1 θ2 ( ( R Q β1 e 1 R z +1 ϕ e =., R = ρ.. 2 ζ+1... θ. (2.4 ρ Combining what we have with (2.2 gives ( ( min A r = min y y β R 1 e 1 R q R y = min t β R 1 e 1 ϕ e ( ( = min z R t ζ +1 he subproblem is solved by choosing t from R t = z. t t. ( Recurrence for x. Let W and W be computed by forward substitution from R W = V and R W = W. hen from x = V y, R y = t, and R t = z, we have x and x = W R y = W t = W R t = W z = x + ζ w Recurrence for W and W. If we write V = ( v 1 v 2 v, W = ( w 1 w 2 w, W = ( w 1 w 2 w, z = ( ζ 1 ζ 2 ζ, an important fact is that when increases to + 1, all quantities remain the same except for one additional term.

4 4 DAVID FONG AND MICHAEL SAUNDERS he first QR factorization proceeds as follows. At iteration we construct a plane rotation operating on rows l and l + 1: I l P l = c l s l s l c l. I l Now if Q +1 = P... P 2 P 1, we have ( B α Q +1 B +1 = Q +1 e = β +2 R θ +1 e ᾱ +1, β +2 Q +2 B +1 = P +1 R θ +1 e ᾱ +1 = R θ +1 e ρ +1 β +2 and we see that θ +1 = s α +1 = (β +1 /ρ α +1 = β +1 /ρ = ϕ. herefore we can write θ +1 instead of ϕ. For the second QR factorization, if Q +1 = P... P 2 P1 we now that and so ( R Q +1 θ +1 e = ( R R Q θ +2 e = P ( R, θ+1 e R c ρ +1 = θ +2 θ+1 e ρ +1. (2.6 By considering the last row of the matrix equation R+1 W +1 = V +1 and the last row of R +1 W +1 = W +1 we obtain equations that define w +1 and w +1 : θ +1 w + ρ +1 w +1 = v +1, θ +1 w + ρ +1 w +1 = w he two rotations. o summarize, the rotations P and P have the following effects on our computation: ( ( ( c s ᾱ ρ θ = +1 s c β +1 α +1 ᾱ +1 ( c s ( c ρ ζ ( ρ θ+1 ζ =. s c θ +1 ρ +1 c ρ +1 ζ Speeding up forward substitution. he forward substitutions for computing w and w can be made more efficient if we define h = ρ w and h = ρ ρ w. We then obtain the updates described in part 6 of the pseudo-code below Algorithm. he following summarizes the main steps of algorithm for solving Ax b, excluding the norms and stopping rules developed later. 1. (Initialize β 1 u 1 = b α 1 v 1 = A u 1 ᾱ 1 = α 1 ζ1 = α 1 β 1 ρ = 1 ρ = 1 c = 1 s = h 1 = v 1 h = x =

5 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 5 2. For = 1, 2, 3..., repeat steps (Continue the bidiagonalization β +1 u +1 = Av α u, 4. (Construct and apply rotation P 1 α +1 v +1 = A u +1 β +1 v ρ = ( ᾱ 2 + β c = ᾱ /ρ s = β +1 /ρ (2.7 θ +1 = s α +1 ᾱ +1 = c α +1 ( (Construct and apply rotation P 6. (Update h, h x θ = s ρ ρ = ( ( c ρ 2 + θ c = c ρ / ρ s = θ +1 / ρ (2.9 ζ = c ζ ζ+1 = s ζ (2.1 h = h ( θ ρ /(ρ ρ h x = x + (ζ /(ρ ρ h h +1 = v +1 (θ +1 /ρ h 3. Norms and stopping rules. Here we derive r, A r, x and estimates of A and cond(a for use within stopping rules. All quantities require O(1 computation at each iteration Computing r. We transform R to upper-bidiagonal form using a third QR factorization: R = Q R with Q = P... P 1. his amounts to one additional rotation per iteration. Now let ( t = Q Q t, b = Q +1 e 1 β 1. (3.1 1 hen r = b Ax = β 1 u 1 AV y = U +1 e 1 β 1 U +1 B y gives ( ( ( r = U +1 (e 1 β 1 Q R +1 y = U +1 e 1 β 1 Q t +1 ( ( Q = U +1 (Q Q +1 b Q t +1 1 ( ( Q = U +1 Q ( t +1 b. 1 herefore, assuming orthogonality of U +1, we have r = ( t b. (3.2 he vectors b and t can be written in the form b = ( β1 β β β+1 t = ( τ 1 τ τ. (3.3 he vector t can be computed by forward substitution from R t = z. Lemma 3.1. In (3.2 (3.3, β i = τ i for i = 1,..., 1. Proof. Appendix A proves the lemma by induction. Using this lemma we can estimate r from just the last two elements of b and the last element of t, as shown in (3.6 below. 1

6 6 DAVID FONG AND MICHAEL SAUNDERS Pseudo-code for computing r. he following summarizes how r may be obtained from quantities arising from the first and third QR factorizations. 1. (Initialize β 1 = β 1 β = ρ = 1 τ = θ = ζ = 2. For the th iteration, repeat steps (Apply rotation P ˆβ = c β β+1 = s β ( (If 2, construct and apply rotation P 1 2 ρ = ( ρ 2 + θ 2 c = ρ / ρ s = θ / ρ (3.5 θ = s ρ ρ = c ρ β = c β + s ˆβ β = s β + c ˆβ 5. (Update t by forward substitution 6. (Form r τ = (ζ θ τ / ρ τ = (ζ θ τ / ρ γ = ( β τ 2 + β 2 +1, r = γ ( Computing A r. From (2.5 we have A r = ζ +1, which by (2.1 is monotonically decreasing Computing x. From section 2.4 we have x = V R R z. From the third QR factorization Q R = R in section 3.1 and a fourth QR factorization ˆQ ( Q R = ˆR we can write x = V R R z = V R R R Q z = V R Q Q R ˆQ ẑ = V ˆQ ẑ, where z and ẑ are defined by forward substitutions R z = z and ˆR ẑ = z. Assuming orthogonality of V we arrive at the estimate x = ẑ. Since only the last diagonal of R and the bottom 2 2 part of ˆR change each iteration, this estimate of x can again be updated cheaply. he pseudo-code, omitted here, can be derived as in section Experimentally we have observed that for every iteration, x > x is either true or very nearly true Estimates of A and cond(a. It is nown that the singular values of B are interlaced by those of A and are bounded above and below by the largest and smallest nonzero singular values of A [15]. herefore we can estimate A and cond(a by B and cond(b respectively. Considering the Frobenius norm of B, we have the recurrence relation B +1 2 F = B 2 F + α2 + β2 +1. From (2.3 (2.4 and (2.6, we can show that the following QLP factorization [22] holds: ( R Q +1 B Q = θ e c ρ (the same as R except for the last diagonal. Since the singular values of B are approximated by the diagonal elements of that lower-bidiagonal matrix [22], and since the diagonals are all positive, we can estimate cond(a by the ratio of the largest and smallest values in { ρ 1,..., ρ, c ρ }. hose values can be updated cheaply.

7 : AN IERAIVE ALGORIHM FOR LEAS SQUARES Stopping criteria. With exact arithmetic, the Golub-Kahan process terminates when either α +1 = or β +1 =. For certain data b, this could happen in practice when is small (but is unliely later. We show that will have solved the problem at that point and should therefore terminate. When α +1 =, with the expression of A r from section 3.2, we have A r = ζ +1 = s ζ = θ +1 ρ ζ = s α +1 ρ ζ =, where (2.1, (2.9, (2.8 are used. hus, a least-squares solution has been obtained. When β +1 =, we have s = β +1 ρ =. (from (2.7 (3.7 β +1 = s β =. (from (3.4, (3.7 (3.8 β = c ( β s ( s ( c +1 β 1 (from (A.6 = c (from (3.7 = ρ ρ β (from (3.5 = ρ ρ τ (from Lemma 3.1 = τ. (from (A.2, (A.3 (3.9 By (3.9, (3.8, and (3.6 we conclude that r =. It follows that Ax = b Practical stopping criteria. For we use the same stopping rules as [15], involving dimensionless quantities AOL, BOL, CONLIM: S1: Stop if r BOL b + AOL A x S2: Stop if A r AOL A r S3: Stop if cond(a CONLIM S1 applies to consistent systems, allowing for uncertainty in A and b [9, heorem 7.1]. S2 applies to inconsistent systems and comes from Stewart s bacward error estimate E 2 assuming uncertainty in A; see section 6.1. S3 applies to any system. 4. Characteristics of the solution on singular systems. If A does not have full column ran, the normal equation A Ax = A b is singular but consistent. We show that and both give the minimum-norm LS solution. hat is, they both solve the optimization problem min x 2 such that A Ax = A b. Let N(A and R(A denote the nullspace and range of a matrix A. Lemma 4.1. If A R m n and p R n satisfy A Ap =, then p N(A. Proof. A Ap = p A Ap = (Ap Ap = Ap =. heorem 4.2. returns the minimum-norm solution. Proof. he final solution satisfies A Ax = A b, and any other solution x satisfies A A x = A b. With p = x x, the difference between the two normal equations gives A Ap =, so that Ap = by Lemma 4.1. From α 1 v 1 = A u 1 and α +1 v +1 = A u +1 β +1 v (2.1, we have v 1,..., v R(A. With Ap =, this implies p V =, so that x 2 2 x 2 2 = x + p 2 2 x 2 2 = p p + 2p x = p p + 2p V y = p p.

8 8 DAVID FONG AND MICHAEL SAUNDERS Corollary 4.3. returns the minimum-norm solution. Proof. At convergence, α +1 = or β +1 =. hus β +1 = α +1 β +1 =, which means equation (2.2 becomes min β 1 e 1 B B y and hence B B y = β 1 e 1, since B has full ran. his is the normal equation for min B y β 1 e 1, the same LS subproblem solved by. We conclude that at convergence, y = y x = V y = V y = x, and heorem 4.2 applies. and thus 4.1. Complexity. We compare the storage requirement and computational complexity for and on Ax b and MINRES on the normal equation A Ax = A b. In able 4.1, we list the vector storage needed (excluding storage for A and b. Recall that A is m n and for LS systems m may be considerably larger than n. Av denotes the woring storage for matrix-vector products. Wor represents the number of floating-point multiplications required at each iteration. able 4.1 Storage and computational requirements for various least-squares methods Storage Wor m n m n Av, u x, v, h, h 3 6 Av, u x, v, w 3 5 MINRES on A Ax = A b Av x, v 1, v 2, w 1, w 2, w Regularized least squares. In this section, we extend to the regularized LS problem: ( ( min A b x. (5.1 λi 2 ( ( A b If Ā = and r λi = Āx, then ( ( B Ā r = A r λ 2 ( x = V +1 β 1 e 1 B β +1 e y λ 2 y ( ( R = V +1 β 1 e 1 R β +1 e y and the rest of the main algorithm follows the same as in the unregularized case. In the last equality, R is defined by the QR factorization ( ( B R Q 2+1 =, Q λi 2+1 P ˆP... P 2 ˆP2 P 1 ˆP1, where ˆP l is a rotation operating on rows l and l he effects of ˆP 1 and P 1 are illustrated here: α 1 ˆα 1 ˆα 1 ρ 1 θ 2 β 2 α 2 ˆP 1 β 3 λ = β 2 α 2 β 3, P β 2 α 2 1 β 3 = ᾱ 2 β 3. λ λ λ λ

9 : AN IERAIVE ALGORIHM FOR LEAS SQUARES Effects on r. he introduction of regularization changes the residual norm as follows: ( ( ( ( ( ( b A u1 AV u1 U+1 B r = x λi = β 1 y λv = β 1 y λv ( ( ( U+1 B = e V 1 β 1 y λi ( ( ( U+1 = e V 1 β 1 Q R 2+1 y ( ( ( U+1 = e V 1 β 1 Q t 2+1 ( ( ( U+1 Q = Q ( t V 2+1 b 1 ( with b Q = Q 2+1 e 1 β 1, where we adopt the notation 1 b = ( β1 β β β+1 ˇβ1 ˇβ. We conclude that r 2 = ˇβ ˇβ 2 +( β τ 2 + β he effect of regularization on the rotations is summarized as ( ĉ ŝ (ᾱ β (ˆα β = ŝ ĉ λ ˇβ ( ( ( c s ˆα β ρ θ = +1 ˆβ. s c β +1 α +1 β+1 ᾱ Pseudo-code for regularized. he following summarizes algorithm for solving the regularized problem (5.1 with given λ. Our Matlab implementation is based on these steps. 1. (Initialize β 1 u 1 = b α 1 v 1 = A u 1 ᾱ 1 = α 1 ζ1 = α 1 β 1 ρ = 1 ρ = 1 c = 1 s = β1 = β 1 β = ρ = 1 τ = θ = ζ = d = h 1 = v 1 h = x = 2. For = 1, 2, 3,... repeat steps (Continue the bidiagonalization β +1 u +1 = Av α u α +1 v +1 = A u +1 β +1 v 4. (Construct rotation ˆP ˆα = ( ᾱ 2 + λ ĉ = ᾱ /ˆα ŝ = λ/ˆα 5. (Construct and apply rotation P ρ = (ˆα 2 + β c = ˆα /ρ s = β +1 /ρ θ +1 = s α +1 ᾱ +1 = c α +1

10 1 DAVID FONG AND MICHAEL SAUNDERS 6. (Construct and apply rotation P θ = s ρ ρ = ( ( c ρ 2 + θ c = c ρ / ρ ζ = c ζ s = θ +1 / ρ ζ+1 = s ζ 7. (Update h, x, h 8. (Apply rotation ˆP, P h = h ( θ ρ /(ρ ρ h x = x + (ζ /(ρ ρ h h +1 = v +1 (θ +1 /ρ h β = ĉ β ˇβ = ŝ β ˆβ = c β β+1 = s β 9. (If 2, construct and apply rotation P ρ = ( ρ 2 + θ c = ρ / ρ θ = s ρ β = c β + s ˆβ s = θ / ρ ρ = c ρ β = s β + c ˆβ 1. (Update t by forward substitution τ = (ζ θ τ / ρ τ = (ζ θ τ / ρ 11. (Compute r d = d + ˇβ 2 γ = d + ( β τ 2 + β 2 +1 r = γ 12. (Compute A r, x, estimate Ā, cond(ā, and test for termination A r = ζ +1 (section 3.2 Compute x (section 3.3 Estimate σ max (B, σ min (B and hence Ā, cond(ā (section 3.4 erminate if any of the stopping criteria are satisfied (section Bacward errors. For inconsistent problems with uncertainty in A (but not b, let x be any approximate solution. he normwise bacward error for x measures the perturbation to A that would mae x an exact LS solution: µ(x min E E s.t. (A + E (A + Ex = (A + E b. (6.1 It is nown to be the smallest singular value of a certain m (n + m matrix C; see Waldén et al. [25] and Higham [9, pp ]: [ ( ] µ(x = σ min (C, C A r x I rr r. 2 Since it is generally too expensive to evaluate µ(x, we need to find approximations.

11 : AN IERAIVE ALGORIHM FOR LEAS SQUARES Approximate bacward errors E 1 and E 2. In 1975, Stewart [2] discussed a particular bacward error estimate that we will call E 1. Let x and r = b A x be the exact LS solution and residual. Stewart showed that an approximate solution x with residual r = b Ax is the exact LS solution of the perturbed problem min b (A + E 1 x, where E 1 is the ran-one matrix E 1 = ex x 2, E 1 = e, e r r, (6.2 x with r 2 = r 2 + e 2. Soon after, Stewart [21] gave a further important result that can be used within any LS solver. he approximate x and a certain vector r = b (A + E 2 x are the exact solution and residual of the perturbed LS problem min b (A + E 2 x, where E 2 = rr A r 2, E 2 = A r, r = b Ax. (6.3 r and both compute E 2 for each iterate x because the current r and A r can be accurately estimated at almost no cost. An added feature is that for both solvers, r = b (A + E 2 x = r because E 2 x = (assuming orthogonality of V. hat is, x and r are theoretically exact for the perturbed LS problem (A + E 2 x b. Stopping rule S2 (section 3.6 requires E 2 AOL A. Hence the following property gives an advantage over for stopping early. heorem 6.1. E2 E 2. Proof. his follows from A r A r and r r Approximate optimal bacward error µ(x. Various authors have derived expressions for a quantity µ(x that has proved to be a very accurate approximation to µ(x in (6.1 when x is at least moderately close to the exact solution x. Grcar, Saunders, and Su [23, 7] show that µ(x can be obtained from a full-ran LS problem as follows: K = [ A r x I ], v = [ ] r, min Ky v, µ(x = Ky / x, (6.4 y and give the following Matlab script for computing the economy size sparse QR factorization K = QR and c Q v (for which c = Ky and thence µ(x: [m,n] = size(a; r = b - A*x; normx = norm(x; eta = norm(r/normx; p = colamd(a; K = [A(:,p; eta*speye(n]; v = [ r ; zeros(n,1]; [c,r] = qr(k,v,; mutilde = norm(c/normx; In our experiments we use this script to compute µ(x for each and iterate x. We refer to this as the optimal bacward error for x because it is provably very close to the true µ(x [6].

12 12 DAVID FONG AND MICHAEL SAUNDERS 6.3. Related wor. More precise stopping rules have been derived recently by Arioli and Gratton [1] and itley-péloquin et al. [3, 12, 24]. he rules allow for uncertainty in both A and b, and may prove to be useful for,, and least-squares methods in general. However, we would lie to emphasize that rule S2 already terminates significantly sooner than on most of our inconsistent test cases; see heorem 6.1, Fig. 7.2 (left, and Fig. 7.3 (top left. 7. Numerical results. For test examples, we have drawn from the University of Florida Sparse Matrix Collection (Davis [4]. We discuss overdetermined systems first, and then some square examples Least-squares problems. he LPnetlib group provides data for 138 linear programming problems of widely varying origin, structure, and size. he constraint matrix and objective function may be used to define a sparse LS problem min Ax b. Each example was downloaded in Matlab format, and a sparse matrix A and dense vector b were extracted from the data structure via A = (Problem.A and b = Problem.c (where denotes transpose. Five examples had b =, and a further six gave A b =. he remaining 127 problems had up to 243 rows, 1 columns, and 1.4M nonzeros in A. Diagonal scaling was applied to the columns of [ A b ] to give a scaled problem min Ax b in which the columns of A (and also b have unit 2-norm. and were run on each of the 127 scaled problems with stopping tolerance AOL = 1, generating sequences of approximate solutions {x } and {x }. he iteration indices are omitted below. he associated residual vectors are denoted by r without ambiguity, and x is the solution to the LS problem, or the minimum-norm solution to the LS problem if the system is singular. As expected, the optimal residual is nonzero in all cases. We record some general observations. 1. r is monotonic by design. r seems to be monotonic (no counterexamples were found and nearly as small as r for all iterations on almost all problems. Figure 7.1 shows a typical example and a rare case. 1 Name:lp greenbeb, Dim:5598x2392, nnz:317, id=631 1 Name:lp woodw, Dim:8418x198, nnz:37487, id= r.7 r Fig For most iterations, r appears to be monotonic and nearly as small as r. Left: A typical case (problem lp greenbeb. Right: A rare case (problem lp woodw. s residual norm is significantly larger than s during early iterations.

13 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 13 Name:lp pilot ja, Dim:2267x94, nnz:14977, id=657 Name:lp sc25, Dim:317x25, nnz:665, id=665 log(e2 log(e Fig For most iterations, E 2 appears to be monotonic (but E 2 is not. Left: A typical case (problem lp pilot ja. is liely to terminate much sooner than (see heorem 6.1. Right: Sole exception (problem lp sc25 at iterations he exception remains even if U and/or V are reorthogonalized. 2. x is nearly monotonic for and even more closely monotonic for. With r monotonic for and essentially so for, E 1 in (6.2 is liely to appear monotonic for both solvers. Although E 1 is not normally available for each iteration, it provides a benchmar for E E 2 is not monotonic, but E 2 appears monotonic almost always. Figure 7.2 shows a typical case. he sole exception for this observation is also shown. 4. Note that Benbow [2] has given numerical results comparing a generalized form of with application of MINRES to the corresponding normal equation. he curves in [2, Figure 3] show the irregular and smooth behavior of and MINRES respectively in terms of A r. hose curves are effectively a preview of the left-hand plots in Figure 7.2 (where serves as our more reliable implementation of MINRES. 5. E 1 E 2 often, but not so for. Some examples are shown on Figure 7.3 along with µ(x, the accurate estimate (6.4 of the optimal bacward error for each point x. 6. E2 µ(x almost always. Figure 7.4 shows a typical example and a rare case. In all such rare cases, E1 µ(x instead! 7. µ(x is not always monotonic. µ(x does seem to be monotonic. Figure 7.5 gives examples. 8. µ(x µ(x almost always. Figure 7.6 gives examples. 9. he errors x x and x x seem to decrease monotonically, with the error typically smaller than for. Figure 7.7 gives examples. his is one property for which seems more desirable (and it has been suggested [17] that for LS problems, could be terminated when rule S2 would terminate.

14 14 DAVID FONG AND MICHAEL SAUNDERS Name:lp cre a, Dim:7248x3516, nnz:18168, id=69 Name:lp pilot, Dim:486x1441, nnz:44375, id=654 E2 E1 Optimal E2 E1 Optimal log(bacward Error for log(bacward Error for Name:lp cre a, Dim:7248x3516, nnz:18168, id=69 E2 E1 Optimal Name:lp pilot, Dim:486x1441, nnz:44375, id=654 E2 E1 Optimal log(bacward Error for log(bacward Error for Fig E 1, E 2, and µ(x for (top figures and (bottom figures. op left: A typical case. E 1 is close to the optimal bacward error, but the computable E 2 is not. op right: A rare case in which E 2 is close to optimal. Bottom left: E 1 and E 2 are often both close to the optimal bacward error. Bottom right: E 1 is far from optimal, but the computable E 2 is almost always close (too close to distinguish in the plot!. Problems lp cre a (left and lp pilot (right. Name:lp en 11, Dim:21349x14694, nnz:4958, id=638 E2 E1 Optimal Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 E2 E1 Optimal log(bacward Error for log(bacward Error for Fig Again, E 2 µ(x almost always (the computable bacward error estimate is essentially optimal. Left: A typical case (problem lp en 11. Right: A rare case (problem lp ship12l. Here, E 1 µ(x!

15 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 15.5 Name:lp maros, Dim:1966x846, nnz:1137, id=642 Name:lp cre c, Dim:6411x368, nnz:15977, id=611 log(optimal Bacward Error log(optimal Bacward Error Fig µ(x seems to be always monotonic, but µ(x is usually not. Left: A typical case for both and (problem lp maros. Right: A rare case for, typical for (problem lp cre c. Name:lp pilot, Dim:486x1441, nnz:44375, id=654 Name:lp standgub, Dim:1383x361, nnz:3338, id=693 log(optimal Bacward Error log(optimal Bacward Error Fig µ(x µ(x almost always. Left: A typical case (problem lp pilot. Right: A rare case (problem lp standgub. 2 1 Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 1 Name:lp pds 2, Dim:7716x2953, nnz:16571, id=649 log x x * log x x * Fig he errors x x and x x seem to decrease monotonically, with s errors smaller than for. Left: A nonsingular LS system (problem lp ship12l. Right: A singular system (problem lp pds 2. and both converge to the minimum-norm LS solution.

16 16 DAVID FONG AND MICHAEL SAUNDERS Name:Hamm.hcircuit, Dim:15676x15676, nnz:51372, id=542 Name:Hamm.hcircuit, Dim:15676x15676, nnz:51372, id=542 log r log A r x 1 4 Name:IBM EDA.trans5, Dim:116835x116835, nnz:7498, id= x 1 4 Name:IBM EDA.trans5, Dim:116835x116835, nnz:7498, id=1324 log r log A r Fig and solving two square nonsingular systems Ax = b: problems Hamm/hcircuit (top and IBM EDA/trans5 (bottom. Left: log 1 r for both solvers, with prolonged plateaus for. Right: log 1 A r (preferable for Square systems. Since and are applicable to consistent systems, it is of interest to compare them on an unbiased test set. We used the search facility of Davis [4] to select a set of square real linear systems Ax = b. With index = UFget, the criteria ids = find(index.nrows > 1 & index.nrows < 2 &... index.nrows == index.ncols & index.isreal == 1 &... index.posdef == & index.numerical_symmetry < 1; returned a list of 42 examples. esting isfield(ufget(id, b left 26 cases for which [ ] b was supplied. For each, diagonal scaling was first applied to the rows of A b and then to its columns to give a scaled problem Ax = b in which the columns of [ A b ] have unit 2-norm. In spite of the scaling, most examples required more than n iterations of or to reduce r satisfactorily (rule S1 in section 3.6 with AOL = BOL = 1. o simulate better preconditioning, we chose two cases that required about n/5 and n/1 iterations. Figure 7.8 (left shows both solvers reducing r monotonically but with plateaus that are prolonged for. With loose stopping tolerances, could terminate somewhat sooner. Figure 7.8 (right shows A r for each solver. he plateaus for correspond to gaining ground with r, but falling significantly bacward by the A r measure.

17 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 17 Name:lp ship12l, Dim:5533x1151, nnz:16276, id=688 Name:lpi gran, Dim:2525x2658, nnz:2111, id=717 NoOrtho OrthoU OrthoV OrthoUV NoOrtho OrthoU OrthoV OrthoUV log(e2 log(e Fig with and without reorthogonalization of V and/or U. Left: An easy case (problem lp ship12l. Right: A helpful case (problem lp gran Reorthogonalization. It is well nown that Krylov-subspace methods can tae arbitrarily many iterations because of loss of orthogonality. For the Golub-Kahan bidiagonalization, we have two sets of vectors U and V. As an experiment, we implemented the following options in : 1. No reorthogonalization. 2. Reorthogonalize V (that is, reorthogonalize v with respect to V. 3. Reorthogonalize U (that is, reorthogonalize u with respect to U. 4. Both 2 and 3. Each option was tested on all of the over-determined test problems with fewer than 16K nonzeros. Figure 7.9 shows an easy case in which all options converge equally well (convergence before significant loss of orthogonality, and an extreme case in which reorthogonalization maes a large difference. Unexpectedly, options 2, 3, and 4 proved to be indistinguishable in all cases. o loo closer, we forced to tae n iterations. Option 2 (with V orthonormal to machine precision ɛ was found to be eeping U orthonormal to at least O( ɛ. Option 3 (with U orthonormal was not quite as effective but it ept V orthonormal to at least O( ɛ up to the point where would terminate when AOL = ɛ. Note that for square or rectangular A with exact arithmetic, is equivalent to MINRES on the normal equation (and hence to the conjugate-residual method [11] and GMRES [19] on the same equation. Reorthogonalization maes the equivalence essentially true in practice. We now focus on reorthogonalizing V but not U. Other authors have presented numerical results involving reorthogonalization. For example, on some randomly generated LS problems of increasing condition number, Hayami et al. [8] compare their BA-GMRES method with an implementation of CGLS (equivalent to [15] in which V is reorthogonalized, and find that the methods require essentially the same number of iterations. he preconditioner chosen for BA- GMRES made that method equivalent to GMRES on A Ax = A b. hus, GMRES without reorthogonalization was seen to converge essentially as well as CGLS or with reorthogonalization of V (option 2 above. his coincides with the analysis by Paige et al. [13], who conclude that MGS-GMRES does not need reorthogonalization of the Arnoldi vectors V.

18 18 DAVID FONG AND MICHAEL SAUNDERS Name:lp maros, Dim:1966x846, nnz:1137, id=642 Name:lp cre c, Dim:6411x368, nnz:15977, id=611 NoOrtho Restart5 Restart1 Restart5 FullOrtho NoOrtho Restart5 Restart1 Restart5 FullOrtho log(e2 log(e Fig with reorthogonalized V and restarting. Restart(l with l = 5, 1, 5 is slower than standard with or without reorthogonalization. Problems lp maros and lp cre c. Name:lp fit1p, Dim:1677x627, nnz:9868, id=625 NoOrtho Local5 Local1 Local5 FullOrtho 1 Name:lp bnl2, Dim:4486x2324, nnz:14996, id=65 NoOrtho Local5 Local1 Local5 FullOrtho log(e2 log(e Fig with local reorthogonalization of V. Local(l with l = 5, 1, 5 illustrates reduced iterations as l increases. Problems lp fit1p and lp bnl Restarting. o conserve storage, a simple approach is to restart the algorithm every l steps, as with GMRES(l [19]. Figure 7.1 shows that restarting even with full reorthogonalization (of V may lead to stagnation. In general, convergence with restarting is much slower than without reorthogonalization Local reorthogonalization. Here we reorthogonalize each new v with respect to the previous l vectors, where l is a specified parameter. Figure 7.11 shows that l = 5 has little effect, but partial speedup was achieved with l = 1 and 5 in the two chosen cases. here is evidence of a useful storage-time tradeoff. he potential speedup depends strongly on the computational cost of Av and A u Partial reorthogonalization. Larsen [18] uses partial reorthogonalization of both V and U within his PROPACK software for computing a set of singular values and vectors for a sparse rectangular matrix A. Similar techniques might prove helpful within. We leave this for future research.

19 : AN IERAIVE ALGORIHM FOR LEAS SQUARES Summary. We have presented, an iterative algorithm for square or rectangular systems, along with details of its implementation and experimental results to suggest that it has advantages over the widely adopted algorithm. As in, theoretical and practical stopping criteria are provided for solving Ax = b and min Ax b with optional ihonov regularization, using estimates of r, A r, x, A and cond(a that are cheaply computable. For LS problems, the Stewart bacward error estimate E 2 (6.3 seems experimentally to be very close to the optimal bacward error µ(x at each iterate x (section 6.2. his often allows to terminate significantly sooner than. Experiments with full reorthogonalization have shown that the Golub-Kahan process retains high accuracy if the columns of either V or U are reorthogonalized. here is no need to reorthogonalize both. his discovery could be helpful for other uses of the Golub-Kahan process. Matlab, Python, and Fortran 9 implementations of are available from [1]. hey all allow local reorthogonalization of V. Acnowledgements. We are grateful to Chris Paige for his helpful comments on reorthogonalization and other aspects of this wor. We are also grateful to two referees for their extremely helpful and perceptive reviews. Further thans go to Martin van Gijzen and Mie Botchev for their help with testing on square systems arising from convection-diffusion problems, and to Victor Pereyra for proposing that be used to terminate if a smaller final error x x is important. Appendix A. Proof of Lemma 3.1. he effects of the rotations P and P can be summarized as ( ( ( ρ 1 θ2 c s β ˆβ =, R = ρ θ, s c β +1 ( ( c s ρ β = ρ s c θ ρ ˆβ ρ β ( ρ θ β where β 1 = β 1, ρ 1 = ρ 1, β1 = ˆβ 1 and where c, s are defined in section 2.6. We define s ( = s 1... s and s ( = s 1... s. hen from (3.3 and (2.4 we have R t = z = ( I Q+1 e +1 β1. Expanding this and (3.1 gives c c 1 1 R t s 1 c 2 ( =. β s 1 c 2 Q 1, b = 1. β 1, ( +1 s ( c ( +1 s ( c ( +2 s ( and we see that τ 1 = ρ 1 c 1 β 1 (A.1 τ = ρ (( s ( c β1 θ τ, (A.2 τ = ρ ((+1 s ( c β1 θ τ. (A.3 β 1 = ˆβ 1 = c 1 β 1 (A.4 β = s β + c ( s ( c β 1 (A.5 β = c β + s ( s ( c +1 β 1. (A.6

20 2 DAVID FONG AND MICHAEL SAUNDERS We want to show by induction that τ i = β i for all i. When i = 1, β 1 = c 1 c 1 β 1 s 1 s 1 c 2 β 1 = β 1 ρ 1 (c 1 ρ 1 θ 2 s 1 c 2 = β 1 ρ 1 α 1 ρ 1 ρ 2 1 ρ 1 = β 1 ρ 1 ρ 1 ρ 1 = β 1 ρ 1 c 1 = τ 1 where the third equality follows from the two lines below: c 1 ρ 1 θ 2 s 1 c 2 = c 1 ρ 1 θ 2 s 1 c 1 α 2 ρ 2 = ρ 1 θ 2 s 1 α 2 ρ 2 = α 1 ρ 1 ( ρ 1 1 ρ 2 θ2 s 1 α 2 ρ 1 1 ρ 2 θ2 s 1 α 2 = ρ 1 1 ρ 2 ( s 1 ρ 2 θ 2 = ρ 1 θ 2 ρ 1 θ 2 = ρ2 1 θ 2 2 ρ 1 = ρ2 1 + θ 2 2 θ 2 2 ρ 1. Suppose τ = β. We consider the expression s ( c ρ c2 ρ 2 β 1 = c ρ (s ( c c ρ β 1 ρ = c θ 2 θ α 1 ρ 1 ρ ρ 1 ρ ρ 1 ρ ρ β 1 = c θ 2 ρ 1 θ ρ β1 = c s 1 s β1 = c s ( β1. (A.7 Applying the induction hypothesis on τ = ρ (( +1 s ( c β1 θ τ gives τ = ρ (( +1 s ( c β1 θ ( c β + s ( s ( c β 1 = ρ θ c β + ( +1 ρ ( s ( c β1 θ s s ( c β 1 = ρ ( ρ s c β + ( +1 ρ ( s( β 1 ρ c c θ +1 s c +1 = c s β + ( +1 s ( β 1 ( c c c s s c +1 = c ( s β + c ( +1 s ( c β 1 + s ( +1 s ( c +1 β 1 = c β + s ( +1 s ( c +1 β 1 = β with the second equality obtained by the induction hypothesis, and the fourth from s ( c β1 θ s s ( c β 1 = s ( c ρ c2 ρ 2 β 1 ( s ρ s s ( c β 1 = s ( c 2 β 1 ( c ρ ρ 2 s 2 ρ 2 = s ( β 1 ( ρ c c θ +1 s c +1, where the first equality follows from (A.7 and the last from c 2 ρ 2 s 2 ρ 2 = ( ρ 2 θ+1 2 s 2 ρ 2 = ρ 2 (1 s 2 θ+1 2 = ρ 2 c 2 θ+1, 2 c ρ 2 ρ c 2 = ρ c 2 c = ρ c c, c θ+1 2 = θ +1 θ +1 c = θ +1ρ +1 c s α +1 = ρ ρ ρ ρ θ +1 s c herefore by induction, we now that τ i = β i for i = 1, 2,.... From (3.3, we see that at iteration, the first 1 elements of b and t are equal.

21 : AN IERAIVE ALGORIHM FOR LEAS SQUARES 21 REFERENCES [1] M. Arioli and S. Gratton, Least-squares problems, normal equations, and stopping criteria for the conjugate gradient method, echnical Report RAL-R-28-8, Rutherford Appleton Laboratory, Oxfordshire, UK, 28. [2] S. J. Benbow, Solving generalized least-squares problems with, SIAM J. Matrix Anal. Appl., 21 (1999, pp [3] X.-W. Chang, C. C. Paige, and D. itley-péloquin, Stopping criteria for the iterative solution of linear least squares problems, SIAM J. Matrix Anal. Appl., 31 (29, pp [4]. A. Davis, University of Florida Sparse Matrix Collection. research/sparse/matrices. [5] G. H. Golub and W. Kahan, Calculating the singular values and pseudo-inverse of a matrix, J. of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2 (1965, pp [6] S. Gratton, P. Jiráne, and D. itley-péloquin, On the accuracy of the Karlson-Waldén estimate of the bacward error for linear least squares problems. Manuscript, Feb 22, 211. [7] J. F. Grcar, M. A. Saunders, and Z. Su, Estimates of optimal bacward perturbations for linear least squares problems, Report SOL 27-1, Department of Management Science and Engineering, Stanford University, Stanford, CA, pp. [8] K. Hayami, J.-F. Yin, and. Ito, GMRES methods for least squares problems, SIAM J. Matrix Anal. Appl., 31 (21, pp [9] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, second ed., 22. [1] software for linear systems and least squares. software.html. [11] D. G. Luenberger, he conjugate residual method for constrained minimization problems, SIAM J. Numer. Anal., 7 (197, pp [12] P. Jiráne and D. itley-péloquin, Estimating the bacward error in, SIAM J. Matrix Anal. Appl., 31 (21, pp [13] C. C. Paige, M. Rozlozni, and Z. Straos, Modified Gram-Schmidt (MGS, least squares, and bacward stability of MGS-GMRES, SIAM J. Matrix Anal. Appl., 28 (26, pp [14] C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Numer. Anal., 12 (1975, pp [15], : An algorithm for sparse linear equations and sparse least squares, ACM rans. Math. Softw., 8 (1982, pp [16], Algorithm 583; : Sparse linear equations and least-squares problems, ACM rans. Math. Softw., 8 (1982, pp [17] V. Pereyra, 21. Private communication. [18] PROPACK software for SVD of sparse matrices. [19] Y. Saad and M. H. Schultz, GMRES: a generalized minimum residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. and Statist. Comput., 7 (1986, pp [2] G. W. Stewart, An inverse perturbation theorem for the linear least squares problem, SIGNUM Newsletter, 1 (1975, pp [21], Research, development and LINPACK, in Mathematical Software III, J. R. Rice, ed., Academic Press, New Yor, 1977, pp [22], he QLP approximation to the singular value decomposition, SIAM J. Sci. Comput., 2 (1999, pp [23] Z. Su, Computational Methods for Least Squares Problems and Clinical rials, PhD thesis, SCCM, Stanford University, 25. [24] D. itley-péloquin, Bacward Perturbation Analysis of Least Squares Problems, PhD thesis, School of Computer Science, McGill University, 21. [25] B. Waldén, R. Karlson, and J.-G. Sun, Optimal bacward perturbation bounds for the linear least squares problem, Numerical Linear Algebra with Applications, 2 (1995, pp

LSMR: An iterative algorithm for least-squares problems

LSMR: An iterative algorithm for least-squares problems David Fong Michael Saunders Institute for Computational and Mathematical Engineering (icme) Stanford University Copper Mountain Conference on Iterative