A Parallel Multisplitting Solution of the Least Squares Problem

Size: px

Start display at page:

Download "A Parallel Multisplitting Solution of the Least Squares Problem"

Vincent Morton
5 years ago
Views:

1 NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Lnear Algebra Appl., 5, (1998) A Parallel Multsplttng Soluton of the Least Squares Problem R. A. Renaut Department of Mathematcs, Arzona State Unversty, Tempe, AZ , USA The lnear least squares problem, mn x Ax b 2, s solved by applyng a multsplttng(ms) strategy n whch the system matrx s decomposed by columns nto p blocks. The b and x vectors are parttoned consstently wth the matrx decomposton. The global least squares problem s then replaced by a sequence of local least squares problems whch can be solved n parallel by MS. In MS the solutons to the local problems are recombned usng weghtng matrces to pck out the approprate components of each subproblem soluton. A new two-stage algorthm whch optmzes the global update each teraton s also gven. For ths algorthm the updates are obtaned by fndng the optmal update wth respect to the weghts of the recombnaton. For the least squares problem presented, the global update optmzaton can also be formulated as a least squares problem of dmenson p. Theoretcal results are presented whch prove the convergence of the teratons. Numercal results whch detal the teraton behavor relatve to subproblem sze, convergence crtera and recombnaton technques are gven. The two-stage MS strategy s shown to be effectve for near-separable problems John Wley & Sons, Ltd. KEY WORDS least squares; QR factorzaton; teratve solvers; parallel algorthms; multsplttng 1. Introducton We consder the soluton of the overdetermned system of lnear equatons Ax = b (1.1) where A s an m n (m n) real matrx of rank n and x and b are vectors of length n. Drect solutons of ths system can be obtaned by a least squares algorthm for mn Ax b x n 2 (1.2) Correspondence to R. A. Renaut, Department of Mathematcs, Arzona State Unversty, Tempe, AZ , USA. CCC /98/ $17.50 Receved 25 June John Wley & Sons, Ltd. Revsed 1 May 1997

2 12 R. A. Renaut va the QR factorzaton of A [9]. Typcal methods for computng the QR decomposton use Householder transformatons, Gvens transformatons, or the Gram Schmdt process. One possble approach to the parallelzaton of these least squares algorthms therefore nvolves the determnaton of parallel algorthms for orthogonal transformatons [4]. Another straghtforward method symmetrzes (1.1) by formng the normal equatons A T Ax = A T b (1.3) Ths system can be solved drectly usng Gaussan elmnaton or teratvely usng any standard method such as conjugate gradents or a Krylov-subspace algorthm [1]. Parallelzaton of these algorthms can proceed at the matrx operaton level, n whch approprate data mappng allows for effcent realzatons of matrx vector update operatons [15]. Each of the parallelzaton strateges mentoned above has the advantage that all convergence characterstcs of the seral algorthm are mantaned because the seral algorthm tself s not modfed. On the other hand, consdered separately, each detal of the algorthm may pose conflctng demands for effcent parallelsm. Also, the approach s potentally very tme ntensve because of lack of portablty across archtectures. The alternatve approach consdered here s to develop new algorthms whch have great potental for parallelsm, are essentally archtecture-ndependent algorthms and use as much seral expertse as possble. The multsplttng (MS) phlosophy ntroduced by O Leary and Whte [14] for the soluton of regular systems of equatons, meets both goals and could be appled at the system level for the soluton of the normal equatons (1.3). A drect mplementaton, however, requres the formaton of the operator A T A, whch s not desrable. Instead we propose least squares MS algorthms for the soluton of (1.2). Specfcally, MS s an teratve technque whch uses doman parttonng to replace a large-scale problem by a set of smaller subproblems, each of whch can be solved ndependently n parallel. The success of MS reles on an approprate recombnaton strategy of the subproblem solutons to gve the global soluton. Ths paper presents three least squares (LS) algorthms based on the MS approach; () an LS algorthm wth a standard MS approach to soluton recombnaton, () an teratve refnement mplementaton of the LS algorthm, () a two-stage MSLS algorthm whch solves a second LS problem to determne the optmal weghts at the recombnaton phase. Theoretcal propertes of these algorthms are determned and an estmate of ther relatve parallel computatonal costs, gnorng communcaton, are presented. A performance evaluaton va numercal mplementaton s also provded. The format of the paper s as follows. In Secton 2 we revew the lnear MS algorthm ntroduced n [14]. The new algorthms desgned for the least squares problem are also presented and estmates of ther computatonal costs are provded. A theoretcal analyss of the convergence propertes of the algorthms s detaled n Secton 3. Results of some numercal tests are reported n Secton 4. Fnally, conclusons and suggestons for future drectons of the research are dscussed n Secton 5.

3 A Parallel Multsplttng Soluton of the Least Squares Problem Descrpton of multsplttng 2.1. Multsplttng for Ax = b Iteratve methods based on a sngle splttng, A = M N, are well known [16]. Multsplttngs generalze the splttng to take advantage of the computatonal capabltes of parallel computers. A multsplttng of A s defned as follows: Defnton 2.1. Lnear multsplttng (LMS). Gven a matrx A n n and a collecton of matrces M (j),n (j),e (j) n n, j = 1:p, satsfyng () A = M (j) N (j) for each j, j = 1:p, () M (j) s regular, j = 1:p, () E (j) s a non-negatve dagonal matrx, j = 1:pand p j=1 E(j) = I. Then the collecton of trples (M (j),n (j),e (j) ), j = 1:ps called a multsplttng of A and the LMS method s defned by the teraton: x k+1 = p E (j) (M (j) ) 1 (N (j) x k + b), k = 1,... (2.1) j=1 The advantage of ths method s that at each teraton there are p ndependent problems of the knd M (j) y k j = N (j) x k + b, j = 1:p (2.2) where yj k represents the soluton to the local problem. The work for each equaton n (2.2) s assgned to one (or a set of) processor(s) and communcaton s requred only to produce the update gven n (2.1). In general, some (most) of the dagonal elements n E (j) are zero and therefore the correspondng components of yj k need not be calculated. If the yk j are dsjont, j = 1: p, the method corresponds to block Jacob and s called non-overlappng. Then, the dagonal matrces E (j) have only zero and one entres. For overlapped subdomans the elements n E (j) need not be just zeros and ones but Frommer and Pohl [7] showed that the beneft of overlap s the ncluson of extra varables n the mnmzaton for the local varables, and that the updated values on the overlapped porton of the doman should not be utlzed,.e., the weghts are stll zeros or ones. Before we contnue to develop the MS approach for the soluton of the least squares problem t s useful to realze that MS can be seen as a doman decomposton algorthm, n whch the update gven by (2.2) only provdes an updated soluton for a porton of the doman. Specfcally, the varable doman x n s parttoned accordng to x = (x 1,x 2,...,x p ) T, where each subdoman has x n and p =1 n = n, wthout overlap of subdomans. The soluton yj k+1 s then the soluton wth updated values only on the j th porton of the partton. The same dea s appled to defne a least squares multsplttng (LSMS) approach Lnear least squares For (1.2) the matrx A s parttoned nto blocks of columns consstently wth the decomposton of x nto blocks as A = (A 1,A 2,...,A p ), where each A m n. Ths s not unlke the row projecton methods as descrbed, for example, n [3], except that there the

4 14 R. A. Renaut decomposton of A s nto blocks of rows. The two approaches are not equvalent. Wth the column decomposton Ax = p =1 A x (1.2) can be replaced by the subproblems mn A y b (x) 2, y n 1 p where b (x) = b j A j x j = b Ax + A x. Clearly each of these subproblems s also a lnear least squares problem, amenable to soluton by QR factorzaton of submatrx A. Equvalently, denote the soluton at teraton k by x k = (x1 k,xk 2,...,xk p ), then the soluton at teraton k + 1 s found from the soluton of the local subproblems accordng to y k+1 mn n A y k+1 b (x k ), 1 p (2.3) x k+1 = p =1 The updated local soluton to the global problem s gven by x k+1 α k+1 x k+1 (2.4) = (x k 1,xk 2,...,xk 1,yk+1,x k +1,...,xk p ) (2.5) where the non-negatve weghts satsfy p =1 αk+1 = 1 and the solutons of the subproblems (2.3) are denoted by y k+1. Ths update equaton s stll vald for overlapped domans wth the one zero weghtng scheme, except that the notaton has to be modfed n (2.3) to ndcate that (2.3) s solved wth respect to a larger block and that y k+1 n (2.5) s the update restrcted to the local doman. For block of (2.4) ( x k+1 = (α k+1 x k+1 p ) + = α k+1 y k+1 + p j=1 j j=1 j = α k+1 y k+1 + (1 α k+1 Ths s completely local and can be rewrtten as where now α k+1 be expressed as: δ k+1 ) αj k+1 x j k+1 α k+1 j x k (2.6) )x k x k+1 = x k + α k+1 (y k+1 x k ) (2.7) = x k + α k+1 δ k+1 s the step taken on partton. The update of b (x k ) n (2.3) can then

5 A Parallel Multsplttng Soluton of the Least Squares Problem 15 b (x k+1 ) = b (x k ) = b (x k ) p j=1 j p j=1 j α k+1 j A j δ k+1 j (2.8) αj k+1 Bj k+1 where Bj k+1 = A j δj k+1. The overlapped update of b (x k+1 ) follows smlarly but does requre communcaton. The basc LSMS algorthm usng p processes follows: Algorthm 2.1. LSMS For all processes, 1 p, calculate Q R = A, ntalze y 0 = x 0, k = 0, α0 = 0, α k = α = 1 p, b (x 0 ) = b. Whle not converged, k = k + 1 calculate A δ = B! Matrx vector update, communcate B to all processors! Global communcaton, update b va (2.9). Fnd y to solve (2.3)! Solve local least squares, δ = y x x = x + αδ! Update x, test for convergence locally, communcate convergence result to all processors, end whle. End Ths algorthm s hghly parallel and completely load-balanced when the problem sze s the same for each process. For the overlapped case, modfcatons of the algorthm are necessary, but because of the zero one weghtngs used, these modfcatons are mnor. Testng for convergence can be carred out ether locally or globally. In the former case t s only necessary to share a logcal varable wth all the other processes. Otherwse the vector x must be accumulated and a global check performed. Observe that ths algorthm s presented as a slave-only model. A master slave model requres only mnor modfcaton. From (2.9) and defnng the resdual r = b Ax r(x k+1 ) = r(x k ) p j=1 α k+1 j A j δ k+1 j (2.9) t s easy to see that teratve refnement requres only a mnor modfcaton of Algorthm 2.1. Specfcally, after the frst teraton, the update of b by (2.9) can be replaced by the update of r from (2.9) and n the update (2.7) we use δ k+1 = y k+1. The communcaton and computaton costs are unchanged, but the local least squares problem (2.9) s replaced by y k+1 mn n A y k+1 r(x k ) 2

6 16 R. A. Renaut for whch the rght-hand sde s now the same for all subproblems. Ths algorthm s the MS analog of the teratve refnement procedure for least squares ntroduced by Golub [8]. In lght of the nvestgaton by Hgham [13], and to gve a far comparson between methods, we have chosen to mplement the teratve refnement (LSMSIR) usng sngle precson resduals. Golub and Wlknson [10] revealed, however, that the procedure s satsfactory only when the true resdual vector s suffcently small. We mght expect, therefore, that LSMSIR wll not offer mprovement compared wth LSMS. Ths s confrmed by the numercal experments presented n Secton 4. Equaton (2.7) suggests that convergence mght be mproved by use of a global update usng a lne search procedure. In partcular, n [18] t was suggested that one choce would be to set α k+1 = α = 1 whch amounts to the update x k+1 = y k+1. Another mprovement suggested employs a one dmensonal lne search dependent on α = α k+1,orap-dmensonal mnmzaton of the resdual over the parameters {α k+1, 1 p}. In the latter case ths can be formulated as a least squares mnmzaton mn Dα r(xk ) 2 (2.10) α p where D m p has columns dj k+1 = A j δj k+1.forp<<nths represents non-parallel overhead of a least squares solve, requrng the formaton of the QR factorzaton of D, but t has the potental to mprove the speed of convergence of the teraton. Algorthm 2.2. Optmal recombnaton ORLSMS For all processors, 1 p calculate Q R = A, ntalze x, k = 1, calculate A x = B, form b va (2.9) and r = b B, fnd y to solve (2.3) δ = y x. Whle not converged calculate A δ = d communcate d to all processors calculate Q D R D = D and solve (2.10) for α update x = x + α δ test convergence update B = B + α d communcate B to all processors form b va (2.9) and r = b B fnd y to solve (2.3) δ = y x, end whle. End Note that n ths verson of the algorthm we have employed a redundant update n whch every process solves the outer least squares problem for α. Ths does have the advantage

7 A Parallel Multsplttng Soluton of the Least Squares Problem 17 that each process can keep a record of the global update, provded that the ntal guess s known to each process. The per teraton communcaton cost s now two global vector exchanges. Hence, the per teraton communcaton costs are twce those wthout the optmal recombnaton. Observe, also, the OR algorthm can be modfed to act on the IR algorthm. Agan a master slave mode of the algorthm follows n a straghtforward manner Computaton performance analyss We have already remarked on the communcaton costs of the algorthms presented n the prevous secton. Moreover, our ntent here s to evaluate the potental parallelsm n these MS algorthms, wthout consderaton of communcaton bandwdths, cache sze or other archtecture-dependent factors. Therefore, we do have to defne a measure of parallel effcency. To do ths we estmate to the hghest order the computaton costs assocated wth each algorthm as compared wth a drect solve by the least squares soluton of the whole system. We make an assumpton that the QR factorzaton s calculated by Householder transformatons, whch are a lttle cheaper than the Gvens rotatons. Seral cost of the QR soluton of (1.2) s gven by C S = 2n 2 (m n/3) + mn + n 2 where the frst term s for the determnaton of R, the second for the update of b and the fnal for the back substtuton to gve x. The per process cost for Algorthm 2.1 s C P = 2n 2 (m n /3) + K(2mn + n 2 ) where K s the total number of teratons requred to acheve a specfed convergence crteron. The frst term represents the formaton of the QR factorzaton of A. Costs to frst order n m or n are gnored. When IR s ncorporated, the costs are unchanged, provded the resdual s calculated to the same precson as the remander of the operatons. Algorthm 2.2 does have a basc teraton cost that s greater because of the QR factorzaton of D. Hence n ths case C POR = 2n 2 (m n /3) + K(2mn + n 2 + 2p2 (m p/3) + mp + p 2 ) The percentage parallel effcency acheved s gven by E = C S C P p (2.11) where p s the number of processes used n the calculaton. Moreover, for the OR algorthms C P s replaced by C POR. Measurements of these effcences are gven n our presentaton of the numercal results. Note that when overlap s ntroduced nto the systems, the formulae are stll vald but wth n replaced by n = n/p + o where o determnes the amount of the overlap.

8 18 R. A. Renaut 3. Convergence 3.1. Convergence of the lnear LSMS algorthm In order to nvestgate the convergence of Algorthm 2.1 we need to determne the lnear teratve scheme satsfed by the global update x k+1. We shall assume that the system matrx A n (1.1) has full rank so that the soluton to (1.2) exsts and s unque. Because the matrx A s of full column rank, so are the submatrces A. Therefore the soluton of the least squares problem (2.3) exsts, s unque and s gven by p y = (A T A ) 1 A T b (A T A ) 1 A T A j xj k j=1 j Hence the th component of the global update (2.4) can be wrtten x k+1 = p C j xj k + α (A T A ) 1 A T b j=1 where { (1 α )I C j = n = j α (A T A ) 1 A T A (3.1) j j Here I n s the dentty matrx of order n, the matrces C j are the j blocks of a matrx C, wth block sze n n j, and C n n, consstent wth x n. Equvalently, (2.4) becomes x k+1 = Cx k + b (3.2) where b = α (A T A ) 1 A T b. Moreover, C = C(α), so that convergence depends on the parameter α. It s easly seen that teratve refnement for Algorthm 2.1 leads exactly to (3.2). The convergence behavor of both algorthms s thus, the same. They dffer only n mplementaton and, consequently, effects of fnte precson arthmetc. Theorem 3.1. The teratve scheme defned by (3.2) wth α = α = 1 s a block Jacob teratve scheme for the soluton of the normal equatons (1.3). Proof Set α = 1 n (3.1). Then t s clear that the equvalent form of (3.2) s Mx k+1 = Nx k + b (3.3) where M s a dagonal matrx wth entres A T A, N s gven by { 0 = j N j = A T A j j and b = A T b. Therefore (3.2) solves the equaton (M N)x = b

9 A Parallel Multsplttng Soluton of the Least Squares Problem 19 and from (3.1), M N = A T A and b = A T b. Corollary 3.1. The teratve scheme defned by (3.2) for fxed α, 0 <α<1, s a relaxed block Jacob scheme for the soluton of the normal equatons (1.3). The condton for the convergence of (3.2) when α = 1 s now mmedate because the system matrx A T A s symmetrc and postve defnte (SPD), (see Corollary 5.47 n Chapter 7 of [2]). Theorem 3.2. The teraton defned by (3.2) wth α = α = 1 converges for any ntal vector x 0 f and only f M + N s postve defnte. Moreover, the Gauss Sedel mplementaton of Algorthm 2.1 necessarly converges because for successve-over-relaxaton (SOR) convergence s gven by Corollary 5.48 n Chapter 7 of [2]. Theorem 3.3. The block SOR method converges for all 0 <α<2. It s now helpful to ntroduce the notaton µ = ρ(m 1 N), the spectral radus of M 1 N, µ σ(m 1 N), an element n the spectrum of M 1 N and µ the smallest egenvalue of M 1 N. Lemma 3.1. All the egenvalues µ of M 1 N satsfy µ <1. Proof The system matrx defned by the normal equatons s SPD. M s also SPD and the teraton wth α = 1 s symmetrzable,.e., there exsts a matrx W, det W 0, such that W(I M 1 N)W 1 s SPD [11]. In ths case a choce for W s W = M 1/2. Therefore, by Theorem n [11] the egenvalues of M 1 N are real and satsfy µ <1. Theorem 3.4. The relaxed teraton converges for any suffcently small postve α satsfyng 2 0 <α< 1 µ Proof The result follows from the observaton that the teraton matrx of the relaxed block Jacob teraton s gven by H = (1 α)i + αm 1 N Therefore f λ σ(h),wehaveλ = 1 α(1 µ), and ρ(h) < 1 f and only f 0 <α< 1 µ 2. By Theorem 3.2 ρ(m 1 N) < 1 when M + N s postve defnte and therefore we can conclude µ > 1 n the above to gve: Corollary 3.2. The relaxed block Jacob teraton converges f M + N s postve defnte and 0 <α<1.

10 20 R. A. Renaut Remark 1 The row projecton methods [3] use block algorthms to solve the set of equatons { AA T y = b x = A T y wth and wthout precondtonng. On the other hand, the column decomposton ntroduced here solves the normal equatons Convergence of the ORLSMS algorthm Unlke Algorthm 2.1 the convergence of Algorthm 2.2 cannot be nvestgated by the determnaton of a lnear teraton for x. Rather, Algorthm 2.2 needs to be seen as a procedure for the mnmzaton of the non-lnear functon f(x)= Ax b 2 2. As such Algorthm 2.2 can then be nterpreted as a modfcaton of the parallel varable dstrbuton (PVD) algorthm for non-lnear functons, f(x), ntroduced by Ferrs and Mangasaran [6]. In the PVD, mnmzaton of f occurs n two stages, a parallel stage and a synchronzaton stage. The former corresponds to the determnaton of the parallel soluton of the local problem (2.3) but wth the addtonal local update of the non-local varables by a scalng of the search drecton for those varables. These search drectons are a set of vectors d k for terates x k, usually gven by d k = f(xk ). A verson for whch these search f(x k ) drectons are taken to be zero s denoted by PVD0. In the synchronzaton stage, x k+1 s updated va the mnmzaton of f, but now wth respect to the weghtngs for the lnear combnaton of x k wth the local solutons x k+1. The mnmzaton s constraned by the requrement that the update x k+1 s a strctly convex combnaton of x k and the x k+1. Ths ensures f(x k+1 )<f(x k )and hence a decrease n the objectve functon each teraton. Convergence of the PVD algorthm for f LCK 1 ( n ) s gven by Theorem 2.1 n [6]. Here the set LCK 1 ( n ) s the set of functons wth Lpschtz contnuous frst partal dervatves on n wth Lpschtz constants K. Theorem 3.5. For a bounded sequence {d k }, ether the sequence {x k+1 } termnates at a statonary pont x k,.e., a pont at whch f(x)=0, or each of ts accumulaton ponts s statonary and lm k f(x k )=0. The proof of ths theorem employs not only the requrement that f has a Lpschtz contnuous gradent, that the sequence {d k } s bounded, but also that n the synchronzaton step f(x k+1 ) p 1 p l=1 f( xk+1 l ), because of the convex update of x k. In Algorthm 2.2 the search drectons are zero, and thus, ORLSMS s actually a verson of PVD0. Furthermore, for f(x)= Ax b 2 2 we have f(y) f(x) 2 2 = 2AT A(y x) 2 2 and f LCK 1 ( n ) wth K = 2ρ(A T A). To prove convergence for ORLSMS t therefore only remans to check that the modfcaton of the synchronzaton stage employed n Algorthm 2.2 satsfes the assumpton f(x k+1 ) p 1 p l=1 f( xk+1 l ) used n the proof of Theorem 3.5 But at synchronzaton n Algorthm 2.2, x s updated by (2.4), for whch r( x l k+1 ) = f( x l k+1 )<r(x k ). Therefore, ths condton s necessarly satsfed, otherwse

11 A Parallel Multsplttng Soluton of the Least Squares Problem 21 1 p p l=1 f( xk+1 l )< p l=1 αk+1 l f( x l k+1 )and the mnmum s not found, contradctng the mnmzaton of the outer stage. Thus, Theorem 3.5 apples for the ORLSMS algorthm. Furthermore, the functon f(x)= Ax b 2 2 s strongly convex, f(y) f(x) f (x)(y x) K 2 y x 2 2, x,y n where K = 2ρ(A T A). Therefore, the lnear convergence result of Ferrs and Mangasaran [6] also apples. Theorem 3.6. The sequence {x k } defned by the algorthm ORLSMS converges lnearly to the unque soluton x LS of (1.2) at the lnear root rate ( f(x x k 0 ) ) f(x LS ) 1/2 ( x ρ(a T 1 1 ( ) K 2 ) k/2 A) p K 1 where K 1 s the Lpschtz constant for l f(x l ). On the contrary, however, when we seek to apply Theorem 3.5 to the LSMS algorthm, we do not have an update at the synchronzaton stage for whch t s necessary that f(x k+1 ) f(x k ). In partcular, ths reducton n the objectve functon s just the reducton n the resdual functon and we see, not unexpectedly, that when we determne the requrement for ths decrease we obtan exactly the restrcton on α as gven by Theorem 3.4 In order to force convergence, the mplementaton used actually updates x, ether as gven by (2.7) or, when ths update does not lead to a decrease n the objectve functon, the update to x s taken as the local soluton x whch leads to the mnmum of f for that teraton. Therefore, the convergence theory for the PVD s useful n ths case for determnng the weakness of the relaxed splttng and mmedately suggests the modfcaton requred to force convergence. 4. Numercal results Here, numercal results of tests of the algorthms n Secton 2 are presented for three examples. Further results can be found n [12] and [17]. The frst example comes from a table-flatness problem and generates a structured matrx. Ths s one of the structured matrces used by Duff and Red n [5], and for the case we chose generates a matrx of sze wth non-zero entres, whch s reasonably condtoned, condton number 105. Results presented for ths test case are referred to as results for measured data. To ndcate how the algorthms perform for sparse matrces wth arbtrary structure no attempt was made to ft the splttng to the structure of the problem by reorderng the unknowns. For dense matrces, we used two matrces of sze wth random entres generated accordng to a normal probablty dstrbuton and a unform probablty dstrbuton, respectvely. The condton numbers of these matrces were approxmately 20 and 398, respectvely. The results presented for these matrces are referred to as results for normal and unform data, respectvely. Note, all matrces used n the evaluaton were reasonably well condtoned.

12 22 R. A. Renaut Table 1. Comparson of four methods, for tolerance 10 TE,TE = 3 and TE = 5. Measured data Algorthm LSMS LSMSIR ORLSMS ORLSMSIR P O TE K R K E K R K E K R K E K R K E N N 75 N 15 N 14 N Table 2. Comparson of four methods, for tolerance 10 TE,TE = 3 and TE = 5. Normal data Algorthm LSMS LSMSIR ORLSMS ORLSMSIR P O TE K R K E K R K E K R K E K R K E N N 890 N N 992 N A representatve selecton of the results s gven n Tables 1 7. The notaton s as follows: P number of processors O overlap between domans TE 10 TE s the tolerance K R number of teratons to convergence 10 TE n l 2 norm of the relatve error K E number of teratons to convergence 10 TE n l 2 norm of the relatve resdual N convergence was not acheved to ths tolerance after teratons Tables 1 3 present a comparson of the four algorthms wthout overlap at tolerances 10 3 and Tables 4 7 show how the algorthms perform when overlap s ncorporated. All calculatons are n sngle precson.

13 A Parallel Multsplttng Soluton of the Least Squares Problem 23 Table 3. Comparson of four methods, for tolerance 10 TE,TE = 3 and TE = 5. Unform data Algorthm LSMS LSMSIR ORLSMS ORLSMSIR P O TE K R K E K R K E K R K E K R K E N 130 N N 351 N 255 N 300 N N 479 N 312 N 317 N N 587 N 625 N 639 N N 690 N 527 N 568 N N 739 N 521 N 560 N N 898 N 599 N 610 N Table 4. Effect of overlap on convergence for ORLSMS and ORLSMSIR and error convergence Measured data Splttng Algorthm O ORLSMS ORLSMSIR Table 5. Effect of overlap on convergence for ORLSMS and ORLSMSIR and resdual convergence Measured data Splttng Algorthm O ORLSMS ORLSMSIR

14 24 R. A. Renaut Table 6. Effect of overlap on convergence for ORLSMS and ORLSMSIR and error convergence Unform data Splttng Algorthm O N N ORLSMS N N N ORLSMSIR Table 7. Effect of overlap on convergence for ORLSMS and ORLSMSIR and resdual convergence Unform data Splttng Algorthm O ORLSMS ORLSMSIR

15 A Parallel Multsplttng Soluton of the Least Squares Problem 25 In summary the numercal results show: () Convergence for a mnmum resdual soluton s acheved more quckly than for a mnmal error soluton. () Iteratve refnement does not mprove the convergence rates, for ether Algorthm 2.1 or 2.2, confrmng the observaton of Golub and Wlknson [10]. () Algorthm 2.2 has generally much better convergence propertes for a mnmum resdual soluton than Algorthm 2.1 because of the optmal recombnaton of the local solutons at each teraton. (v) Overlap can mprove the rate of convergence. The amount of overlap to use s problem dependent. For a dense matrx the cost of the subproblem soluton ncreases wth overlap so that at some pont more overlap s no longer benefcal. For separable or near separable problems the deal overlap s often mmedately clear. In other cases graph theoretc technques may be needed to determne optmal groupngs of varables. (v) Overlap does not always reduce the number of outer teratons to convergence of the OR algorthms. A decrease n the objectve functon s guaranteed each teraton but the mnmzaton wth respect to the weghts, α, may lead to dfferent subspaces beng weghted dfferently than n the non-overlapped case. Hence, faster convergence s not guaranteed,.e., rate of convergence s dependent on the vectors α k. Fgures 1 5 llustrate the results of Tables 1 7, usng the estmate of percentage parallel effcency gven by (2.11). In Fgures 1 3 the lne types are O, +, X and for the algorthms LSMS, LSMSIR, ORLSMS and ORLSMSIR, respectvely. In Fgures 4 and 5 the lne types, O, X and + ndcate overlap 0, 10, 20 and 30, respectvely. Effcency for the random matrces s less than for the structured cases. Also, because overlap s more costly, the gan n rate of convergence s not recognzed n terms of parallel effcency when overlap s large. Effcences greater than ndcate speed-up of the splt algorthm as compared wth a straghtforward drect QR solve. The ntroducton of OR s effectve at mprovng parallel effcency. 5. Conclusons The algorthms presented n ths paper provde a vable parallel strategy for the soluton of the lnear least squares problem. In partcular, a two-level approach to mnmzaton n whch subproblem solutons are obtaned ndependently, but then combned to gve an optmal global update, s very successful. The method has been demonstrated to work not only for a sparse test example but also for dense random matrces. We conclude that the approach s of partcular value for: () Problems that are nherently near separable. In these cases the optmal local solutons quckly converge to the global optmum and the cost of each subproblem solve s relatvely cheap. The method s then vable. () Large dense problems whch are too memory ntensve to be solved on a sngle processor machne. Although, n ths case, parallel effcency s very low, the algorthm provdes an effectve soluton technque. Furthermore, these algorthms have the addtonal advantages, compared wth drect parallelzaton of the seral algorthm, of smplcty, portablty and flexblty.

16 26 R. A. Renaut 400 Table 1: Error 10^ Table 1: Error 10^ Table 1: Resdual 10^ Table 1: Resdual 10^ Fgure 1. Comparson of parallel effcency of algorthms for data from Table 1

17 A Parallel Multsplttng Soluton of the Least Squares Problem 27 Table 2: Error 10^ 3 60 Table 2: Error 10^ Table 2: Resdual 10^ 3 Table 2: Resdual 10^ Fgure 2. Comparson of parallel effcency of algorthms for data from Table 2

18 28 R. A. Renaut 120 Table 3: Error 10^ 3 16 Table 3: Error 10^ Table 3: Resdual 10^ 3 Table 3: Resdual 10^ Fgure 3. Comparson of parallel effcency of algorthms for data from Table 3

19 A Parallel Multsplttng Soluton of the Least Squares Problem Table 4: IR Error 10^ Table 4: Error 10^ Table 5: IR Resdual 10^ Table 5: Resdual 10^ Fgure 4. Comparson of parallel effcency for overlap 0, 10, 20, 30 for data from Tables 4 and 5

20 30 R. A. Renaut Table 6: IR Error 10^ 4 Table 6: Error 10^ Table 7: IR Resdual 10^ Table 7: Resdual 10^ Fgure 5. Comparson of parallel effcency for overlap 0, 10, 20, 30 for data from Tables 6 and 7

21 A Parallel Multsplttng Soluton of the Least Squares Problem 31 REFERENCES 1. R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Ejkhout, R. Pozo, C. Romne and H. van der Vorst. Templates for the Soluton of Lnear Systems: Buldng Blocks for Iteratve Methods. SIAM, Phladelpha, A. Berman and R. J. Plemmons. Nonnegatve Matrces n the Mathematcal Scences. Classcs n Appled Mathematcs, SIAM, Phladelpha, R. Bramley and A. Sameh. Row projecton methods for large nonsymmetrc lnear systems. SIAM J. Sc. Stat. Comput., 13, , E. Chu and A. George. QR factorzaton of a dense matrx on a hypercube multprocessor. SIAM J. Sc. Stat. Comput., 11(5), , I. S. Duff and J. K. Red. A comparson of some methods for the soluton of sparse overdetermned systems of lnear equatons. J. Inst. Maths. Applcs., 17, , M. C. Ferrs and O. L. Mangasaran. Parallel varable dstrbuton. SIAM J. Optmzaton, 4(4), , A. Frommer and B. Pohl. A comparson result for multsplttngs and waveform relaxaton methods. Numer. Lnear Algebra Appl., 2, , G. H. Golub. Numercal methods for solvng least squares problems. Numer. Math., 7, , G. H. Golub and C. van Loan. Matrx Computatons, second edton. John Hopkns Press, Baltmore, G. H. Golub and J. H. Wlknson. Note on the teratve refnement of least squares soluton. Numer. Math., 9, , L. A. Hageman and D. M. Young. Appled Iteratve Methods. Academc Press, New York, Q. He. Parallel multsplttngs for nonlnear mnmzaton. Ph.D. thess, Arzona State Unversty, In preparaton. 13. N. J. Hgham. Iteratve refnement enhances the stablty of QR decomposton methods for solvng lnear equatons. BIT, 31, , D. P. O Leary and R. E. Whte. Mult-splttng of matrces and parallel soluton of lnear systems. SIAM J. Alg. Dsc. Meth., 6, , J. M. Ortega. Introducton to Parallel and Vector Solutons of Lnear Systems. Plenum Press, New York and London, J. M. Ortega and W. C. Rhenboldt. Iteratve Soluton of Nonlnear Equatons n Several Varables. Academc Press, New York, R. A. Renaut, Q. He and F.-S. Horng. Parallel multsplttng for mnmzaton, n Grand Challenges n Computer Smulaton, A. Tentner, edtor, pp , Hgh Performance Computng Socety for Computer Smulaton, R. A. Renaut and H. D. Mttelmann. Parallel multsplttngs for optmzaton. J. Parallel Alg. and Appl., 7, 17 27, 1995.

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence. Vector Norms Chapter 7 Iteratve Technques n Matrx Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematcs Unversty of Calforna, Berkeley Math 128B Numercal Analyss Defnton A vector norm