An Eæcient Tridiagonal Eigenvalue Solver. Ren-Cang Li and Huan Ren.

Size: px

Start display at page:

Download "An Eæcient Tridiagonal Eigenvalue Solver. Ren-Cang Li and Huan Ren."

Erick Roberts
5 years ago
Views:

1 An Eæcient Tridiagonal Eigenvalue Solver on CM 5 with Laguerre's Iteration Ren-Cang Li and Huan Ren Department of Mathematics University of California at Berkeley Berkeley, California li@math.berkeley.edu ren@math.berkeley.edu Report No. UCBèèCSD December 1994 Computer Science Division èeecsè University of California Berkeley, California 94720

2 An Eæcient Tridiagonal Eigenvalue Solver on CM 5 with Laguerre's Iteration æ Ren-Cang Li and Huan Ren Department of Mathematics University of California at Berkeley Berkeley, California May 1992 Computer Science Division Technical Report UCBèèCSD , University of California, Berkeley, CA 94720, December, Abstract In this paper, we propose an algorithm for ænding eigenvalues of symmetric tridiagonal matrices based on Laguerre's iteration. The algorithm is fullly parallelizable and has been parallelized on CM5 at University of California at Berkeley. We've achieved best possible speedup when matrix dimension is large enough. Besides, we have a well-written serial code which works much more eæcient in pathologically close eigenvalue cases than an existing serial code of the same kind due to Li and Zeng. 1 Introduction Laguerre's iteration turns out to be one of the best iterative methods to ænd zeros of polynomials. If the polynomial has only real zeros, Laguerre's æ This material is based in part upon work supported by Argonne National Laboratory under grant No and the University of Tennessee through the Advanced Research Projects Agency under contract No. DAAL03-91-C-0047, by the National Science Foundation under grant No. ASC , and by the National Science Infrastructure grants No. CDA and CDA

3 An Eæcient Eigenvalue Solver on CM 5 2 iteration guarantees to produce a sequence of approximations which converge monotonically to the desired zero. Moreover, the convergence is ultimately cubic provided the algebraic multiplicity of the desired zero is estimated correctly. Therefore, convergence to simple zeros is always cubic and at least linearly for multiple zeros. This report addresses some implementation issues in programming Laguerre's iteration both serially and parallelly to compute eigenvalues of a symmetric tridiagonal matrix. Our contributions are æ Diæerent Divide-and-Conquer strategy which we call Cross-and-Recover in contrast with Li and Zeng's Split-and-Merge ë5ë, which oæers better inclusion intervals. æ More eæcient zeroíænder code in contrast with Li and Zeng's ë5ë. We make our zero-ænder go as fast as possible for pathological cases. æ Eæciently parallel implementation on CM 5. Speedup is almost linear. We will make our contribution more concrete later. 2 The Basic Structure of the Algorithm Let T be an n æ n tridiagonal matrix 0 T = d 1 e 1 e 1 d 2 e e n,2 d n,1 e n,1 e n,1 d n 1 ; è1è C A whose eigenvalues, ç 1 éç 2 é æææéç n ;

4 An Eæcient Eigenvalue Solver on CM 5 3 are to be found. Assume, without loss of generality, all e j 6= 0, i.e., T is unreducible. Let k = bn=2c and write T = 0 T 0 e k,1 e k,1 d k e k e k T 1 1 C A : If, now, the eigenvalues of T b, a submatrix of T with its kth row and column crossed out, è! T0 bt = T 1 are known as bç 1 ç ç b 2 çæææçç b n,1; Cauchy's interlacing theorem tells us precisely where each ç j locates, i.e., ç 1 ç b ç 1 ç ç 2 ç b ç 2 çæææç b ç n,1 ç ç n : è2è Also we have bç 1, èje k,1j + jd k j + je k jè ç ç 1 ;ç n ç b ç n,1 +èje k,1j + jd k j + je k jè è3è by Weyl-Lidskii theorem. With è2è and è3è, ç j can be computed by Laguerre's iteration applying to the characteristic polynomial of T. How dowe get ç b j? The above procedure can be applied recursively until matrices become small enough to be solved easily. Pictorially, our algorithm has the structure as shown in Figure 1, in which we assume the submatrices at the very bottom have the same size. 3 Laguerre's Iteration Let fèxè = detèt, xiè: è4è

5 An Eæcient Eigenvalue Solver on CM 5 4 Recover 8m+7 4m+3 2m+1 m Cross Figure 1: The Divide-and-Conquer Process

6 An Eæcient Eigenvalue Solver on CM 5 5 Laguerre's iteration èlæ ç L 1æè is deæned as L ræ def = x + ç, f 0 èxè f èxè s ç ç n,r æ r n èn, 1è ç, f 0 èxè f èxè ç 2 ç ç ç;, n f 00 èxè f èxè è5è where r is a parameter to be chosen ideally to accelerate convergence. For our purpose, r is always set to be 1. It is proved that If x 2 èç j ;ç j+1 è, then ç j él,èxè éxél + èxè éç j+1. ç ç To apply Laguerre's iteration, one needs to know, f 0 èxè and ç f 00 èxè f èxè f èxè It turns out that they can be computed in a recursive way due to ë5ë èref. to Algorithm DetEvlè. As a by-product, the neg count çèxè at x is computed also, which is the number of eigenvalues of T less than x. In a few cases, however, one may just need the neg count at x for decision making as we will see later. In those cases, calling DetEvl to get neg count is more expensive than necessary. So, instead, a call to Algorithm NegCount can be made. 3.1 Costs for Calling DetEvl and NegCount The following tables present operation counts for one call to DetEvl or to NegCount. Table 1: Costs +or, æ æ DetEvl 6n, 5 6èn, 1è 3n, 2 NegCount 2n, 1 n, 1 n In Table 1, it is assumed that d i, x and e 2 i,1 =ç i,1 for each i are computed once, and e 2 i is computed whenever needed. By looking at DetEvl, one can see that there are rooms for rearranging computations to minimize total arithmetic operations. Table 2: Costs èoptimizedè +or, æ æ DetEvl 6n, 5 9èn, 1è n NegCount 2n, 1 0 n ç.

7 An Eæcient Eigenvalue Solver on CM 5 6 Algorithm DetEvl: input: T, x; output:, f 0 èxè f èxè begin DetEvl = çn, f 00 èxè f èxè = ç n and neg count; ç 1 = d 1, x; if ç 1 =0then ç 1 = e 2 1æ 2 m; è* æ m is the machine epsilon *è ç 0 =0,ç 1 =1=ç 1, ç 0 =0,ç 1 =0,neg count =0; if ç 1 é 0 then neg count =1; for i =2:n ç i =èd i, xè, èe 2 i,1=ç i,1è; if ç i =0then ç i =èe 2 i,1=ç i,1èæ 2 m; if ç i é 0 then neg count = neg count +1; ç i = ç i = end for end DetEvl æ ææ èdi, xèç i,1 +1, èe 2 i,1 =çi,1èçi,2 ç æ ææ i; èdi, xèç i,1 +2ç i,1, èe 2 i,1=ç i,1èç i,2 ç i; Figure 2: Algorithm DetEvl Algorithm NegCount: input: T, x; output: neg count; begin DetEvl ç 1 = d 1, x; if ç 1 =0then ç 1 = e 2 1æ 2 m; ç 0 =0,ç 1 =1=ç 1, ç 0 =0,ç 1 =0,neg count =0; if ç 1 é 0 then neg count =1; for i =2:n ç i =èd i, xè, èe 2 i,1 =çi,1è; if ç 1 =0then ç i =èe 2 i,1=ç i,1èæ 2 m; if ç i é 0 then neg count = neg count +1; end for end NegCount Figure 3: Algorithm NegCount

8 An Eæcient Eigenvalue Solver on CM 5 7 In Table 2, it is assumed that e 2 i is pre-computed and for DetEvl each 1=ç i is computed for ç i, ç i and for 1=ç i AVery Minor Error in Li and Zeng's Error Analysis Li and Zeng ë5ë gave a detailed error analysis for DetEvl. It is correct generically, but it is not in one case. Assume we are working with the following arithmetic model flèx æ yè =èx æ yèè1 + æè; where æ is one of the operations +;,; æ and =, flèæè denotes the æoating point result of the corresponding operation, and jæj ç æ m, the machine epsilon. It was proved by Li and Zeng that flëfèxèë = flëdetèt, xièë = detèèt + Eè, xièè1 + æè; è6è where,è3n, 2èæ m + Oèæ 2 mè ç æ ç è3n, 2èæ m + Oèæ 2 mè, and e 1 æ 1. e E = E LZ = 1 æ C en,1æ n,1 A e n,1æ n,1 0 with,2:5æ m + Oèæ 2 mè ç æ j ç 2:5æ m + Oèæ 2 mè. We claim this error estimate is incorrect when d 1, x =0. To see why, consider a T with all d j =0,nodd and x =0. Were their error analysis correct, the flëdetèt èë produced by DetEvl would be zero. However, DetEvl never yields zero determinants! The error was made because of ignoring the case d 1, x = 0. The correct E for the case d 1, x =0is 0 1 e 2 1æ 2 m e 1 æ 1. e E = 1 æ C en,1æ n,1 A e n,1æ n,1 0

9 An Eæcient Eigenvalue Solver on CM The Basic Building Block The kernel of our algorithm is to compute the eigenvalue ç j of T inside an prescribed interval ëa; bë whose endpoints a and b are, initially, the eigenvalues of its two principal submatrices as we saw before and are updated as iteration goes on. Thus ç j is the only eigenvalue of T inside ëa; bë. To avoid some unpleasant situations, our code does the following at the beginning.

10 An Eæcient Eigenvalue Solver on CM Preprocessing max oæ := max èje ij + je i+1jè;, 1çiçn,2 æ tol := 2:5 æ max oæ + æ a+b 2 æ æ æ æ m; if b, a ç 2 æ tol return ç j := a+b 2 ; call NegCount at a+b if ç, a+b 2 æ = j then b := a+b 2 else è* ç, a+b 2 a := a+b 2 2 ; ; æleft half := true; éj*è ; left half := false; if left half = true then tol := è2:5 æ max oæ + jajè æ æ m; x 0 := a +2æ tol; call DetEvl at x 0; if else çèx 0è=j, then return ç j := L,èx 0è; x 1 := L +èx 0è; else è* left half = false *è tol := è2:5 æ max oæ + jbjè æ æ m; x 0 := b, 2 æ tol; call DetEvl at x 0; if çèx 0è éj, then return ç j := L +èx 0è; else x 1 := L,èx 0è; tol := è2:5 æ max oæ + jx 1jè æ æ m; if jx 1, x 0jçtol, then if left half = true then call NegCount at x 1 + tol; if çèx 1 + tolè =j return ç j := x 1 + tol; else è* left half = false *è call NegCount at x 1, tol; if çèx 1, tolè éj return ç j := x 1, tol; æ 1 := èb, aè 4p æ m; æ 2 := èb, aè p æ m; æ 3 := èb, aè 4p æ 3 m; if jx 1, x 0j éæ 1, go to è*è;

11 An Eæcient Eigenvalue Solver on CM 5 10 if jx 1, x 0jçæ 3, then if left half = true then call NegCount at x 1 + æ 3; if çèx 1 + æ 3è=j then, b := x 1 + æ 3; x 2 := x 1 + æ 3=10; go to è*è; else è* left half = false *è call NegCount at x 1, æ 3; if çèx 1, æ 3è éjthen, a := x 1, æ 3; x 2 := x 1, æ 3=10; go to è*è; if jx 1, x 0jçæ 2, then if left half = true then call NegCount at x 1 + æ 2; if çèx 1 + æ 2è=j then, b := x 1 + æ 2; x 2 := x 1 + æ 2=10; go to è*è; else è* left half = false *è call NegCount at x 1, æ 2; if çèx 1, æ 2è éjthen, a := x 1, æ 2; x 2 := x 1, æ 2=10; go to è*è; if jx 1, x 0jçæ 1, then if left half = true then call NegCount at x 1 + æ 1; if çèx 1 + æ 1è=j then, b := x 1 + æ 1; x 2 := x 1 + æ 1=10; go to è*è; else x 2 := x 1 + æ 1; a := x 2; else è* left half = false *è call NegCount at x 1, æ 1; if çèx 1, æ 1è éjthen, a := x 1, æ 1; x 2 := x 1, æ 1=10; go to è*è; else x 2 := x 1, æ 1; b := x 2; è*è continue... è* Loop Body for Laguerre's Iteration *è

12 An Eæcient Eigenvalue Solver on CM 5 11 Although, Laguerre's iteration has very nice convergence properties in applying to unreducible tridiagonal symmetric matrix T, there are two diæcult situations which need to be handled carefully in order to make iterations go as fast as it should. 4.2 Two diæcult Situations: Detection and Solution Suppose we are at x on the way toç. For the sake ofconvenience, let's assume x = a. The following two situations are diæcult ones. Case èiè: On the left of x, T has an eigenvalue which is terribly close to x, while x and ç are reasonably apart; Case èiiè: T has a few other eigenvalues on the right ofç and those eigenvalues look almost the same as ç from the view at x; orx is outside the region of cubic convergence of Laguerre's iteration for ænding ç. It is possible that two diæcult situations happen at the same time. Case èiè is pointed by Kahan ë4ë. Both ë4ë and ë5ë mention Case èiiè. A popular treatment to Case èiiè is to estimate the ënumerical" algebraic multiplicity of ç. However, the multiplicity could not be estimated correctly until x is suæciently close to x. Proposed here, are eæectively numerical detections of the two cases and their possible solutions. Let x i = L + èx i,1è èx 0 = xè. The Case èiè can be easily detected by looking at èx 2, x 1 è=èx 1, x 0 è. The situation falls into Case èiè if the number is bigger than 1. One solution to Case èiè is to restart at new x =èx 2 + bè=2. It turns out this new x might be too farther to ç than the old x. Fortunately, our preprocessing described above prevents this from happening. The Case èiiè: To detect Case èiiè, again one only needs to look at èx 2, x 1 è=èx 1, x 0 è. The situation falls into Case èiiè if it is less than 1 but not so small to be thought that iterations will go at least superlinearly towards ç for the next few iterations. For an instance, use the following

13 An Eæcient Eigenvalue Solver on CM 5 12 criterion: 1 5 ç q def = x 2, x 1 é 1: è7è x 1, x 0 As we mention above, one could then use L ræ for r other than 1 to accelerate iterations. However, the right r is not so easy to get at least at the beginning. The following method works quite well. As a matter of fact, once we detect that è7è holds, we would expect iterations x i = L + èx i,1è would continue to go for another few iterations at a similar rate, i.e., for the next few iterations, roughly, Note x i+1, x i x i, x i,1 ç q: ç = x 1 +èx 2, x 1 è+èx 3, x 2 è+æææ+èx i+1, x i è+æææ: Since Laguerre's iteration goes monotonically upwards for the precent case, each x i+1, x i in the summation is positive. Suppose the iterations x i = L + èx i,1è would go linearly at the rate about q, sayuntil x k+1, and then begin superlinear ècubicè convergence. èk could be 1.è Therefore which yields x i+1, x i x 2, x 1 ç q i,1 ; ç ç x 1 +èx 2, x 1 èè1 + q + æææ+ q k,1 è + ènegligible é 0è = x 1 +èx 2, x 1 è 1, qk 1, q + ènegligible é 0è; which means if we know k, we could make x 1 +èx 2, x 1 èè1 + q + æææ+ q k,1 è be the next approximation which should be much more accurate than x 3. A very simple way to ænd k is to try k =2`, where ` =1; 2; æææ, 1, sequentially, by calling NegCount. ` should be made as big as possible while maintaining x 1 +èx 2, x 1 èè1 + q + æææ+ q k,1 è still on the left of ç. One restriction on ` should be q k ç æ m, which gives, roughly, ç ç 3:6, lnè, ln qè ` ç 0:69

14 An Eæcient Eigenvalue Solver on CM 5 13 for double precision. According to our experience, ` bigger than 6 could be set to 1. Roughly speaking, 2` calls to DetEvl are saved by ` + 1 calls to Neg- Count. 5 Numerical Tests for Serial Implementation We implement our algorithm based on what we have discussed both serially and parallelly. In this section we are going to present some numerical results on how eæcient our serial implementation is in comparison with Li and Zeng's and DSTEBZ, the bisection routine in LAPACK for computing eigenvalues only. For reference, we also list times taken by DSTEIN, the inverse iteration routine in LAPACK for computing eigenvectors after calling DSTEBZ. Our code was written in C. Li and Zeng's code, and routines from LA- PACK as well, were written in Fortran. Our ærst example is glued Wilkinson matrices which have many pathologically close eigenvalues. To be more speciæc, the tridiagonal matrices are of form 0 T = 1 æ æ W + 21 æ. æ..... ; æ C æ W + 21 æ A æ W + 21 W + 21 where W + 21 is the well-known Wilkinson matrix whose diagonal entries are 10; 9; æææ; 1; 0; 1; æææ; 9; 10 and oæ-diagonal entries all 1, æ a parameter. Bigger testing matrices can be obtained by gluing more W + 21.

15 An Eæcient Eigenvalue Solver on CM 5 14 Table 3: Timing for Glued Wilkinson Matrices with æ =10,5 n Ours Li & Zeng DSTEBZ DSTEIN Table 4: Timing for Glued Wilkinson Matrices with æ =10,10 n Ours Li & Zeng DSTEBZ DSTEIN Table 3 and Table 4 indicates that our code handles pathologically close eigenvalue cases much better than Li and Zeng's. For æ =10,5, our code runs a little bit faster than DSTEBZ. However, when æ =10,10, DSTEBZ got more potential as n goes larger and larger because eigenvalues are extremely clustered together. For random testing matrices, our code takes comparable times as Li and Zeng's code does as the following table shows. On the other hand, DSTEBZ is very slow. Table 5: Timing for Random Matrices n Ours Li & Zeng DSTEBZ DSTEIN To make Tables 3 to 5 more visible, we plot Tables 3 and 4 into Figure 4. Figure 5 plots Table 5. 6 Parallel Implementation on CM5 We code our parallel version of the algorithm in Split-C. For the sake of convenience of our presentation, the notation T èi : jè denotes the principle

16 An Eæcient Eigenvalue Solver on CM Timing (in seconds) (a) Ours Li & Zeng DSTEBZ Timing (in seconds) (b) Ours Li & Zeng DSTEBZ Dimension of Matrices Dimension of Matrices Figure 4: Glued Wilkinson Matrices: èaè æ =10,5, èbè æ =10, Timing (in seconds) Ours Li & Zeng DSTEBZ Dimension of Matrices Figure 5: Random Matrices

17 An Eæcient Eigenvalue Solver on CM 5 16 submatrix consisting the intersection of rows i to j and columns i to j, i.e., 0 1 T èi : jè def = d i e i e i d i+1 e i Outline of our Implementation e j,2 d j,1 e j,1 e j,1 d j Let PROCS be the number of processors. It is a power of 2 èprocs =2 5 =32 for rodin and PROCS =2 6 = 64 for rodin3è. Let's begin with an example: n = 50, PROCS =8. decoupled into T è1 : 6è; T è8 : 13è; T è15 : 20è; T è22 : 26è; T è28 : 32è; T è34 : 38è; T è40 : 44è; T è46 : 50è: : C A First of all, T is Each of those small decoupled problems then can be solved serially by the 8 processors each of which takes care of one of them in order. We call this Step 0. Step 1 is for T è1 : 13è, which will activate communications between Processor 0 and Processor 1, and for T è15 : 26è, which will activate communications between Processor 2 and Processor 3, and so on. After Step 1, Step 2 is reached. All eigenvalues of T are computed after the completion of Step 3. Figure 6 gives a bird-view of how our algorithm goes on the example, where i : j indicates a submatrix T èi : jè on which one or more processors are working. Assume PROCS =2 p. Generally, our implementation æow is as described in Figure Implementation Details Subtle Issues to Be Handled with Care To implement our algorithm on CM5 eæciently turns out not to be an easy task. There are two major issues, among others, we have to deal with in

18 An Eæcient Eigenvalue Solver on CM 5 17 Step 3 1:50, The Whole Matrix Step 2 1 : : 50 Step 1 1 : : 26 28:38 40:50 Step 0 1:6 8:13 15:20 22:26 28:32 34:38 40:44 46: Figure 6: Algorithm Flow: An Example

19 An Eæcient Eigenvalue Solver on CM 5 18 j k n,èprocs,1è q = ; r =ën, PROCS èprocs, 1èë, q æ PROCS; Step 0: Processor i solves the eigenvalues of T èi start : i end è serially, where if r =0then i start := ièq +1è+1, i end := ièq +1è+q; else è* r ç 1 *è for 0 ç i ç r, 1, i start := ièq +2è+1 and i end := ièq +2è+q +1; end for for r ç i ç PROCS, 1, i start := r + ièq +1è+1,and i end := r + ièq +1è+q; end for end of Step 0 ææææææææææææææææææ Step j: for i =0:è2 p,j, 1è, Processors i2 j to èi + 1è2 j collaborate together to solve the eigenvalues of T èèi2 j è start :èèi + 1è2 j è end è; end for end of Step j ææææææææææææææææææ Step p: All processors collaborate together to solve the eigenvalues of T ; end of Step p Figure 7: Algorithm Flow

20 An Eæcient Eigenvalue Solver on CM 5 19 order to achieve the best performance on CM 5 as we should. æ Minimizing Communications The ærst few things come to our mind is to minimize the number of messages to be communicated between processors, to maintain communication locally if possible, and to use fast communication tool èbulk-getè always; We do let every node processor do some redundant operations at the same time as tradeoæ to reduce the number of messages, which turns out to be a good thing to do and aæects negligibly on ænal speedups. æ Load-Balancing Load-balancing is another important issue which has heavy inæuence on ænal speedups if it is handled poorly. We will be more concrete about on how we distribute work among processors so that each processor would have roughly the same amount ofwork to ænish with Memory Requirements Since we are solving a symmetric tridiagonal matrix, the memory we need to store the matrix is 2n, 1. To reduce communication, we let every node processor has a copy of the matrix. A global spreader array glambda is created across processors for the purpose of communication between processors and has size n in each node. Besides all this, a work space of size n in each node is needed. For work distribution, an array of length at most n of structures, each of which takes 2 double æoating numbers and one integer, is also needed. Totally, about 6:5n size of double precision æoating number space is used in each node processor Communication Details Communications come in after the completion of each step and before entering the next step èref. to Figures 6 and 7è. We illustrate our points by the same example in Figure 6.

21 An Eæcient Eigenvalue Solver on CM 5 20 Step 1 1:6 8:13 15:20 22:26 28:32 34:38 40:44 46:50 Merge and Sort Step 0 Figure 8: Communications: After Step 0 and before Step 1 After Step 0 and before Step 1: There is only one level of communications to deal with. As this stage, Processor 0 owns èusually in ascending orderè the eigenvalues of T è1 : 6è, and Processor 1 owns those of T è8 : 13è. In order for the eigenvalues of T è1 : 13è to be computed in Step 1 by the two processors together, both processors have toknow the eigenvalues of T è1: 6è and of T è8 : 13è, which serve as separators for the eigenvalues of T è1 : 13è. So what we doisëprocessor 0 get the eigenvalues of T è8 : 13è from Processor 1, while at the same time Processor 1 gets those of T è1 : 6è, and then let both processors put all eigenvalues together and do sorting by Merge Sort. èhence, one of the two is doing redundant work!è" After sorting, Processor 0 will decide its responsibility to ænd how many ærst few èroughly half, generallyè, eigenvalues of T è1 : 13è, while Processor 1 will do the rest. Work distribution strategy will be discussed later. Similarly, Processor 2 to Processor 7 do their own part. Figure 8 displays what we just said, in which a sign $ indicates communications between the corresponding two processors. After Step 1 and before Step 2: Two levels of communications are involved. As this stage, Processor 0 owns the ærst part of the spectrum of T è1 : 13è, while Processor 1 owns the rest; Processor 2 owns the ærst part of the spectrum of T è15 : 26è, while Processor 3 own the rest. In order for the eigenvalues of T è1 : 26è to be computed in Step 2 by the four processors

22 An Eæcient Eigenvalue Solver on CM 5 21 Step 2 Level 2 Merge and Sort Level 1 Merge Step 1 Figure 9: Communications: After Step 1 and before Step 2 together, they have to know the eigenvalues of T è1 : 13è and of T è15 : 26è, which, after sorted, serve as separators for the eigenvalues of T è1 : 26è. To this end, we do communications as shown by Figure 9. After Level 1, Processor 0 and Processor 1 both own the entire eigenvalues in proper order of T è1 : 13è, while Processor 2 and Processor 3 both own the entire eigenvalues in proper order of T è15 : 26è; after Level 2, each of the four processors has all eigenvalues èsortedè of T è1 : 13è and of T è15 : 26è. After Step 2 and before Step 3: Three levels of communications are involved. As this stage, Processor 0 to Processor 3 together own the eigenvalues of T è1 : 26è, and each of the four processors owns a consecutive part. Also Processor 4 to Processor 7 together own the eigenvalues of T è28 : 50è, and again each of them owns a consecutive part. In order for the eigenvalues of T to be computed in Step 3 by all processors together, they have toknow the eigenvalues of T è1 : 26è and of T è28 : 50è, which, after sorted, serve as separators for the eigenvalues of T.To this end, we do communications as

23 An Eæcient Eigenvalue Solver on CM 5 22 Step 3 Level 3 Merge and Sort Level 2 Merge Level 1 Merge Step 2 Figure 10: Communications: After Step 2 and before Step 3 shown by Figure 10. After Level 2, each ofprocessor 0 to Processor 3 owns the entire èorderedè eigenvalues of T è1 : 26è, while each ofprocessor 4 and Processor 7 owns the entire èorderedè eigenvalues of T è28 : 50è; after Level 3, each of the eight processors has all eigenvalues of T è1 : 26è and of T è28 : 50è. Generally, there are j levels of communications after Step j, 1 and before Step j. At each level, all messages can be passed simultaneously as all processors are connected by fat tree wires. Each message, hopefully, is

24 An Eæcient Eigenvalue Solver on CM 5 23 of equal length. Thus there would be no processors ænish message passing earlier than any others and become idle Task Distribution Right before each step and after the last level of the very last communication before entering the next step, each processor owns two sets of eigenvalues each of which is in proper order, and they together form separators for the eigenvalues of the current èsubèmatrix to be found by relevant processors in the coming step. Before getting into the step, there are two things to be done: 1. Sort the two sets of eigenvalues into a set of numbers in proper order; We use Merge-Sort to fulæll this task since eigenvalues in each set are already in proper order. Each relevant processors do the same sorting actually. Those redundant sortings do not aæects overall speedups as they are cost marginally. 2. Tell cheaply which part of the spectrum of the current èsubèmatrix each relevant processors are going to compute while maintaining loadbalancing. To deal with the second one, we again let every relevant processor does the same thing in trading for less communications. After each processor ænishes sorting, we let them scan the sorted eigenvalues which are, as a matter of fact, separators for the eigenvalues to be found, and do the following

25 An Eæcient Eigenvalue Solver on CM 5 24 If two consecutive separators are close enough, an new eigenvalue is found instantly. Put the eigenvalue into the corresponding position in glambda; otherwise an entry of an array ofstructure is created, where the structure has form New f double left; double right; g int index; Here left is the left end point of the interval contains the eigenvalue we want to compute and right is the right end point, and index indicates which eigenvalue to be computed. At the end, every relevant processor knows how many structures have been created, which, in turn, gives the number of eigenvalues needed to be computed by Laguerre iterations. At this point, it's easy to distribute the work evenly among relevant processors without introducing further communications. Doing so makes every processor compute roughly the same number of eigenvalues, and hopefully keeps load-balancing. 7 Numerical Tests on CM 5 Our parallel implementation on CM 5 is a great success in sense that the best speedup is achieved on all kinds of matrices wehave tested. Table 6 lists times consumed on matrices whose eigenvalues are well-separated, where p denotes the actual number of processors participating computations. To be more speciæc, the diagonal entries of the matrices are 1; 2; æææ;n and oæ-diagonal entries 1.

26 An Eæcient Eigenvalue Solver on CM 5 25 Table 6: Timing for Matrices with Well-Separated Eigenvalues n p =1 p =2 p =4 p =8 p =16 p =32 p = Table 7 displays times taken on tridiagonal matrices whose entries are uniformly distributed on ë,1; 1ë subject to symmetry. Table 7: Timing for Random Matrices n p =1 p =2 p =4 p =8 p =16 p =32 p = To see how much speedup we have gotten, we plot two ægures showing speedups hiding in Table 6 and Table 7. always an interesting test matrices. eæcient our implementation is on this kind of matrices. Glued Wilkinson matrices are Table 8 and Figure 13 display how Table 8: Timing for Glued Wilkinson Matrices n p =1 p =2 p =4 p =8 p =16 p =32 p = References ë1ë Cuppen, J. J. M., A divide and conquer method for the symmetric tridiagonal eigenproblem, Numer. Math., 36è1981è, 177í195.

27 An Eæcient Eigenvalue Solver on CM n=8192 n=4096 Speedups n= n=1024 n=512 n= Numbers of Processors Figure 11: Speedups for Matrices with Well-Separated Eigenvalues n=8192 n=4096 Speedups n=2048 n=1024 n= n= Numbers of Processors Figure 12: Speedups for Random Matrices

28 An Eæcient Eigenvalue Solver on CM Speedups n=8400 n=4200 n=2100 n= Numbers of Processors Figure 13: Speedups for Glued Wilkinson Matrices ë2ë J. Demmel, M. T. Heath, and H. A. van der Vorst, Parallel numerical linear algebra, Acta Numerica, ë3ë G. H. Golub, and Ch. F. van Loan, Matrix Computations, The Jhons Hopkins University Press, 2nd edition, ë4ë W. Kahan, Notes on Laguerre's iteration, 1992 ë5ë T. Y. Li and Zhonggang Zeng, Laguerre's iteration in solving the symmetric tridiagonal eigenproblem a revisit, preprint, ë6ë B. Parlett, Laguerre's method applied to the matrix eigenvalue problem, Math. Comp., 18è1964è, 464í485.

LECTURE 17. Algorithms for Polynomial Interpolation

LECTURE 17. Algorithms for Polynomial Interpolation LECTURE 7 Algorithms for Polynomial Interpolation We have thus far three algorithms for determining the polynomial interpolation of a given set of data.. Brute Force Method. Set and solve the following