A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4

Size: px

Start display at page:

Download "A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4"

Hope Morton
6 years ago
Views:

1 Universiteit-Utrecht * Deartment of Mathematics A simle and ecient arallel FFT algorithm using the BSP model by Marcia A. Inda and Rob H. Bisseling Prerint nr. 3 March 2000

2 A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4 FFT algorithm based on the BSP model. Our arallel algorithm uses the grou-cyclic distribution family, which makes it simle to understand and easy to imlement. We show how to reduce the communication cost of the algorithm by a factor of three, in the case that the inut/outut vector is in the cyclic distribution. We also show howtoreduce comutation time on comuters with acache-based architecture. We resent erformance results on a Cray T3E with u to 64 rocessors, obtaining reasonable eciency levels for local roblem sizes as small as 256 and very good eciency levels for sizes larger than Introduction The discrete Fourier transform (DFT) lays an imortant role in comutational science. DFT alications ranges from solving numerical dierential equations to signal rocessing. (For an introduction to DFT alications see e.g. [7].) The widesread use of DFTs in comutational science is mainly due to the existence of fast algorithms, known by the general name of fast Fourier transform (FFT), which comute the DFT of an inut vector of size in O( log ) oerations instead of the O( 2 ) oerations needed by a direct aroach, i.e., by a matrix-vector multilication. In 965, Cooley and Tukey [9] ublished a aer describing the FFT idea (giving secial attention to the so called radix-2 FFT ). Since then, many variants of the algorithm have aeared. For an extensive discussion of the family of FFT algorithms, see Van Loan [27]. In recent years, after the dawn of arallel comuting, the originally sequential FFT algorithms have been modied and adated to the needs of arallel comutation (see e.g. [, 2, 3,,, 2, 4, 5, 2, 22, 25, 27]). The lack of a unied arallel comuting model and the existence of many dierent arallel architectures have made it rather dicult to develo ecient and ortable arallel FFTs. Recently, however, as the arallel rogramming environments have become less machine deendent, examles of such algorithms have aeared. Tyical examles are the Key words and hrases. fast Fourier transform, bulk synchronous arallel, arallel comutation. Inda suorted by a doctoral fellowshi from CAPES, Brazil. Comuter time on the Cray T3E of HPaC, Delft, funded by CF, The etherlands.

3 6-ass (or 6-ste) aroach and the related transose aroach (see e.g. [2, 3, 4, 5, 2]). Those algorithms regard the inut vector of size = 0 as an 0 matrix, and carry out the comutations in a similar way as done for two-dimensional FFTs. Since those algorithms require the number of rocessors to be a divisor of 0 and, they can only be used if. As the number of available rocessors grows and the communication seed increases, it is imortant to develo arallel algorithms that can handle more than rocessors. Though generalized algorithms have already been roosed, they only work for very secic combinations of and such as = (d;)=d =k, where d is the dimensionality and k [2, Cha. 0.3] or k = [22], and both and are owers of two. Furthermore, to our knowledge none of those algorithms were imlemented. Our main aim in this aer is to resent a new arallel FFT algorithm and its imlementation. Our arallel algorithm works for any < as long as both are owers of two, which is required because of the radix-2 framework. (A mixed-radix framework will be discussed elsewhere [].) We dedicate the remainder of this section to giving a brief introduction to the basic framework of radix-2 and radix-4 FFTs and to the bulk synchronous arallel (BSP) model. In Section 2, we derive our arallel FFT algorithm by inserting suitable ermutation matrices into the basic radix-2 decomosition of the Fourier matrix. This aroach leads to a simle and easy to imlement distributed memory arallel FFT algorithm. In Section 3, we resent a set of temlates that are used in the imlementation of the algorithm. In Section 4, we resent variants of our FFT algorithm. We show how to modify the algorithm to accet vectors that are not in the block distribution. We alsoshowhow to obtain a cache-friendly version of our algorithm, that is, an algorithm that takes advantage of the cache memory of a comuter by breaking u the comutations into small sections in such away that the data stored in the cache is comletely used before new data is brought in. In Section 5, we resent results regarding the erformance of our imlementation and discuss asects such as the cache eect. In Section 6, we draw our conclusions and discuss future work. 2

4 .. The fast Fourier transform. The DFT of a comlex vector z of size is dened as the comlex vector Z, also of size, with comonents Z k = X; j=0 z j e 2ijk 0 k<: (.) The inverse DFT, which transforms the comlex vector Z back into the vector z, is then dened by z j = X; k=0 Z k e ; 2ijk 0 k<: (.2) Alternatively, the DFT can be seen as a matrix-vector multilication: Z = F z: (.3) The comlex matrix F is known as the Fourier matrix it has elements (F ) jk = w jk, where w = e 2i : (.4) Though it is ossible to develo FFT algorithms that comute the DFT of a vector of arbitrary size, the radix-2 FFT algorithm only works for 's that are owers of two. For simlicity, we will restrict our discussion to such values of. The sequential iterative radix-2 FFT algorithm starts with the so-called bit reversal ermutation of the inut vector (see Section 3.2), and roceeds in log 2 buttery stages, numbered K =2 4 :::, which modify the vector. Each buttery stage consists of =K times a buttery comutation: z t+j z t+j+k=2 A z t+j + w j K z t+j+k=2 z t+j ; w j K z t+j+k=2 A for j =0 ::: K=2 ;, (.5) where t = 0 K ::: ; K indicates the beginning of a buttery block in the vector. These oerations cost one comlex multilication and two additions, or 0 real oating oint oerations (os), er air. The total o count of the radix-2 FFT is therefore C FFT-2 () =0 K 2 K log 2 =5 log 2 : Algorithm. is an in-lace version of this algorithm. 3

5 Algorithm. Sequential radix-2 FFT algorithm. CALL Seq FFT( y). ARGUMETS : Vector size is a owerof2with 2. y =(y in 0 ::: yin ;): Comlex vector of size. P OUTPUT y (y out 0 ::: y out ; ;), where yout k = DESCRIPTIO. Perform a bit reversal on y. 2. Perform log 2 buttery stages A K on y. K 2 while K do for t =0to ; K ste K do for j =0to K=2 ; do a w j K y t+j+k=2 y t+j+k=2 y t+j ; a y t+j y t+j + a K 2 K j=0 yin j ex(2ijk=). Following van Loan's matrix aroach [27], Algorithm. can be described as a sequence of sarse matrix-vector multilications which corresond to the following decomosition of the Fourier matrix F = A A A 4 A 2 P (.6) where P is the ermutation matrix corresonding to the bit reversal ermutation (ste of Algorithm.), and the matrices A K corresond to the buttery stages (ste 2 of Algorithm.). The block structure of the buttery stages leads to block-diagonal matrices of the form A K = I =K B K (.7) which is shorthand for a block-diagonal matrix diag(b K ::: B K )with=k coies of the K K matrix B K on the diagonal. The symbol reresents the direct (or Kronecker) roduct of two matrices, which is formally dened at the end of this subsection. The matrix B K is known as the K K 2-buttery matrix which corresonds to the buttery Actually, the matrix decomosition corresonding to the algorithm of Cooley and Tukey [9] is F = P ~ A ~ A ~ A4 ~ A2 where ~ AK = P ; A K P. 4

6 comutation (.5). This matrix can be written as B K = 2 4 I K=2 K=2 I K=2 ; K=2 3 5 : (.) Here, the matrix I K=2 is the K=2 K=2 identity matrix and K=2 is the K=2 K=2 diagonal matrix K=2 = diag(w 0 K w K ::: w K=2; K ): (.9) Later on we will also need generalized versions of A K : A K = I =K B K (.0) where B K is the generalized K K 2-buttery matrix [3, 5, 0] B K = 2 4 I K=2 K=2 I K=2 ; K=2 3 5 (.) which has the same form as the original B K, but the weights w j K in (.9) are relaced by w j+ K, where can be any real number. In ractice, often a radix-4 FFT is used. A radix-4 algorithm can be derived comletely analogously to the radix-2 algorithm, yielding a similar matrix decomosition. The algorithm starts with a reversal of airs of bits instead of a reversal of single bits, and roceeds in log 4 4-buttery stages which involve quadrules of vector comonents instead of airs. Since 34 os are erformed er quadrule, this brings the o count down to C FFT-4 () =34 K 4 K log 4 =4:25 log 2 : The resulting algorithm has the disadvantage that either it must be assumed that is a ower of four, or secial recautions must be taken which comlicate the algorithm. Wetake a slightly dierent aroach: wherever ossible wetake airs of stages A K A K=2 together and erform them as one oeration. Our K K 4-buttery matrix has the form D K = B K (I 2 B K=2 )= I K=4 2 K=4 K=4 3 K=4 I K=4 ; 2 K=4 i K=4 ;i 3 K=4 I K=4 2 ; K=4 K=4 ; 3 K=4 5 (.2) I K=4 ; 2 K=4 ;i K=4 i 3 K=

7 where K=4 is the K=4 K=4 diagonal matrix K=4 = diag(w 0 K w K ::: w K=4; K ): (.3) This matrix is a symmetrically ermuted version of the radix-4 buttery matrix [27]. 2 This aroach gives the eciency of a radix-4 FFT algorithm, and the exibility of treating a arallel FFT within the radix-2 framework. For examle, if we wish to ermute the data sometime during the comutation, for reasons of data locality, this can haen after any stage, and not only after an even number of stages. An algorithm for the inverse FFT is obtained using the following roerty: F ; = F = A A 4 A 2 P : (.4) The backward algorithm is basically the same as the forward one, the only dierence being that the owers of w K are relaced by their conjugates and that the nal result is rescaled. ow, we dene the direct roduct of two matrices and give some roerties that will be used in the course of the aer. Denition. (Direct roduct). Let A be a q r matrix and B be an m n matrix. Then the direct roduct (or Kronecker roduct) of A and B is the qm rn matrix dened by A B = a 0 0 B a 0 r; B : a q; 0 B a q; r; B As one would exect, the direct roduct is associative, but it is not commutative. Lemma.2 summarizes some direct roduct roerties that follow directly from the definition. (See [24, 27] for other useful roerties). Lemma.2 (Proerties of the direct roduct). The following holds.. (A B)(C D) =(AC) (BD), as long as the roducts are dened. 2. (I m I n )=I mn. 2 In verifying this, note that van Loan denes the weights to be w K = ex(; 2i K ). 6

8 3. If A and B are square matrices of order m and n, resectively, then (A I n )(I m B) =(A B) =(I m B)(A I n ). 4. If A and B are square matrices of order n such that AB = BA, then (I m A)(I m B) =(I m B)(I m A)..2. The bulk synchronous arallel model. The BSP model [26] is a arallel rogramming model which gives a simle and eective way to roduce ortable arallel algorithms. It does not deend on a secic comuter architecture, and it rovides a simle cost function that enables us to choose between algorithms without actually having to imlementthem. In the BSP model, a comuter consists of a set of rocessors, each with its own memory, connected by a communication network that allows rocessors to access the rivate memories of other rocessors. Accessing local memory (the rocessor's own memory) is faster than accessing remote memory (memory owned by other rocessors), but access time is considered to be indeendent from the comuter architecture. In this model, algorithms consist of a sequence of suerstes and synchronization barriers. The use of suerstes and synchronization barriers imoses a sequential structure on arallel algorithms, and this greatly simlies the design rocess. The variant of the BSP model that we use is a single rogram multile data (SPMD) model, i.e., each one of the rocessors executes a coy of the same rogram, though each has its own data. The rogram distinguishes between the rocessors through a arameter s (the rocessor identication number). Secial cases are treated using \if" statements. In our model, a suerste is either a comutation suerste, or a communication suerste. A comutation suerste is a sequence of local comutations carried out on data already available locally before the start of the suerste. A communication suerste consists of communication of data between rocessors. To ensure the correct execution of the algorithm, global synchronization barriers (i.e., laces of the algorithm where all rocessors must synchronize with each other) recede and/or follow a communication suerste. A BSP comuter can be characterized by four global arameters:, the number of rocessors v, thecomuting velocity in o/s 7

9 g, the communication time er data element sent or received, measured in o time units l, the synchronization time, also measured in o time units. Algorithms can be analyzed by using the arameters g, and l. The arameter v is used to estimate the total execution time after the cost function has been comuted. The o count of a comutation suerste is simly the maximum amount of work (in os) of any rocessor. The o count of a communication suerste is hg + l or hg +2l (deending on the number of global synchronizations). Here h is the maximum number of data elements sent or received by any rocessor. The cost function of an algorithm can be obtained by adding the os of the searate suerstes. This yields an exression of the form a + bg + cl. For further details and some basic techniques, see [4, 6]. The second reference describes BSPlib, a standard library dened in May 997 which enables arallel rogramming in BSP style. The Paderborn University BSP (PUB) library [6] is another library that ermits rogramming in BSP style it rovides the extra feature of subset synchronization..3. Parallel radix-2 FFTs. Since the introduction of arallel comuters, and even before that, methods for arallelizing FFT algorithms have been roosed [23]. The earliest methods roduced arallel algorithms that, using rocessors, carry out an FFT of size in O(log ) comutation suerstes, which are interleaved by O(log ) communication suerstes that need to communicate O( ) data elements er rocessor (see e.g. [,, 25]) and hence have cost O( g +l). Such methods aeared as a direct conse- quence of the divide-and-conquer structure of the radix-2 FFT algorithm. The aer [] by Chu and George discusses several existing arallel algorithms of this tye and three variants of their own. Restricting the discussion to vector sizes that are owers of two, they resent a common framework in which all the algorithms they discuss are reorderings of one another in the following sense. Each buttery stage K of an FFT of size, erforms airwise oerations that combine elements j and j + K=2 from the vector being transformed using the weight w j modk K. Writing j in its binary reresentation j =(j m; ::: j 0 ) 2, where m =log 2, we observe that elements j and j+k=2 dier only in bit log 2 K; and that w j modk K = w (j log 2 K; ::: j 0 ) 2 K. If the ordering of the vector is changed, so that original element j is stored as element

10 l, the buttery stages must be modied to carry out the same oerations. If we can reresent the new ordering using a ermutation of the original bits, it is easy to know which elements to combine and which weights to use. For examle, if =6a ossible reordering of the inut vector could be l =(j 0 j 2 j j 3 ) 2, where j =(j 3 j 2 j j 0 ) 2. The buttery stage corresonding to K =6should then combine elements l =(j 0 j 2 j 0) 2 with l +=(j 0 j 2 j ) 2 using weights w (j 2 j j 0 ) 2 6. In the arallel scenario, any grou of log 2 bits can be used to reresent the rocessor number, while the remaining log 2 (=) bits are used to reresent the local index. If the bit corresonding to the next buttery stage is one of the log 2 (=) bits that reresent the local index, then that stage is local, otherwise communication is needed. Swarztrauber [25] carries out a similar discussion. He starts with a more general formulation of the roblem, where is not restricted to owers of two, but when discussing the distributed memory framework, he only considers FFTs on a hyercube, restricting both and to owers of two. The roblem of the algorithms discussed in [,, 25] is that reorderings are carried out by means of exchanging one bit at a time. Since there are log 2 bits in the rocessor art, log 2 communication suerstes of size O( g + l) are needed. A less exensive solution to the roblem is to exchange all the rocessor bits with a grou of local bits corresonding to buttery stages that have already been erformed. Since the communication cost of the ermutation that exchanges many bits is of the same order O( g + l) as for exchanging one bit, the reduction in the communication cost is huge. The basic idea for such algorithms already aears in the original Cooley and Tukey aer [9]. In their derivation of the FFT algorithm, they start by considering the case where can be decomosed as = 0, and rewrite (.) as Z k k 0 = X ; j =0 X 0 ; j 0 =0 z j0 j w j 0k 0! w j (k 0 +k 0 ) 0 k < 0 k 0 < 0 (.5) where Z k k 0 = Z k 0 +k 0 = Z k, and z j0 j = z j0 +j = z j. Since w j 0k 0 = w j 0k 0, the inner sum of (.5) corresonds to a DFT of size 0, which can be comuted by the same rocess as before if 0 is not rime. They remark that this rocess can be alied to any ossible factorization of, = 0 ::: H; 0 and that, if is comosite enough, real gains (over the O( 2 ) direct aroach) can be achieved. Afterwards they derive the 9

11 radix-2 algorithm by choosing to be a ower of two. If instead of decomosing into its rime factors, we sto at a higher level, we obtain a decomosition of the FFT into a sequence of shorter FFTs that, in the arallel case, can be sread out over the rocessors. This is what haens in our FFT algorithm resented in the next section and in algorithms based on the 6-ass aroach and the related transose aroach (see e.g. [2, 3, 4, 5, 2]). 2. The arallel algorithm 2.. Grou-cyclic distribution family and the arallel FFT. Since our arallel FFT algorithm is based on the radix-2 decomosition (.6) of the Fourier matrix, must be a ower of two. For ractical reasons = must be integer and therefore must also be a ower of two. Our arallel FFT algorithm makes use of the data distributions dened below. Denition 2. (Cyclic distribution in r grous, C r ( )). Let r,, and be integers with r, such that r divides and. Let f be a vector of size to be distributed over rocessors organized in r grous. Dene M = =r to be the size of the subvector of a grou and u = =r to be the number of rocessors in a grou. We say that f is cyclically distributed in r grous (or r-cyclically distributed) over rocessors if, for all j, the element f j has local index j 0 =(j mod M) div u, and is stored in rocessor s 0 + s, where s 0 = (j div M) u is the number of the rst rocessor in the grou (i.e., the rocessor oset) and s =(j mod M) modu is the rocessor identication within the grou. We use the name grou-cyclic distribution family to designate all the r-cyclic distributions generated by the same and. This family includes the well-known cyclic distribution (C ( )) and block distribution (C ( )) as extreme cases. The arallel FFT algorithm works as follows. A total of H = dlog 2 = log 2 (=)e = dlog e hases is erformed. (The number of hases is the largest integer H for which (=) H;.) In hase 0, the algorithm uses the block distribution to erform the rst log 2 (=) buttery stages, i.e., those involving butteries with K = (which we call short distance butteries). Afterwards, in each intermediate hase J < H; it uses the r-cyclic distribution, with r = =(=) J, to erform a grou of log 2 (=) buttery 0

12 roc. 0 roc. roc. 2 roc. 3 roc. 4 roc. 5 roc. 6 roc. 7 (A) Phase 0: Short distance butterflies Phase : Medium distance butterflies Phase 2: Long distance butterflies (B) Phase 0: Short distance butterflies Phase : Medium distance butterflies Phase 2: Long distance butterflies Figure. Buttery oerations using the grou-cyclic distribution family C r ( ) = C r ( 32). (A) Logical view, (B) storage view. The short distance butteries (A 2 32 and A 4 32 ) are erformed using the C ( 32) distribution (block distribution). The medium distance butteries (A 32 and A 6 32 ) are erformed using the C 2 ( 32) distribution. The long distance butteries (A ) are erformed using the C ( 32) distribution (cyclic distribution). For clarity, not all buttery airs are shown. stages with (=) J <K (=) J+ (the medium distance butteries). Finally, in hase H ;, it uses the cyclic distribution to erform the remaining buttery stages, i.e., those involving butteries with K > (=) H; (the long distance butteries). Figure illustrates the use of the grou-cyclic distribution family in our arallel FFT. The same oerations are illustrated in two ways: (A) using the logical view, and (B) using the storage view. The logical view emhasizes the logical sequence of the elements in the vector while the storage view emhasizes the way the elements are actually stored. For the block distribution, both views are the same. Our algorithm has only O(log = log(=)) communication suerstes of size O( g + l), which is a signicant imrovement over the O(log ) communication suerstes also of

13 size O( g + l) of the algorithms discussed in the revious section. McColl [22] outlined a arallel FFT algorithm which is a secial case of the FFT algorithm we resent here. His algorithm only works for =(=) H. Furthermore, his algorithm sends the indices corresonding to the weights together with the data vector, increasing the communication costs unnecessarily. In the original BSP aer [26], Valiant already discussed the FFT and mentioned the order of the communication cost that can be achieved Permutations and ermutation matrices. Let u, v, and be ositive integers such that u divides v and v divides. We dene the following ermutation: u v : f0 ::: ; g!f0 ::: ; g j = j 0 M + j u + j 2 7! l = j 0 M + j 2 v + j (2.) where M = v u, j 0 = j div M, j = (j mod M) div u, and j 2 = j mod u. ote that ; u v = v u and that v = u is the identity ermutation. Permuting a vector of size by u v can be achieved by dividing the vector into v=u subvectors of size M = u=v and then erforming a (erfect) shue ermutation u M : u M : f0 ::: M ; g!f0 ::: M ; g j 7! k =(j mod u) M u + j div u (2.2) on each of the subvectors. 3 This relation can be exressed by which imlies that u u = u. u v (j) =j div M M + u M (j mod M) (2.3) In the case that divides, redistributing a vector of size from block distribution to r-cyclic distribution over rocessors is equivalent to ermuting it by u, where u = =r. ermutation matrix: Using matrix notation, this ermutation can be exressed by the (; u ) lj = >< >: 0 if l = u (j) otherwise: (2.4) 3 The shue ermutations (j) and (j) can be used to ermuteavector of size distributed over rocessors from block to cyclic distribution and vice-versa. 2

14 Multilying a vector y by ; u results in a vector with comonents (; u y) l = y ; u (l), for all l in other words, this multilication corresonds to redistributing the vector from block distribution to cyclic distribution in r = =u grous. ote that ; u = I r S u u, cf. (2.3). The matrix corresonding to the inverse ermutation ; u is ;; u =;T u = ; u. From now on, we use the abbreviations ; u and u to denote ; u and u, resectively. We restrict the use of subscrits and to cases where they are not obvious from the context. We also use S u M to denote ; u u M Decomosition of the Fourier matrix. To obtain the arallel FFT algorithm, we modify the original radix-2 decomosition of the Fourier matrix (.6) by inserting identity ermutation matrices I =; ; u ; u corresonding to the changes of distribution, and regrouing the matrices in the resulting decomosition. In the case that this is done as follows: F =; ; ; A ; ; ; ; ; ; A 2 ;; ; A :::A 2 P =; ; (; A ; ; ) (; A 2 ;; ); A :::A 2 P : (2.5) By dening ^A k u =; u A ku ; ; u (2.6) and using the fact that ; is the identity matrix, we can rewrite (2.5) as F =; ; ^A ^A 2 2 {z } hase ; ; ; ^A ^A 2 {z } hase 0 ; P : (2.7) As we did with the ermutation matrices ;, from now on we denote ^A k u by ^A k u, reserving the indices and for when they are not obvious from the context and for stand-alone denitions. In the general case, not restricted to, we arrive at the following decomosition of the Fourier matrix: F =; ; ^A ::: ^A 2 (=)H; {z } hase H; ; ; ; ( )H;2 ^A ( )H;2 ::: ^A 2 ( ) H;2 ::: ; ; {z } hase H;2 ^A ::: ^A 2 {z } hase 3 ; ; ; ; ( ) H;2 ::: ^A ::: ^A 2 ; P : (2.) {z } hase 0

15 (A) A 0 A 4 (B) B 0 B 0 A A 4 s =0 B 4 B 4 A 0 A 4 2 A 4 A 3 4 s = B 2 4 B 2 4 s =2 B 3 4 B 3 4 s =3 (C) w 0 w w 2 B 0 w 3 = ;w 0 ;w ;w 2 ;w 3 B 4 = w 4 w 4 w 2 4 w 3 4 ;w 4 ;w 4 ;w 2 4 ;w 3 4 B 2 4 = 2 w 4 w 2 4 w 22 4 w ;w 4 ;w 2 4 ;w 22 4 ;w 32 4 B 3 4 = 3 w 4 w 3 4 w 23 4 w ;w 4 ;w 3 4 ;w 23 4 ;w 33 4 Figure 2. (A) Structure of the matrix ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ). Examle with = 2, =, r = 2, k =, and hence u = 4 and n = 6. (For clarity, the A k n are deicted as A k ) (B) Matrix diag(a0=u k n A=u k n ::: A(u;)=u k n ). (C) Matrices B s =u k, with s =0 ::: u;. The matrices ^A k u are block diagonal matrices with block size : ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ) (2.9) where r = =u, n = =, anda k n was dened reviously, cf. (.0). Figure 2 examlies the structure of the matrix ^A k u. follows from Theorem 2.2. We shall formally state this as Corollary 2.3 which Theorem 2.2. Let u, M, and k be owers of two such that 2 k M=u. Dene n = M=u. Then S u M A ku M S ; u M = diag(a0=u k n A=u k n ::: A(u;)=u k n ): 4

16 Proof. See Aendix A. Corollary 2.3. Let r,, and be owers of two with r <. Dene u = =r and n = =. Let k be a ower of two with 2 k n. Then Proof. Dene M = =r. Then ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ): ^A k u =; u A ku ; ; u =(I r S u M )(I r A ku M )(I r S ; u M ) = I r (S u M A ku M S ; u M )=I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ): Starting from the Fourier matrix decomosition (2.), it is easy to develo a arallel (BSP) FFT algorithm. Since all the matrices ^A k u are block diagonal matrices with block size equal to =, any multilication ^A k u y can be handled locally, rovided that the vector y is in the block distribution. This roerty guarantees that matrix decomosition (2.) detaches communication and comutation comletely. On the one hand, each generalized buttery hase ( ^A u ::: ^A 2 u )y, is a strict comutation suerste. On the other hand, each ermutation ; u is a strict communication suerste. In the next section we give a comlete descrition of the resulting arallel algorithm. 3. Imlementation of the arallel algorithm We resent our algorithm in the form of temlates, which are high level of detail algorithms, ready to be imlemented, though not necessarily comletely otimized. In the list that follows we introduce the terminology used in our arallel algorithms. Suerstes. Each suerste is numbered textually and labeled according to its tye: (Com) comutation suerste, (Comm) communication suerste, (CCm) subroutine containing both comutation and communication suerstes. Global synchronizations are exlicitly indicated by the keyword Synchronize. Suerstes inside loos are executed reeatedly, though they are numbered only once. Indexing. All the indices of vectors or array structures are global. This means that array elements have a unique index which is indeendent of the rocessor that owns it. This roerty enables us to describe variables and gain access to arrays in an unambiguous manner, even though the array is distributed and each rocessor has 5

17 only art of it. (In an actual imlementation, it is more convenient to convert the indexing scheme to a local one.) Communication. Communication between rocessors is indicated using g j Put(id n f i ) This oeration uts n elements of array f, starting from element i, into rocessor id and stores them there in array g starting from element j. Subscrits are not needed when the rst element of the array is 0 or when communicating scalars. When communicating more than one element, we use boldface to emhasize that we are dealing with a vector and not a scalar. All uts are assumed to be buered, so that they are safely carried out, even if f and g haen to be the same. Algorithm 3. is a direct imlementation of the matrix decomosition (2.). The inut vector y is transformed in lace. Suerste ermutes the inut vector by a bit reversal: y P y: Suerste 2 carries out the short distance butteries: y (A :::A 2 )y. (Since is the identity ermutation, ^Ak = A k, for k = 2 ::: =.) Suerste 3 ermutes y to the r-cyclic distribution, with r = max( =(=)): y ; =r y: Each time suerste 4 is executed, it comutes a grou of medium distance butter- ies: y ( ^A ( ::: ^A )J 2 ( ) J ) y J H ; 2: Suerste 5 reares the vector for the next buttery hase by ermuting it to the r-cyclic distribution, with r = max( =(=) J+ ): y (; r ; ; ) y: Suerste 6 carries out the long distance butter- ( )J ies: y ( ^A ::: ^A (=)H; 2 ) y: Finally, suerste 7 ermutes the vector back to the y: ote that, to obtain the normalized inverse transform, block distribution: y ; ; the outut vector must be divided by. The subroutines used in the FFT temlate are described in the following subsections. 3.. Generalized butteries. The subroutine BTFLY (Algorithm 3.2) is a sequential subroutine that multilies the inut vector by A n n :::A k 0 n A k 0 =2 n. airs of generalized buttery oerations. Ste erforms The k-th iteration of the outermost whileloo erforms the air of generalized butteries stages A k n A k=2 n, the t-th iteration of the intermediate for-loo corresonds to the t-th reetition of the generalized 4-buttery Dk = B k (I 2 B k=2 ), cf. (.2), which is comuted by the innermost for-loo. Ste 2 6

18 Algorithm 3. Temlate for the arallel fast Fourier transform, using the grou-cyclic distribution family. CALL BSP FFT(s sign y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors is a ower of 2 with <. sign: Transform direction + for forward, ; for backward. : Transform size is a ower of 2 with 2. y =(y in 0 ::: yin ;): Comlex vector of size (block distributed). P OUTPUT y (y out 0 ::: y out ; ;), where yout k = DESCRIPTIO CCm Parallel bit reversal ermutation. BSP BitRev(s y) 2 Com Short distance butteries. BTFLY(0 sign 4 y s ) j=0 yin j ex(sign 2ijk=). 3 CCm Permutation to C max( =(=)) ( ) distribution. BSP BlockToCyclic(s ; s mod smod min( ) y (s;smod ) ) H dlog e for J =to H ; 2 do 4 Com Medium distance butteries. BTFLY( smod(=)j (=) J sign 4 y s ) 5 Comm Permutation to C max( =(=)J+) ( ) distribution. BSP CyclicToCyclic(s (=) J min( (=) J+ ) y) 6 Com Long distance butteries. BTFLY( s sign 4 (=)H; y s ) 7 CCm Permutation to block distribution. BSP CyclicToBlock(0 s y) is only executed if the number of buttery stages is odd. It comutes the last generalized 2-buttery A n n. The FFT algorithm comutes the desired (short, medium, or long distance) buttery stages corresonding to hase J, 0 J < H, by dening the inut arameter = (s mod u)=u, where u = min( (=) J ), and erforming the generalized buttery stages on the local art of the vector y (i.e., the subvector y s starts at element s=). of size = that If the needed weights are stored in a looku table, the cost of an A k na k=2 n buttery oeration is 34 k 4 n k = 7 2 n. Summing over all airs A k n A k=2 n 7 and adding 5n os for

19 Algorithm 3.2 Temlate for the sequential generalized buttery oerations. CALL BTFLY( sign n k 0 y). ARGUMETS : Buttery arameter, used to comute the correct weights 0 <. sign: Transform direction + for forward, ; for backward. n: Vector size n is a ower of 2 with n 2. k 0 : Smaller 4-buttery size k 0 is a ower of 2 with 4 k 0 2n. y =(y 0 ::: y n; ): Comlex vector of size n. OUTPUT y A n n :::A k 0 n A k 0 =2 ny for forward, and y A n n ::: A k 0 na k 0 =2 ny for backward. DESCRIPTIO. Perform airs of buttery stages A k n A k=2 n. k k 0 while k n do for t =0to n ; k ste k do for j =0to k=4 ; do yw y t+j+k=2 yw2 y t+j+k=4 yw3 y t+j+3k=4 a y t+j + yw2 b y t+j ; yw2 c yw +yw3 d yw ; yw3 y t+j a + c y t+j+k=4 b + sign id y t+j+k=2 a ; c b ; sign id y t+j+3k=4 w sign(j+) k w sign2(j+) k w sign3(j+) k k 4 k 2. Perform the last buttery stage A n n. if k =2n then for j =0to n=2 ; do a y j+n=2 y j w sign(j+) n y j+n=2 y j ; a y j + a the last buttery (if necessary) gives a cost of C BTFLY (n k 0 )= = 7 2 n [(log 2 n ; log 2 k 0 +2)div2]+5n [(log 2 n ; log 2 k 0 +2)mod2] = 7 4 n [log 2 n ; log 2 k 0 +2]+ 3 4 n [(log 2 n ; log 2 k 0 +2)mod2] (3.) for the generalized buttery algorithm (Algorithm 3.2).

20 The total comutation cost of our arallel FFT (Algorithm 3.) is obtained by adding the costs of the buttery hases: C BTFLY ( 4) = 7 4 log (log 2 mod 2) for hases J = 0 to H ; 2, where H = dlog e, and for the last hase H ; if (=) H; =, i.e., H =log. Otherwise, the cost for the last hase is This gives C BTFLY ( 4(=)H; )= 7 4 (log 2 mod log 2 ) [(log 2 mod log 2 )mod2]: C FFT ar Com ( ) = 7 4 log [(log 2 mod 2)(log 2 div log 2 ) +(log 2 mod log 2 )mod2] (3.2) where the second term corresonds to the extra cost we have to ay for erforming 2-butteries. The communication and synchronization costs of our arallel FFT are discussed in Section 3.5 after we discuss the arallel ermutation subroutines Parallel bit reversal. The bit reversal matrix P is dened by (P ) jk = >< >: 0 if j = rev (k) otherwise: (3.3) Here, rev is the bit reversal ermutation rev : f0 ::: ; g!f0 ::: ; g j = m; X l=0 b l 2 l 7! k = m; X l=0 b m;l; 2 l (3.4) where m = log 2 and (b m; :::b 0 ) 2 is the binary reresentation of j. ote that rev ; = rev, which means that P ; = P. The bit reversal ermutation has the following very useful roerty. Lemma 3.. Let u =2 q, =2 m, with q m. Then rev (j) = rev u (j div u)+ u rev u(j mod u) 0 j <: 9

21 Proof. Let d = m ; q, so that u =2d. If the binary reresentation of j is (b m; :::b 0 ) 2, then ow, rev (j) = j = j mod u + u (j div u) = m; X l=0 b m;l; 2 l = Xd; l=0 q; X l=0 b m;l; 2 l +2 d X d; b l 2 l +2 q b l+q 2 l : q; X l=0 l=0 b q;l; 2 l = a + u b: If we show that a = rev u (j div u) and b = rev u (j mod u), we are done. In fact, a = = Xd; l=0 d; X l=0 Xd; b m;l; 2 l = b m;q;l;+q 2 l = (substituting c l = b l+q ) l=0!! d; d; X X c d;l; 2 l =rev c l 2 l = rev b l+q 2 l =rev(j div u) u u u l=0 l=0 and b = q; X X! q; b q;l; 2 l = rev u b l 2 l = rev u (j mod u): l=0 l=0 Corollary 3.2. Let u be owers of two. Then P =(I u P )(P u I )S u u u Proof. The matrix (I u P )(P u I )S u corresonds to a sequence of three ermutations: u u. j! l = u (j) =j mod u + j div u u 2. l! t = rev u (l div ) + l mod =rev u u u u(j mod u) + j div u u 3. t! k = t div u u +rev(t mod ) = rev u u u(j mod u) u +rev(j div u) = rev (j) u Let y be a vector of size =2 m block distributed over =2 q rocessors. Suose that we want to ermute it by a bit reversal ermutation, i.e., erform y P y. Alying Corollary 3.2 with u =, it is ossible to slit the arallel bit reversal ermutation in two arts as follows. 20

22 . y (P I )S y, which is a global ermutation that sends the elements to the nal destination rocessors, but with local indices still in the original order: j! t = rev (j mod ) {z } Proc(t) + j div {z : } t 0 Having as basis the block distribution, from now on, we use Proc(k) = k div denote the rocessor in which element k is stored, and k 0 = k mod local index of the element. to to denote the 2. y (I P ) y, which is a local bit reversal ermutation in the local index t 0 : t 0! k 0 = rev (t 0 ) Algorithm 3.3 carries out a arallel bit reversal using this idea. If we combine the local bit reversal (suerste 2 of Algorithm 3.3) with the short distance buttery hase (suerste 2 of Algorithm 3.), we obtain a comlete local sequential FFT. This means that we can easily relace the two suerstes by any otimized FFT subroutine we can lay our hands on. Algorithm 3.3 Temlate for the arallel bit reversal. CALL BSP BitRev(s n y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors is a ower of 2 with <. : Vector size is a owerof2with 2. y =(y 0 ::: y ; ): Comlex vector of size (block distributed). OUTPUT y P y. DESCRIPTIO Comm Global ermutation. for j = s to s + ; do dest rev (j mod ) x dest +j div Put(dest y j) Synchronize 2 Com Local bit reversal. for t 0 =0to ; do y s +rev (t 0 ) x s +t0 If < =, it is ossible to otimize communication suerste of Algorithm 3.3 by sending ackets of data. This is done in a similar way as when ermuting from block to 2

23 cyclic distribution (see Section 3.4) the only dierence is in the destination rocessor, which is rev (j mod ) instead of j mod Permutations within the grou-cyclic distribution family. Permuting a vector from the C r ( ) distribution to the Cr 2 ( ) distribution, where r = =u and r 2 = =u 2 may beany ossible grou sizes, not necessarily owers of two, can be done as follows: rst, use ; u to ermute the vector to the block distribution, and then use u2 to ermute it to the C r 2 ( ) distribution. This oeration is exensive if erformed in arallel, because all the data have to be moved twice around the rocessors. The best aroach is to combine both ermutations into one: u 2 u : f0 ::: ; g!f0 ::: ; g j 7! l = u2 ( ; u (j)): (3.5) (ote that ( u 2 u ) ; = u u 2, and that u 2 u is an abbreviation for u 2 u.) In the general case, there is no simle formula for comuting the destination index l. Algorithm 3.4 is a temlate for this case. Some combinations of the arameters r r 2 and, however, lead to simler exressions for the destination index. The simlest case is when r or r 2 is equal to, i.e., one of the distributions involved is the block distribution. This situation occurs in suerstes 3 and 7 of the FFT algorithm (Algorithm 3.) and is discussed in Section Permutation from block to cyclic distribution. The ermutations and ; are the ermutations that convert a vector from block to cyclic distribution and vice versa. In the case that < b = =, both and ; ackets of size b= (here we assume that divides b, but nothing more). can be otimized by sending For, this is done as follows. First erform a local ermutation b on the local index j 0, j 0! t 0 = j 0 mod b + j0 div : Then erform a global cyclic ermutation of ackets on the global index t = j div bb+t 0 = t 0 b + t b + t 2, t! k = t {z} b + t 0 b + t 2 Proc(k) {z } k 0 22 : (3.6)

24 Algorithm 3.4 Temlate for the arallel ermutation from r -cyclic to r 2 -cyclic distribution. CALL BSP CyclicToCyclic(s u u 2 y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors <. : Vector size. u : umber of rocessors in the old grou u = =r. u 2 : umber of rocessors in the new grou u 2 = =r 2. y =(y 0 ::: y ; ): Comlex vector of size (block distributed). OUTPUT y ; u2 ; ; u y. DESCRIPTIO Comm Global ermutation u 2 u. for j = s to s + ; do l u 2 u (j) roc l div y l Put(roc y j ) Synchronize The resulting global index k is then k = j 0 mod b + j div b b + j0 div = j mod b + j div b b +(j mod b) div = j mod b +(jdiv b b + j mod b) div = j mod b + j div which is indeed (j), see Figure 3. Subroutine BSP BlockToCyclic (Algorithm 3.5) with s = s and s 0 = 0 carries out the ermutation. The subroutine erforms the ermutation on a vector y of size = b which is block distributed over rocessors. This is done in suerste 3 ot the FFT algorithm if =. Subroutine BSP BlockToCyclic can also be used to carry out the ermutation r r, which is needed in suerste 3 of the FFT algorithm if >=. This is ossible because ermuting a vector of size r by r r is the same as dividing it into r subvectors of size and then erforming a shue ermutation on each of the subvectors, cf. (2.3). In this case s = s mod, s 0 = s ; s, and y is a subvector starting at element s 0 b of the original vector. 23

25 roc. 0 roc. roc. 2 roc (A) (B) Figure 3. Schematic reresentation of a two stage ermutation from block to cyclic distribution (storage view). Examle with = 32 and = 4. (A) Local 4 ermutation. (B) Global cyclic ermutation of ackets of size 2. The temlate for subroutine BSP CyclicToBlock, which carries out ; is obtained in a similar way as the temlate for subroutine BSP BlockToCyclic. and ; r r temlate starts by erforming a global cyclic ermutation of ackets, and then it carries out a local ermutation ; b. details.) The (The temlate is not resented here see [7] for more 3.5. BSP cost. To comute the total cost of our arallel FFT algorithm (Algorithm 3.) we need to sum the comutation, communication, and synchronization costs. The comutation costs were already obtained in Section 3.. To simlify the nal result we only include the higher order term of the total comutation cost (3.2), C FFT ar Com ( ) = 7 4 log 2 (3.7) which is exact when only 4-butteries are erformed. The communication and synchronization costs are the costs involved in erforming the bit reversal and the ermutations related to the grou-cyclic distribution family. The maximum amount of data sent or received during a ermutation involving comlex numbers is equal to = comlex values (or 2= real values). If the ermutation is erformed with uts, the number of synchronizations is, giving a total cost of C ermut ( ) =2 g + l (3.) 24

26 Algorithm 3.5 Temlate for the arallel ermutation from block to cyclic distribution. CALL BSP BlockToCyclic(s 0 s b y). ARGUMETS s 0 s : Processor oset and rocessor identication within grou 0 s <. : umber of rocessors in grou. b: Block size divides b, if<b. y =(y 0 ::: y b; ): Comlex vector of size b (block distributed within grou). OUTPUT y S b y. DESCRIPTIO if b then Comm Global b ermutation. for j = s b to (s +)b ; do dest j mod y destb+j div Put(s 0 + dest y j ) Synchronize else 2 Com Local b ermutation. for j 0 =0to b ; do k 0 j 0 mod b + j0 div x s b+k 0 y s b+j 0 3 Comm Global cyclic ermutation of ackets. for roc =0to ; do y rocb+s b Put(s 0 + roc b x s b+roc b ) Synchronize for each of the dlog e +ermutations erformed in the FFT algorithm. The total cost of the FFT algorithm is C FFT, ar ( ) = 7 4 log 2 +2 (dlog e +) g +(dlog e +) l: (3.9) In Section 5, we discuss the validity of cost function (3.9) as an accurate estimator of the true cost of the FFT algorithm. 4. Variants of the algorithm 4.. Parallel FFT using other data distributions. U to now, we discussed an FFT algorithm where the inut and outut (I/O) vector must be block distributed. There exist alications of the FFT where it would be better if the I/O vector would be distributed by other distributions or where the distribution can be freely chosen (see e.g. [4, 9, 2]). Here, we discuss how to modify our arallel FFT algorithm to accet I/O vectors that are not distributed by the block distribution. 25

27 The rst and the last suerstes of Algorithm 3. are ermutations. Because of this, the algorithm can be modied to accet any I/O data distribution without any extra communication cost, or even at a smaller communication cost deending on the desired distributions. If the inut vector is not in the block distribution, the algorithm is modied by combining the redistribution to block distribution with the bit reversal ermutation. If the outut vector is exected to be in a distribution other than the block distribution, this is done by relacing the ermutation from cyclic to block distribution by a ermutation from the cyclic to the desired distribution. If the desired distribution for the outut vector is the cyclic distribution, the last communication suerste can be comletely skied. The rst ermutation can also be skied if the inut vector is already stored by the distribution associated with the bit reversal ermutation. Alications where the inut vector is bit reversed and the outut vector is cyclically distributed are advantageous, because, in such cases, two comlete ermutations can be skied. 4 This saves two thirds of the total communication cost in the common case that =, leaving only one ermutation in the middle of the comutation. While the cyclic distribution is simle and widely used, the distribution associated with the bit reversal ermutation is awkward. Fortunately, it is ossible to modify Algorithm 3. so that the cyclic distribution is a natural inut distribution, i.e., a distribution that does not involve any communication as the rst suerste. This is done as follows. The rst three suerstes of Algorithm 3. are described by the matrix decomosition ; u A :::A 2 P (4.) where u = min( =). Knowing that A K = I A K, and that the bit reversal matrix can be decomosed as P = (I P ) (P I ) S (cf. Corollary 3.2), we rewrite 4 The idea of skiing ermutations to save communication time or to reduce the overhead caused by local ermutations is known. Cooley and Tukey [9] already suggested this in order to save local bit reversals. Other authors (e.g., [4, 9, 2]) give examles where skiing ermutations saves communication time. 26

28 matrix (4.) as ; u (I A ) :::(I A 2 ) (I P ) (P I ) S =; u (I F ) (P I ) S =; u (P I ) (I F ) S : (4.2) Here we used Lemma.2. The rst three suerstes of the arallel FFT algorithm derived from this new decomosition are: ( Comm ) ermutation from block to cyclic distribution, (2 Com ) local FFT, (3 Comm ) ermutation dened by ; min( =) (P I ). In the case that the inut vector is already cyclically distributed, the rst suerste can be skied Generalized buttery hase with adjustable size. In our original algorithm, we chose to insert the ermutation matrices ; u in the leftmost ossible osition. This rocedure corresonds to factoring as = (=) H;, and gives an algorithm (=) H; with a minimum number of ermutations. However, if 6= (=) H;, it is ossible to insert the ermutation matrices at an earlier osition without increasing the number of ermutations. The resulting algorithm would corresond to a dierent factorization of. We can use this exibility to reduce the comutation cost of some combinations of and by inserting the ermutations so that a maximal number of generalized buttery stages are aired o. Another reason to ermute the vector at an earlier stage is that the sizes of the buttery hases can be better balanced (so that all factors of have aroximately the same size). This would enhance the erformance on a cache-sensitive comuter (see the discussion in Section 5). An even a more eective way of enhancing the erformance on a cache-sensitive comuter is to reduce the buttery sizes so that they always t comletely into the cache. We suggest a method in the following subsection Cache-friendly arallel FFT. Each comutation suerste of our arallel FFT algorithm erforms a buttery hase which consists of a sequence of generalized buttery stages reresented by the oeration y R l ny, where l and n are owers of two with 2 l n, and R l n = A n n A 2l n A l n (4.3) is an n n matrix. Suose that the cache memory of a comuter is such that the data needed by a buttery hase of size n=v, where v < n is a ower of two, ts totally in 27

29 the comuter cache. We can view v as the number of virtual rocessors available in each rocessor. If we decomose (4.3) into a sequence of smaller buttery hases of size less than or equal to n=v which can be carried out indeendently from each other, we can get the most out of the cache of the comuter. Dene h = dlog n v ne and j = dlog n v le;, so that ( n v )j <l ( n v )j+. Similarly to (2.), if we denote ; u v n by ; u, we can write R l n =; ; v ^A n v v ::: ^A 2 (n=v) h; v v {z } hase h;j; ::: ; ; ( n v )j+ ^A n v ( n v )j+ ::: ^A 2 ( n v )j+ {z } hase ; v ; ; ( n v )h;2 ^A n v ( n v )h;2 ::: ^A 2 ( n v )h;2 {z } hase h;j;2 ; ( n v ) h;2 ::: ; ( n v ) j+ ;; ( n ^A nv v )j ( n ::: ^A v )j l (n=v) j ( n v )j {z } hase 0 ; ( n v ) j (4.4) where the nn matrix ^A k u is an abbreviation for ^A k u v n =; u v n A ku n; ; u v n. Generalized versions of Theorem 2.2 and Corollary 2.3 can be used to rove that ^A k u v n = I v u diag(a =u k n A(+)=u v k n v ::: A (+u;)=u k n ): (4.5) v The matrix decomosition (4.4) can be used to construct an alternative (cache-friendly) algorithm for the comutation of the generalized buttery hases. ote that if = 0 then the resulting algorithm can be used to construct a cache-friendly sequential FFT algorithm. 5. Exerimental results and discussion In this section, we resent results on the erformance of our imlementation of the FFT. We imlemented the FFT algorithm for the block distribution in ASI C using the BSPlib communications library [6]. Our rograms are comletely self-contained, and we did not rely on any system-rovided numerical software such as BLAS, FFTs, etc. We tested our imlementation on a Cray T3E with u to 64 rocessors, each having a theoretical eak seed of 600 Mo/s. The accuracy of double recision (64-bit) arithmetic is :0 0 ;5. We also give accuracy results from calculations on a Sun Workstation using IEEE 754 oating oint arithmetic, which has a double recision accuracy of 2:2 0 ;6, and which is the standard used in many comuters such as workstations. To make a consistent comarison of the results, we comiled all test rograms using the bsfront driver with otions -O3 -flibrary-level 2 -fcombine-uts and measured 2

30 the elased execution times on exclusively dedicated CPUs using the system clock. The times given corresond to an average of the execution times of a forward FFT and a normalized backward FFT. 5.. Accuracy. We tested the overall accuracy of our imlementation by measuring the error obtained when transforming a random comlex vector f with values Re(f j ) and Im(f j ) uniformly distributed between 0 and. The relative error is dened as jjf ; Fjj 2 =jjfjj 2 where F is the vector obtained by transforming the original vector f by a forward (or backward) FFT, and F is the exact transform, which we comuted using the same algorithm but using quadrule recision. Here, jj jj 2 indicates the L 2 -norm. Table shows the relative errors of the sequential algorithm for various roblem sizes. Since the error for the forward and backward FFT are aroximately the same, we resent only the results for the forward transform. The errors of the arallel imlementation are of the same order as in the sequential case. In fact, the error of the arallel imlementation only diers from the error of the sequential one if the buttery stages are not aired in the same way. This validates the arallel algorithm. The results also indicate that IEEE arithmetic is suerior to the CRAY-secic arithmetic. Table. Relative errors for the sequential FFT algorithm. CRAY T3E IEEE :4 0 ;6 :9 0 ; :2 0 ;6 :6 0 ;6 204 :4 0 ;6 : 0 ; : 0 ;5 :9 0 ;6 92 3:2 0 ;5 2:0 0 ; :5 0 ;5 2:2 0 ; :3 0 ;4 2:3 0 ; :4 0 ;4 2:3 0 ; Performance of the sequential imlementation. Our sequential FFT algorithm was imlemented using Algorithm 3.2 with = 0.Its eciency can be analyzed by looking at its execution times or its FFT o rates: FFT rate (seq ) = 29 5 log 2 Time(seq ) (5.)

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3