A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4

Size: px
Start display at page:

Download "A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4"

Transcription

1 Universiteit-Utrecht * Deartment of Mathematics A simle and ecient arallel FFT algorithm using the BSP model by Marcia A. Inda and Rob H. Bisseling Prerint nr. 3 March 2000

2 A SIMPLE AD EFFICIET PARALLEL FFT ALGORITHM USIG THE BSP MODEL MARCIA A. IDA AD ROB H. BISSELIG Abstract. In this aer, we resent a new arallel radix-4 FFT algorithm based on the BSP model. Our arallel algorithm uses the grou-cyclic distribution family, which makes it simle to understand and easy to imlement. We show how to reduce the communication cost of the algorithm by a factor of three, in the case that the inut/outut vector is in the cyclic distribution. We also show howtoreduce comutation time on comuters with acache-based architecture. We resent erformance results on a Cray T3E with u to 64 rocessors, obtaining reasonable eciency levels for local roblem sizes as small as 256 and very good eciency levels for sizes larger than Introduction The discrete Fourier transform (DFT) lays an imortant role in comutational science. DFT alications ranges from solving numerical dierential equations to signal rocessing. (For an introduction to DFT alications see e.g. [7].) The widesread use of DFTs in comutational science is mainly due to the existence of fast algorithms, known by the general name of fast Fourier transform (FFT), which comute the DFT of an inut vector of size in O( log ) oerations instead of the O( 2 ) oerations needed by a direct aroach, i.e., by a matrix-vector multilication. In 965, Cooley and Tukey [9] ublished a aer describing the FFT idea (giving secial attention to the so called radix-2 FFT ). Since then, many variants of the algorithm have aeared. For an extensive discussion of the family of FFT algorithms, see Van Loan [27]. In recent years, after the dawn of arallel comuting, the originally sequential FFT algorithms have been modied and adated to the needs of arallel comutation (see e.g. [, 2, 3,,, 2, 4, 5, 2, 22, 25, 27]). The lack of a unied arallel comuting model and the existence of many dierent arallel architectures have made it rather dicult to develo ecient and ortable arallel FFTs. Recently, however, as the arallel rogramming environments have become less machine deendent, examles of such algorithms have aeared. Tyical examles are the Key words and hrases. fast Fourier transform, bulk synchronous arallel, arallel comutation. Inda suorted by a doctoral fellowshi from CAPES, Brazil. Comuter time on the Cray T3E of HPaC, Delft, funded by CF, The etherlands.

3 6-ass (or 6-ste) aroach and the related transose aroach (see e.g. [2, 3, 4, 5, 2]). Those algorithms regard the inut vector of size = 0 as an 0 matrix, and carry out the comutations in a similar way as done for two-dimensional FFTs. Since those algorithms require the number of rocessors to be a divisor of 0 and, they can only be used if. As the number of available rocessors grows and the communication seed increases, it is imortant to develo arallel algorithms that can handle more than rocessors. Though generalized algorithms have already been roosed, they only work for very secic combinations of and such as = (d;)=d =k, where d is the dimensionality and k [2, Cha. 0.3] or k = [22], and both and are owers of two. Furthermore, to our knowledge none of those algorithms were imlemented. Our main aim in this aer is to resent a new arallel FFT algorithm and its imlementation. Our arallel algorithm works for any < as long as both are owers of two, which is required because of the radix-2 framework. (A mixed-radix framework will be discussed elsewhere [].) We dedicate the remainder of this section to giving a brief introduction to the basic framework of radix-2 and radix-4 FFTs and to the bulk synchronous arallel (BSP) model. In Section 2, we derive our arallel FFT algorithm by inserting suitable ermutation matrices into the basic radix-2 decomosition of the Fourier matrix. This aroach leads to a simle and easy to imlement distributed memory arallel FFT algorithm. In Section 3, we resent a set of temlates that are used in the imlementation of the algorithm. In Section 4, we resent variants of our FFT algorithm. We show how to modify the algorithm to accet vectors that are not in the block distribution. We alsoshowhow to obtain a cache-friendly version of our algorithm, that is, an algorithm that takes advantage of the cache memory of a comuter by breaking u the comutations into small sections in such away that the data stored in the cache is comletely used before new data is brought in. In Section 5, we resent results regarding the erformance of our imlementation and discuss asects such as the cache eect. In Section 6, we draw our conclusions and discuss future work. 2

4 .. The fast Fourier transform. The DFT of a comlex vector z of size is dened as the comlex vector Z, also of size, with comonents Z k = X; j=0 z j e 2ijk 0 k<: (.) The inverse DFT, which transforms the comlex vector Z back into the vector z, is then dened by z j = X; k=0 Z k e ; 2ijk 0 k<: (.2) Alternatively, the DFT can be seen as a matrix-vector multilication: Z = F z: (.3) The comlex matrix F is known as the Fourier matrix it has elements (F ) jk = w jk, where w = e 2i : (.4) Though it is ossible to develo FFT algorithms that comute the DFT of a vector of arbitrary size, the radix-2 FFT algorithm only works for 's that are owers of two. For simlicity, we will restrict our discussion to such values of. The sequential iterative radix-2 FFT algorithm starts with the so-called bit reversal ermutation of the inut vector (see Section 3.2), and roceeds in log 2 buttery stages, numbered K =2 4 :::, which modify the vector. Each buttery stage consists of =K times a buttery comutation: z t+j z t+j+k=2 A z t+j + w j K z t+j+k=2 z t+j ; w j K z t+j+k=2 A for j =0 ::: K=2 ;, (.5) where t = 0 K ::: ; K indicates the beginning of a buttery block in the vector. These oerations cost one comlex multilication and two additions, or 0 real oating oint oerations (os), er air. The total o count of the radix-2 FFT is therefore C FFT-2 () =0 K 2 K log 2 =5 log 2 : Algorithm. is an in-lace version of this algorithm. 3

5 Algorithm. Sequential radix-2 FFT algorithm. CALL Seq FFT( y). ARGUMETS : Vector size is a owerof2with 2. y =(y in 0 ::: yin ;): Comlex vector of size. P OUTPUT y (y out 0 ::: y out ; ;), where yout k = DESCRIPTIO. Perform a bit reversal on y. 2. Perform log 2 buttery stages A K on y. K 2 while K do for t =0to ; K ste K do for j =0to K=2 ; do a w j K y t+j+k=2 y t+j+k=2 y t+j ; a y t+j y t+j + a K 2 K j=0 yin j ex(2ijk=). Following van Loan's matrix aroach [27], Algorithm. can be described as a sequence of sarse matrix-vector multilications which corresond to the following decomosition of the Fourier matrix F = A A A 4 A 2 P (.6) where P is the ermutation matrix corresonding to the bit reversal ermutation (ste of Algorithm.), and the matrices A K corresond to the buttery stages (ste 2 of Algorithm.). The block structure of the buttery stages leads to block-diagonal matrices of the form A K = I =K B K (.7) which is shorthand for a block-diagonal matrix diag(b K ::: B K )with=k coies of the K K matrix B K on the diagonal. The symbol reresents the direct (or Kronecker) roduct of two matrices, which is formally dened at the end of this subsection. The matrix B K is known as the K K 2-buttery matrix which corresonds to the buttery Actually, the matrix decomosition corresonding to the algorithm of Cooley and Tukey [9] is F = P ~ A ~ A ~ A4 ~ A2 where ~ AK = P ; A K P. 4

6 comutation (.5). This matrix can be written as B K = 2 4 I K=2 K=2 I K=2 ; K=2 3 5 : (.) Here, the matrix I K=2 is the K=2 K=2 identity matrix and K=2 is the K=2 K=2 diagonal matrix K=2 = diag(w 0 K w K ::: w K=2; K ): (.9) Later on we will also need generalized versions of A K : A K = I =K B K (.0) where B K is the generalized K K 2-buttery matrix [3, 5, 0] B K = 2 4 I K=2 K=2 I K=2 ; K=2 3 5 (.) which has the same form as the original B K, but the weights w j K in (.9) are relaced by w j+ K, where can be any real number. In ractice, often a radix-4 FFT is used. A radix-4 algorithm can be derived comletely analogously to the radix-2 algorithm, yielding a similar matrix decomosition. The algorithm starts with a reversal of airs of bits instead of a reversal of single bits, and roceeds in log 4 4-buttery stages which involve quadrules of vector comonents instead of airs. Since 34 os are erformed er quadrule, this brings the o count down to C FFT-4 () =34 K 4 K log 4 =4:25 log 2 : The resulting algorithm has the disadvantage that either it must be assumed that is a ower of four, or secial recautions must be taken which comlicate the algorithm. Wetake a slightly dierent aroach: wherever ossible wetake airs of stages A K A K=2 together and erform them as one oeration. Our K K 4-buttery matrix has the form D K = B K (I 2 B K=2 )= I K=4 2 K=4 K=4 3 K=4 I K=4 ; 2 K=4 i K=4 ;i 3 K=4 I K=4 2 ; K=4 K=4 ; 3 K=4 5 (.2) I K=4 ; 2 K=4 ;i K=4 i 3 K=

7 where K=4 is the K=4 K=4 diagonal matrix K=4 = diag(w 0 K w K ::: w K=4; K ): (.3) This matrix is a symmetrically ermuted version of the radix-4 buttery matrix [27]. 2 This aroach gives the eciency of a radix-4 FFT algorithm, and the exibility of treating a arallel FFT within the radix-2 framework. For examle, if we wish to ermute the data sometime during the comutation, for reasons of data locality, this can haen after any stage, and not only after an even number of stages. An algorithm for the inverse FFT is obtained using the following roerty: F ; = F = A A 4 A 2 P : (.4) The backward algorithm is basically the same as the forward one, the only dierence being that the owers of w K are relaced by their conjugates and that the nal result is rescaled. ow, we dene the direct roduct of two matrices and give some roerties that will be used in the course of the aer. Denition. (Direct roduct). Let A be a q r matrix and B be an m n matrix. Then the direct roduct (or Kronecker roduct) of A and B is the qm rn matrix dened by A B = a 0 0 B a 0 r; B : a q; 0 B a q; r; B As one would exect, the direct roduct is associative, but it is not commutative. Lemma.2 summarizes some direct roduct roerties that follow directly from the definition. (See [24, 27] for other useful roerties). Lemma.2 (Proerties of the direct roduct). The following holds.. (A B)(C D) =(AC) (BD), as long as the roducts are dened. 2. (I m I n )=I mn. 2 In verifying this, note that van Loan denes the weights to be w K = ex(; 2i K ). 6

8 3. If A and B are square matrices of order m and n, resectively, then (A I n )(I m B) =(A B) =(I m B)(A I n ). 4. If A and B are square matrices of order n such that AB = BA, then (I m A)(I m B) =(I m B)(I m A)..2. The bulk synchronous arallel model. The BSP model [26] is a arallel rogramming model which gives a simle and eective way to roduce ortable arallel algorithms. It does not deend on a secic comuter architecture, and it rovides a simle cost function that enables us to choose between algorithms without actually having to imlementthem. In the BSP model, a comuter consists of a set of rocessors, each with its own memory, connected by a communication network that allows rocessors to access the rivate memories of other rocessors. Accessing local memory (the rocessor's own memory) is faster than accessing remote memory (memory owned by other rocessors), but access time is considered to be indeendent from the comuter architecture. In this model, algorithms consist of a sequence of suerstes and synchronization barriers. The use of suerstes and synchronization barriers imoses a sequential structure on arallel algorithms, and this greatly simlies the design rocess. The variant of the BSP model that we use is a single rogram multile data (SPMD) model, i.e., each one of the rocessors executes a coy of the same rogram, though each has its own data. The rogram distinguishes between the rocessors through a arameter s (the rocessor identication number). Secial cases are treated using \if" statements. In our model, a suerste is either a comutation suerste, or a communication suerste. A comutation suerste is a sequence of local comutations carried out on data already available locally before the start of the suerste. A communication suerste consists of communication of data between rocessors. To ensure the correct execution of the algorithm, global synchronization barriers (i.e., laces of the algorithm where all rocessors must synchronize with each other) recede and/or follow a communication suerste. A BSP comuter can be characterized by four global arameters:, the number of rocessors v, thecomuting velocity in o/s 7

9 g, the communication time er data element sent or received, measured in o time units l, the synchronization time, also measured in o time units. Algorithms can be analyzed by using the arameters g, and l. The arameter v is used to estimate the total execution time after the cost function has been comuted. The o count of a comutation suerste is simly the maximum amount of work (in os) of any rocessor. The o count of a communication suerste is hg + l or hg +2l (deending on the number of global synchronizations). Here h is the maximum number of data elements sent or received by any rocessor. The cost function of an algorithm can be obtained by adding the os of the searate suerstes. This yields an exression of the form a + bg + cl. For further details and some basic techniques, see [4, 6]. The second reference describes BSPlib, a standard library dened in May 997 which enables arallel rogramming in BSP style. The Paderborn University BSP (PUB) library [6] is another library that ermits rogramming in BSP style it rovides the extra feature of subset synchronization..3. Parallel radix-2 FFTs. Since the introduction of arallel comuters, and even before that, methods for arallelizing FFT algorithms have been roosed [23]. The earliest methods roduced arallel algorithms that, using rocessors, carry out an FFT of size in O(log ) comutation suerstes, which are interleaved by O(log ) communication suerstes that need to communicate O( ) data elements er rocessor (see e.g. [,, 25]) and hence have cost O( g +l). Such methods aeared as a direct conse- quence of the divide-and-conquer structure of the radix-2 FFT algorithm. The aer [] by Chu and George discusses several existing arallel algorithms of this tye and three variants of their own. Restricting the discussion to vector sizes that are owers of two, they resent a common framework in which all the algorithms they discuss are reorderings of one another in the following sense. Each buttery stage K of an FFT of size, erforms airwise oerations that combine elements j and j + K=2 from the vector being transformed using the weight w j modk K. Writing j in its binary reresentation j =(j m; ::: j 0 ) 2, where m =log 2, we observe that elements j and j+k=2 dier only in bit log 2 K; and that w j modk K = w (j log 2 K; ::: j 0 ) 2 K. If the ordering of the vector is changed, so that original element j is stored as element

10 l, the buttery stages must be modied to carry out the same oerations. If we can reresent the new ordering using a ermutation of the original bits, it is easy to know which elements to combine and which weights to use. For examle, if =6a ossible reordering of the inut vector could be l =(j 0 j 2 j j 3 ) 2, where j =(j 3 j 2 j j 0 ) 2. The buttery stage corresonding to K =6should then combine elements l =(j 0 j 2 j 0) 2 with l +=(j 0 j 2 j ) 2 using weights w (j 2 j j 0 ) 2 6. In the arallel scenario, any grou of log 2 bits can be used to reresent the rocessor number, while the remaining log 2 (=) bits are used to reresent the local index. If the bit corresonding to the next buttery stage is one of the log 2 (=) bits that reresent the local index, then that stage is local, otherwise communication is needed. Swarztrauber [25] carries out a similar discussion. He starts with a more general formulation of the roblem, where is not restricted to owers of two, but when discussing the distributed memory framework, he only considers FFTs on a hyercube, restricting both and to owers of two. The roblem of the algorithms discussed in [,, 25] is that reorderings are carried out by means of exchanging one bit at a time. Since there are log 2 bits in the rocessor art, log 2 communication suerstes of size O( g + l) are needed. A less exensive solution to the roblem is to exchange all the rocessor bits with a grou of local bits corresonding to buttery stages that have already been erformed. Since the communication cost of the ermutation that exchanges many bits is of the same order O( g + l) as for exchanging one bit, the reduction in the communication cost is huge. The basic idea for such algorithms already aears in the original Cooley and Tukey aer [9]. In their derivation of the FFT algorithm, they start by considering the case where can be decomosed as = 0, and rewrite (.) as Z k k 0 = X ; j =0 X 0 ; j 0 =0 z j0 j w j 0k 0! w j (k 0 +k 0 ) 0 k < 0 k 0 < 0 (.5) where Z k k 0 = Z k 0 +k 0 = Z k, and z j0 j = z j0 +j = z j. Since w j 0k 0 = w j 0k 0, the inner sum of (.5) corresonds to a DFT of size 0, which can be comuted by the same rocess as before if 0 is not rime. They remark that this rocess can be alied to any ossible factorization of, = 0 ::: H; 0 and that, if is comosite enough, real gains (over the O( 2 ) direct aroach) can be achieved. Afterwards they derive the 9

11 radix-2 algorithm by choosing to be a ower of two. If instead of decomosing into its rime factors, we sto at a higher level, we obtain a decomosition of the FFT into a sequence of shorter FFTs that, in the arallel case, can be sread out over the rocessors. This is what haens in our FFT algorithm resented in the next section and in algorithms based on the 6-ass aroach and the related transose aroach (see e.g. [2, 3, 4, 5, 2]). 2. The arallel algorithm 2.. Grou-cyclic distribution family and the arallel FFT. Since our arallel FFT algorithm is based on the radix-2 decomosition (.6) of the Fourier matrix, must be a ower of two. For ractical reasons = must be integer and therefore must also be a ower of two. Our arallel FFT algorithm makes use of the data distributions dened below. Denition 2. (Cyclic distribution in r grous, C r ( )). Let r,, and be integers with r, such that r divides and. Let f be a vector of size to be distributed over rocessors organized in r grous. Dene M = =r to be the size of the subvector of a grou and u = =r to be the number of rocessors in a grou. We say that f is cyclically distributed in r grous (or r-cyclically distributed) over rocessors if, for all j, the element f j has local index j 0 =(j mod M) div u, and is stored in rocessor s 0 + s, where s 0 = (j div M) u is the number of the rst rocessor in the grou (i.e., the rocessor oset) and s =(j mod M) modu is the rocessor identication within the grou. We use the name grou-cyclic distribution family to designate all the r-cyclic distributions generated by the same and. This family includes the well-known cyclic distribution (C ( )) and block distribution (C ( )) as extreme cases. The arallel FFT algorithm works as follows. A total of H = dlog 2 = log 2 (=)e = dlog e hases is erformed. (The number of hases is the largest integer H for which (=) H;.) In hase 0, the algorithm uses the block distribution to erform the rst log 2 (=) buttery stages, i.e., those involving butteries with K = (which we call short distance butteries). Afterwards, in each intermediate hase J < H; it uses the r-cyclic distribution, with r = =(=) J, to erform a grou of log 2 (=) buttery 0

12 roc. 0 roc. roc. 2 roc. 3 roc. 4 roc. 5 roc. 6 roc. 7 (A) Phase 0: Short distance butterflies Phase : Medium distance butterflies Phase 2: Long distance butterflies (B) Phase 0: Short distance butterflies Phase : Medium distance butterflies Phase 2: Long distance butterflies Figure. Buttery oerations using the grou-cyclic distribution family C r ( ) = C r ( 32). (A) Logical view, (B) storage view. The short distance butteries (A 2 32 and A 4 32 ) are erformed using the C ( 32) distribution (block distribution). The medium distance butteries (A 32 and A 6 32 ) are erformed using the C 2 ( 32) distribution. The long distance butteries (A ) are erformed using the C ( 32) distribution (cyclic distribution). For clarity, not all buttery airs are shown. stages with (=) J <K (=) J+ (the medium distance butteries). Finally, in hase H ;, it uses the cyclic distribution to erform the remaining buttery stages, i.e., those involving butteries with K > (=) H; (the long distance butteries). Figure illustrates the use of the grou-cyclic distribution family in our arallel FFT. The same oerations are illustrated in two ways: (A) using the logical view, and (B) using the storage view. The logical view emhasizes the logical sequence of the elements in the vector while the storage view emhasizes the way the elements are actually stored. For the block distribution, both views are the same. Our algorithm has only O(log = log(=)) communication suerstes of size O( g + l), which is a signicant imrovement over the O(log ) communication suerstes also of

13 size O( g + l) of the algorithms discussed in the revious section. McColl [22] outlined a arallel FFT algorithm which is a secial case of the FFT algorithm we resent here. His algorithm only works for =(=) H. Furthermore, his algorithm sends the indices corresonding to the weights together with the data vector, increasing the communication costs unnecessarily. In the original BSP aer [26], Valiant already discussed the FFT and mentioned the order of the communication cost that can be achieved Permutations and ermutation matrices. Let u, v, and be ositive integers such that u divides v and v divides. We dene the following ermutation: u v : f0 ::: ; g!f0 ::: ; g j = j 0 M + j u + j 2 7! l = j 0 M + j 2 v + j (2.) where M = v u, j 0 = j div M, j = (j mod M) div u, and j 2 = j mod u. ote that ; u v = v u and that v = u is the identity ermutation. Permuting a vector of size by u v can be achieved by dividing the vector into v=u subvectors of size M = u=v and then erforming a (erfect) shue ermutation u M : u M : f0 ::: M ; g!f0 ::: M ; g j 7! k =(j mod u) M u + j div u (2.2) on each of the subvectors. 3 This relation can be exressed by which imlies that u u = u. u v (j) =j div M M + u M (j mod M) (2.3) In the case that divides, redistributing a vector of size from block distribution to r-cyclic distribution over rocessors is equivalent to ermuting it by u, where u = =r. ermutation matrix: Using matrix notation, this ermutation can be exressed by the (; u ) lj = >< >: 0 if l = u (j) otherwise: (2.4) 3 The shue ermutations (j) and (j) can be used to ermuteavector of size distributed over rocessors from block to cyclic distribution and vice-versa. 2

14 Multilying a vector y by ; u results in a vector with comonents (; u y) l = y ; u (l), for all l in other words, this multilication corresonds to redistributing the vector from block distribution to cyclic distribution in r = =u grous. ote that ; u = I r S u u, cf. (2.3). The matrix corresonding to the inverse ermutation ; u is ;; u =;T u = ; u. From now on, we use the abbreviations ; u and u to denote ; u and u, resectively. We restrict the use of subscrits and to cases where they are not obvious from the context. We also use S u M to denote ; u u M Decomosition of the Fourier matrix. To obtain the arallel FFT algorithm, we modify the original radix-2 decomosition of the Fourier matrix (.6) by inserting identity ermutation matrices I =; ; u ; u corresonding to the changes of distribution, and regrouing the matrices in the resulting decomosition. In the case that this is done as follows: F =; ; ; A ; ; ; ; ; ; A 2 ;; ; A :::A 2 P =; ; (; A ; ; ) (; A 2 ;; ); A :::A 2 P : (2.5) By dening ^A k u =; u A ku ; ; u (2.6) and using the fact that ; is the identity matrix, we can rewrite (2.5) as F =; ; ^A ^A 2 2 {z } hase ; ; ; ^A ^A 2 {z } hase 0 ; P : (2.7) As we did with the ermutation matrices ;, from now on we denote ^A k u by ^A k u, reserving the indices and for when they are not obvious from the context and for stand-alone denitions. In the general case, not restricted to, we arrive at the following decomosition of the Fourier matrix: F =; ; ^A ::: ^A 2 (=)H; {z } hase H; ; ; ; ( )H;2 ^A ( )H;2 ::: ^A 2 ( ) H;2 ::: ; ; {z } hase H;2 ^A ::: ^A 2 {z } hase 3 ; ; ; ; ( ) H;2 ::: ^A ::: ^A 2 ; P : (2.) {z } hase 0

15 (A) A 0 A 4 (B) B 0 B 0 A A 4 s =0 B 4 B 4 A 0 A 4 2 A 4 A 3 4 s = B 2 4 B 2 4 s =2 B 3 4 B 3 4 s =3 (C) w 0 w w 2 B 0 w 3 = ;w 0 ;w ;w 2 ;w 3 B 4 = w 4 w 4 w 2 4 w 3 4 ;w 4 ;w 4 ;w 2 4 ;w 3 4 B 2 4 = 2 w 4 w 2 4 w 22 4 w ;w 4 ;w 2 4 ;w 22 4 ;w 32 4 B 3 4 = 3 w 4 w 3 4 w 23 4 w ;w 4 ;w 3 4 ;w 23 4 ;w 33 4 Figure 2. (A) Structure of the matrix ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ). Examle with = 2, =, r = 2, k =, and hence u = 4 and n = 6. (For clarity, the A k n are deicted as A k ) (B) Matrix diag(a0=u k n A=u k n ::: A(u;)=u k n ). (C) Matrices B s =u k, with s =0 ::: u;. The matrices ^A k u are block diagonal matrices with block size : ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ) (2.9) where r = =u, n = =, anda k n was dened reviously, cf. (.0). Figure 2 examlies the structure of the matrix ^A k u. follows from Theorem 2.2. We shall formally state this as Corollary 2.3 which Theorem 2.2. Let u, M, and k be owers of two such that 2 k M=u. Dene n = M=u. Then S u M A ku M S ; u M = diag(a0=u k n A=u k n ::: A(u;)=u k n ): 4

16 Proof. See Aendix A. Corollary 2.3. Let r,, and be owers of two with r <. Dene u = =r and n = =. Let k be a ower of two with 2 k n. Then Proof. Dene M = =r. Then ^A k u = I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ): ^A k u =; u A ku ; ; u =(I r S u M )(I r A ku M )(I r S ; u M ) = I r (S u M A ku M S ; u M )=I r diag(a 0=u k n A=u k n ::: A(u;)=u k n ): Starting from the Fourier matrix decomosition (2.), it is easy to develo a arallel (BSP) FFT algorithm. Since all the matrices ^A k u are block diagonal matrices with block size equal to =, any multilication ^A k u y can be handled locally, rovided that the vector y is in the block distribution. This roerty guarantees that matrix decomosition (2.) detaches communication and comutation comletely. On the one hand, each generalized buttery hase ( ^A u ::: ^A 2 u )y, is a strict comutation suerste. On the other hand, each ermutation ; u is a strict communication suerste. In the next section we give a comlete descrition of the resulting arallel algorithm. 3. Imlementation of the arallel algorithm We resent our algorithm in the form of temlates, which are high level of detail algorithms, ready to be imlemented, though not necessarily comletely otimized. In the list that follows we introduce the terminology used in our arallel algorithms. Suerstes. Each suerste is numbered textually and labeled according to its tye: (Com) comutation suerste, (Comm) communication suerste, (CCm) subroutine containing both comutation and communication suerstes. Global synchronizations are exlicitly indicated by the keyword Synchronize. Suerstes inside loos are executed reeatedly, though they are numbered only once. Indexing. All the indices of vectors or array structures are global. This means that array elements have a unique index which is indeendent of the rocessor that owns it. This roerty enables us to describe variables and gain access to arrays in an unambiguous manner, even though the array is distributed and each rocessor has 5

17 only art of it. (In an actual imlementation, it is more convenient to convert the indexing scheme to a local one.) Communication. Communication between rocessors is indicated using g j Put(id n f i ) This oeration uts n elements of array f, starting from element i, into rocessor id and stores them there in array g starting from element j. Subscrits are not needed when the rst element of the array is 0 or when communicating scalars. When communicating more than one element, we use boldface to emhasize that we are dealing with a vector and not a scalar. All uts are assumed to be buered, so that they are safely carried out, even if f and g haen to be the same. Algorithm 3. is a direct imlementation of the matrix decomosition (2.). The inut vector y is transformed in lace. Suerste ermutes the inut vector by a bit reversal: y P y: Suerste 2 carries out the short distance butteries: y (A :::A 2 )y. (Since is the identity ermutation, ^Ak = A k, for k = 2 ::: =.) Suerste 3 ermutes y to the r-cyclic distribution, with r = max( =(=)): y ; =r y: Each time suerste 4 is executed, it comutes a grou of medium distance butter- ies: y ( ^A ( ::: ^A )J 2 ( ) J ) y J H ; 2: Suerste 5 reares the vector for the next buttery hase by ermuting it to the r-cyclic distribution, with r = max( =(=) J+ ): y (; r ; ; ) y: Suerste 6 carries out the long distance butter- ( )J ies: y ( ^A ::: ^A (=)H; 2 ) y: Finally, suerste 7 ermutes the vector back to the y: ote that, to obtain the normalized inverse transform, block distribution: y ; ; the outut vector must be divided by. The subroutines used in the FFT temlate are described in the following subsections. 3.. Generalized butteries. The subroutine BTFLY (Algorithm 3.2) is a sequential subroutine that multilies the inut vector by A n n :::A k 0 n A k 0 =2 n. airs of generalized buttery oerations. Ste erforms The k-th iteration of the outermost whileloo erforms the air of generalized butteries stages A k n A k=2 n, the t-th iteration of the intermediate for-loo corresonds to the t-th reetition of the generalized 4-buttery Dk = B k (I 2 B k=2 ), cf. (.2), which is comuted by the innermost for-loo. Ste 2 6

18 Algorithm 3. Temlate for the arallel fast Fourier transform, using the grou-cyclic distribution family. CALL BSP FFT(s sign y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors is a ower of 2 with <. sign: Transform direction + for forward, ; for backward. : Transform size is a ower of 2 with 2. y =(y in 0 ::: yin ;): Comlex vector of size (block distributed). P OUTPUT y (y out 0 ::: y out ; ;), where yout k = DESCRIPTIO CCm Parallel bit reversal ermutation. BSP BitRev(s y) 2 Com Short distance butteries. BTFLY(0 sign 4 y s ) j=0 yin j ex(sign 2ijk=). 3 CCm Permutation to C max( =(=)) ( ) distribution. BSP BlockToCyclic(s ; s mod smod min( ) y (s;smod ) ) H dlog e for J =to H ; 2 do 4 Com Medium distance butteries. BTFLY( smod(=)j (=) J sign 4 y s ) 5 Comm Permutation to C max( =(=)J+) ( ) distribution. BSP CyclicToCyclic(s (=) J min( (=) J+ ) y) 6 Com Long distance butteries. BTFLY( s sign 4 (=)H; y s ) 7 CCm Permutation to block distribution. BSP CyclicToBlock(0 s y) is only executed if the number of buttery stages is odd. It comutes the last generalized 2-buttery A n n. The FFT algorithm comutes the desired (short, medium, or long distance) buttery stages corresonding to hase J, 0 J < H, by dening the inut arameter = (s mod u)=u, where u = min( (=) J ), and erforming the generalized buttery stages on the local art of the vector y (i.e., the subvector y s starts at element s=). of size = that If the needed weights are stored in a looku table, the cost of an A k na k=2 n buttery oeration is 34 k 4 n k = 7 2 n. Summing over all airs A k n A k=2 n 7 and adding 5n os for

19 Algorithm 3.2 Temlate for the sequential generalized buttery oerations. CALL BTFLY( sign n k 0 y). ARGUMETS : Buttery arameter, used to comute the correct weights 0 <. sign: Transform direction + for forward, ; for backward. n: Vector size n is a ower of 2 with n 2. k 0 : Smaller 4-buttery size k 0 is a ower of 2 with 4 k 0 2n. y =(y 0 ::: y n; ): Comlex vector of size n. OUTPUT y A n n :::A k 0 n A k 0 =2 ny for forward, and y A n n ::: A k 0 na k 0 =2 ny for backward. DESCRIPTIO. Perform airs of buttery stages A k n A k=2 n. k k 0 while k n do for t =0to n ; k ste k do for j =0to k=4 ; do yw y t+j+k=2 yw2 y t+j+k=4 yw3 y t+j+3k=4 a y t+j + yw2 b y t+j ; yw2 c yw +yw3 d yw ; yw3 y t+j a + c y t+j+k=4 b + sign id y t+j+k=2 a ; c b ; sign id y t+j+3k=4 w sign(j+) k w sign2(j+) k w sign3(j+) k k 4 k 2. Perform the last buttery stage A n n. if k =2n then for j =0to n=2 ; do a y j+n=2 y j w sign(j+) n y j+n=2 y j ; a y j + a the last buttery (if necessary) gives a cost of C BTFLY (n k 0 )= = 7 2 n [(log 2 n ; log 2 k 0 +2)div2]+5n [(log 2 n ; log 2 k 0 +2)mod2] = 7 4 n [log 2 n ; log 2 k 0 +2]+ 3 4 n [(log 2 n ; log 2 k 0 +2)mod2] (3.) for the generalized buttery algorithm (Algorithm 3.2).

20 The total comutation cost of our arallel FFT (Algorithm 3.) is obtained by adding the costs of the buttery hases: C BTFLY ( 4) = 7 4 log (log 2 mod 2) for hases J = 0 to H ; 2, where H = dlog e, and for the last hase H ; if (=) H; =, i.e., H =log. Otherwise, the cost for the last hase is This gives C BTFLY ( 4(=)H; )= 7 4 (log 2 mod log 2 ) [(log 2 mod log 2 )mod2]: C FFT ar Com ( ) = 7 4 log [(log 2 mod 2)(log 2 div log 2 ) +(log 2 mod log 2 )mod2] (3.2) where the second term corresonds to the extra cost we have to ay for erforming 2-butteries. The communication and synchronization costs of our arallel FFT are discussed in Section 3.5 after we discuss the arallel ermutation subroutines Parallel bit reversal. The bit reversal matrix P is dened by (P ) jk = >< >: 0 if j = rev (k) otherwise: (3.3) Here, rev is the bit reversal ermutation rev : f0 ::: ; g!f0 ::: ; g j = m; X l=0 b l 2 l 7! k = m; X l=0 b m;l; 2 l (3.4) where m = log 2 and (b m; :::b 0 ) 2 is the binary reresentation of j. ote that rev ; = rev, which means that P ; = P. The bit reversal ermutation has the following very useful roerty. Lemma 3.. Let u =2 q, =2 m, with q m. Then rev (j) = rev u (j div u)+ u rev u(j mod u) 0 j <: 9

21 Proof. Let d = m ; q, so that u =2d. If the binary reresentation of j is (b m; :::b 0 ) 2, then ow, rev (j) = j = j mod u + u (j div u) = m; X l=0 b m;l; 2 l = Xd; l=0 q; X l=0 b m;l; 2 l +2 d X d; b l 2 l +2 q b l+q 2 l : q; X l=0 l=0 b q;l; 2 l = a + u b: If we show that a = rev u (j div u) and b = rev u (j mod u), we are done. In fact, a = = Xd; l=0 d; X l=0 Xd; b m;l; 2 l = b m;q;l;+q 2 l = (substituting c l = b l+q ) l=0!! d; d; X X c d;l; 2 l =rev c l 2 l = rev b l+q 2 l =rev(j div u) u u u l=0 l=0 and b = q; X X! q; b q;l; 2 l = rev u b l 2 l = rev u (j mod u): l=0 l=0 Corollary 3.2. Let u be owers of two. Then P =(I u P )(P u I )S u u u Proof. The matrix (I u P )(P u I )S u corresonds to a sequence of three ermutations: u u. j! l = u (j) =j mod u + j div u u 2. l! t = rev u (l div ) + l mod =rev u u u u(j mod u) + j div u u 3. t! k = t div u u +rev(t mod ) = rev u u u(j mod u) u +rev(j div u) = rev (j) u Let y be a vector of size =2 m block distributed over =2 q rocessors. Suose that we want to ermute it by a bit reversal ermutation, i.e., erform y P y. Alying Corollary 3.2 with u =, it is ossible to slit the arallel bit reversal ermutation in two arts as follows. 20

22 . y (P I )S y, which is a global ermutation that sends the elements to the nal destination rocessors, but with local indices still in the original order: j! t = rev (j mod ) {z } Proc(t) + j div {z : } t 0 Having as basis the block distribution, from now on, we use Proc(k) = k div denote the rocessor in which element k is stored, and k 0 = k mod local index of the element. to to denote the 2. y (I P ) y, which is a local bit reversal ermutation in the local index t 0 : t 0! k 0 = rev (t 0 ) Algorithm 3.3 carries out a arallel bit reversal using this idea. If we combine the local bit reversal (suerste 2 of Algorithm 3.3) with the short distance buttery hase (suerste 2 of Algorithm 3.), we obtain a comlete local sequential FFT. This means that we can easily relace the two suerstes by any otimized FFT subroutine we can lay our hands on. Algorithm 3.3 Temlate for the arallel bit reversal. CALL BSP BitRev(s n y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors is a ower of 2 with <. : Vector size is a owerof2with 2. y =(y 0 ::: y ; ): Comlex vector of size (block distributed). OUTPUT y P y. DESCRIPTIO Comm Global ermutation. for j = s to s + ; do dest rev (j mod ) x dest +j div Put(dest y j) Synchronize 2 Com Local bit reversal. for t 0 =0to ; do y s +rev (t 0 ) x s +t0 If < =, it is ossible to otimize communication suerste of Algorithm 3.3 by sending ackets of data. This is done in a similar way as when ermuting from block to 2

23 cyclic distribution (see Section 3.4) the only dierence is in the destination rocessor, which is rev (j mod ) instead of j mod Permutations within the grou-cyclic distribution family. Permuting a vector from the C r ( ) distribution to the Cr 2 ( ) distribution, where r = =u and r 2 = =u 2 may beany ossible grou sizes, not necessarily owers of two, can be done as follows: rst, use ; u to ermute the vector to the block distribution, and then use u2 to ermute it to the C r 2 ( ) distribution. This oeration is exensive if erformed in arallel, because all the data have to be moved twice around the rocessors. The best aroach is to combine both ermutations into one: u 2 u : f0 ::: ; g!f0 ::: ; g j 7! l = u2 ( ; u (j)): (3.5) (ote that ( u 2 u ) ; = u u 2, and that u 2 u is an abbreviation for u 2 u.) In the general case, there is no simle formula for comuting the destination index l. Algorithm 3.4 is a temlate for this case. Some combinations of the arameters r r 2 and, however, lead to simler exressions for the destination index. The simlest case is when r or r 2 is equal to, i.e., one of the distributions involved is the block distribution. This situation occurs in suerstes 3 and 7 of the FFT algorithm (Algorithm 3.) and is discussed in Section Permutation from block to cyclic distribution. The ermutations and ; are the ermutations that convert a vector from block to cyclic distribution and vice versa. In the case that < b = =, both and ; ackets of size b= (here we assume that divides b, but nothing more). can be otimized by sending For, this is done as follows. First erform a local ermutation b on the local index j 0, j 0! t 0 = j 0 mod b + j0 div : Then erform a global cyclic ermutation of ackets on the global index t = j div bb+t 0 = t 0 b + t b + t 2, t! k = t {z} b + t 0 b + t 2 Proc(k) {z } k 0 22 : (3.6)

24 Algorithm 3.4 Temlate for the arallel ermutation from r -cyclic to r 2 -cyclic distribution. CALL BSP CyclicToCyclic(s u u 2 y). ARGUMETS s: Processor identication 0 s<. : umber of rocessors <. : Vector size. u : umber of rocessors in the old grou u = =r. u 2 : umber of rocessors in the new grou u 2 = =r 2. y =(y 0 ::: y ; ): Comlex vector of size (block distributed). OUTPUT y ; u2 ; ; u y. DESCRIPTIO Comm Global ermutation u 2 u. for j = s to s + ; do l u 2 u (j) roc l div y l Put(roc y j ) Synchronize The resulting global index k is then k = j 0 mod b + j div b b + j0 div = j mod b + j div b b +(j mod b) div = j mod b +(jdiv b b + j mod b) div = j mod b + j div which is indeed (j), see Figure 3. Subroutine BSP BlockToCyclic (Algorithm 3.5) with s = s and s 0 = 0 carries out the ermutation. The subroutine erforms the ermutation on a vector y of size = b which is block distributed over rocessors. This is done in suerste 3 ot the FFT algorithm if =. Subroutine BSP BlockToCyclic can also be used to carry out the ermutation r r, which is needed in suerste 3 of the FFT algorithm if >=. This is ossible because ermuting a vector of size r by r r is the same as dividing it into r subvectors of size and then erforming a shue ermutation on each of the subvectors, cf. (2.3). In this case s = s mod, s 0 = s ; s, and y is a subvector starting at element s 0 b of the original vector. 23

25 roc. 0 roc. roc. 2 roc (A) (B) Figure 3. Schematic reresentation of a two stage ermutation from block to cyclic distribution (storage view). Examle with = 32 and = 4. (A) Local 4 ermutation. (B) Global cyclic ermutation of ackets of size 2. The temlate for subroutine BSP CyclicToBlock, which carries out ; is obtained in a similar way as the temlate for subroutine BSP BlockToCyclic. and ; r r temlate starts by erforming a global cyclic ermutation of ackets, and then it carries out a local ermutation ; b. details.) The (The temlate is not resented here see [7] for more 3.5. BSP cost. To comute the total cost of our arallel FFT algorithm (Algorithm 3.) we need to sum the comutation, communication, and synchronization costs. The comutation costs were already obtained in Section 3.. To simlify the nal result we only include the higher order term of the total comutation cost (3.2), C FFT ar Com ( ) = 7 4 log 2 (3.7) which is exact when only 4-butteries are erformed. The communication and synchronization costs are the costs involved in erforming the bit reversal and the ermutations related to the grou-cyclic distribution family. The maximum amount of data sent or received during a ermutation involving comlex numbers is equal to = comlex values (or 2= real values). If the ermutation is erformed with uts, the number of synchronizations is, giving a total cost of C ermut ( ) =2 g + l (3.) 24

26 Algorithm 3.5 Temlate for the arallel ermutation from block to cyclic distribution. CALL BSP BlockToCyclic(s 0 s b y). ARGUMETS s 0 s : Processor oset and rocessor identication within grou 0 s <. : umber of rocessors in grou. b: Block size divides b, if<b. y =(y 0 ::: y b; ): Comlex vector of size b (block distributed within grou). OUTPUT y S b y. DESCRIPTIO if b then Comm Global b ermutation. for j = s b to (s +)b ; do dest j mod y destb+j div Put(s 0 + dest y j ) Synchronize else 2 Com Local b ermutation. for j 0 =0to b ; do k 0 j 0 mod b + j0 div x s b+k 0 y s b+j 0 3 Comm Global cyclic ermutation of ackets. for roc =0to ; do y rocb+s b Put(s 0 + roc b x s b+roc b ) Synchronize for each of the dlog e +ermutations erformed in the FFT algorithm. The total cost of the FFT algorithm is C FFT, ar ( ) = 7 4 log 2 +2 (dlog e +) g +(dlog e +) l: (3.9) In Section 5, we discuss the validity of cost function (3.9) as an accurate estimator of the true cost of the FFT algorithm. 4. Variants of the algorithm 4.. Parallel FFT using other data distributions. U to now, we discussed an FFT algorithm where the inut and outut (I/O) vector must be block distributed. There exist alications of the FFT where it would be better if the I/O vector would be distributed by other distributions or where the distribution can be freely chosen (see e.g. [4, 9, 2]). Here, we discuss how to modify our arallel FFT algorithm to accet I/O vectors that are not distributed by the block distribution. 25

27 The rst and the last suerstes of Algorithm 3. are ermutations. Because of this, the algorithm can be modied to accet any I/O data distribution without any extra communication cost, or even at a smaller communication cost deending on the desired distributions. If the inut vector is not in the block distribution, the algorithm is modied by combining the redistribution to block distribution with the bit reversal ermutation. If the outut vector is exected to be in a distribution other than the block distribution, this is done by relacing the ermutation from cyclic to block distribution by a ermutation from the cyclic to the desired distribution. If the desired distribution for the outut vector is the cyclic distribution, the last communication suerste can be comletely skied. The rst ermutation can also be skied if the inut vector is already stored by the distribution associated with the bit reversal ermutation. Alications where the inut vector is bit reversed and the outut vector is cyclically distributed are advantageous, because, in such cases, two comlete ermutations can be skied. 4 This saves two thirds of the total communication cost in the common case that =, leaving only one ermutation in the middle of the comutation. While the cyclic distribution is simle and widely used, the distribution associated with the bit reversal ermutation is awkward. Fortunately, it is ossible to modify Algorithm 3. so that the cyclic distribution is a natural inut distribution, i.e., a distribution that does not involve any communication as the rst suerste. This is done as follows. The rst three suerstes of Algorithm 3. are described by the matrix decomosition ; u A :::A 2 P (4.) where u = min( =). Knowing that A K = I A K, and that the bit reversal matrix can be decomosed as P = (I P ) (P I ) S (cf. Corollary 3.2), we rewrite 4 The idea of skiing ermutations to save communication time or to reduce the overhead caused by local ermutations is known. Cooley and Tukey [9] already suggested this in order to save local bit reversals. Other authors (e.g., [4, 9, 2]) give examles where skiing ermutations saves communication time. 26

28 matrix (4.) as ; u (I A ) :::(I A 2 ) (I P ) (P I ) S =; u (I F ) (P I ) S =; u (P I ) (I F ) S : (4.2) Here we used Lemma.2. The rst three suerstes of the arallel FFT algorithm derived from this new decomosition are: ( Comm ) ermutation from block to cyclic distribution, (2 Com ) local FFT, (3 Comm ) ermutation dened by ; min( =) (P I ). In the case that the inut vector is already cyclically distributed, the rst suerste can be skied Generalized buttery hase with adjustable size. In our original algorithm, we chose to insert the ermutation matrices ; u in the leftmost ossible osition. This rocedure corresonds to factoring as = (=) H;, and gives an algorithm (=) H; with a minimum number of ermutations. However, if 6= (=) H;, it is ossible to insert the ermutation matrices at an earlier osition without increasing the number of ermutations. The resulting algorithm would corresond to a dierent factorization of. We can use this exibility to reduce the comutation cost of some combinations of and by inserting the ermutations so that a maximal number of generalized buttery stages are aired o. Another reason to ermute the vector at an earlier stage is that the sizes of the buttery hases can be better balanced (so that all factors of have aroximately the same size). This would enhance the erformance on a cache-sensitive comuter (see the discussion in Section 5). An even a more eective way of enhancing the erformance on a cache-sensitive comuter is to reduce the buttery sizes so that they always t comletely into the cache. We suggest a method in the following subsection Cache-friendly arallel FFT. Each comutation suerste of our arallel FFT algorithm erforms a buttery hase which consists of a sequence of generalized buttery stages reresented by the oeration y R l ny, where l and n are owers of two with 2 l n, and R l n = A n n A 2l n A l n (4.3) is an n n matrix. Suose that the cache memory of a comuter is such that the data needed by a buttery hase of size n=v, where v < n is a ower of two, ts totally in 27

29 the comuter cache. We can view v as the number of virtual rocessors available in each rocessor. If we decomose (4.3) into a sequence of smaller buttery hases of size less than or equal to n=v which can be carried out indeendently from each other, we can get the most out of the cache of the comuter. Dene h = dlog n v ne and j = dlog n v le;, so that ( n v )j <l ( n v )j+. Similarly to (2.), if we denote ; u v n by ; u, we can write R l n =; ; v ^A n v v ::: ^A 2 (n=v) h; v v {z } hase h;j; ::: ; ; ( n v )j+ ^A n v ( n v )j+ ::: ^A 2 ( n v )j+ {z } hase ; v ; ; ( n v )h;2 ^A n v ( n v )h;2 ::: ^A 2 ( n v )h;2 {z } hase h;j;2 ; ( n v ) h;2 ::: ; ( n v ) j+ ;; ( n ^A nv v )j ( n ::: ^A v )j l (n=v) j ( n v )j {z } hase 0 ; ( n v ) j (4.4) where the nn matrix ^A k u is an abbreviation for ^A k u v n =; u v n A ku n; ; u v n. Generalized versions of Theorem 2.2 and Corollary 2.3 can be used to rove that ^A k u v n = I v u diag(a =u k n A(+)=u v k n v ::: A (+u;)=u k n ): (4.5) v The matrix decomosition (4.4) can be used to construct an alternative (cache-friendly) algorithm for the comutation of the generalized buttery hases. ote that if = 0 then the resulting algorithm can be used to construct a cache-friendly sequential FFT algorithm. 5. Exerimental results and discussion In this section, we resent results on the erformance of our imlementation of the FFT. We imlemented the FFT algorithm for the block distribution in ASI C using the BSPlib communications library [6]. Our rograms are comletely self-contained, and we did not rely on any system-rovided numerical software such as BLAS, FFTs, etc. We tested our imlementation on a Cray T3E with u to 64 rocessors, each having a theoretical eak seed of 600 Mo/s. The accuracy of double recision (64-bit) arithmetic is :0 0 ;5. We also give accuracy results from calculations on a Sun Workstation using IEEE 754 oating oint arithmetic, which has a double recision accuracy of 2:2 0 ;6, and which is the standard used in many comuters such as workstations. To make a consistent comarison of the results, we comiled all test rograms using the bsfront driver with otions -O3 -flibrary-level 2 -fcombine-uts and measured 2

30 the elased execution times on exclusively dedicated CPUs using the system clock. The times given corresond to an average of the execution times of a forward FFT and a normalized backward FFT. 5.. Accuracy. We tested the overall accuracy of our imlementation by measuring the error obtained when transforming a random comlex vector f with values Re(f j ) and Im(f j ) uniformly distributed between 0 and. The relative error is dened as jjf ; Fjj 2 =jjfjj 2 where F is the vector obtained by transforming the original vector f by a forward (or backward) FFT, and F is the exact transform, which we comuted using the same algorithm but using quadrule recision. Here, jj jj 2 indicates the L 2 -norm. Table shows the relative errors of the sequential algorithm for various roblem sizes. Since the error for the forward and backward FFT are aroximately the same, we resent only the results for the forward transform. The errors of the arallel imlementation are of the same order as in the sequential case. In fact, the error of the arallel imlementation only diers from the error of the sequential one if the buttery stages are not aired in the same way. This validates the arallel algorithm. The results also indicate that IEEE arithmetic is suerior to the CRAY-secic arithmetic. Table. Relative errors for the sequential FFT algorithm. CRAY T3E IEEE :4 0 ;6 :9 0 ; :2 0 ;6 :6 0 ;6 204 :4 0 ;6 : 0 ; : 0 ;5 :9 0 ;6 92 3:2 0 ;5 2:0 0 ; :5 0 ;5 2:2 0 ; :3 0 ;4 2:3 0 ; :4 0 ;4 2:3 0 ; Performance of the sequential imlementation. Our sequential FFT algorithm was imlemented using Algorithm 3.2 with = 0.Its eciency can be analyzed by looking at its execution times or its FFT o rates: FFT rate (seq ) = 29 5 log 2 Time(seq ) (5.)

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i Comuting with Haar Functions Sami Khuri Deartment of Mathematics and Comuter Science San Jose State University One Washington Square San Jose, CA 9519-0103, USA khuri@juiter.sjsu.edu Fax: (40)94-500 Keywords:

More information

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE MATHEMATICS OF COMPUTATIO Volume 75, umber 256, October 26, Pages 237 247 S 25-5718(6)187-9 Article electronically ublished on June 28, 26 O POLYOMIAL SELECTIO FOR THE GEERAL UMBER FIELD SIEVE THORSTE

More information

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720 Parallelism and Locality in Priority Queues A. Ranade S. Cheng E. Derit J. Jones S. Shih Comuter Science Division University of California Berkeley, CA 94720 Abstract We exlore two ways of incororating

More information

A generalization of Amdahl's law and relative conditions of parallelism

A generalization of Amdahl's law and relative conditions of parallelism A generalization of Amdahl's law and relative conditions of arallelism Author: Gianluca Argentini, New Technologies and Models, Riello Grou, Legnago (VR), Italy. E-mail: gianluca.argentini@riellogrou.com

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Numerous alications in statistics, articularly in the fitting of linear models. Notation and conventions: Elements of a matrix A are denoted by a ij, where i indexes the rows and

More information

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift

More information

Introduction to Group Theory Note 1

Introduction to Group Theory Note 1 Introduction to Grou Theory Note July 7, 009 Contents INTRODUCTION. Examles OF Symmetry Grous in Physics................................. ELEMENT OF GROUP THEORY. De nition of Grou................................................

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

A Parallel Algorithm for Minimization of Finite Automata

A Parallel Algorithm for Minimization of Finite Automata A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction GOOD MODELS FOR CUBIC SURFACES ANDREAS-STEPHAN ELSENHANS Abstract. This article describes an algorithm for finding a model of a hyersurface with small coefficients. It is shown that the aroach works in

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM JOHN BINDER Abstract. In this aer, we rove Dirichlet s theorem that, given any air h, k with h, k) =, there are infinitely many rime numbers congruent to

More information

Sets of Real Numbers

Sets of Real Numbers Chater 4 Sets of Real Numbers 4. The Integers Z and their Proerties In our revious discussions about sets and functions the set of integers Z served as a key examle. Its ubiquitousness comes from the fact

More information

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A Dahleh George Verghese Deartment of Electrical Engineering and Comuter Science Massachuasetts Institute of Technology c Chater Matrix Norms

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict

More information

GIVEN an input sequence x 0,..., x n 1 and the

GIVEN an input sequence x 0,..., x n 1 and the 1 Running Max/Min Filters using 1 + o(1) Comarisons er Samle Hao Yuan, Member, IEEE, and Mikhail J. Atallah, Fellow, IEEE Abstract A running max (or min) filter asks for the maximum or (minimum) elements

More information

New weighing matrices and orthogonal designs constructed using two sequences with zero autocorrelation function - a review

New weighing matrices and orthogonal designs constructed using two sequences with zero autocorrelation function - a review University of Wollongong Research Online Faculty of Informatics - Paers (Archive) Faculty of Engineering and Information Sciences 1999 New weighing matrices and orthogonal designs constructed using two

More information

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R.

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R. 1 Corresondence Between Fractal-Wavelet Transforms and Iterated Function Systems With Grey Level Mas F. Mendivil and E.R. Vrscay Deartment of Alied Mathematics Faculty of Mathematics University of Waterloo

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

1-way quantum finite automata: strengths, weaknesses and generalizations

1-way quantum finite automata: strengths, weaknesses and generalizations 1-way quantum finite automata: strengths, weaknesses and generalizations arxiv:quant-h/9802062v3 30 Se 1998 Andris Ambainis UC Berkeley Abstract Rūsiņš Freivalds University of Latvia We study 1-way quantum

More information

Matching Partition a Linked List and Its Optimization

Matching Partition a Linked List and Its Optimization Matching Partition a Linked List and Its Otimization Yijie Han Deartment of Comuter Science University of Kentucky Lexington, KY 40506 ABSTRACT We show the curve O( n log i + log (i) n + log i) for the

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

On Wald-Type Optimal Stopping for Brownian Motion

On Wald-Type Optimal Stopping for Brownian Motion J Al Probab Vol 34, No 1, 1997, (66-73) Prerint Ser No 1, 1994, Math Inst Aarhus On Wald-Tye Otimal Stoing for Brownian Motion S RAVRSN and PSKIR The solution is resented to all otimal stoing roblems of

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

c Copyright by Helen J. Elwood December, 2011

c Copyright by Helen J. Elwood December, 2011 c Coyright by Helen J. Elwood December, 2011 CONSTRUCTING COMPLEX EQUIANGULAR PARSEVAL FRAMES A Dissertation Presented to the Faculty of the Deartment of Mathematics University of Houston In Partial Fulfillment

More information

A construction of bent functions from plateaued functions

A construction of bent functions from plateaued functions A construction of bent functions from lateaued functions Ayça Çeşmelioğlu, Wilfried Meidl Sabancı University, MDBF, Orhanlı, 34956 Tuzla, İstanbul, Turkey. Abstract In this resentation, a technique for

More information

CSC165H, Mathematical expression and reasoning for computer science week 12

CSC165H, Mathematical expression and reasoning for computer science week 12 CSC165H, Mathematical exression and reasoning for comuter science week 1 nd December 005 Gary Baumgartner and Danny Hea hea@cs.toronto.edu SF4306A 416-978-5899 htt//www.cs.toronto.edu/~hea/165/s005/index.shtml

More information

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm Gabriel Noriega, José Restreo, Víctor Guzmán, Maribel Giménez and José Aller Universidad Simón Bolívar Valle de Sartenejas,

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

Mobius Functions, Legendre Symbols, and Discriminants

Mobius Functions, Legendre Symbols, and Discriminants Mobius Functions, Legendre Symbols, and Discriminants 1 Introduction Zev Chonoles, Erick Knight, Tim Kunisky Over the integers, there are two key number-theoretic functions that take on values of 1, 1,

More information

Elliptic Curves and Cryptography

Elliptic Curves and Cryptography Ellitic Curves and Crytograhy Background in Ellitic Curves We'll now turn to the fascinating theory of ellitic curves. For simlicity, we'll restrict our discussion to ellitic curves over Z, where is a

More information

The Arm Prime Factors Decomposition

The Arm Prime Factors Decomposition The Arm Prime Factors Decomosition Arm Boris Nima arm.boris@gmail.com Abstract We introduce the Arm rime factors decomosition which is the equivalent of the Taylor formula for decomosition of integers

More information

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS CASEY BRUCK 1. Abstract The goal of this aer is to rovide a concise way for undergraduate mathematics students to learn about how rime numbers behave

More information

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model Shadow Comuting: An Energy-Aware Fault Tolerant Comuting Model Bryan Mills, Taieb Znati, Rami Melhem Deartment of Comuter Science University of Pittsburgh (bmills, znati, melhem)@cs.itt.edu Index Terms

More information

Fault Tolerant Quantum Computing Robert Rogers, Thomas Sylwester, Abe Pauls

Fault Tolerant Quantum Computing Robert Rogers, Thomas Sylwester, Abe Pauls CIS 410/510, Introduction to Quantum Information Theory Due: June 8th, 2016 Sring 2016, University of Oregon Date: June 7, 2016 Fault Tolerant Quantum Comuting Robert Rogers, Thomas Sylwester, Abe Pauls

More information

Generalized Coiflets: A New Family of Orthonormal Wavelets

Generalized Coiflets: A New Family of Orthonormal Wavelets Generalized Coiflets A New Family of Orthonormal Wavelets Dong Wei, Alan C Bovik, and Brian L Evans Laboratory for Image and Video Engineering Deartment of Electrical and Comuter Engineering The University

More information

arxiv: v2 [quant-ph] 2 Aug 2012

arxiv: v2 [quant-ph] 2 Aug 2012 Qcomiler: quantum comilation with CSD method Y. G. Chen a, J. B. Wang a, a School of Physics, The University of Western Australia, Crawley WA 6009 arxiv:208.094v2 [quant-h] 2 Aug 202 Abstract In this aer,

More information

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO MARIA ARTALE AND DAVID A. BUCHSBAUM Abstract. We find an exlicit descrition of the terms and boundary mas for the three-rowed

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS #A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,

More information

Round-off Errors and Computer Arithmetic - (1.2)

Round-off Errors and Computer Arithmetic - (1.2) Round-off Errors and Comuter Arithmetic - (.). Round-off Errors: Round-off errors is roduced when a calculator or comuter is used to erform real number calculations. That is because the arithmetic erformed

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

Topic 7: Using identity types

Topic 7: Using identity types Toic 7: Using identity tyes June 10, 2014 Now we would like to learn how to use identity tyes and how to do some actual mathematics with them. By now we have essentially introduced all inference rules

More information

Unit 1 - Computer Arithmetic

Unit 1 - Computer Arithmetic FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2

More information

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS Proceedings of DETC 03 ASME 003 Design Engineering Technical Conferences and Comuters and Information in Engineering Conference Chicago, Illinois USA, Setember -6, 003 DETC003/DAC-48760 AN EFFICIENT ALGORITHM

More information

The Euler Phi Function

The Euler Phi Function The Euler Phi Function 7-3-2006 An arithmetic function takes ositive integers as inuts and roduces real or comlex numbers as oututs. If f is an arithmetic function, the divisor sum Dfn) is the sum of the

More information

p-adic Measures and Bernoulli Numbers

p-adic Measures and Bernoulli Numbers -Adic Measures and Bernoulli Numbers Adam Bowers Introduction The constants B k in the Taylor series exansion t e t = t k B k k! k=0 are known as the Bernoulli numbers. The first few are,, 6, 0, 30, 0,

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

The Fekete Szegő theorem with splitting conditions: Part I

The Fekete Szegő theorem with splitting conditions: Part I ACTA ARITHMETICA XCIII.2 (2000) The Fekete Szegő theorem with slitting conditions: Part I by Robert Rumely (Athens, GA) A classical theorem of Fekete and Szegő [4] says that if E is a comact set in the

More information

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle] Chater 5 Model checking, verification of CTL One must verify or exel... doubts, and convert them into the certainty of YES or NO. [Thomas Carlyle] 5. The verification setting Page 66 We introduce linear

More information

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari Condence tubes for multile quantile lots via emirical likelihood John H.J. Einmahl Eindhoven University of Technology Ian W. McKeague Florida State University May 7, 998 Abstract The nonarametric emirical

More information

Cryptography. Lecture 8. Arpita Patra

Cryptography. Lecture 8. Arpita Patra Crytograhy Lecture 8 Arita Patra Quick Recall and Today s Roadma >> Hash Functions- stands in between ublic and rivate key world >> Key Agreement >> Assumtions in Finite Cyclic grous - DL, CDH, DDH Grous

More information

Multi-Operation Multi-Machine Scheduling

Multi-Operation Multi-Machine Scheduling Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job

More information

Galois Fields, Linear Feedback Shift Registers and their Applications

Galois Fields, Linear Feedback Shift Registers and their Applications Galois Fields, Linear Feedback Shift Registers and their Alications With 85 illustrations as well as numerous tables, diagrams and examles by Ulrich Jetzek ISBN (Book): 978-3-446-45140-7 ISBN (E-Book):

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

Finding Shortest Hamiltonian Path is in P. Abstract

Finding Shortest Hamiltonian Path is in P. Abstract Finding Shortest Hamiltonian Path is in P Dhananay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune, India bstract The roblem of finding shortest Hamiltonian ath in a eighted comlete grah belongs

More information

PARTITIONS AND (2k + 1) CORES. 1. Introduction

PARTITIONS AND (2k + 1) CORES. 1. Introduction PARITY RESULTS FOR BROKEN k DIAMOND PARTITIONS AND 2k + CORES SILVIU RADU AND JAMES A. SELLERS Abstract. In this aer we rove several new arity results for broken k-diamond artitions introduced in 2007

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix

More information

#A47 INTEGERS 15 (2015) QUADRATIC DIOPHANTINE EQUATIONS WITH INFINITELY MANY SOLUTIONS IN POSITIVE INTEGERS

#A47 INTEGERS 15 (2015) QUADRATIC DIOPHANTINE EQUATIONS WITH INFINITELY MANY SOLUTIONS IN POSITIVE INTEGERS #A47 INTEGERS 15 (015) QUADRATIC DIOPHANTINE EQUATIONS WITH INFINITELY MANY SOLUTIONS IN POSITIVE INTEGERS Mihai Ciu Simion Stoilow Institute of Mathematics of the Romanian Academy, Research Unit No. 5,

More information

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

CSE 599d - Quantum Computing When Quantum Computers Fall Apart CSE 599d - Quantum Comuting When Quantum Comuters Fall Aart Dave Bacon Deartment of Comuter Science & Engineering, University of Washington In this lecture we are going to begin discussing what haens to

More information

AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM. Radu Balan, Alexander Jourjine, Justinian Rosca. Siemens Corporation Research

AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM. Radu Balan, Alexander Jourjine, Justinian Rosca. Siemens Corporation Research AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM DEGENERATE MIXTURES Radu Balan, Alexander Jourjine, Justinian Rosca Siemens Cororation Research 7 College Road East Princeton, NJ 8 fradu,jourjine,roscag@scr.siemens.com

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

q-ary Symmetric Channel for Large q

q-ary Symmetric Channel for Large q List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu

More information

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation Paer C Exact Volume Balance Versus Exact Mass Balance in Comositional Reservoir Simulation Submitted to Comutational Geosciences, December 2005. Exact Volume Balance Versus Exact Mass Balance in Comositional

More information

A randomized sorting algorithm on the BSP model

A randomized sorting algorithm on the BSP model A randomized sorting algorithm on the BSP model Alexandros V. Gerbessiotis a, Constantinos J. Siniolakis b a CS Deartment, New Jersey Institute of Technology, Newark, NJ 07102, USA b The American College

More information

216 S. Chandrasearan and I.C.F. Isen Our results dier from those of Sun [14] in two asects: we assume that comuted eigenvalues or singular values are

216 S. Chandrasearan and I.C.F. Isen Our results dier from those of Sun [14] in two asects: we assume that comuted eigenvalues or singular values are Numer. Math. 68: 215{223 (1994) Numerische Mathemati c Sringer-Verlag 1994 Electronic Edition Bacward errors for eigenvalue and singular value decomositions? S. Chandrasearan??, I.C.F. Isen??? Deartment

More information

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators S. K. Mallik, Student Member, IEEE, S. Chakrabarti, Senior Member, IEEE, S. N. Singh, Senior Member, IEEE Deartment of Electrical

More information

3 Properties of Dedekind domains

3 Properties of Dedekind domains 18.785 Number theory I Fall 2016 Lecture #3 09/15/2016 3 Proerties of Dedekind domains In the revious lecture we defined a Dedekind domain as a noetherian domain A that satisfies either of the following

More information

Notes on Instrumental Variables Methods

Notes on Instrumental Variables Methods Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

Math 104B: Number Theory II (Winter 2012)

Math 104B: Number Theory II (Winter 2012) Math 104B: Number Theory II (Winter 01) Alina Bucur Contents 1 Review 11 Prime numbers 1 Euclidean algorithm 13 Multilicative functions 14 Linear diohantine equations 3 15 Congruences 3 Primes as sums

More information

Multiplicative group law on the folium of Descartes

Multiplicative group law on the folium of Descartes Multilicative grou law on the folium of Descartes Steluţa Pricoie and Constantin Udrişte Abstract. The folium of Descartes is still studied and understood today. Not only did it rovide for the roof of

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

Solving Sparse Integer Linear Systems. Pascal Giorgi. Dali team - University of Perpignan (France)

Solving Sparse Integer Linear Systems. Pascal Giorgi. Dali team - University of Perpignan (France) Solving Sarse Integer Linear Systems Pascal Giorgi Dali team - University of Perignan (France) in collaboration with A. Storjohann, M. Giesbrecht (University of Waterloo), W. Eberly (University of Calgary),

More information

Catalan s Equation Has No New Solution with Either Exponent Less Than 10651

Catalan s Equation Has No New Solution with Either Exponent Less Than 10651 Catalan s Euation Has No New Solution with Either Exonent Less Than 065 Maurice Mignotte and Yves Roy CONTENTS. Introduction and Overview. Bounding One Exonent as a Function of the Other 3. An Alication

More information

MAS 4203 Number Theory. M. Yotov

MAS 4203 Number Theory. M. Yotov MAS 4203 Number Theory M. Yotov June 15, 2017 These Notes were comiled by the author with the intent to be used by his students as a main text for the course MAS 4203 Number Theory taught at the Deartment

More information

Participation Factors. However, it does not give the influence of each state on the mode.

Participation Factors. However, it does not give the influence of each state on the mode. Particiation Factors he mode shae, as indicated by the right eigenvector, gives the relative hase of each state in a articular mode. However, it does not give the influence of each state on the mode. We

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION Journal of Statistics: Advances in heory and Alications Volume 8, Number, 07, Pages -44 Available at htt://scientificadvances.co.in DOI: htt://dx.doi.org/0.864/jsata_700868 MULIVARIAE SAISICAL PROCESS

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

Hotelling s Two- Sample T 2

Hotelling s Two- Sample T 2 Chater 600 Hotelling s Two- Samle T Introduction This module calculates ower for the Hotelling s two-grou, T-squared (T) test statistic. Hotelling s T is an extension of the univariate two-samle t-test

More information

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative Section 0.0: Comlex Numbers from Precalculus Prerequisites a.k.a. Chater 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative Commons Attribution-NonCommercial-ShareAlike.0 license.

More information

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education CERIAS Tech Reort 2010-01 The eriod of the Bell numbers modulo a rime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education and Research Information Assurance and Security Purdue University,

More information

0.6 Factoring 73. As always, the reader is encouraged to multiply out (3

0.6 Factoring 73. As always, the reader is encouraged to multiply out (3 0.6 Factoring 7 5. The G.C.F. of the terms in 81 16t is just 1 so there is nothing of substance to factor out from both terms. With just a difference of two terms, we are limited to fitting this olynomial

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information