Introduction to HPC. Lecture 20

Size: px

Start display at page:

Download "Introduction to HPC. Lecture 20"

Aubrey Carpenter
6 years ago
Views:

1 COSC Introduction to HPC Lecture Dept of Computer Science COSC Fast Fourier Transform

2 COSC Image Processing In Electron Microscopy Medical Imaging Signal Processing In Radio Astronomy Modeling and Solution of PDEs COSC FFT O(PlogP) CooleyTukey, radix,,,. Mixedradix CooleyTukey (MR) Prime Factor Algorithm (PFA) Splitradix (SR) Rader s Algorithm Transforms on real data Sine, Cosine transforms: transforms on real data with symmetry (odd and even symmetry, respectively) (Cosine transforms are used for JPEG, MPEG, )

3 COSC The Fast Fourier Transform The Discrete Fourier Transform (DFT) P X(m) = Σ ω mj P x(j), j= V m [,P], ω P = e πi P The Inverse Discrete Fourier Transform (IDFT) P x(j) = Σ ω mj P X(m), j= V j [,P], ω P = e πi P COSC The DFT X() X() X() X() = P.. ω P ω P ω P. ω (P) P ω P ω P ω P. ω (P) P ω P ω P ω P. ω (P) P x() x() x() x(). X(P)..... ω P (P) ω P (P) ω P (P). ω P (P)(P). x(p) X = Wx The DFT is indeed Matrixvector multiplication

4 COSC The Inverse DFT x() x() x() x().. ω P ω P ω P. ω P (P) X() X() ω P ω P ω P. ω (P) = P X() P ω P ω P ω P. ω (P) P X() = P WX. x(p)..... ω P (P) ω P (P) ω P (P). ω P (P)(P). X(P) Thus, the elements of W are the inverse of the elements of W Proof: P Σ ω ik P ω kj P k= = P Σ ω (ij)k P k= = P if i=j =( ω (ij)p ω (ij) )/( )= if i j COSC FFT Decimationintime (DIT) Decimationinfrequency (DIF) Ordered vs scrambled (bitreversed) Selfsorting Inplace

5 COSC DIF FFT i i ω P Now, consider even and odd l Two half sized DFTs! COSC DIF FFT First computation step The butterfly

6 COSC First computation step DIF FFT COSC DIF FFT Result after recursive application of DIF splitting formula

COSC DIF FFT Normal order Bitreversed order ( scrambled ) COSC DIF FFT Twiddle factors Only half as many as the number of data points (half of the unit circle) First stage use all P/ rotations of ω P

7 COSC DIF FFT Normal order Bitreversed order ( scrambled ) COSC DIF FFT Twiddle factors Only half as many as the number of data points (half of the unit circle) First stage use all P/ rotations of ω P nd stage use every other twiddle factor rd stage use every fourth. First stage, twiddle exponent: if msb=, then the remaining bits define exponent nd stage, if msb=, then remaining lower order bits define exponent of ω P/

8 COSC Normal and BitReversed orders Index Binary code Reversed binary code Bitreversed index COSC DIT FFT Now consider l and l+p/ since

9 COSC DIT FFT The DIT butterfly The last computation step COSC DIT FFT Result after recursive application of DIT splitting formula

COSC DIT FFT Bitreversed order ( scrambled ) Normal order COSC DIT FFT Twiddle factors Only half as many as the number of data points (half of the unit circle) Last stage use all P/ rotations of ω P

10 COSC DIT FFT Bitreversed order ( scrambled ) Normal order COSC DIT FFT Twiddle factors Only half as many as the number of data points (half of the unit circle) Last stage use all P/ rotations of ω P nd to last stage use every other twiddle factor rd to last stage use every fourth. Last stage, twiddle exponent: if msb=, then the remaining bits define exponent nd to last stage, if msb = of result index, then remaining lower order bits define exponent of ω P/

COSC DIF vs DIT FFT DIF Normal input order, bitreversed output AllP/ twiddles used in first stage, every nd twiddle second stage, every fourth in rd stage etc.

11 COSC DIF vs DIT FFT DIF Normal input order, bitreversed output AllP/ twiddles used in first stage, every nd twiddle second stage, every fourth in rd stage etc. Twiddle exponent computed based on source index DIT Bitreversed input, normal output AllP/ twiddles used in last stage, every nd twiddle in second last stage, every fourth in rd third last stage etc. Twiddle index computed based on result index Both compute butterflies on successively lower order bits of source index! COSC DIF FFT Normal to Bitreversed order Bitreversed to Normal order DIF can be used for either normal or bitreversed input order! Output order always bitreverse of input order!

12 COSC DIT FFT Bitreversed to Normal order Normal to Bitreversed order DIT can be used for either normal or bitreversed input order! Output order always bitreverse of input order! COSC FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary!

COSC FFT followed by Inverse FFT DIF DIF Use inverse twiddles for the inverse FFT No bitreversal necessary! But? Twiddles!!! Allocation for forward and inverse different!

13 COSC FFT followed by Inverse FFT DIF DIF Use inverse twiddles for the inverse FFT No bitreversal necessary! But? Twiddles!!! Allocation for forward and inverse different!! COSC FFT followed by Inverse FFT DIT DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! But? Twiddles!!! Allocation for forward and inverse different!!

14 COSC FFT followed by Inverse FFT DIT DIF Use inverse twiddles for the inverse FFT No bitreversal necessary! Twiddle allocation for forward and inverse the same!! COSC Radix DIF FFT For rewrite as

15 COSC FFT followed by Inverse FFT DIF followed by DIT, or DIT followed by DFT have same twiddle allocation which is important in parallel computation DIT followed by DIT or DFT followed by DFT have different twiddle allocation for forward and inverse FFT. Problem in parallel computation (more twiddle storage than necessary) We illustrated this for normal input order. Same is true for bitreversed input order Since bitreversal is its own inverse, no explicit bitreversal necessary to restore input order for forward followed by inverse FFT COSC Radix DIF FFT

16 COSC Radix DIT FFT With rewrite as COSC Radix DIT FFT

17 COSC Radix DIF FFT COSC Radix DIT FFT

18 COSC FFT: arithmetic and memory ops Butterflies Arithmetic Operations Storage References FFT Add/Sub Mult Total Data Twiddles Total Radix Radix Radix FFT Arithmetic Operations Storage References FFT Add/Sub Mult Total Data Twiddles Total Radix Pp Pp Pp Pp Pp Pp Radix (/)Pp (/)Pp (/)Pp (/)Pp (/)Pp (/)Pp Radix (/)Pp (/)Pp (/)Pp (/)Pp (/)Pp (/)Pp COSC Parallel FFT Data allocation Example: Poweroftwo data set, poweroftwo processors Consecutive data allocation Cyclic data allocation P P P P P P P P P P P P P P P P Communication Input order Consecutive Cyclic Normal First n stages Last n stages Bitreversed Last n stages First n stages

19 COSC Parallel Radix DIT FFT Processor Processor Processor Processor Communication Local COSC Parallel Radix FFT + Inverse FFT DIT DIF Processor Processor Processor Processor Processor twiddle factor subset the same in forward and inverse FFT

20 COSC Parallel Radix + Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. Data for the first radix stage cannot be performed locally! Data for the first stage is half the processor address range apart

21 COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. Exchange First radix stage can be performed concurrently without communication after the exchange! No further stages can be performed locally COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. nd radix stage can be performed concurrently without communication after the exchange! No further stages can be performed locally

22 COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. rd radix stage can be performed concurrently without communication after the exchange! Last stage cannot be be performed locally COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. After th Exch. Last ( th ) radix stage can be performed concurrently without communication after the exchange! Note, the last exchange stage is the same as the first!

23 COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. After th Exch. Four exchanges even though there are only three bits used for processor addresses! All butterflies local! Data permuted!! Unshuffle on nonlocal (all indices for source, index (left cyclic shift)!! output index bitreverse of indices above) COSC The unshuffle How did it happen? Initial allocation Index ( ) Step : exchange ( ) Step : exchange ( ) Step : exchange ( ) Step : exchange ( )

24 COSC Permutation based parallel FFT Block allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. After th Exch. All butterflies local! Four exchanges! Data permuted!! (all indices for source, output index bitreverse of indices above) Unshuffle on nonlocal index (left cyclic shift)!! COSC Permutation based FFT Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. After th Exch. Exchanges based on blocks defined by local msb ( ) ( ) ( ) ( ) ( ) (paddr maddr)

25 COSC Permutation based FFT Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. After th Exch. Exchanges based on blocks defined by local lsb ( ) ( ) ( ) ( ) ( ) (paddr maddr) COSC Minimizing the number of permutations (paddr maddr) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (...) ( ) ( ) ( ) (...) ( ) ( ) ( ) (..) Three permutations instead of four many ways

26 COSC Permutation based FFT Proc ID P P P P P P P P P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. section Exchange Sequence ( ) ( ) ( ) ( ) COSC Permutation based parallel FFT Cyclic data allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. All butterflies local! Three exchanges! Data permuted!! Consecutive order after exchange sequence

27 COSC Permutation based parallel FFT Cyclic data allocation Proc ID P P P P P P P P Initial Alloc. After st Exch. After nd Exch. After rd Exch. Consecutive order!! Exchange sequence ( ) ( ) ( ) ( ) (maddr paddr) COSC Permutation based FFT FFT computations carried out from msb to lsb in data index (always) To achieve that all computations are local permutation depends on data allocation (and if the FFT is self sorting) Exchanges affect memory access strides both in carrying out permutations, and in carrying out butterfly computations

28 COSC Parallel unordered FFT communication requirements COSC Permutation based parallel FFT Cyclic data allocation, DIF without permutations, twiddles Proc ID P P P P P P P P Input index Proc ID P P P P P P P P Twiddle expon. st stage Proc ID P P P P P P P P Proc ID P P P P P P P P Twiddle expon. nd stage Twiddle expon. rd stage

29 COSC Permutation based parallel FFT Cyclic data allocation, DIF without permutations Proc ID P P P P P P P P Twiddle expon. th stage Proc ID P P P P P P P P Twiddle expon. th stage COSC Twiddle factor allocation

30 COSC Permutation based FFT DIT permutation based FFT twiddles Proc ID P P P P P P P P P P P P P P P P Initial Alloc. Twiddle expon. st stage Twiddle expon. nd stage Twiddle expon. rd stage Twiddle expon. th stage COSC Permutation based FFT DIT permutation based FFT twiddles Proc ID P P P P P P P P P P P P P P P P Twiddle expon. th stage Twiddle expon. th stage

Multiply with twiddle factors. Transpose N Nmatrix.

31 COSC Four step FFT Constructing the four step FFT Factorize N in two equal (palindrome) factors N N N. Compute first rank, N FFTs of size. Multiply with twiddle factors. Transpose N Nmatrix. Compute last rank, N FFTs of size N N FFT FFT FFT FFT FFT FFT FFT FFT COSC Square Transpose (c)

32 COSC Square Transpose Xeon Clovertown Opteron Memory. GB/s, cycles Memory. GB/s, cycles L Cache K/core, B, way, cycles L Cache K/core, B, way, cycles L Cache M/dual, B, way, cycles L Cache M/core, B, way, cycles COSC Parallel FFT on Binary ncube

33 COSC Pipelined FFT on ncube The first four steps of a pipelined, inplace, FFT on a cube Time step Time step Time step Time step Memory location Processor The Table entry is network dimension starting with the dimension corresponding to the msb of the paddr) COSC Pipelined Bisection FFT on ncube The first four steps of a pipelined, inplace, FFT on a cube Time step Time step Time step Time step Memory location Processor

34 COSC Pipelined, D, inplace, FFT on cube Performance of a pipelined, onedimensional, inplace, radix, FFT on a cube as a function of data allocation (mesh shape of the cube) COSC Pipelined, D, inplace, FFT on cube Performance of a pipelined, twodimensional, inplace, radix, FFT on a cube as a function of data allocation (mesh shape of the cube)

Bit-Reversed Input to the Radix-2 DIF FFT

Chapter 5 Bit-Reversed Input to the Radix-2 DIF FFT Technically speaking, the correctness of Algorithm 4.2 depends on the fact that x m is initially contained in a[m]. For easy reference, the contents