How to Write Fast Numerical Code

Size: px

Start display at page:

Download "How to Write Fast Numerical Code"

Egbert Carpenter
5 years ago
Views:

1 How to Write Fast Numerical Code Lecture: Discrete Fourier transform, fast Fourier transforms Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 1 minutes each random order random speaker 2

2 Linear Transforms Overview: Transforms and algorithms Discrete Fourier transform Fast Fourier transforms After that: Optimized implementation and autotuning (FFTW) Automatic program synthesis (Spiral) 3 FFT References Complexity: Bürgisser, Clausen, Shokrollahi, Algebraic Complexity Theory, Springer, 1997 History: Heideman, Johnson, Burrus: Gauss and the History of the Fast Fourier Transform, Arch. Hist. Sc. 34(3) 1985 FFTs: Cooley and Tukey, An algorithm for the machine calculation of complex Fourier series, Math. of Computation, vol. 19, pp , 1965 Nussbaumer, Fast Fourier Transform and Convolution Algorithms, 2nd ed., Springer, 1982 van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, 1992 Tolimieri, An, Lu, Algorithms for Discrete Fourier Transforms and Convolution, Springer, 2nd edition, 1997 Franchetti, Püschel, Voronenko, Chellappa and Moura, Discrete Fourier Transform on Multicore, IEEE Signal Processing Magazine, special issue on ``Signal Processing on Platforms with Multiple Cores'', Vol. 26, No. 6, pp. 9-12, 29 4

Linear Transforms Very important class of functions: signal processing, scientific computing, Mathematically: Change of basis = Multiplication by a fixed matrix T 1 y y B 1.

3 Linear Transforms Very important class of functions: signal processing, scientific computing, Mathematically: Change of basis = Multiplication by a fixed matrix T 1 y y B 1. A = y = T x y n 1 Output T T = [t k;`] k;`<n x x x = B x n 1 Input 1 C A Equivalent definition: Summation form n 1 X y k = t k;`x`; `= k < n 5 Smallest Relevant Example: DFT, Size 2 Transform (matrix): T = " # Computation: y = " # 1 1 x 1 1 or y = x + x 1 y 1 = x x 1 As graph (direct acyclic graph or DAG): y x y 1 x 1 called a butterfly 211_2_1_archive.html 6

4 Transforms: Examples A few dozen transforms are relevant Some examples DFT n = [e 2k`¼i=n ] k;`<n 8 < cos 2¼k` RDFT n = [r k`] k;`<n ; r k` = n ; k bn 2 c : sin 2¼k` n ; k > bn 2 c DHT = h cos(2k`¼=n) + sin(2k`¼=n) i k;`<n " # WHTn=2 WHT WHT n = n=2 ; WHT WHT n=2 WHT 2 = DFT 2 n=2 IMDCT n = h cos((2k + 1)(2` n)¼=4n) i k<2n; `<n h i DCT-2 n = cos(k(2` + 1)¼=2n) k;`<n DCT-3 n = DCT-2 T n (transpose) universal tool MPEG JPEG DCT-4 n = h cos((2k + 1)(2` + 1)¼=4n) i k;`<n 7 Blackboard Discrete Fourier transform (DFT) Transform algorithms Fast Fourier transform, size 4 8

5 Linear Transforms: DFT 1 y y B 1. A = y = T x y n 1 Output T x x x = B x n 1 Input 1 C A Example: T = DFT n = [e 2k`¼i=n ] k;`<n = [! n k` ] k;`<n ;! n = e 2¼i=n 9 Algorithms: Example FFT, n = 4 Fast Fourier transform (FFT) i 1 i i 1 i 5 x = x i Representation using matrix algebra Data flow graph stride 2 stride 1 1

6 Cooley-Tukey FFT (Recursive, General-Radix) Blackboard Kronecker products Stride permutations 11 Example FFT, n = 16 (Recursive, Radix 4) stride 4 stride 1 12

7 Recursive Cooley-Tukey FFT radix DFT km = (DFT k I m )Tm km (I k DFT m )L km k DFT km = L km m (I k DFT m )T km m (DFT k I m ) decimation-in-time decimation-in-frequency For powers of two n = 2 t sufficient together with base case " # 1 1 DFT 2 = 1 1 Cost: (complex adds, complex mults) = (n log 2 (n), n log 2 (n)/2) independent of recursion (real adds, real mults) (2n log 2 (n), 3n log 2 (n)) = 5n log 2 (n) flops depends on recursion: best is at least radix-8 13 Recursive vs. Iterative FFT Recursive, radix-k Cooley-Tukey FFT DFT km = (DFT k I m )Tm km (I k DFT m )L km k DFT km = L km m (I k DFT m )T km m (DFT k I m ) Iterative, radix 2, decimation-in-time/decimation-in-frequency 1 ty DFT 2 t (I 2 j 1 DFT 2 I 2 t j) (I 2 j 1 T 2t j+1 2 t j ) A R 2 t j=1 1 ty DFT 2 t = R 2 (I 2 t j T 2 2j j 1 ) (I 2 t j DFT 2 I 2 j 1) A j=1 14

8 Radix 2, recursive Radix 2, iterative Recursive vs. Iterative Iterative FFT computes in stages of butterflies = log 2 (n) passes through the data Recursive FFT reduces passes through data = better locality Same computation graph but different topological sorting Rough analogy: MMM Triple loop Blocked DFT Iterative FFT Recursive FFT 16

9 The FFT Is Very Malleable 17 Iterative FFT, Radix 2 18

10 Pease FFT, Radix 2 19 Stockham FFT, Radix 2 2

11 Six-Step FFT 21 Multi-Core FFT 22

12 Transform Algorithms DFT n! P k=2;2m > ³ ³ ³ DFT2m I k=2 1 i C 2m rdft 2m (i=k) RDFT k I m ; k even; 11 RDFT n RDFT RDFT 2m n! (P DHT > n k=2;m I RDFT 2) rdft 2m (i=k) RDFT 1 k B 2m DHT DHT I k=2 1 rdft i D 2m 2m (i=k) RDFT CC B k rdht n DHT 2m (i=k) DHT I k m C A ; 2m rdht 2m (i=k) DHT Ã!! k rdft 2n (u) rdht 2n (u)! L2n rdft m I k 2m ((i + u)=k) rdft i Ã 2k (u) rdht 2m ((i + u)=k) rdht 2k (u) Im ; DCT-4 n! S ndct-2 n diag k<n (1=(2 cos((2k + 1)¼=4n))) ÃÃ" #! Ã" #!! 1 1 IMDCT 2m! (J m I m I m J m) I 1 m I 1 m J 2m DCT-4 2m ty WHT 2 k! (I 2 k 1 + +k i 1 WHT 2 k i I k 2 i+1 + +k t ); k = k k t i=1 DFT 2! F 2 k even; RDFT-3 n! (Q > k=2;m I 2) (I k i rdft 2m )(i + 1=2)=k)) (RDFT-3 k I m) ; k even; DCT-2 n! P k=2;2m > ³ DCT-22m K2 2m ³ I k=2 1 N 2m RDFT-3 > 2m Bn(L n=2 k=2 I 2)(I m RDFT k )Q m=2;k ; DCT-3 n! DCT-2 > n ; DCT-4 n! Q > ³ k=2;2m Ik=2 N 2m RDFT-3 > 2m B n (L n=2 k=2 I 2)(I m RDFT-3 k )Q m=2;k : DFT n! (DFT k I m) T n m (I k DFT m) L n k ; n = km Cooley-Tukey FFT DFT n! P n(dft k DFT m)q n; n = km; gcd(k; m) = 1 Prime-factor FFT DFT p! Rp T (I 1 DFT p 1 )D p(i 1 DFT p 1 )R p; p prime Rader FFT DCT-3 n! (I m J m) L n m (DCT-3m(1=4) DCT-3m(3=4)) " # Im J m 1 (F 2 I m) p1 (I I ; n = 2m m) DCT-2 2! diag(1; 1= p 2) F 2 DCT-4 2! J 2 R 13¼=8 23 Complexity of the DFT Measure: L c, 2 c Complex adds count 1 Complex mult by a constant a with a < c counts 1 L 2 is strictest, L the loosest (and most natural) Upper bounds: n = 2 k : L 2 (DFT n ) 3/2 n log 2 (n) (using Cooley-Tukey FFT) General n: L 2 (DFT n ) 8 n log 2 (n) (needs Bluestein FFT) Lower bound: Theorem by Morgenstern: If c <, then L c (DFT n ) ½ n log c (n) Implies: in the measure L c, the DFT is Θ(n log(n)) 24

History of FFTs The advent of digital signal processing is often attributed to the FFT (Cooley-Tukey 1965) History: Around 185: FFT discovered by Gauss [1] (Fourier publishes the concept of Fourier

13 History of FFTs The advent of digital signal processing is often attributed to the FFT (Cooley-Tukey 1965) History: Around 185: FFT discovered by Gauss [1] (Fourier publishes the concept of Fourier analysis in 187!) 1965: Rediscovered by Cooley-Tukey 25 [1]: Heideman, Johnson, Burrus: Gauss and the History of the Fast Fourier Transform Arch. Hist. Sc. 34(3) 1985 Carl-Friedrich Gauss Contender for the greatest mathematician of all times Some contributions: Modular arithmetic, least square analysis, normal distribution, fundamental theorem of algebra, Gauss elimination, Gauss quadrature, Gauss-Seidel, non-euclidean geometry, 26

How to Write Fast Numerical Code Spring 2012 Lecture 19. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 19 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Miscellaneous Roofline tool Project report etc. online Midterm (solutions online)