EE216B: VLSI Signal Processing. FFT Processors. Prof. Dejan Marković FFT: Background

Size: px

Start display at page:

Download "EE216B: VLSI Signal Processing. FFT Processors. Prof. Dejan Marković FFT: Background"

Beryl Sanders
5 years ago
Views:

1 4/30/0 EE6B: VLSI Signal Processing FFT Processors Prof. Dejan Marković FFT: Background A bit of history algorithm first described by Gauss algorithm rediscovered (not for the first time) by Cooley and Tukey Applications FFT is a key building block in wireless communication receivers Also used for frequency analysis of EEG signals And many other applications Implementations Custom design with fixed number of points Flexible FFT kernels for many configurations 9.

4/30/0 Fourier Transform: Concept A complex function can be approximated with a weighted sum of basis functions Fourier used sinusoids with varying frequencies as the basis functions jωt X( ω) x( t)

2 4/30/0 Fourier Transform: Concept A complex function can be approximated with a weighted sum of basis functions Fourier used sinusoids with varying frequencies as the basis functions jωt X( ω) x( t) e dt π jωt x( t) X( ω) e dω π This representation provides the frequency content of the original function Fourier transform assumes that all spectral components are present at all times 9.3 The Fast Fourier Transform (FFT) Efficient method for calculating discrete Fourier transform (DFT) N = length of transform, must be composite N = N N N m Transform length DFT ops FFT ops DFT ops / FFT ops 64 4, ,536,048 3,04,048,576 0, ,536 4,94,967,96,048,576 4,

3 4/30/0 The DFT Algorithm Converts time-domain samples into frequency-domain samples X k N n 0 x e n π j nk N complex number k = 0,, N Implementation options Direct-sum evaluation: O(N ) operations FFT algorithm: O(N logn) operations Inverse DFT: frequency-to-time conversion DFT with opposite sign in the exponent A /N factor N π j nk N xk Xne k = 0,, N N n The Divide-and-Conquer Approach Map the original problem into several sub-problems in such a way that the following inequality holds: Cost(sub-problems) + Cost(mapping) < Cost(original problem) DFT: X N nk k xnwn n 0 k = 0,, N X() z N n 0 x z n n X k = evaluation of X(z) at z = W N k {x n } and {X k } are periodic sequences So, how shall we divide the problem? 9.6 3

4 4/30/0 The Divide and Conquer Approach (Cont.) Procedure: Consider sub-sets of the initial sequence Take the DFT of these sub-sequences Reconstruct the DFT from the intermediate results #: Define: I t, t = 0,, r partition of {0,, N } that defines G different subsets of the input sequence N r i i i 0 t 0 i I X() z x z x z #: Normalize the powers of z w.r.t. x 0t in each subset I t r X z z x z i i0t () i0t i t 0 Replace z by w N k in the inner sum t t i i 9.7 Cooley-Tukey Mapping Consider decimated versions of the initial sequence, N = N N I n = {n N + n } Equivalent description: Substitute () into () n = 0,, N ; n = 0,, N N N nk nn k k N n N n N n 0 N 0 i jπ in jπ in N N i N N X W x W W e e W N N nk nk k N n N n N n 0 N 0 X W x W (8.) (8.) DFT of length N 9.8 4

5 4/30/0 Cooley-Tukey Mapping (Cont.) Define: Y n,k = k th output of n th length-n DFT N nk k n, k N n 0 X Y W k = k N + k k = 0,, N k = 0,, N Y can be taken modulo N n,k k N k N k k WN W N W N W N W N All the X k for k being congruent modulo N are from the same group of N outputs of Y n,k Y = Y since k can be taken modulo N n,k n,k Equivalent description: N N n ( k N k ) n k n k kn k n, k N n, k N N n 0 n 0 X Y W Y W W From N DFTs of length N applied to Y Y n n,k,k 9.9 Cooley-Tukey Mapping, Example: N = 5, N = 3 -D mapped to -D x 3 x 0 x 4 x x 5 x N = 3 DFTs of 3 Twiddle-factor 4 N = 5 DFTs of length N = 5 multiply length N = 3 x 6 x 9 DFT-5 DFT-5 DFT-5 x DFT-3 DFT-3 x DFT-3 3 x DFT-3 x DFT-3 x 8 x x 7 0 x 5 x 6 x 0 x x x 3 x 4 x 9 [] x 4 [] P. Duhamel and M. Vetterli, "Fast Fourier Transforms - A Tutorial Review and a State-of-the-art," Elsevier Signal Processing, vol. 4, no. 9, pp , Apr

6 4/30/0 N = 3, N = 5 versus N = 5, N = 3 Original -D sequence x 0 x x x 3 x 4 x 6 x 7 x 8 x 5 x 9 x 0 x x x 3 x 4 -D to -D mapping N = 3, N = 5 N = 5, N = 3 x x 3 x 6 x 9 x 0 x x 4 x 7 x 0 x 3 x x 5 x 8 x x 4 x 0 x 5 x 0 x x 6 x x x 7 x x 8 x 3 x 3 x 4 x 4 x 9 D-D mapping can t be obtained by simple matrix transposition 9. Radix- Decimation in Time (DIT) N =, N = N divides the input sequence into the sequence of even- and odd-numbered samples ( decimation in time (DIT)) N/ N/ n k k n k k n N/ N n N/ n 0 n 0 X x W W x W N/ N/ n k n k N/ k n N/ N n N/ n 0 n 0 X x W W x W X k and X k+n/ obtained by -pt DFTs on the outputs of length N/ DFTs of the even- and odd-numbered sequences, one of which is weighted by twiddle factors 9. 6

7 4/30/0 Decimation in Time and Decimation in Frequency x 0 Decimation in Time (DIT) DFT-4 DFT- x 0 x 0 Decimation in Frequency (DIF) DFT- DFT-4 x 0 x x {x i } DFT- x 4 x x x DFT- {x i } x x 4 x 3 x 5 x 3 W 8 x 6 x 4 x 5 x 6 x 7 DFT-4 {x i+ } W 8 W 8 W 8 3 DFT- DFT- x x 6 x 3 x 7 x 4 x 5 x 6 x 7 DFT- DFT- W 8 W 8 3 DFT-4 {x i+ } x x 3 x 5 x 7 N = DFTs of length N = 4 Twiddlefactor multiply N = 4 DFTs of length N = N = 4 DFTs of length N = Twiddlefactor multiply Reverse the role of N, N (duality between DIT and DIF) N = DFTs of length N = Signal-Flow Graph (SFG) Notation In generalizing this formulation, it is most convenient to adopt a graphic approach Signal-flow-graph notation describes the three basic DSP operations: Addition Multiplication by a constant Delay x[n] y[n] x[n] x[n] + y[n] a a x[n] z k x[n] x[n k] 9.4 7

8 4/30/0 Radix- Butterfly Decimation in Time (DIT) X = A + B W Y = A B W Decimation in Frequency (DIF) X = A + B Y = (A B) W A X A X B W Y B W Y It does not make sense to compute B W twice, Z = B W Z = B W X = A + Z Y = A Z complex mult complex adds X = A + B Y = (A B) W complex mult complex adds Abbreviation: complex mult = c-mult 9.5 Radix-4 DIT Butterfly A B C D W d V W X Y V = A + B W b + C W c + D W d W = A j B W b C W c + j D W d X = A B W b + B W b + C W c D W d Y = A + j B W b C W c j D W d B = B W b C = C W c D = D W d 3 c-mults V = A + B + C + D W = A j B C + j D X = A B + B + C D Y = A + j B C j D 3 c-mults c-adds Multiply by j swap Re/Im, possibly a negation radix-4 BF is equivalent to 4 radix- BFs 3 c-mults 4 c-mults c-adds 8 c-adds Reduces to 8 c-adds with intermediate values: A + C A C B + D j B j D 9.6 8

9 4/30/0 Comparison of Radix- and Radix-4 Radix-4 has about the same number of adds and 75% the number of multiplies compared to radix- Higher radices Possible, but rarely used (complicated control) Additional frequency gains diminish for r > 4 Computational complexity: Number of mults = reasonable st estimate of algorithmic complexity M = log r (N) N mults = O(M N) Add N adds for more accurate estimate complex mult = 4 real mults + real adds 9.7 FFT Complexity N FFT Radix # Re mults # Re adds , , ,560 5, ,6 3, ,44 3, ,304 47, ,78 35, ,536 35, ,440 36,704 For M = log r (N) N mult = O(M N) Decreases monotonically with radix increase Decreases, reaches min, increases 9.8 9

10 4/30/0 Further Simplifications x 0 + X 0 W e πnk j N x W + X n =, k = in this example Example: N = 8 Im W 8 W 3 8 W 4 8 W 5 8 W 6 8 W 8 W 8 7 W 8 0 Considerably simpler Sign inversion Re Swap Re/Im (or both) 9.9 The Complete 8-Point Decimation-in-Time FFT x 0 x 4 x x 6 x x 5 x 3 x 7 X 0 X X X 3 X 4 X 5 X 6 X

11 N log (N) complexity 4/30/0 FFT Algorithm X() k N n 0 xn ( ) e n j π k N N complex multiplications and additions N m j π k N m 0 x( m) e N /4 complex multiplications and additions N m j π k N m 0 x(m ) e N /4 complex multiplications and additions N /6 complex multiplications and additions N /6 complex multiplications and additions 9. Higher Radix FFT Algorithms Radix Radix 4 N log (N) complexity N log 4 (N) complexity FFT size Real Multiplications Radix Real Additions Radix N 4 8 srfft 4 8 srfft Radix- is the most symmetric structure and best suited for folding 9.

4360 Real Multiplications 4/30/0 5-pt FFT: Synthesis-Time Bottleneck Direct-mapped architecture took about 8 hours to synthesize, retiming was unfinished after.

12 4360 Real Multiplications 4/30/0 5-pt FFT: Synthesis-Time Bottleneck Direct-mapped architecture took about 8 hours to synthesize, retiming was unfinished after.5 days It is difficult to explore the design space if synthesis is done for every implementation 3566 Real Additions 9.3 Exploiting Hierarchy Each stage can be treated as a combination of butterfly- structures, with varying twiddle factors x 0 x(0) Stage Stage Stage 3 x 0 X(0) x 4 x(4) - x X() x x() - x X() x 6 x(6) - W8 - x 3 X(3) x x() x 4 X(4) x 3 x(3) - W8 x 5 X(5) x 5 x(5) - W8 xx(7) 7 - W8 X(7) - W x 6 X(6) x 7

13 4/30/0 Modeling the Radix- Butterfly: Area Re In Im In + + Re Out Im Out Multiplication with CSA Shifted version of the input a a a3 a4 a5 a6 a7 a8 a9 Re In x Carry save adder Carry save adder Carry save adder Re tw Re Out c s c s c s Im In Im tw x Carry save adder c s Carry save adder c s x Carry save adder c s x Im Out Carry save adder c s Carry Propagate Adder Area mult Area base Out Area total = Area base + Area mult 9.5 Twiddle-Factor Area Estimation Non-zero bits range from 0-8 for the sine and cosine term Non-zero bits in the cosine term Twiddle Factor Index 70 Non-zero bits in the sine term Twiddle Factor Index 70 Number of twiddle factors Number of twiddle factors Number of non-zero bits in the cosine term Number of non-zero bits in the sine term 9.6 3

14 4/30/0 Area Estimates from Synthesis 0000 cosine nz bits = cosine nz bits = Area 6000 Area Estimated Area Synthesized Area Estimated Area Synthesized Area 0 3 Number of non-zero bits in the sine term Number of non-zero bits in the sine term x 04 cosine nz bits = 3 cosine nz bits = 4.5 Area Estimated Area Synthesized Area Number of non-zero bits in the sine term Area Estimated Area Synthesized Area Number of non-zero bits in the sine term Average error in area estimate: ~ 5% 9.7 FFT Timing Estimation xx(0) 0 Stage Stage Stage 3 X(0) x 0 xx(4) 4 - X() x xx() - X() x xx(6) 6 - W8 - X(3) x 3 xx() X(4) x 4 xx(3) 3 - W8 X(5) x 5 xx(5) 5 - W8 X(6) x 6 x(7) 7 - W8 - W8 3 X(7) x 7 Critical path in each stage equal to delay of most complex butterfly For 3 stages delay varied between.6 ns to ns per stage (90-nm CMOS) Addition of pipeline registers between stages reduces the delay 9.8 4

15 4/30/0 FFT Energy Estimation Challenge: Difficult to estimate hierarchically. Carrying out switch level simulations in MATLAB or gate level simulation in RC is time consuming. Solution: Derive analytical expressions for energy in each butterfly as a function of input switching activity and nonzero bits in twiddle factor. x(0) 0 p(0) Stage Stage Stage 3 p(0)? X(0) x 0 xx(4) 4 p(4) - p()? X() x xx() p() p()? - X() x xx(6) 6 p(6) - p(3)? W8 - X(3) x 3 x() p() p(4)? X(4) x 4 x(3) 3 p(3) - p(5)? W8 X(5) x 5 x(5) 5 p(5) p(6)? - W8 X(6) x 6 x(7) 7 p(7) - p(7)? W8 - W8 3 X(7) x FFT Architectures Memory-based Execution time (~/Throughput) time-mux SDF, MDC parallelism Direct-mapped Area

16 4/30/0 Single-Path Delay Feedback (SDF) Architecture x[0] X[0] x[] x[] x[3] x[4] x[5] x[6] x[7] X[4] X[] X[6] X[] X[5] X[3] X[7] Stage Stage Stage 3 Z -4 Z - Z - Radix- butterfly Radix- butterfly Radix- butterfly Stage Stage Stage FFT Optimized Implementation 6

17 4/30/0 FFT Optimization For a given performance Minimize energy and/or area 9.33 Techniques Radix factorization Parallelism DL realization

18 4/30/0 FFT Factorization: N = N N N -pt FFT TF Full multiplier N -pt FFT N X( k) x( n) e n 0 πnk j N N N πnk πnk πn k j j j N NN N x( nn n ) e e e n 0 n 0 N -point DFT Twiddle factor (TF) N -point DFT 9.35 j j j C C 3 Radix Factorization C C 3 Radix, j CONSTANT multipliers j j C C 4 C 5 4 3, j, C, C 4, j, C, C, C 3, C 4, C 5, C 6 j C j C C

19 4/30/0 Flexible Processing Element (PE) PE-4 PE-4 W N nk Radix Implementation PE 4 PE+PE 8 PE+PE3+PE 6 PE+PE3+PE4+PE PE PE4 (3//) -j C... C 6 PE () -j PE3 (/) -j C C C 6 j C C 4 C C 3 C Only 3 Constant Factors to Build PUs Re{x} Im{x} = x x(j) N add = j x(j) real part imag part Re{x} xc 3 xc 4 Re{x} Im{x} xc xc Im{x} N add = N add = 8 xc 5 xc 6 Factor CSD realization Adders

20 4/30/0 Hardware Cost of PUs Radix Implementation Butterfly adders Const x (equiv. + ) PU 4 0 PU+PU 8 3 PU+PU3+PU 3 4 PU+PU3+PU4+PU Impact of TW Wordlength FFT size 4 points 8 points 6 points TW bits Radix Radix Radix Radix

21 Normalized Energy 4/30/0 Radix Structure of 8-pt FFT Arch R R4 R8 R6 A 7 A 5 A3 3 A4 3 A5 4 A6 A7 A8 A9 3 A0 A Normalized Area 9.4 Parallelism M-pt FFT L-pt FFT (a) reference SP SP M-pt FFT M-pt FFT P M-pt FFT (b) parallel L-pt FFT L-pt FFT P L-pt FFT PS PS M-pt FFT P (c) parallel (P = L) 9.4

22 4/30/0 (a) reference architecture M = L = 8 Z -8 Z -4 Z - PU PU PU Z - PU Example: 6-point FFT (b) L-path (L = 8) architecture SP... M = Z - PU 8... L = 8 8-point FFT... PS Arch 8-way L = 8 Delay 0 8 Mult 4 6 Add Hardware cost (a) (b) Delay buffers Full multipliers Adders Radix-Optimized 04-pt FFT Final architecture N = 8, N = 8 (out of 68 arch.)... 8-pt (4/4/8) pipeline FFT pt parallel FFT 8-pt FFT Z 64 Z 3 Z 6 Z 8 Z 4 Z Z PU PU PU PU PU PU3 PU Radix 4 Radix 4 Radix

23 Normalized Energy Normalized Energy 4/30/0 Optimization Techniques Normalized Area 9.45 Optimization Techniques Normalized Area

24 Normalized Energy Energy/FFT [nj] 4/30/0 Min EAP 4mW, 0.75mm 00MS/s) 0. N = Normalized Area 9.47 Mix-Radix vs. Sub-V T Circuits 04 points, 40 MS/s 7. Radix 0.7 V Sub-V T circuits V Logic synthesis C.-H. Yang, et al., JSSC 3/0. M. Seok, et al., ISSCC Area [mm ]

25 4/30/0 Flexible FFT: LTE Example 3GPP-LTE Specifications Parameter Configuration FFT size BW (MHz)

3.6 mm.5 mm 4/30/0 Flexible FFT (8-048 pts) for 3GPP-LTE... 6-56-pt SDF FFT... 8.

6/3/64/8/56-pt 56-pt 8-pt 64-pt 6-pt 3-pt FFT FFT Architecture BW (MHz) FFT size Optimal Factorization 0 048 (6x6)x8 5 536 (6x6)x6 0 04 (8x6)x8 5 5 (8x8)x8.5 56 (4x8)x8.

26 3.6 mm.5 mm 4/30/0 Flexible FFT (8-048 pts) for 3GPP-LTE pt SDF FFT /6-pt parallel FFT Flexible FFT: N = 6 to 56 N = 8 or 6 MUX Z -8 Z -64/-3 Z -3 Z -3/-6 Z -6 Z -6/-8 Z -8 PU PU3 PU4 PU Z -8 PU MUX MUX MUX Z -4 Z - Z - PU3 PU4 PU 6/3/64/8/56-pt 56-pt 8-pt 64-pt 6-pt 3-pt FFT FFT Architecture BW (MHz) FFT size Optimal Factorization (6x6)x (6x6)x (8x6)x8 5 5 (8x8)x (4x8)x8.5 8 (6)x8 9.5 Flexible FFT Chip Layout.7 mm. mm MEM MEM MEM MEM MEM MEM Reg. File Bank pt FFT Hard-output Sphere Decoder Soft-output Bank Pre-proc. MEM MEM MEM MEM MEM MEM PU3 PU Level shifters 8/6pt FFT TF Mults PU4 TF Mults PU PU4 PU3 PU PU C.-H. Yang, et al., VLSI

27 4/30/0 Energy of Flexible FFTs Energy/FFT (norm. to baseline) E/FFT = Power f clk N L min / 65nm V 0.4V FFT size, Baseline Energy (nj/fft) Y. Chen, et al., TCAS-II, /008. A. Wang and A. Chandrakasan, JSSC, /005. Baseline: C.-H. Yang and D. Marković, VLSI FFT Summary FFT: a technique for frequency analysis of stationary signals Key building block in radio systems and many other apps FFT is an economical implementation of DFT that leverages shorter sub-sequences to implement DFT on N discrete samples Key building block of an FFT is butterfly operator, which can be realized with N radix ( and 4 being the most common) Radix factorization is essential for low energy and area FFT does not work well with non-stationary signals

28 4/30/0 References : Algorithmic Techniques J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comp., vol.9, pp.97-30, 965. P. Duhamel and H. Hollmann, Split-radix FFT algorithm, Electron. Lett., vol. 0, pp. 4 6, 984. I.J. Good, The Interaction Algorithm and Practical Fourier Analysis: An Addendum, J. Royal Statistical Society, no., pp , 960. S. Winograd, On computing the discrete Fourier transform, Math. Comp., vol. 3, pp , 978. P. Duhamel, Algorithms Meeting the Lower Bounds on the Multiplicative Complexity of Length-n DFTs and Their Connection with Practical Algorithms, IEEE Trans. Acoustic, Speech, and Signal Processing, vol. 38, no. 9, pp , 990. P. Duhamel and M. Vetterli, "Fast Fourier Transforms - A Tutorial Review and a State-of-the-art," Elsevier Signal Processing, vol. 4, no. 9, pp , Apr References : Traditional Architectures S. Magar, et. al., An application specific DSP chip set for 00 MHz data rates, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Apr. 988, vol. 4, pp S. He and M. Torkelson, Design and implementation of a 04- point pipeline FFT processor, in Proc. IEEE Custom Integrated Circuits Conf., May 998, pp B.M. Baas, A low-power high-performance, 04-point FFT processor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp , Mar J. O Brien, J. Mather, and B. Holland, A 00 MIPS single-chip k FFT processor, in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 989, pp

29 4/30/0 References : Circuit Techniques A. Wang and A. Chandrakasan, A 80-mV Subthreshold FFT Processor Using a Minimum Energy Design Methodology, IEEE J. Solid-State Circuits, vol. 40, no., pp , Jan M. Seok, D. Jeon, C. Chakrabarti, D. Blaauw, and D. Sylvester, A 0.7V 30MHz 7.7nJ/transform 04-pt Complex FFT Core with Super-pipelining, IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 0, pp Y. Chen, Y.-W. Lin, Y.-C. Tsao, and C.-Y. Lee, A.4-Gsample/s DVFS FFT Processor for MIMO OFDM Communication Systems, IEEE J. Solid-State Circuits, vol. 43, no. 5, pp , May 008. K.-S. Chong, B.-H. Gwee, and J.S. Chang, Energy-Efficient Synchronous-Logic and Asynchronous-Logic FFT/IFFT Processors, IEEE J. Solid-State Circuits, vol. 4, no. 9, pp , Sep References : Mix-radix Architectures (/) G. Zhong, F. Xu, and A.N. Willson, A Power-Scalable Reconfigurable FFT/IFFT IC Based on Multi-Processor Ring, IEEE J. Solid-State Circuits, vol. 4, no., pp , Feb Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, A -GS/s FFT/IFFT Processor for UWB Applications, IEEE J. Solid-State Circuits, vol. 40, no. 8, pp , Aug Y. Chen, Y.-C. Tsao, and C.-Y. Lee, An indexed-scaling pipelined FFT processor for OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no., pp , Feb Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, Low-power variable-length fast Fourier transform processor, IEE Proc.-Comput. Digit. Tech., vol. 5, no. 4, Jul. 005, pp W.-C. Yeh and C.-W. Jen, High-Speed and Low-Power Split-Radix FFT, IEEE Trans. Signal Processing, vol.5, no.3, pp ,

30 4/30/0 References : Mix-radix Architectures (/) C.-H. Yang, T.-H. Yu, and D. Marković, "Power and Area Minimization of Reconfigurable FFT Processors: A 3GPP-LTE Example," IEEE J. Solid-State Circuits, vol. 47, no.3, pp , Mar. 0. T.J. Ding, J.V. McCanny, and Y. Hu, Rapid Design of Application Specific FFT Cores, IEEE Trans. Signal Processing, vol. 47, no. 5, pp , 003. T.-D. Chiueh and P.-Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, Wiley, 007. A. Wenzler and E. Luder, New structures for complex multipliers and their noise analysis, in Proc. IEEE Int. Symp. Circuits Syst., ISCAS 95, May 995, vol., pp

A Low-Error Statistical Fixed-Width Multiplier and Its Applications

A Low-Error Statistical Fixed-Width Multiplier and Its Applications Yuan-Ho Chen 1, Chih-Wen Lu 1, Hsin-Chen Chiang, Tsin-Yuan Chang, and Chin Hsia 3 1 Department of Engineering and System Science, National