Parallelism in Computer Arithmetic: A Historical Perspective

Size: px

Start display at page:

Download "Parallelism in Computer Arithmetic: A Historical Perspective"

Henry Hensley
5 years ago
Views:

1 Parallelism in Computer Arithmetic: A Historical Perspective 21s 2s 199s 198s 197s 196s 195s Behrooz Parhami Aug. 218 Parallelism in Computer Arithmetic Slide 1 University of California, Santa Barbara

2 About This Presentation This slide show was first developed for an invited talk at a special session on computer arithmetic in honor of Drs. Graham Jullien and William Miller, held on Monday 8/6 at the 61st Midwest Symposium on Circuits and Systems, Windsor, Ontario, Canada, August 5-8, 218. All rights reserved for the author. 218 Behrooz Parhami Edition Released Revised Revised Revised First Aug. 218 File: Aug. 218 Parallelism in Computer Arithmetic Slide 2

3 Parallelism in Computer Arithmetic: A Historical Perspective Many early parallel processing breakthroughs emerged from the quest for faster and higher-throughput arithmetic operations. Additionally, the influence of arithmetic techniques on parallel computer performance can be seen in diverse areas such the bit-serial arithmetic units of early massively parallel SIMD computers, pipelining and pipeline chaining in vector machines, design of floating-point standards to ensure the accuracy and portability of numerically-intensive programs, and prominence of GPUs in today s top-of-the-line supercomputers. This paper contains a few representative samples of the many interactions and cross-fertilizations between computer-arithmetic and parallelcomputation communities by presenting historical perspectives, case studies of state of art and practice, and directions for further collaboration. Aug. 218 Parallelism in Computer Arithmetic Slide 3

My Personal Journey and Career 197 3 years

since graduation My children, 23-33 Aug.

4 My Personal Journey and Career years at UCSB We are here years since graduation My children, Aug. 218 Parallelism in Computer Arithmetic Slide 4

5 I. Introduction: What Is Parallelism? The two extreme views: - Any circuit that manipulates multiple bits at once is parallel - Must have concurrency at the level of large functional blocks My view: Parallel processing is possible at the three levels of circuits, function units, and compute nodes I will provide an example at each of the three levels: - Circuit level: Parallel-prefix adders - Function level: Recursive/divide-and-conquer multiplication - System level: Discrete Fourier transform, DFT/FFT The three levels of parallelism are not mutually exclusive and can be readily combined Aug. 218 Parallelism in Computer Arithmetic Slide 5

6 II. Circuit-Level Parallelism Adders and multipliers are our two main workhorses - In this section, I cover parallel-prefix adders - Recursive multiplication is covered in Section III although it has circuit-level embodiments as well Parallel-prefix computation - Given the inputs x, x 1, x 2, x 3,, x k 1 - And an associative binary operator - Compute all the prefixes of the expression x x 1 x 2 x 3 x k 1 Example: Indexing via prefix sums Aug. 218 Parallelism in Computer Arithmetic Slide 6

7 Share-Nothing vs. Share-Everything Carry Networks Challenge: Find circuit sharing schemes that come close to A in speed and to B in cost... x 3 y 3 x 2 y 2 x 1 y 1 x y A c in c k g i p i g k 1 c k 1 Carry is: annihilated or killed propagated generated (impossible) c k 2 B g k 2 p k 2 g i+1 p i+1 p k 1 x i g i Carry network c i+1 c i s i y i p i c 2 c 1 Aug. 218 Parallelism in Computer Arithmetic Slide 7 s 3 g 1 p 1 g p g 1 p 1 g p c 1 c c c s 2 s 1 s A: Full lookahead. Each carry, and thus sum bit is computed independently and in parallel B: Ripple-carry. Each carry circuit shares the entire circuit of the previous carry

8 The Carry Operator and Block-Propagate/Generate Block B' Block B" j i j 1 i 1 g p g p (g", p") (g', p') g" p" g' p' g p Block B (g, p) g = g" + g'p" p = p'p" Parallel-prefix carries Denote (g i, p i ) by x i g p (g [,2], p [,2] ) x x 1 x 2 x 3 x k 1 c 3 Aug. 218 Parallelism in Computer Arithmetic Slide 8

9 The Brent-Kung Carry Network [7, 7] [6, 6] [5, 5] [4, 4] [3, 3] [2, 2] [1, 1] [, ] g [1,1] p [1,1] g [,] p [,] [6, 7] [2, 3] [4, 5] [, 1] [4, 7] [, 3] g [,1] p [,1] [, 7] [, 6] [, 5] [, 4] [, 3] [, 2] [, 1] [, ] Aug. 218 Parallelism in Computer Arithmetic Slide 9

10 Design Alternatives and Tradeoffs x 9 x 1 x 11 x 12 x 13 x 14 x 15 x 8 x 7 x x 6 5 x 4 x 3 x x 2 1 x x 9 x 1 x 11 x 12 x 13 x 14 x 15 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 x Level 1 Brent-Kung: 6 levels 26 cells Kogge-Stone: 4 levels 49 cells 5 6 s 9 s 1 s 11 s 12 s 13 s 14 s 15 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s s 9 s 1 s 11 s 12 s 13 s 14 s 15 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s x 15 x 14 x 13 x 12 x 11 x 1 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 x Nearly an infinite number of hybrid designs are possible Brent- Kung Kogge- Stone Brent- Kung Hybrid: 5 levels 32 cells s 15 s 14 s 13 s 12 s 11 s 1 s 9 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s Aug. 218 Parallelism in Computer Arithmetic Slide 1

11 A Taxonomy of Parallel-Prefix Adders Fanout = 2 f + 1 Logic levels = log 2 k + l From: Harris, David, 23 Wire tracks = 2 t Aug. 218 Parallelism in Computer Arithmetic Slide 11

12 III. Function-Level Parallelism Multiplication is now just as essential as addition - In this section, I cover divide-and-conquer multiplication - Many other multiplication schemes exist several of them have parallel-processing connections Recursive multiplication xy = (2 k/2 x H + x L )(2 k/2 y H + y L ) = 2 k x H y H + 2 k/2 (x H y L + x L y H ) + x L y L = 2 k p k/2 (p 3 + p 2 ) + p 1 Complexity analysis: T(k) = 4T(k/2) + (log k) = (k 2 ) A(k) = A(k/2) + (k) = (k) p 4 p 2 p 3 x y p 1 p Aug. 218 Parallelism in Computer Arithmetic Slide 12

13 Analysis of Recursive Multiplication Recursive multiplication xy = (2 k/2 x H + x L )(2 k/2 y H + y L ) = 2 k x H y H + 2 k/2 (x H y L + x L y H ) + x L y L = 2 k p k/2 (p 3 + p 2 ) + p 1 Complexity analysis (serial): T(k) = 4T(k/2) + (log k) = (k 2 ) A(k) = A(k/2) + (k) = (k) p 4 p 2 p 3 x y p 1 p Complexity analysis (parallel): T(k) = T(k/2) + (log k) = (log k) A(k) = 4A(k/2) + (k) = (k 2 ) Theoretical lower bounds: AT = W(k 3/2 ) AT 2 = W(k 2 ) Aug. 218 Parallelism in Computer Arithmetic Slide 13

14 The Trick Proposed by Karatsuba and Ofman Recursive multiplication xy = 2 k p k/2 (p 3 + p 2 ) + p 1 Compute the auxiliary term p 5 = (x H x L )(y H y L ) = p 4 + p 1 p 3 p 2 p 3 + p 2 = p 4 + p 1 p 5 Complexity analysis: A(k) = 3A(k/2) + (k) = (k ) p 4 p 2 p = log 2 3 x y p 1 p The benefit is significant for extremely wide operands (4/3) 5 = 4.2 (4/3) 1 = 17.8 (4/3) 2 = (4/3) 5 = 1,765,781 Aug. 218 Parallelism in Computer Arithmetic Slide 14

15 Improvements to Karatsuba-Ofman Algorithm Original / Naive (k 2 ) Karatsuba-Ofman (k ) Toom / Cook (k ), (k 1.44 ) (k 1+e ) Schonhage-Strassen (k logk loglogk) Furer Still faster Is (k log k) feasible? Aug. 218 Parallelism in Computer Arithmetic Slide 15

16 Similar Trick Used for Matrix Multiplication Strassen s trick: Eight matrix multiplications reduced to 7 Original / Naive (n 3 ) Strassen (k 2.87 ) 2.87 = log 2 7 Aug. 218 Parallelism in Computer Arithmetic Slide 16

17 Strassen Matrix Multiplication in Practice Practical implementations in C++ (your results may vary) Time (s) Advantages of Strassen s algorithm show up for n ~ 3 Naive O(n 3 ) Strassen s method does not show as much improvement as Karatsuba-Ofman s because: - Its branching reduction factor is 7/8 instead of 3/4 - Matrix addition is relatively more complex than integer add Matrix size (n) Aug. 218 Parallelism in Computer Arithmetic Slide 17

18 IV. System-Level Parallelism Multiple independent or interacting arithmetic streams: - Early examples included using one or more co-processors - Modern embodiments entail the use of GPUs and the like Streamlined arithmetic blocks: No extra features Discrete Fourier Transform (DFT / FFT) Inputs x, x 1,..., x n 1 Outputs y, y 1,..., y n 1 y i = j=:n 1 n ij x j n is a primitive nth root of unity Naive method (n 2 ) x x 1 x 2... DFT y y 1 y 2... Inv. DFT x x 1 x 2... FFT (n log n) x n 1 y n 1 x n 1 Aug. 218 Parallelism in Computer Arithmetic Slide 18

19 FFT Can Be Performed in Many Different Ways Quote from The Principles of Computer Hardware: At least one good reason for studying multiplication and division is that there is an infinite number of ways of performing these operations and hence there is an infinite number of PhDs (or expenses-paid visits to conferences in the USA) to be won from inventing new forms of multiplier. ~ Alan Clemens, 1985 The statement above is even more true for DFT / FFT! Google search for FFT yields 28M+ hits The 1965 paper by Cooley and Tukey has 14K+ citations Many books on FFT have 1s to 1s of citations New ways of performing FFT are still being discovered Aug. 218 Parallelism in Computer Arithmetic Slide 19

20 Computation Scheme for 16-Point FFT Bit-reversal permutation Butterfly operation a b j a + b j a b j n log n butterfly processors, each performing one operation Pipelining improves hardware utilization Aug. 218 Parallelism in Computer Arithmetic Slide 2

21 Butterfly and Shuffle-Exchange Networks x u y x u y x u y x 4 u 1 y 1 x 1 u 2 y 4 x 1 u 2 y 4 x 2 u 2 y 2 x 2 u 1 y 2 x 2 u 1 y 2 x 6 u 3 y 3 x 3 u 3 y 6 x 3 u 3 y 6 x 1 v y 4 x 4 v y 1 x 4 v y 1 x 5 v 1 y 5 x 5 v 2 y 5 x 5 v 2 y 5 x 3 v 2 y 6 x 6 v 1 y 3 x 6 v 1 y 3 x 7 v 3 y 7 x 7 v 3 y 7 x 7 v 3 y 7 Rearrangement of nodes makes inter-column connections identical Shuffle and shuffle-exchange link pairs replaced by separate shuffle and exchange links Aug. 218 Parallelism in Computer Arithmetic Slide 21

22 Projections to Reduce Hardware Complexity n log n cost log n time n cost log n time Horizontal projection: Reduces hardware complexity by a factor log n, without increasing the asymptotic time complexity log n cost n time Vertical projection: Reduces hardware complexity by a factor n, while increasing the asymptotic time complexity by n / log n Aug. 218 Parallelism in Computer Arithmetic Slide 22

computations to perform an n-point FFT in O(n) time with

23 Timing of Computations in Low-Cost FFT Circuit Butterfly processor The feedback connections Scheduling of computations to perform an n-point FFT in O(n) time with O(log n) processors Aug. 218 Parallelism in Computer Arithmetic Slide 23

24 V. Conclusion: Where Are We, Where Next? I reviewed only 3 examples, but there are more - Parallel-prefix adders - Recursive/divide-and-conquer multipliers - Discrete Fourier Transform, DFT / FFT - Key role of GPUs in building exascale computers - High-precision, error-free, and wide-range arithmetic Path to further connections / interactions - Study cross-citation patterns between the two fields - Redundancy for data preservation and fault tolerance - New/emerging technologies: QCA, SET, Nanomagnets, - Program portability via standardization - Speculative execution of multiple program paths Aug. 218 Parallelism in Computer Arithmetic Slide 24

25 Questions or Comments?

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders ECE 645: Lecture 3 Conditional-Sum Adders and Parallel Prefix Network Adders FPGA Optimized Adders Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 7.4, Conditional-Sum