Computer Arithmetic Design

Size: px

Start display at page:

Download "Computer Arithmetic Design"

Erik Snow
6 years ago
Views:

1 Computer Arithmetic Design Instructor: Kuan Jen Lin Web: Dept. of EE, FJU, Taiwan Room: SF 727B Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 1

2 SW & HW SW = Algorithm + Data Structure + Programming techniques HW = Algorithm + Architecture + Design Method Computing Communication Pipeline Systolic array Low power Interface Full custom Cell based FPGA System level Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 2

3 Course Objectives Learn computer algorithms to do arithmetic operations Learn hardware designs for computer arithmetic. After completing the course Students are able to implement computer arithmetic hardware designs using HDL. Students are able to read research papers about computer arithmetic. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 3

Tetbook Tetbook Behrooz Parhami, Computer Arithmetic Algorithms and Hardware Designs, Oford University Press Reference books: Ercegovac and Lang, Digital

4 Tetbook Tetbook Behrooz Parhami, Computer Arithmetic Algorithms and Hardware Designs, Oford University Press Reference books: Ercegovac and Lang, Digital Arithmetic, MKP. Stine, Digital Computer Aruthmetic datapath Design Using Verilog HDL, CAP Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 4

5 Syllabus Number representation Two-operand Addition Multi-operand Addition Multiplication Division Square Root Papers reading and presentation Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 5

6 Grading Mid Eam (30%) Papers reading and presentation (30%) Homework (some problems need HDL programming) (30%) Attendance and Others (10%) Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 6

7 Number Representation Instructor: Kuan Jen Lin Dept. of EE, FJU, Taiwan Room: SF 727B Most slides are revision of PowerPoint files gotten from tetbook website. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 7

8 Numbers and Arithmetic Chapter Goals Define scope and provide motivation Set the framework for the rest of the book Review positional fied-point numbers Chapter Highlights What goes on inside your calculator? Ways of encoding numbers in k bits Radices and digit sets: conventional, eotic Conversion from one system to another Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 8

9 What is Computer Arithmetic? Pentium Division Bug ( ): Pentium s radi-4 SRT algorithm occasionally gave incorrect quotient First noted in 1994 by T. Nicely who computed sums of reciprocals of twin primes: 1/5 + 1/7 + 1/11 + 1/ /p + 1/(p + 2) +... Worst-case eample of division error in Pentium: c = = Correct quotient circa 1994 Pentium double FLP value; accurate to only 14 bits (worse than single!) Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 9

10 The Scope of Computer Arithmetic. Hardware (our focus in this book) Software Design of efficient digital circuits for primitive and other arithmetic operations such as +,,,,, log, sin, cos Issues: Algorithms Error analysis Speed/cost trade-offs Hardware implementation Testing, verification General-purpose Special-purpose Fleible data paths Tailored to Fast primitive applications like: operations like Digital filtering +,,,, Image processing Benchmarking Radar tracking Numerical methods for solving systems of linear equations, partial differential equations, etc. Issues: Algorithms Error analysis Computational compleity Programming Testing, verification Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 10

11 A Motivating Eample Using a calculator with, 2, and y functions, compute: u = 2 = th root of 2 v = 2 1/1024 = Save u and v; If you can t save, recompute values when needed = (((u 2 ) 2 )...) 2 = ' = u 1024 = y = (((v 2 ) 2 )...) 2 = y' = v 1024 = Perhaps v and u are not really the same value w = v u = Nonzero due to hidden digits (u 1) 1000 = [Hidden... (0) 68] (v 1) 1000 = [Hidden... (0) 69] Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 11

12 Finite Precision Can Lead to Disaster Eample: Failure of Patriot Missile (1991 Feb. 25) Source American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile The Scud struck an American Army barracks, killing 28 Cause, per GAO/IMTEC report: software problem (inaccurate calculation of the time since boot) Problem specifics: Time in tenths of second as measured by the system s internal clock was multiplied by 1/10 to get the time in seconds Internal registers were 24 bits wide 1/10 = (chopped to 24 b) Error Error in 100-hr operation period = 0.34 s Distance traveled by Scud = (0.34 s) (1676 m/s) 570 m Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 12

13 Numbers and Their Encodings Some 4-bit number representation formats Unsigned integer ± Signed integer Fied point, 3+1 Signed fraction Floating point ± e s log Radi point 2's-compl fraction Logarithmic Eponent in { 2, 1, 0, 1} Significand in {0, 1, 2, 3} Base-2 logarithm Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 13

14 Encoding Numbers in 4 Bits Number format Unsigned integers Signed-magnitude ± fied-point,. Signed fraction, ±. ± 2 s-compl. fraction, floating-point, s 2 e in [ 2, 1], s in [0, 3] e e s logarithmic (log =.) log Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 14

15 Fied-Radi Positional Number Systems ( k 1 k l ) r = i r i One can generalize to: Arbitrary radi (not necessarily integer, positive, constant) Arbitrary digit set, usually { α, α+1,..., β 1, β} = [ α, β] Eample 1.1. Balanced ternary number system: Radi r = 3, digit set = [ 1, 1] Eample 1.2. Negative-radi number systems: Radi r, r 2, digit set = [0, r 1] The special case with radi 2 and digit set [0, 1] is known as the negabinary number system Can it represent all integer number? k 1 i= l Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 15

16 More Eamples of Number Systems Eample 1.3. Digit set [ 4, 5] for r = 10: (3 1 5)ten represents 295 = Eample 1.4. Digit set [ 7, 7] for r = 10: (3 1 5)ten = (3 0 5)ten = ( )ten Eample 1.7. Quater-imaginary number system: radi r = 2j, digit set [0, 3] Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 16

17 Number Radi Conversion Whole part Fractional part u = w. v = ( k 1 k l ) r Old = ( X K 1 X K 2... X 1 X 0. X 1 X 2... X L ) R New Eample: (31) eight = (25) ten 31 Oct. = 25 Dec. Halloween = Xmas Radi conversion, using arithmetic in the old radi r Convenient when converting from r = 10 Radi conversion, using arithmetic in the new radi R Convenient when converting to R = 10 Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 17

18 Radi Conversion: Old-Radi Arithmetic Converting whole part w: (105) ten = (?) five Repeatedly divide by five Quotient Remainder Therefore, (105) ten = (410) five Converting fractional part v: ( ) ten = (410.?) five Repeatedly multiply by five Whole Part Fraction Therefore, ( ) ten ( ) five Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 18

19 Radi Conversion: New-Radi Arithmetic Converting whole part w: (22033) five = (?) ten ((((2 5) + 2) 5 + 0) 5 + 3) : : : : 10 : : : : : : : 12 : : : : : 60 : : : 303 : Horner s rule or formula Converting fractional part v: ( ) five = (105.?) ten ( ) five 5 5 = (22033) five = (1518) ten 1518 / 5 5 = 1518 / 3125 = Therefore, ( ) five = ( ) ten Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 19

20 Horner s Rule for Fractions Converting fractional part v: ( ) five = (?) ten (((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / : : : : 0.6 : : : : : : : 3.6 : : : : : 0.72 : : : : Horner s rule or formula Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 20

21 Classes of Number Representations Signed number Redundant number system Residue number system Real number Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 21

22 2 Representing Signed Numbers Chapter Goals Learn different encodings of the sign info Discuss implications for arithmetic design Chapter Highlights Using sign bit, biasing, complementation Properties of 2 s-complement numbers Signed vs unsigned arithmetic Signed numbers, positions, or digits Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 22

23 Four-bit signed-magnitude number representation system for integers Decrement Signed values (signed magnitude) _ Bit pattern (representation) 0100 Increment Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 23

24 Four-bit biased integer number representation system with a bias of 8 Increment Signed values (biased by 8) _ Bit pattern (representation) 0100 Increment Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 24

25 Arithmetic with Biased Numbers Addition/subtraction of biased numbers + y + bias = ( + bias) + (y + bias) bias y + bias = ( + bias) (y + bias) + bias A power-of-2 (or 2 a 1) bias simplifies addition/subtraction Comparison of biased numbers: Compare like ordinary unsigned numbers find true difference by ordinary subtraction We seldom perform arbitrary arithmetic on biased numbers Main application: Eponent field of floating-point numbers Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 25

26 Eample and Two Special Cases Eample -- complement system for fied-point numbers: Complementation constant M = Fied-point number range [ 6.000, ] Represent as = Auiliary operations for complement representations complementation or change of sign (computing M ) computations of residues mod M Thus, M must be selected to simplify these operations Two choices allow just this for fied-point radi-r arithmetic with k whole digits and l fractional digits Radi complement M = r k Digit complement M = r k ulp (aka diminished radi compl) ulp (unit in least position) stands for r l Allows us to forget about l, even for nonintegers Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 26

27 Two s- Complement Numbers Signed values (2 s complement) _ Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan Unsigned representations Two s complement = radi complement system for r = 2 M = 2 k 2 k = [(2 k ulp) ] + ulp = compl + ulp Range of representable numbers in with k whole bits: from 2 k 1 to 2 k 1 ulp ulp (unit in least position) stands for r l Allows us to forget about l, even for nonintegers 27

28 One s-complement Number Representation Signed values (1 s complement) _ Unsigned representations 0100 One s complement = digit complement (diminished radi complement) system for r = 2 M = 2 k ulp (2 k ulp) = compl Range of representable numbers in with k whole bits: from 2 k 1 + ulp to 2 k 1 ulp Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 28

29 Range/Precision etension for 2 s- and 1 s Complement Range/precision etension for 2 s-complement numbers... k 1 k 1 k 1 k 1 k l Sign etension Sign bit LSD Etension Range/precision etension for 1 s-complement numbers... k 1 k 1 k 1 k 1 k l k 1 k 1 k 1... Sign etension Sign bit LSD Etension Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 29

30 Mod 2 k vs Mod 2 k -1 Mod-2 k operation needed in 2 s-complement arithmetic is trivial: Simply drop the carry-out (subtract 2 k if result is 2 k or greater) Mod-(2 k ulp) operation needed in 1 s-complement arithmetic is done via end-around carry ( + y) (2 k ulp) Connect c out to c in Since the dropped carry is worth 2 k unites and the inserted carry is worth ulp, the combined effect is to reduce the magnitude by 2 k -ulp. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 30

31 Why 2 s-complement Is the Universal Choice y Controlled complementation c out 0 1 Mu _ y or y Adder s = ± y c in add/sub Can replace this mu with k XOR gates 0 for addition, 1 for subtraction Adder/subtractor architecture for 2 s-complement numbers. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 31

32 Interpreting a 2 s-complement number as having a negatively weighted most-significant digit. = ( ) two s-compl = 90 Check: = ( ) two s-compl = ( ) two = 90 Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 32

33 Redundant Number Systems Chapter Goals Eplore the advantages and drawbacks of using more than r digit values in radi r Chapter Highlights Redundancy eliminates long carry chains Redundancy takes many forms: trade-offs Conversions between redundant and nonredundant representations Redundancy used for end values too? Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 33

34 Coping with the Carry Problem Ways of dealing with the carry propagation problem: 1. Limit propagation to within a small number of bits (Chapters 3-4) 2. Detect end of propagation; don t wait for worst case (Chapter 5) 3. Speed up propagation via lookahead etc. (Chapters 6-7) 4. Ideal: Eliminate carry propagation altogether! (Chapter 3) Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 34

35 Use Redundant Number System (1/2) Operand digits in [0, 9] Position sums in [0, 18] But how can we etend this beyond a single addition? Subsequent additions will cause problems. The digit values 10 through 18 are redundant. Carry occurs if the sum >= 10, while not >18. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 35

36 Use Redundant Number System (2/2) Is there still carry propagation problem? The sum of digits for each position is in [0, 36], each can be decomposed into an interim sum in [0, 16] and a transfer digit in [0, 2], i.e. carry. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 36

37 Eample: Addition of Redundant Numbers Position sum decomposition [0, 36] = 10 [0, 2] + [0, 16] Absorption of transfer digit [0, 16] + [0, 2] = [0, 18] Operand digits in [0, 18] Position sums in [0, 36] Interim sums in [0, 16] Transfer digits in [0, 2] Sum digits in [0, 18] Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 37

38 Carry-Free Addition Schemes Interim sum at position i Operand digits at position i Transfer digit into position i i, i+1,yi+1 y i i 1,yi 1 i+1,yi+1 y i i 1,yi 1 i, i+1,yi+1 i, y i i 1,yi 1 s i+1 s i s i 1 (Impossible for positional system with fied digit set) (a) Ideal single-stage carry-free. ti s i+1 s i s i 1 (b) Two-stage carry-free. s i+1 s i s i 1 (c) Single-stage with lookahead. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 38

39 Redundancy Inde So, redundancy helps us achieve carry-free addition But how much redundancy is actually needed? Is [0, 11] enough for r = 10? Redundancy inde ρ = α + β + 1 r For eample, = Operand digits in [0, 11] Position sums in [0, 22] Interim sums in [0, 9] Transfer digits in [0, 2] Sum digits in [0, 11] Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 39

40 Digit Sets and Digit-Set Conversions Eample 3.1: Convert from digit set [0, 18] to [0, 9] in radi = 10 (carry 1) = 10 (carry 1) = 10 (carry 1) = 10 (carry 1) = 10 (carry 1) = 10 (carry 1) Answer; all digits in [0, 9] Note: Conversion from redundant to nonredundant representation always involves carry propagation Thus, the process is sequential and slow Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 40

41 Generalized Signed-Digit Numbers Non-redundant α = 0 α 1 Conventional Radi-r Positional ρ = 0 ρ 1 Non-redundant signed-digit α = β (even r) Symmetric minimal GSD r = 2 Minimal GSD Generalized signed-digit (GSD) ρ = 1 ρ 2 α β Asymmetric minimal GSD α = 0 α = 1 (r?2) Symmetric nonminimal GSD α < r α = β Non-minimal GSD α = 0 α β Asymmetric nonminimal GSD Radi r Digit set [ α, β] Requirement α + β + 1 r Redundancy inde ρ = α + β + 1 r α = 1 β = r BSD or BSB Storedcarry (SC) Non-binary SB Ordinary signed-digit Unsigned-digit redundant (UDR) SCB r = 2 α = r/2 + 1 α = r?1 r = 2 BSC Minimally redundant OSD Maimally redundant OSD BSCB Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 41

42 Binary Signed Digit (BSD) i BSD representation of +6 s, v Sign and value encoding 2 s-compl bit 2 s-complement n, p Negative & positive flags n, z, p out-of-3 encoding Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 42

43 Carry-Free Addition Algorithms Carry-free addition of GSD numbers i+1,y i+1 i, y i i?,y i? Compute the position sums p i = i + y i Divide p i into a transfer t i+1 and interim sum w i = p i rt i+1 w i t i Add incoming transfers to get the sum digits s i = w i + t i If the transfer digits t i are in [ λ, μ], we must have: α + λ p i rt i+1 β μ interim sum s i+1 s i s i? These constraints lead to: Smallest interim sum if a transfer of λ is to be absorbable Largest interim sum if a transfer of μ is to be absorbable λ α/ (r 1) μ β/ (r 1) Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 43

44 Is Carry-Free Addition Always Applicable? No: It requires one of the following two conditions [Parh 90] a. r > 2, ρ 3 b. r > 2, ρ = 2, α 1, β 1 e.g., not [ 1, 10] in radi 10 In other words, it is inapplicable for r = 2 ρ = 1 ρ = 2 with α = 1 or β = 1 BSD is not two-stage carry-free Perhaps most useful case e.g., carry-save e.g., carry/borrow-save Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 44

45 Use Carry-Estimate in [ 1, 1] high low low low high high A position sum 1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of 1 for incoming transfer in [ 1, 0]. This guarantees that t i w i and thus 1 s i 1. i y in [ 1, 1] i p in [ 2, 2] i e in {low: [ 1, 0], high: [0, 1]} i w in [ 1, 1] i t in [ 1, 1] i+1 s in [ 1, 1] i Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 45

46 Residue Number Systems Chapter Goals Study a way of encoding large numbers as a collection of smaller numbers to simplify and speed up some operations Chapter Highlights Moduli, range, arithmetic operations Many sets of moduli possible: tradeoffs Conversions between RNS and binary The Chinese remainder theorem Why are RNS applications limited? Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 46

47 RNS Representations and Arithmetic Chinese puzzle, 1500 years ago: What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively? Residues uniquely identify the number, hence they constitute a representation Pairwise relatively prime moduli: m k 1 >... > m 1 > m 0 The residue i of wrt the ith modulus m i (similar to a digit): i = mod m i = m i RNS representation contains a list of k residues or digits: = (2 3 2) RNS(7 5 3) Default RNS for this chapter: RNS( ) Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 47

48 RNS Dynamic Range Product M of the k pairwise relatively prime moduli is the dynamic range M = m k 1... m 1 m 0 We can take the range of RNS( ) For RNS( ), M = = 840 to be [ 420, 419] or any other set of 840 Negative numbers: Complement relative to M consecutive integers m = M i m i 21 = ( ) RNS 21 = ( ) RNS = ( ) RNS Here are some eample numbers in our default RNS( ): ( ) RNS Represents 0 or 840 or... ( ) RNS Represents 1 or 841 or... ( ) RNS Represents 2 or 842 or..... ( ) RNS Represents 64 or 904 or... ( ) RNS Represents 70 or 770 or... ( ) RNS Represents 1 or 839 or... Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 48

49 RNS as Weighted Representation For RNS( ), the weights of the 4 positions are: Eample: ( ) RNS represents the number = = 9 For RNS(7 5 3), the weights of the 3 positions are: Eample -- Chinese puzzle: (2 3 2) RNS(7 5 3) represents the number = = 23 We will see later how the weights can be determined for a given RNS Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 49

50 RNS Encoding and Arithmetic Operations Operand 1 Operand 2 mod 8 mod 7 mod 5 mod 3 Binary-coded format for RNS( ). Mod-8 Unit Mod-7 Unit Mod-5 Unit Mod-3 Unit Result Arithmetic in RNS( ) ( ) RNS Represents = +5 ( ) RNS Represents y = 1 ( ) RNS + y : = 4, = 4, etc. ( ) RNS y : = 6, = 6, etc. (alternatively, find y and add to ) ( ) RNS y : = 3, = 2, etc. mod 8 mod 7 mod 5 mod 3 Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 50

51 Choosing the RNS Moduli Target range for our RNS: Decimal values [0, ] Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence Pick m 0 = 2, m 1 = 3, m 2 = 5, etc. After adding m 5 = 13: RNS( ) M = Inadequate RNS( ) M = Too large RNS( ) M = Just right! = 19 bits Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21) RNS( ) M = Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 51

52 An Improved Strategy Target range for our RNS: Decimal values [0, ] Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the net larger prime RNS(2 2 3) M = 12 RNS( ) M = 2520 RNS( ) M = RNS( ) M = (remove one 3, combine 3 & 5) RNS( ) M = = 18 bits Fine tuning: Maimize the size of the even modulus within the 4-bit limit RNS( ) M = Too large We can now remove 5 or 7; not an improvement in this eample Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 52

53 Low-Cost RNS Moduli Target range for our RNS: Decimal values [0, ] Strategy 3: To simplify the modular reduction (mod m i ) operations, choose only moduli of the forms 2 a or 2 a 1, aka low-cost moduli RNS(2 a k 1 2 a k a a 0 1) We can have only one even modulus 2 a i 1 and 2 a j 1 are relatively prime iff a i and a j are relatively prime RNS( ) basis: 3, 2 M = 168 RNS( ) basis: 4, 3 M = 1680 RNS( ) basis: 5, 3, 2 M = RNS( ) basis: 5, 4, 3 M = Comparison It s easy to mod 2 k and 2 k -1 RNS( ) 18 bits M = RNS( ) 17 bits M = Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 53

54 Encoding and Decoding of Numbers Conversion from binary/decimal to RNS Eample 4.1: Represent the number y = ( ) two = (164) ten in RNS( ) The mod-8 residue is easy to find 3 = y 8 = (100) two = 4 We have y = ; thus 2 = y 7 = = 3 1 = y 5 = = 4 0 = y 3 = = 2 Table 4.1 Residues of the first 10 powers of 2 i 2 i 2 i 7 2 i 5 2 i Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 54

55 Conversion from RNS to Binary/Decimal Theorem 4.1 (The Chinese remainder theorem) = ( k ) RNS = i M i α i i m i M where M i = M/m i and α i = M i 1 m i (multiplicative inverse of M i wrt m i ) Implementing CRT-based RNS-to-binary conversion = i M i α i i m i M = i f i ( i ) M We can use a table to store the f i values - i m i entries Table 4.2 Values needed in applying the Chinese remainder theorem to RNS( ) i m i i M i α i i m i M Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 55

56 Intuitive Justification for CRT Puzzle: What number has the remainders of 2, 3, and 2 when divided by the numbers 7, 5, and 3, respectively? = (2 3 2) RNS(7 5 3) = (?) ten (1 0 0) RNS(7 5 3) = multiple of 15 that is 1 mod 7 = 15 (0 1 0) RNS(7 5 3) = multiple of 21 that is 1 mod 5 = 21 (0 0 1) RNS(7 5 3) = multiple of 35 that is 1 mod 3 = 70 (2 3 2) RNS(7 5 3) = (2 0 0) + (0 3 0) + (0 0 2) = 2 (1 0 0) + 3 (0 1 0) + 2 (0 0 1) = = = 233 = 23 mod 105 Therefore, = (23) ten Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 56

57 Difficult RNS Arithmetic Operations Sign test Magnitude comparison Division Could convert back and forth to/from binary. Another approach: convert to a mied radi system, as numbers in a mied radi system are comparable. Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 57

58 Difficult RNS Arithmetic Operations Eample: Of the following RNS( ) numbers: Which, if any, are negative? Which is the largest? Which is the smallest? Assume a range of [ 420, 419] a b c d e f = ( ) RNS = ( ) RNS = ( ) RNS = ( ) RNS = ( ) RNS = ( ) RNS Answers: d < c < f < a < e < b 70 < 8 < 1 < 8 < 21 < 64 Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 58

59 General RNS Division General RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB): Use an algorithm that has built-in tolerance to imprecision, and apply the approimate CRT decoding to choose quotient digits Eample SRT algorithm (s is the partial remainder) s < 0 quotient digit = 1 s 0 quotient digit = 0 s > 0 quotient digit = 1 The BSD quotient can be converted to RNS on the fly Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 59

60 Limits of Fast Arithmetic in RNS Known results from number theory Theorem 4.2: The ith prime p i is asymptotically i ln i Theorem 4.3: The number of primes in [1, n] is asymptotically n/ln n Theorem 4.4: The product of all primes in [1, n] is asymptotically e n Implications to speed of arithmetic in RNS Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits That is, with fast log-time adders, addition needs O(log log k) time Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 60

61 Hardware Implementation for RNS Representations Operand 1 Operand 2 Mod-8 Unit Mod-7 Unit Mod-5 Unit Mod-3 Unit Result mod 8 mod 7 mod 5 mod 3 Computer Arithmetic 1, Dept. of EE, Fu Jen Catholic University, Taiwan 61

62 Addition/Subtraction Instructor: Kuan Jen Lin Dept. of EE, FJU, Taiwan Room: SF 727B Most slides originate from the tetbook author s PowerPoint presentation files. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 1

63 II Addition / Subtraction Review addition schemes and various speedup methods Addition is a key op (in itself, and as a building block) Subtraction = negation + addition Carry propagation speedup: lookahead, skip, select, Two-operand versus multioperand addition Topics in This Part Chapter 5 Basic Addition and Counting Chapter 6 Carry-Lookahead Adders Chapter 7 Variations in Fast Adder Chapter 8 Multioperand Addition Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 2

64 Basic Addition and Counting Chapter Goals Study the design of ripple-carry adders, discuss why their latency is unacceptable, and set the foundation for faster adders Chapter Highlights Full adders are versatile building blocks Longest carry chain on average: log 2 k bits Fast asynchronous adders are simple Counting is relatively easy to speed up Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 3

65 HA and FA Adders Inputs Outputs y c s Half-adder (HA): Truth table and block diagram Inputs Outputs y c c s in out y c FA out s Full-adder (FA): Truth table and block diagram c s HA y c in Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 4

66 Half-Adder Implementations c y c y s y s y (a) AND/XOR half-adder. _ c (b) NOR-gate half-adder. s y (c) NAND-gate half-adder with complemented carry. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 5

67 Some Full-Adder Details Logic equations for a full-adder: s = y c in (odd parity function) = yc in y c in yc in y c in c out = y c in y c in (majority function) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 6

68 Full-Adder Implementations y y c out HA HA c in c out c in s (a) Built of half-adders. y c out Mu s s c in (b) Built as an AND-OR circuit. (c) Suitable for CMOS realization. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 7

69 Bit Serial Adder and Ripple Adder y Shift i y i Carry FF c i+1 FA c i Shift Clock s i (a) Bit-serial adder. s 31 y 31 y 1 1 y 0 0 c 32 c out FA c c 2 FA c 1 FA c 0 c in s 32 s 31 s (b) Ripple-carry adder. 1 s 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 8

70 Critical Path Through a Ripple-Carry Adder T ripple-add = T FA (,y c out ) + (k 2) T FA (c in cout) + T FA (c in s) k 1 y k 1 y k-2 k 2 y 1 1 y 0 0 c k c out c k 2 c k 1 c 2 c 1 FA FA... FA FA c 0 c in s k s k 1 s k 2 s 1 s 0 Critical path in a k-bit ripple-carry adder. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 9

71 Conditions and Eceptions c out y k 1 k 1 y k 2 k 2 c k 1 c k c k 2 FA FA... c 2 y1 1 y 0 0 FA c 1 FA c0 c in Overflow Negative Zero s k 1 s k 2 Overflows occurs when two numbers of like sign are added and a result of the opposite sign is produced. overflow 2 s-compl = k 1 y k 1 s k 1 k 1 y k 1 s k 1 overflow 2 s-compl = c k c k 1 = c k c k 1 c k c k 1 s 1 s 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 10

72 Binary Adders as Versatile Building Blocks (1/2) Set one input to 0: c out = AND of other inputs y Set one input to 1: Set one input to 0 and another to 1: c out = OR of other inputs s = NOT of third input c out FA c in Bit 3 Bit 2 Bit 1 Bit w 1 z 0 y s w yz c 4 c 3 c 2 c 1 c 0 w yz yz y 0 (w yz) Fig. 5.6 Four-bit binary adder used to realize the logic function f = w + yz and its complement. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 11

73 Binary Adders as Versatile Building Blocks (2/2) Inputs Outputs y c c s in out y c out FA s c in Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 12

74 Eample of Carry Propagation Bit positions c out c in \ /\ / \ /\ / Carry chains and their lengths Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 13

75 14 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan Using Probability to Analyze Carry Propagation Given binary numbers with random bits, for each position i we have Probability of carry generation = ¼ (both 1s) Probability of carry annihilation = ¼ (both 0s) Probability of carry propagation = ½ (different) Probability that carry generated at position i propagates through position j 1 and stops at position j (j > i) 2 (j 1 i) 1/2 = 2 (j i) Epected length of the carry chain that starts at position i 1) ( ) 1 ( ) 1 ( ) 1 ( 1 1 ) 1 ( 1 1 ) ( 2 2 )2 ( 1)2 ( 2 )2 ( 2 )2 ( )2 ( = + = = + + = + = + i k i k i k i k i k l l i k k i j i j i k i k i k l i k i j Because the carry definitely stops at position k, the term for k is not multiplied by ½.

76 Carry Completion Detection b k... b i+1 y = +y i i i i b i i y i... b = c 0 in c k = c out... c i+1 c i i + y i... c = c 0 in d i+1 i y i b i c i 0 0 Carry not yet known 0 1b i = Carry 1: No known carry to be 1 1 0c i = Carry 1: Carry known to be 0 alldone } From other bit positions Dual rail coding Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 15

77 Self-Timed Adder Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 16

78 Self-Timed Adder with Parallel carry Completion Sensing Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 17

79 Addition of a Constant: Counters 0 1 Mu Data in Count / Initialize Clock +1 ( 1) Count register Reset Load Clear Enable Counter overflow c out Incrementer (Dec rementer) + 1 ( 1) Data out Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 18

80 Implementing a Simple Up Counter k 1 k 2... c k c k 1 c k 2 c 2 c s k 1 s k 2 s 2 s 1 s 0 Ripple-carry incrementer for use in an up counter. Count Output Q3 T Q 2 T Q1 T Q0 T Increment Q3 Q2 Q1 Q0 Four-bit asynchronous up counter built only of negative-edgetriggered T flip-flops. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 19

81 Manchester Carry Chains and Adders Sum digit in radi r s i = ( i + y i + c i ) mod r Special case of radi 2 s i = i y i c i Computing the carries c i is thus our central problem For this, the actual operand digits are not important What matters is whether in a given position a carry is generated, propagated, or annihilated (absorbed) For binary addition: g i = i y i p i = i y i a i = i y i = ( i y i ) It is also helpful to define a transfer signal: t i = g i p i = a i = i y i Using these signals, the carry recurrence is written as c i+1 = g i c i p i = g i c i g i c i p i = g i c i t i Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 20

82 Manchester Carry Network The worst-case delay of a Manchester carry chain has three components: 1. Latency of forming the switch control signals 2. Set-up time for switches 3. Signal propagation delay through k switches g i = i y i p i = i y i c i +1 = g i c i p i c i+1 a i g i Logic 0 p Logic 1 (a) Conceptual representation i 0 1 c i V DD c' i+1 g i p i Clock V SS (b) Possible CMOS realization. c' i Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 21

83 Carry Network is the Essence of a Fast Adder g i p i Carry is: i y i annihilated or killed propagated generated (impossible) g i = i y i p i = i y i g k 1 p k 1 g k 2 p k 2 g i+1 p i g i p i g 1 p 1 g 0 p 0 c 0 Carry network c k c k 1 c k c i+1 c i c 1 c 0 Ripple; Skip; Lookahead; Parallel-prefi The main part of an adder is the carry network. The rest is just a set of gates to produce the g and p signals and the sum bits. s i Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 22

84 Carry Propagation Network of a Ripple-Carry Adder The carry recurrence: c i+1 = g i p i c i Latency of k-bit adder is roughly 2k gate delays: 1 gate delay for production of p and g signals, plus 2(k 1) gate delays for carry propagation, plus 1 XOR gate delay for generation of the sum bits g k 1 p k 1 g k 2 p k 2 g 1 p 1 g 0 p 0... c k c k 1 c k 2 c 2 c 1 c 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 23

85 Carry-Lookahead Adders Chapter Goals Understand the carry-lookahead method and its many variations used in the design of fast adders Chapter Highlights Single- and multilevel carry lookahead Various designs for log-time adders Relating the carry determination problem to parallel prefi computation Implementing fast adders in VLSI Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 24

86 Unrolling the Carry Recurrence Recall the generate, propagate, annihilate (absorb), and transfer signals: Signal Radi r Binary g i is 1 iff i + y i r i y i p i is 1 iff i + y i = r 1 i y i a i is 1 iff i + y i < r 1 i y i = ( i y i ) t i is 1 iff i + y i r 1 i y i s i ( i + y i + c i ) mod r i y i c i The carry recurrence can be unrolled to obtain each carry signal directly from inputs, rather than through propagation c i = g i 1 c i 1 p i 1 = g i 1 (g i 2 c i 2 p i 2 ) p i 1 Where p j can be replaced with tj. = g i 1 g i 2 p i 1 c i 2 p i 2 p i 1 = g i 1 g i 2 p i 1 g i 3 p i 2 p i 1 c i 3 p i 3 p i 2 p i 1 = g i 1 g i 2 p i 1 g i 3 p i 2 p i 1 g i 4 p i 3 p i 2 p i 1 c i 4 p i 4 p i 3 p i 2 p i 1 =. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 25

87 Four-Bit Carry-Lookahead Adder (1/2) Compleity reduced by deriving the carry-out indirectly c 4 =g 3 +c 3 p 3 c 4 p 3 g 3 c 3 Full carry lookahead is quite practical for a 4-bit adder p 2 g 2 c 1 = g 0 c 0 p 0 c 2 = g 1 g 0 p 1 c 0 p 0 p 1 c 3 = g 2 g 1 p 2 g 0 p 1 p 2 c 0 p 0 p 1 p 2 c 4 = g 3 g 2 p 3 g 1 p 2 p 3 g 0 p 1 p 2 p 3 c 0 p 0 p 1 p 2 p 3 c 2 c 1 p 1 g 1 p 0 g 0 c 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 26

88 Four-Bit Carry-Lookahead Adder (2/2) Source: Ercegovac and Lang, Digital Arithmetic, MKP Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 27

89 Carry Lookahead Beyond 4 Bits Consider a 32-bit adder c 1 = g 0 c 0 p 0 c 2 = g 1 g 0 p 1 c 0 p 0 p 1 c 3 = g 2 g 1 p 2 g 0 p 1 p 2 c 0 p 0 p 1 p input AND c 31 = g 30 g 29 p 30 g 28 p 29 p 30 g 27 p 28 p 29 p c 0 p 0 p 1 p 2 p 3... p 29 p High fan-ins necessitate 32-input OR tree-structured circuits For wide words, full carry lookahead is impractical. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 28

90 Two Schemes to Manage the Compleity High-radi addition (i.e., radi 2 h ) Increases the latency for generating g and p signals and sum digits, but simplifies the carry network (optimal radi?) Multilevel lookahead Eample: 16-bit addition Radi-16 (four digits) Two-level carry lookahead (four 4-bit blocks) Either way, the carries c 4, c 8, and c 12 are determined first c 16 c 15 c 14 c 13 c 12 c 11 c 10 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 c 0 c out??? c in Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 29

91 One-Level carry Lookahead Adder Source: Ercegovac and Lang, Digital Arithmetic, pp.72. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 30

92 Block Generate and Propagate signals Block generate and propagate signals g [i,i+3] = g i+3 g i+2 p i+3 g i+1 p i+2 p i+3 g i p i+1 p i+2 p i+3 p [i,i+3] = p i p i+1 p i+2 p i+3 c i+3 c i+2 c i+1 Note: unrelated to c i g p g p g p g p i+3 i+3 i+2 i+2 i+1 i+1 i i 4-bit lookahead carry generator c i g [i,i+3] p [i,i+3] C k = g[0,k-1]+c 0 p[0,k-1] C i+4 = g[i,i+3]+c i p[i,i+3] Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 31

93 4-bit Lookahead Carry Generator p [i,i+3] g [i,i+3] p i+3 Block Signal Generation Intermediate Carries g i+3 c i+3 p i+2 g i+2 c i+2 p i+1 c i+1 c i g i+1 p i g i Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 32

94 A Two-Level Carry-Lookahead Adder (64 bits) c c c c c g [12,15] g [8,11] g [4,7] g [0,3] p [12,15] p [8,11] p [4,7] p [0,3] c c 4-bit lookahead carry generator g p [48,63] g [32,47] g [16,31] g [0,15] p [48,63] [32,47] p [16,31] p [0,15] 16-bit Carry-Lookahead Adder 4-bit lookahead carry generator g p [0,63] [0,63] 16 bit CLA C4, C8 and C12 are the C i+1, C i+2 an C i+3 respectively in last slide. C k = g[0,k-1]+c 0 p[0,k-1] Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 33

95 Latency of a 16-bit 2-Level l Carry-Lookahead Adder (1/2) (Level 1) g and p for individual bit positions 1 gate level (Level 1) g and p signals for 4-bit blocks 2 gate levels i.e. g[0,3], p[0,3] g[12, 15], p[12, 15] (Level 2) Block carry-in signals c 4, c 8, and c 12 g[0,15], p[0,15] (Level 1) Internal carries within 4-bit blocks c1, c2, c3, c5,.. (Level 2) C 15 if required 2 gate levels 2 gate levels (Level 1) Sum bits (XOR) 2 gate levels??? Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 34

96 Latency of a 16-bit 2-Level l Carry- Lookahead Adder (2/2) Total latency for the 16-bit adder is 9 gate levels Each additional lookahead level adds 4 gate levels of latency (yellow block in last slide) Latency for k-bit CLA adder: 4 log4k + 1 gate levels Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 35

97 Combining of g and p signals j 1 Block B" j 0 i 1 Block B' i 0 g p g p (g", p") (g', p') g" p" g' p' g p g = g" + g'p" p = p'p" Block B (g, p) g p Combining of g and p signals of two (contiguous or overlapping) blocks B' and B" of arbitrary widths into the g and p signals for block B. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 36

98 Formulating the Prefi Computation Problem The problem of carry determination can be formulated as: Given (g 0, p 0 ) (g 1, p 1 )... (g k 2, p k 2 ) (g k 1, p k 1 ) Find (g [0,0], p [0,0] ) (g [0,1], p [0,1] )... (g [0,k 2], p [0,k 2] ) (g [0,k 1], p [0,k 1] ) c 1 c 2... c k 1 c k Carry-in can be viewed as an etra ( 1) position: (g 1, p 1 ) = (c in, 0) The desired pairs are found by evaluating all prefies of (g 0, p 0 ) (g 1, p 1 )... (g k 2, p k 2 ) (g k 1, p k 1 ) The carry operator is associative, but not commutative [(g 1, p 1 ) (g 2, p 2 )] (g 3, p 3 ) = (g 1, p 1 ) [(g 2, p 2 ) (g 3, p 3 )] Prefi sums analogy: Given k 1 Find k 1 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 37

99 Prefi-Based Carry Network g 3, p 3 g 2, p 2 g 1, p 1 g 0, p g [0,3], p [0,3] =(c g 3, 4 p, 3 --) g [0,2], p [0,2] =(c g 2, 3, p--) 2 + g [0,1], p [0,1] =(c g 1, p 21, --) g [0,0], p [0,0] =(c g 0, 1 p, 0 --) Four-input prefi sums network Scan order g p g p Four-bit Carry lookahead network g [0,3], p [0,3] =(c 4, --) g [0,2], p [0,2] =(c 3, --) g [0,1], p [0,1] =(c 2, --) g [0,0], p [0,0] =(c 1, --) g p Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 38

100 Parallel Prefi Sums Network Built of Two k/2- Input Networks and k/2 Adders(Ladner-Fischer) k 1... k/2 k/ Prefi Sums k/2 Prefi Sums k/ s k/ s 0 s k 1... s k/2 Recursive dividing Delay recurrence Cost recurrence Incurs large fanout D(k) = D(k/2) + 1 = log 2 k C(k) = 2C(k/2) + k/2 = (k/2) log 2 k Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 39

101 a is t in the tetbook Source: Ercegovac and Lang, Digital Arithmetic, pp.81 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 40

102 Eliminate Large Fanout Increase the number of levels Increase the number of cells Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 41

103 The Brent-Kung Recursive Construction k 1 k Prefi Sums k/ Parallel prefi sums network built of one k/2-input network and k 1 adders. s k 1 s k 2... s 3 s 2 s 1 s 0 Delay recurrence Cost recurrence D(k) = D(k/2) + 2 = 2 log 2 k 1 ( 2 really) C(k) = C(k/2) + k 1 = 2k 2 log 2 k Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 42

104 Brent-Kung Carry Network (8-Bit Adder) [7, 7 ] [6, 6] [5, 5 ] [4, 4 ] [3, 3 ] [2, 2 ] [1, 1] [0, 0 ] g p [1,1] [1,1] g [0,0] p [0,0] [6, 7 ] [2, 3 ] [4, 5] [0, 1 ] [4, 7 ] [0, 3 ] g p [0,1] [0,1] [0, 7 ] [0, 6] [0, 5 ] [0, 4 ] [0, 3 ] [0, 2 ] [0, 1] [0, 0 ] Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 43

105 Source: Ercegovac and Lang, Digital Arithmetic, pp.83 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 44

106 Brent-Kung Carry Network (16-Bit Adder) Level 1 Reason for latency being 2 log 2 k s 15 s 14 s 13 s 12 s 11 s 10 s 9 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 45

107 Kogge-Stone Carry Network (16-Bit Adder) Cost formula C(k) = (k 1) + (k 2) + (k 4) (k k/2) = k log 2 k k + 1 log 2 k levels (minimum possible) s 15 s 14 s 13 s 12 s 11 s 10 s 9 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 46

108 Source: Ercegovac and Lang, Digital Arithmetic, pp.84 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 47

109 Speed-Cost Tradeoffs in Carry Networks Method Ladner-Fischer Kogge-Stone Delay log 2 k log 2 k Cost (k/2) log 2 k k log 2 k k + 1 Improving the Ladner/Fischer design Brent-Kung 2 log 2 k 2 2k 2 log 2 k k?... k/2 k/2? Prefi Sums k/2 Prefi Sums k/ s k?... s k/2... s k/2?... s 0 These outputs can be produced one time unit later without increasing the overall latency This strategy saves enough to make the overall cost linear (best possible) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 48

110 Hybrid B-K/K-S Carry Network (16-Bit Adder) Level Brent-Kung: 6 levels 26 cells Kogge-Stone: 4 levels 49 cells 6 s 9 s 10 s 11 s 12 s 13 s 14 s 15 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s 0 s 9 s 10 s 11 s 12 s 13 s 14 s 15 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s Brent- Kung Kogge- Stone Hybrid: 5 levels 32 cells Brent- Kung s 15 s 14 s 13 s 12 s 11 s 10 s 9 s 8 s 7 s 6 s 5 s 4 s 3 s 2 s 1 s 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 49

111 Four-Bit Manchester Carry Chains (Transistor Level) g 3 PH2 PH2 g 3 PH2 PH2 g [0,3] p 3 g 2 PH2 PH2 p 3 g 2 PH2 PH2 p [0,3] g [0,2] p 2 g 1 PH2 p 2 g 1 PH2 PH2 p [0,2] g [0,1] p 1 g 0 PH2 p 1 g 0 PH2 PH2 p [0,1] g [0,3] p 0 p 0 PH2 p [0,3] PH2 PH2 (a) (b) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 50

112 Variations in Fast Adders Chapter Goals Study alternatives to the carry-lookahead method for designing fast adders Chapter Highlights Many methods besides CLA are available (both competing and complementary) Best design is technology-dependent (often hybrid rather than pure) Knowledge of timing allows optimizations Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 51

113 Simple Carry-Skip Adders c c Bit Block 4-Bit Block p Skip c c Bit Block Skip p c c 8 4-Bit Block (a) Ripple-carry adder. 4-Bit Block 4-Bit Block p Skip [12,15] [8,11] [4,7] 8 (b) Simple carry-skip adder. c c Ripple-carry stages Skip logic (2 gates) p c c [0,3] 0 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 52

114 Carry-Skip Adder Using MUX Source: Ercegovac and Lang, Digital Arithmetic, pp.66. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 53

115 Another View of Carry-Skip Addition g 4j+3 p 4j+3 g 4j+2 p 4j+2 g 4j+1 p 4j+1 g 4j p 4j c 4j+4 c 4j+3 c 4j+2 c 4j+1 c 4j One-way street Freeway Street/freeway analogy for carry-skip adder. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 54

116 Carry-Skip Adder with Fied Block Size Block width b; k/b blocks to form a k-bit adder (assume b divides k) T fied-skip-add = (b 1) (k/b 2) + (b 1) in block 0 OR gate skips in last block 2b + k/b 3.5 stages dt/db = 2 k/b 2 = 0 b opt = k/2 1stage = 2 gate levels T opt = 2 2k Eample: k = 32, b opt = 4, T opt = 12.5 stages (contrast with 32 stages for a ripple-carry adder) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 55

117 Worst Case Delay Source: Ercegovac and Lang, Digital Arithmetic, pp Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 56

118 Worst case in block C 0 =0 Worst case in last block C 12 =1 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 57

119 Carry-Skip Adder with Variable-Width Blocks (1/2) bt 1 bt 2... b b0 Block widths 1 Carry path (3) Carry path (1) Carry path (2) Ripple Skip Carry path (2) goes through one fewer skip than (1), so block t-2 can be one bit wider than block t-1 without increasing the total delay. Carry path (3) goes through one fewer skip than (1), so block 1 can be one bit wider than block 0 without increasing the total delay. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 58

120 Carry-Skip Adder with Variable-Width Blocks (2/2) The total number of bits in the t blocks is k: 2[b + (b + 1) (b + t/2 1)] = t(b + t/4 1/2) = k b = k/t t/4 + 1/2 T var-skip-add = 2(b 1) t 2 = 2k/t + t/2 2.5 dt/db = 2k/t 2 + 1/2 = 0 t opt = 2 k T opt = 2 k 2.5 (a factor of 2 smaller than for fied-block) Let b=1 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 59

121 Multilevel Carry-Skip Adders c out c in S 1 S 1 S 1 S 1 S 1 c out c in S 1 S 1 S 1 S 1 S 1 S 2 c out c in S 1 S 1 S 1 S 2 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 60

122 Single-Level Carry-Skip Adder (Eample 7.1) Assumptions: Each of the following takes one unit of time: generation of g i and p i, generation of level-i skip signal from level-(i 1) skip signals, ripple, skip, and formation of sum bit once the incoming carry is known Build the widest possible one-level carry-skip adder with total delay of 8 c 8 out b 6 7 b5 b4 b3 b2 b S 1 S 1 S 1 S 1 S 1 At the right end, block width is limited by the output timing requirement. Stage b 0 takes 2 time units: one for generating gp and the other for generating carry. Stage b1 cannot be more than 3 bits, because its output is available at time 3, so it can take one time unit for generating gp and two for propagation across 2 bits. 2 b 0 cin 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 61

123 At the left end, block width is limited by input timing. Stage b4 cannot be more than 3 bits, because its input become available at time 5 and the total adder delay is to be 8 units.. Ma adder width = 18 ( ) Generalization of Eample 7.1 for total time T (even or odd) T/2 T/ (T + 1)/ Thus, for any T, the total width is (T + 1) 2 /4 2 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 62

124 c Two-Level Carry-Skip Adder (1/2) Eample 7.2 Given the delay pair {β, α} for a level-2 block in Fig. 7.7a, the number of level-1 blocks that can be accommodated is γ = min(β 1, α) α out b α 1 b α 2 b2 b1 α 1 α S 1 S 1 S 1 S 1 S 1 S 1 1 b0 S 1 cin 0 c out Single-level carry-skip adder with T assimilate = α β b β 2 b β 3 β 1 β 2 4 b 2 3 b 1 2 b 0 1 cin S 1 S 1 S 1 S 1 S 1 S 1 S 1 Single-level carry-skip adder with T produce = β Width of the ith level-1 block in the level-2 block characterized by {β, α} is b i = min(β γ + i + 1, α i); the total block width is then i=0 to γ 1 b i Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 63

125 Two-Level Carry-Skip Adder (2/2) c out 8 Tproduce T assimilate {8, 1} {7, 2} {6, 3} {5, 4} {4, 5} {3, 8} c bf be bd bc bb ba S2 S2 S2 S2 S2 (a) F Block E Block D Block C Block B Block A in c out t=8 2 c in 2 t= Ma adder width = 30 ( ) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 64

126 Carry-Skip Adder Optimization Scheme Block of b full-adder units I(b) G(b) A(b) Inputs S (b) h E (b) h Level-h skip Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 65

127 Carry-Select Adders k - 1 k/2 k k/2-bit adder k/2-bit adder 0 1 k/2-bit adder k/2+1 k/2+1 k/2 1 0 Mu c k/2 c out k/2 c in Carry-select adder for k- bit numbers built from three k/2-bit adders. High k/2 bits Low k/2 bits C select-add (k) = 3C add (k/2) + k/2 + 1 T select-add (k) = T add (k/2) + 1 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 66

128 Two-level Carry-Select Adder Built of k/4-bit adders k - 1 3k/4 0 k/4-bit adder 1 3k/4-1 k/2 0 k/4-bit adder 1 k/2-1 k/4 k/ k/4-bit adder 1 k/4-bit adder c in k/4+1 k/4+1 k/4 k/4 k/4+1 k/4+1 k/4 1 0 Mu 1 0 Mu c k/4 1 0 Mu k/2+1 c k/2 k/4 c out, High k/2 bits Middle k/4 bits Low k/4 bits k/2-bit conditional-sum Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 67

129 Conditional Adder Source: Ercegovac and Lang, Digital Arithmetic, pp.86 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 68

130 Carry Select Adder Source: Ercegovac and Lang, Digital Arithmetic, pp.87 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 69

131 Conditional Sum Adder Source: Ercegovac and Lang, Digital Arithmetic, pp.87 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 70

132 16-Bit Conditional Sum Adder The same as Fig in tetbook Source: Ercegovac and Lang, Digital Arithmetic, pp.89 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 71

133 Conditional-Sum Adder Multilevel carry-select idea carried out to the etreme (to 1-bit blocks. C(k) 2C(k/2) + k + 2 k (log 2 k + 2) + kc(1) T(k) = T(k/2) + 1 = log 2 k + T(1) where C(1) and T(1) are the cost and delay of the circuit of the following circuit for deriving the sum and carry bits with a carry-in of 0 and 1 y i i c s i For c i = 1 c s For c i = 0 i+1 i+1 i k + 2 is an upper bound on number of single-bit 2-to-1 multipleers needed for combining two k/2-bit adders into a k-bit adder Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 72

134 A Hybrid Carry-Lookahead/Carry-Select Adder The most popular hybrid addition scheme: Lookahead Carry Generator c in Carry-Select Block g, p Mu Mu Mu c out Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 73

135 Summary Source: Ercegovac and Lang, Digital Arithmetic, pp.114. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 74

136 A Hybrid Ripple-Carry/Carry-Lookahead Design c 48 c 32 c 16 c c c c g [12,15] g [8,11] g [4,7] g [0,3] p [12,15] p [8,11] p [4,7] p [0,3] 4-Bit Lookahead Carry Generator (with carry-out) 16-bit Carry-Lookahead Adder Any Two Addition Schemes Can Be Combined Other possibilities: hybrid carry-select/ripple-carry hybrid ripple-carry/carry-select... Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 75

137 Optimizations in Fast Adders What looks best at the block diagram or gate level may not be best when a circuit-level design is generated (effects of wire length, signal loading,... ) Modern practice: Optimization at the transistor level Variable-block carry-lookahead adder Optimizations for average or peak power consumption Timing-based optimizations (net slide) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 76

138 Multioperand Addition Chapter Goals Learn methods for speeding up the addition of several numbers (needed for multiplication or inner-product) Chapter Highlights Running total kept in redundant form Current total + Net number New total Deferred carry assimilation Wallace/Dadda trees and parallel counters Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 77

139 Some Applications of Multioperand Addition a a 1 a a a p p p p p p p p s (0) (1) (2) (3) (4) (5) (6) Multioperand addition problems for multiplication or inner-product computation in dot notation. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 78

140 Serial Implementation with One Adder (i) k bits Adder Partial sum register k + log 2 n bits i 1 j=0 (j) T serial-multi-add = O(n log(k + log n)) = O(n log k + n log log n) Therefore, addition time grows superlinearly with n when k is fied and logarithmically with k for a given n Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 79

141 Pipelined Adder Source: Ercegovac and Lang, Digital Arithmetic, pp.166. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 80

142 Parallel Implementation as Tree of Adders n 1 adders k k k k k k Adder Adder Adder k+1 k+1 k+1 Adder Adder k+2 k+2 k log 2 n adder levels Adder k+3 Adding 7 numbers in a binary tree of adders. T tree-fast-multi-add = O(log k + log(k + 1) log(k + log 2 n 1)) = O(log n log k + log n log log n) T tree-ripple-multi-add = O(k + log n) [Justified on the net slide] Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 81

143 Elaboration on Tree of Ripple-Carry Adders k k k k k k Adder Adder Adder k+1 k+1 k+1 Adder Adder k+2 k+2 k t+1 t+1 t t... FA HA Level i t+2 t+1 t+2 t+2 t+1 t+1 Adder k+3 T tree-ripple-multi-add = O(k + log n)... t+3 FA t+3 t+2 HA t+2 Level i+1 Fig. 8.5 Ripple-carry adders at levels i and i + 1 in the tree of adders used for multi-operand addition. The absolute best latency that we can hope for is O(log k + log n) There are kn data bits to process and using any set of computation elements with constant fan-in, this requires O(log(kn)) time We will see shortly that carry-save adders achieve this optimum time Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 82

144 Carry-Save Adders Ripple carry adder FA Cut FA FA FA FA FA Carry save adder FA FA FA FA FA FA c in Carry-propagate adder c out dot notation. Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit Full-adder Specifying full- and halfadder blocks, with their inputs and outputs, in dot notation. Half-adder Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 83

145 Eample of CSA Reduction by row (3:2) counter 3 2 Also considered as reduction by column [3:2]. [p:q] counter: p bits of the same weight and produce q bits of adjacent weights. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 84

146 Use Dot Notation c in Carry-propagate adder c out Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 85

147 Multioperand Addition Using Carry-Save Adders T carry-save-multi-add = O(tree height + T CPA ) = O(log n + log k) CSA CSA C carry-save-multi-add = (n 2)C CSA + C CPA CSA CSA CPA Input Sum register Carry register CSA CSA Carry-propagate adder Output Serial carry-save addition using a single CSA. Tree of carry-save adders reducing seven numbers to two. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 86

148 Reduction by a CSA Tree 12 FAs 6 FAs Bit position = 12 FAs FAs FAs FAs + 1 HA bit adder --Carry-propagate adder FAs 4 FAs + 1 HA 7-bit adder Total cost = 7-bit adder + 28 FAs + 1 HA Addition of seven 6-bit numbers in dot notation. Representing a seven-operand addition in tabular form. A full-adder compacts 3 dots into 2 (compression ratio of 1.5) A half-adder rearranges 2 dots (no compression, but still useful) Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 87

149 Width of Adders in a CSA Tree [0, k 1] [0, k 1] k-bit CSA [0, k 1] [0, k 1] [0, k 1] k-bit CSA [1, k] [0, k 1] [1, k] [0, k 1] [0, k 1] [0, k 1] Adding seven k-bit numbers and the CSA/CPA widths required. Bit K+1 does not involve addition The inde pair [i, j] means that bit positions from i up to j are involved. k-bit CSA k-bit CSA k-bit CPA k-bit CSA [1, k] [2, k+1] [1, k] [1, k 1] [2, k+1] [2, k+1] [1, k+1] [0, k 1] Due to the gradual retirement (dropping out) of some of the result bits, CSA widths do not vary much as we go down the tree levels k+2 [2, k+1] 1 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 88

150 Wallace and Dadda Trees n inputs... 2 outputs h(n) = 1 + h( 2n/3 ) n(h) = 3n(h 1)/ h 1 < n(h) h h levels h levels Table 8.1 The maimum number n(h) of inputs for an h-level CSA tree h n(h) h n(h) h n(h) n(h): Maimum number of inputs for h levels Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 89

151 Wallace and Dadda Reduction Trees 12 FAs 6 FAs 6 FAs 4 FAs + 1 HA 7-bit adder Total cost = 7-bit adder + 28 FAs + 1 HA Addition of seven 6-bit numbers using Wallace strategy. Wallace tree: Reduce the number of operands at the earliest possible opportunity h n(h) Dadda tree: Postpone the reduction to the etent possible without causing added delay 6 FAs 11 FAs 7 FAs 4 FAs + 1 HA 7-bit adder Total cost = 7-bit adder + 28 FAs + 1 HA Adding seven 6-bit numbers using Dadda s strategy. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 90

152 A Small Optimization in Reduction Trees 6 FAs 6 FAs 11 FAs 11 FAs 7 FAs 6 FAs + 1 HA 4 FAs + 1 HA 3 FAs + 2 HA 7-bit adder 7-bit adder Total cost = 7-bit adder + 28 FAs + 1 HA Adding seven 6-bit numbers using Dadda s strategy. Total cost = 7-bit adder + 26 FAs + 3 HA taking advantage of the final adder s carry-in. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 91

153 Parallel Counters 1-bit full-adder = (3; 2)-counter FA FA FA Circuit reducing 7 bits to their 3-bit sum = (7; 3)-counter FA FA HA 1 2 FA HA 3-bit ripple-carry adder Circuit reducing n bits to their log 2 (n + 1) -bit sum = (n; log 2 (n +1) )-counter 3 2 A 10-input parallel counter also known as a (10; 4)-counter. 1 0 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 92

154 Implementation of [4:2] Counter Source: Ercegovac and Lang, Digital Arithmetic, pp.145. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 93

155 Implementation of [5:2] Counter Source: Ercegovac and Lang, Digital Arithmetic, pp.146. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 94

156 Implementation of [7:2] Counter Source: Ercegovac and Lang, Digital Arithmetic, pp.146. Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 95

157 Generalized Parallel Counters Multicolumn reduction... (5, 5; 4)-counter Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five numbers to two numbers. Unequal columns (2, 3; 3)-counter Gen. parallel counter = Parallel compressor Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 96

158 A General Strategy for Column Compression (n; 2)-counters To i + 1 To i + 2 To i + 3 i n inputs ψ 1 ψ 2 ψ 3 One circuit slice i 1 ψ 1 i 2 ψ 2 i 3 ψ 3... n + ψ 1 + ψ 2 + ψ ψ 1 + 4ψ 2 + 8ψ n 3 ψ 1 + 3ψ 2 + 7ψ Eample: Design a bit-slice of an (11; 2)-counter Solution: Let s limit transfers to two stages. Then, 8 ψ 1 + 3ψ 2 Possible choices include ψ 1 = 5, ψ 2 = 1 or ψ 1 = ψ 2 = 2 Computer Arithmetic 2, Dept. of EE, Fu Jen Catholic University, Taiwan 97

159 Multiplication Instructor: Kuan Jen Lin Dept. of EE, FJU, Taiwan Room: SF 727B Most slides originate from the tetbook author s PowerPoint presentation files. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 1

160 III Multiplication Review multiplication schemes and various speedup methods Multiplication is heavily used (in arith & array indeing) Division = reciprocation + multiplication Multiplication speedup: high-radi, tree,... Bit-serial, modular, and array multipliers Topics in This Part Chapter 9 Basic Multiplication Schemes Chapter 10 High-Radi Multipliers Chapter 11 Tree and Array Multipliers Chapter 12 Variations in Multipliers Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 2

161 9 Basic Multiplication Schemes Chapter Goals Study shift/add or bit-at-a-time multipliers and set the stage for faster methods and variations to be covered in Chapters Chapter Highlights Multiplication = multioperand addition Hardware, firmware, software algorithms Multiplying 2 s-complement numbers The special case of one constant operand Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 3

162 Shift/Add Multiplication Algorithms Notation for our discussion of multiplication algorithms: a Multiplicand a k 1 a k 2... a 1 a 0 Multiplier k 1 k p Product (a ) p 2k 1 p 2k 2... p 3 p 2 p 1 p 0 Initially, we assume unsigned operands a Multiplicand Multiplier a 20 0 a 21 1 a 22 2 a 23 Partial products bit-matri Product Multiplication of two 4-bit unsigned binary numbers in dot notation. p 3 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 4

163 Multiplication Recurrence a a 20 0 a 21 1 a 22 2 a 23 p 3 Multiplicand Multiplier Partial products bit-matri Product Preferred Multiplication with right shifts: top-to-bottom accumulation p (j+1) =(p (j) + j a 2 k ) 2 1 with p (0) = 0 and add p (k) = p = a + p (0) 2 k shift right Multiplication with left shifts: bottom-to-top accumulation p (j+1) = 2p (j) + k j 1 a with p (0) = 0 and shift p (k) = p = a + p (0) 2 k add Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 5

164 Eamples of Basic Multiplication Right-shift algorithm Left-shift algorithm ======================== ======================= a a ======================== ======================= p (0) p (0) a p (0) a p (1) p (1) p (1) a p (1) a p (2) p (2) p (2) a p (2) a p (3) p (3) p (3) a p (3) a p (4) p (4) p (4) ======================== ======================= Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 6

165 Programmed Using Right-Shift Algorithm {Using right shifts, multiply unsigned m_cand and m_ier, storing the resultant 2k-bit product in p_high and p_low. Registers: R0 holds 0 Rc for counter R0 0 Rc Counter Ra for m_cand R for m_ier Rp for p_high Rq for p_low} Ra Multiplicand R Multiplier {Load operands into registers Ra and R} Rp Product, high Rq Product, low mult: load Ra with m_cand load R with m_ier {Initialize partial product and counter} copy R0 into Rp copy R0 into Rq load k into Rc {Begin multiplication loop} m_loop: shift R right 1 {LSB moves to carry flag} branch no_add if carry = 0 add Ra to Rp {carry flag is set to cout} no_add: rotate Rp right 1 {carry to MSB, LSB to carry} rotate Rq right 1 {carry to MSB, LSB to carry} decr Rc {decrement counter by 1} branch m_loop if Rc 0 {Store the product} store Rp into p_high store Rq into p_low m_done:... Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 7

166 Time Compleity of Programmed Multiplication Assume k-bit words k iterations of the main loop 6-7 instructions per iteration, depending on the multiplier bit Thus, 6k + 3 to 7k + 3 machine instructions, ignoring operand loads and result store k = 32 implies instructions on average This is too slow for many modern applications! Microprogrammed multiply would be somewhat better Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 8

167 Sequential Multiplication with Right Shifts Shift Multiplier Doublewidth partial product p Shift (j) Hardware realization Multiplicand a Mu k j Clock? a j k Control path? c out Adder k Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 9

168 Sequential Multiplication with Left Shifts Shift Multiplier Doublewidth partial product p Shift (j) k-j-1 2k 0 Multiplicand a 0 1 Mu k-j-1 a k c out 2k-bit adder 2k Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 10

169 Multiplication of Signed Numbers Negative multiplicand, positive multiplier: No change, other than looking out for proper sign etension ============================ a ============================ p (0) a p (1) p (1) a p (2) p (2) a p (3) p (3) a p (4) p (4) a p (5) p (5) ============================ Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 11

170 Multiplication with a Negative Multiplier Negative multiplicand, negative multiplier: In last step (the sign bit), subtract rather than add 10101= ============================ a ============================ p (0) a p (1) p (1) a p (2) p (2) a p (3) p (3) a p (4) p (4) ( 4 a) p (5) p (5) ============================ Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 12

171 Booth s Recoding i i 1 y i Eplanation No string of 1s in sight End of string of 1s in Beginning of string of 1s in Continuation of string of 1s in Eample Operand (1) Recoded version y Justification 2 j + 2 j i i = 2 j+1 2 i Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 13

172 Eample Multiplication with Booth s Recoding 2 complement of is ============================ a Multiplier y Booth-recoded ============================ p (0) y 0 a p (1) p (1) y 1 a p (2) p (2) y 2 a p (3) p (3) y 3 a p (4) p (4) y 4 a p (5) p (5) ============================ Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 14

173 Multiplication by Constants Eplicit, e.g. y := Implicit, e.g. A[i, j] := A[i, j] + B[i, j] Address of A[i, j] = base + n i + j m n 1 Row i Software aspects: Optimizing compilers replace multiplications by shifts/adds/subs Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm Hardware aspects: Column j Synthesize special-purpose units such as filters y[t] = a 0 [t] + a 1 [t 1] + a 2 [t 2] + b 1 y[t 1] + b 2 y[t 2] Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 15

174 Multiplication Using Binary Epansion Eample: Multiply R1 by the constant 113 = ( ) two R2 R1 shift-left 1 R3 R2 + R1 R6 R3 shift-left 1 R7 R6 + R1 R112 R7 shift-left 4 R113 R112 + R1 Shift, add Shift Ri: Register that contains i times (R1) This notation is for clarity; only one register other than R1 is needed Shorter sequence using shift-and-add instructions R3 R1 shift-left 1 + R1 R7 R3 shift-left 1 + R1 R113 R7 shift-left 4 + R1 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 16

175 Multiplication via Recoding Eample: Multiply R1 by 113 = ( ) two = ( ) two R8 R1 shift-left 3 R7 R8 R1 R112 R7 shift-left 4 R113 R112 + R1 Shift, add Shift Shift, subtract Shorter sequence using shift-and-add/subtract instructions R7 R3 shift-left 3 R1 R113 R7 shift-left 4 + R1 6 shift or add (3 shift-and-add) instructions needed without recoding Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 17

176 Multiplication via Factorization Eample: Multiply R1 by 119 = 7 17 = (8 1) (16 + 1) R8 R1 shift-left 3 R7 R8 R1 R112 R7 shift-left 4 R119 R112 + R7 Shorter sequence using shift-and-add/subtract instructions R7 R3 shift-left 3 R1 R119 R7 shift-left 4 + R7 Requires a scratch register for holding the 7 multiple 119 = ( ) two = ( ) two More instructions may be needed without factorization Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 18

177 High-Radi Multipliers Chapter Goals Study techniques that allow us to handle more than one multiplier bit in each cycle (two bits in radi 4, three in radi 8,...) Chapter Highlights High radi gives rise to difficult multiples Recoding (change of digit-set) as remedy Carry-save addition reduces cycle time Implementation and optimization methods Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 19

178 Radi-4 Multiplication in Dot Notation a Multiplicand Multiplier Radi 2 a 20 0 a 21 1 a 22 2 a 23 p 3 Partial products bit-matri Product Radi-4, or two-bitat-a-time, multiplication in dot notation Number of cycles is halved, but now the difficult multiple 3a must be dealt with a ( ) 1 0 two a 4 0 ( ) a two 1 p Multiplicand Multiplier Product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 20

179 A Possible Design for a Radi-4 Multiplier Precomputed via shift-and-add 3a (3a = 2a + a) 0 a 2a Mu To the adder Multiplier 2-bit shifts i+1 i Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 21

180 Eample Radi-4 Multiplication Using 3a ================================ a a ================================ p (0) ( 1 0 ) two a p (1) p (1) ( 3 2 ) two a p (2) p (2) ================================ a ( ) 1 0 ( ) 3 2 p Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 22

181 A Second Design for a Radi-4 Multiplier 0 a 2a Mu To the adder 2-bit shifts Multiplier i+1 i c i+1 i +c c Carry mod 4 FF c i c replacing 3a with 4a (carry into net higher radi-4 multiplier digit) and a. Set if i+1 = i = 1 or i+1 ( if i c) = c = 1 i+1 i+1 i c Mu control Set carry Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 23

182 Radi-4 Booth s Recoding i+1 i i 1 y i+1 y i z i/2 Eplanation No string of 1s in sight End of string of 1s Isolated End of string of 1s Beginning of string of 1s End a string, begin new one Beginning of string of 1s Continuation of string of 1s Contet Recoded radi-2 digits Radi-4 digit Only shifting and complementation required Eample Operand (1) Recoded version y (1) Radi-4 version z Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 24

183 Eample Multiplication via Modified Booth s Recoding ================================ a z 1 2 Radi-4 ================================ p (0) z 0 a p (1) p (1) z 1 a p (2) p (2) ================================ a ( ) a two 0 ( ) a two 1 p Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 25

184 Multiple Generation with Radi-4 Booth s Recoding Multiplier Init. 0 Multiplicand 2-bit shift i+1 i i? k Sign etension, not 0 Recoding Logic Could have named this signal one/two neg two non0 0 a 2a Enable 0 1 Mu Select k+1 0 0, a, or 2a Add/subtract control z i/2 a To adder input Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 26

185 Using Carry-Save Adders Old Cumulative Partial Product 0 2a Mu 0 a Mu Multiplier i+1 i CSA Adder New Cumulative Partial Product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 27

186 Keeping the Partial Product in Carry-Save Form Mu Multiplier Old PP Net multiple CS sum New PP Shift Partial Product k k Multiplicand 0 Mu Carry Sum k-bit CSA k k k-bit Adder Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 28

187 Carry-Save Multiplier with Radi-4 Booth s Recoding (1/2) a Multiplier i+1 i i-1 Booth recoder and selector Old cumulative partial product z a i/2 New cumulative partial product CSA Adder FF 2-bit Adder Etra dot To the lower half of partial product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 29

188 Carry-Save Multiplier with Radi- 4 Booth s Recoding (2/2) i+2 neg i+1 i i? Recoding Logic two non0 a 2a 0 1 Enable Mu Select k+1 i? 0, a, or 2a Selective Complement k+2 0, a,, 2a, or?a Etra "Dot" for Column i z a i/2 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 30

189 Another Design for Radi-4 Multiplication Old Cumulative Partial Product 0 2a Mu 0 a Mu Multiplier i+1 i CSA New Cumulative Partial Product CSA Adder FF 2-Bit Adder To the Lower Half of Partial Product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 31

190 Radi-8 and Radi-16 Multipliers 4-bit right shift 4-Bit Shift 0 8a Mu 0 4a Mu 0 2a Mu 0 a Mu Multiplier i+3 i+2 i+1 i CSA CSA CSA CSA Sum Carry Partial Product (Upper Half) Bit FF Adder 4 To the Lower Half of Partial Product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 32

191 A Spectrum of Multiplier Design Choices Adder Net multiple Partial product Several multiples... Small CSA tree Partial product All multiples... Full CSA tree Adder Adder Basic binary Speed up High-radi or partial tree Economize Full tree Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 33

192 VLSI Compleity Issues A radi-2 b multiplier requires: bk two-input AND gates to form the partial products bit-matri O(bk) area for the CSA tree At least Θ(k) area for the final carry-propagate adder Total area: A = O(bk) Latency: T = O((k/b) log b + log k) Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints: AT grows at least as fast as k 3/2 AT 2 is at least proportional to k 2 The preceding radi-2 b implementations are suboptimal, because: AT = O(k 2 log b + bk log k) AT 2 = O((k 3 /b) log 2 b) Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 34

193 Comparing High- and Low-Radi Multipliers AT = O(k 2 log b + bk log k) AT 2 = O((k 3 /b) log 2 b) Low-Cost b = O(1) High Speed b = O(k) AT- or AT 2 - Optimal AT O(k 2 ) O(k 2 log k) O(k 3/2 ) AT 2 O(k 3 ) O(k 2 log 2 k) O(k 2 ) Intermediate designs do not yield better AT or AT 2 values; The multipliers remain asymptotically suboptimal for any b By the AT measure (indicator of cost-effectiveness), slower radi-2 multipliers are better than high-radi or tree multipliers Thus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers High-radi multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11) Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 35

194 Tree and Array Multipliers Chapter Goals Study the design of multipliers for highest possible performance (speed, throughput) Chapter Highlights Tree multiplier = reduction tree + redundant-to-binary converter Avoiding full sign etension in multiplying signed numbers Array multiplier = one-sided reduction tree + ripple-carry adder Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 36

195 Full-Tree Multipliers a Multiplier... Multiple- Forming Circuits a a... a Partial-Products Reduction Tree (Multi-Operand Addition Tree) Redundant result Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 37

196 Full-Tree versus Partial-Tree Multiplier All partial products... Large tree of carry-save adders Several partial products... Small tree of carry-save adders Adder Logdepth Logdepth Adder Product Product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 38

197 Variations in Full-Tree Multiplier Design Designs are distinguished by variations in three elements: 1. Multiple-forming circuits a Multiple- Forming Circuits... a a a Multiplier Partial products reduction tree Partial-Products Reduction Tree (Multi-Operand Addition Tree) 3. Redundant-to-binary converter Redundant result Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 39

198 Eample of Variations in CSA Tree Design Wallace Tree (5 FAs + 3 HAs + 4-Bit Adder) FA FA FA HA FA HA FA HA Bit Adder Latency!! Dadda Tree (4 FAs + 2 HAs + 6-Bit Adder) FA FA FA HA HA FA Bit Adder Two different binary 4 4 tree multipliers. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 40

199 A 7X7 Tree [0, 6] [1, 7] [2, 8] [3, 9] [4, 10] [5, 11] [6, 12] Multiplier [1, 6] 7-bit CSA 7-bit CSA [2, 8] [1,8] [5, 11] [3, 11] 7-bit CSA [0,6] [6, 12] [2, 8] [3, 12] [1,7] 7-bit CSA [2,8] [3,9] [4,10] [5,11] X [6,12] The inde pair [i, j] means that bit positions from i up to j are involved. [3,9] [2,12] [3,12] 10-bit CSA [3,12] [4,13] [4,12] 10-bit CPA Ignore [4, 13] Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 41

200 Balanced-Delay Tree for 11 Inputs FA FA FA FA Inputs FA Level-1 carries FA 11 + ψ 1 = 2ψ Therefore, ψ 1 = 8 carries are needed FA FA FA FA Level-2 carries Level-3 carries FA FA FA FA FA Level-4 carry FA FA Outputs FA Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 42

201 Binary Tree of 4-to-2 Reduction Modules CSA 4-to-2 4-to-2 4-to-2 4-to-2 CSA 4-to-2 4-to-2 4-to-2 reduction module implemented with two levels of (3; 2)-counters 4-to-2 Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 43

202 Tree Multipliers for Signed Numbers Etended positions Sign Magnitude positions k 1 k 1 k 1 k 1 k 1 k 1 k 2 k 3 k 4... y k 1 y k 1 y k 1 y k 1 y k 1 y k 1 y k 2 y k 3 y k 4... z k 1 z k 1 z k 1 z k 1 z k 1 z k 1 z k 2 z k 3 z k 4... α β γ α β γ From Fig Sign etension in multioperand addition. Sign etensions α β γ α β γ α β γ α β γ α β α Signs The difference in multiplication is the shifting sign positions Five redundant copies removed FA FA FA FA FA αβγ FA Fig Sharing of full adders to reduce the CSA width in a signed tree multiplier. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 44

203 Using the Negative-Weight Property of the Sign Bit Sign etension is a way of converting negatively weighted bits (negabits) to positively weighted bits (posibits) to facilitate reduction, but there are other methods of accomplishing the same without introducing a lot of etra bits Baugh and Wooley have contributed two such methods a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 a 4 3 a 3 3 a 2 3 a 1 3 a 0 3 a 4 4 a 3 4 a 2 4 a 1 4 a p p p p p p p p p p a. Unsigned a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 -a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 -a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 -a 4 3 a 3 3 a 2 3 a 1 3 a 0 3 a 4 4 -a 3 4 -a 2 4 -a 1 4 -a p 9 p 8 p 7 p 6 p 5 p 4 p 3 p 2 p 1 p 0 b. 2's-complement a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 _ a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 _ a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 a _ 4 3 a _ 3 3 _ a 2 3 a _ 1 3 a 0 3 a 4 4 a 3 4 a 2 4 a 1 4 a 0 4 a 4 a p p p p p p p p p p c. Baugh-Wooley a 4 a 3 a 2 a 1 a a a a a a a a a a a a a a a a a a 4 3 a 3 3 a 2 3 a a a a a a p p p p p p p p p p d. Modified B-W Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 45

204 Fig a 4 4 -a 3 4 -a 2 4 -a 1 4 -a p 9 p 8 p 7 p 6 p 5 p 4 p 3 p 2 p 1 p 0 The Baugh-Wooley Method and Its Modified Form a 4 0 = a 4 (1 0 ) a 4 = a 4 0 a 4 a 4 a 4 0 a 4 In net column a 4 0 = (1 a 4 0 ) 1 = (a 4 0 ) 1 1 (a 4 0 ) 1 In net column a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 _ a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 _ a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 a _ 4 3 a _ 3 3 a _ 2 3 a _ 1 3 a 0 3 a a a a a a 4 a p p p p p p p p p p c. Baugh-Wooley a a a a a a a a a a a a a a a a a a a a a a 4 3 a 3 3 a 2 3 a a a a a a p p p p p p p p p p d. Modified B-W Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 46

205 Alternate Views of the Baugh-Wooley Methods a 4 3 a 4 2 a 4 1 a a 3 4 a 2 4 a 1 4 a a 4 3 a 4 2 a 4 1 a a 3 4 a 2 4 a 1 4 a a 4 3 a 4 2 a 4 1 a a 3 4 a 2 4 a 1 4 a a 4 a 4 a 4 3 a 4 2 a 4 1 a a 3 4 a 2 4 a 1 4 a 0 4 a 4 a a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 a 4 3 a 3 3 a 2 3 a 1 3 a 0 3 a 4 4 a 3 4 a 2 4 a 1 4 a p p p p p p p p p p a. Unsigned a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 -a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 -a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 -a 4 3 a 3 3 a 2 3 a 1 3 a 0 3 a 4 4 -a 3 4 -a 2 4 -a 1 4 -a p 9 p 8 p 7 p 6 p 5 p 4 p 3 p 2 p 1 p 0 b. 2's-complement a 4 a 3 a 2 a 1 a a 4 0 a 3 0 a 2 0 a 1 0 a 0 0 _ a 4 1 a 3 1 a 2 1 a 1 1 a 0 1 _ a 4 2 a 3 2 a 2 2 a 1 2 a 0 2 a _ 4 3 a _ 3 3 _ a 2 3 a _ 1 3 a 0 3 a 4 4 a 3 4 a 2 4 a 1 4 a 0 4 a 4 a p p p p p p p p p p c. Baugh-Wooley a 4 a 3 a 2 a 1 a a a a a a a a a a a a a a a a a a 4 3 a 3 3 a 2 3 a a a a a a p p p p p p p p p p d. Modified B-W Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 47

206 Partial-Tree Multipliers High-radi versus partial-tree multipliers: The difference is quantitative, not qualitative For small h, say 8 bits, we view the multiplier of Fig as high-radi h inputs... CSA Tree Upper part of the cumulative partial product (stored-carry) When h is a significant fraction of k, say k/2 or k/4, then we tend to view it as a partial-tree multiplier Better design through pipelining to be covered in Section 11.6 Sum Carry Adder FF h-bit Adder Fig General structure of a partial-tree multiplier. Lower part of the cumulative partial product Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 48

207 Truncated Multipliers ulp. o o o o o o o o k-by-k fractional. o o o o o o o o multiplication o o o o o o o o. o o o o o o o o. o o o o o o o o. o o o o o o o o. o o o o o o o o. o o o o o o o o. o o o o o o o o. o o o o o o o o o o o o o o o o o o o o o o o o Ma error = 8/2 + 7/4 + 6/8 + 5/16 + 4/32 + 3/64 + 2/ /256 = ulp Mean error = ulp Removing the dots at the right does not lead to much loss of precision. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 49

208 Truncated Multipliers with Error Compensation We can introduce additional dots on the left-hand side to compensate for the removal of dots from the right-hand side Constant compensation Variable compensation. o o o o o o o. o o o o o o o. o o o o o o. o o o o o o. o o o o o. o o o o o. o o o o. o o o o. o o o. o o o. 1 o o. o o. o. -1 o.. y -1 Constant and variable error compensation for truncated multipliers. Ma error = +4 ulp Ma error 3 ulp Mean error =? ulp Ma error = +? ulp Ma error? ulp Mean error =? ulp Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 50

209 Array Multipliers [3:2] Adder, i.e. a full adder a 2 a 1 a 0 0 a a a a a 0 0 a 4 a 3 CSA CSA Ripple-Carry Adder a CSA CSA A basic array multiplier uses a one-sided CSA tree and a ripplecarry adder. a 3 1 a 4 1 a 3 2 a 4 2 a 3 3 a 4 3 a 3 4 a 4 4 a 2 1 a 2 2 a 2 3 a 2 4 p 8 a 1 1 a 1 2 a 1 3 a 1 4 p 7 a 0 1 a 0 2 a 0 3 a 0 4 p 6 p 0 p 1 p 2 p 3 p 9 p 5 Details of a 5 5 array multiplier using FA blocks. 0 p 4 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 51

210 Signed (2 s-complement) Array Multiplier using the Baugh- Wooley method or to shorten the critical path. a _ a _ a a a a a 2 1 a 1 1 a 0 1 a 0 0 p 0 a _ a a 2 2 a 1 2 a 0 2 p 1 a 4 4 a a _ a a 2 3 a 1 3 a 0 3 a 3 4 a 2 4 a 1 4 a 0 4 p 2 p 3 a 4 4 p 8 p 7 p 6 p 9 p 5 p 4 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 52

211 Array Multiplier Built of Modified Full-Adder Cells a a a a a Design of a 5 5 array multiplier with two additive inputs and full-adder blocks that include AND gates. FA p p p p p p p p p p 5 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 53

212 Array Multiplier without a Final Carry-Propagate Adder... Mu i i B i Level i Mu See net slide i i+1 i i+1 Mu... B i+1 k k Mu k [k, 2k?] k? i+1 i i? 1 All remaining bits of the final product produced only 2 gate levels after p k 1 0 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 54

213 Etend Bits in Less-Significant Part in a Conditional Adder The circuit in the right part is considered a conditional adder as the circuit in the left part. Source: Ercegovac and Lang, Digital Arithmetic, pp Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 55

214 Pipelined Tree and Array Multipliers h inputs... h inputs... CSA Tree Upper part of the cumulative partial product (stored-carry) Pipelined CSA Tree Latches Latches Latches (h + 2)-input CSA tree CSA Latch Sum CSA Carry Adder FF h-bit Adder Lower part of the cumulative partial product Sum Carry Adder FF h-bit Adder Lower part of the cumulative partial product General structure of a partialtree multiplier. Efficiently pipelined partial-tree multiplier. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 56

215 Pipelined Array Multipliers With latches after every FA level, the maimum throughput is achieved a a a a a Latches may be inserted after every h FA levels for an intermediate design Eample: 3-stage pipeline Pipelined 5 5 array multiplier using latched FA blocks. The small shaded boes are latches. Latched FA with AND gate Latch FA FA FA FA p p p p p p p p p p Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 57

216 Variations in Multipliers Chapter Goals Learn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.) Chapter Highlights Building a multiplier from smaller units Performing multiply-add as one operation Bit-serial and (semi)systolic multipliers Using a multiplier for squaring is wasteful Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 58

217 Divide-and-Conquer Designs Building wide multiplier from narrower ones a HH p a a LH a H L H a L H L a LL Rearranged partial products in 2b-by-2b multiplication 2b bits b bits a HH 3b bits a H L a LH a LL Divide-and-conquer (recursive) strategy for synthesizing a 2b 2b multiplier from b b multipliers. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 59

218 General Structure of a Recursive Multiplier 2b 2b 3b 3b 4b 4b use (3; 2)-counters use (5; 2)-counters use (7; 2)-counters 4b 4b 3b 3b 2b 2b b b Using b b multipliers to synthesize 2b 2b, 3b 3b, and 4b 4b multipliers. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 60

219 An 8 X 8 Multiplier Using 4 X 4 Multipliers a a a a H H L H H L L L [4, 7] [4, 7] [0, 3] [4, 7] [4, 7] [0, 3] [0, 3] [0, 3] M ultiply M ultiply M ultiply M ultiply [12,15] [8,11] [8,11] [4, 7] [8,11] [4, 7] [4, 7] [0, 3] 8 Add [4, 7] 1 2 Add [8,11] 8 Add [4, 7] Add [8,11] Add [12,15] p p p [12,15] [8,11] [4, 7] [0, 3] p Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 61

220 Additive Multiply Modules a y z y a z p = a + y + z 4-bit adder c in p (a) Block diagram (b) Dot notation Additive multiply module with 2 4 multiplier (a) plus 4-bit and 2-bit additive inputs (y and z). b c AMM b-bit and c-bit multiplicative inputs b-bit and c-bit additive inputs (b + c)-bit output (2 b 1) (2 c 1) + (2 b 1) + (2 c 1) = 2 b+c 1 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 62

221 Multiplier Built of AMMs Legend: 2 bits [0, 1] 4 bits a [0, 3] Understanding [0, 1] [2, 5] * * * [0, 1] a [2, 3] [4, 7] a [0, 3] [2, 3] [6, 9] [4, 5] * [4,7] [2, 3] a [4, 5] [4, 7] a [0, 3] [4,5] [8, 11] [6, 7] [6, 9] a [4, 7] a [4, 7] [10,13] p [12,15] p [8, 9] [10,11] a [0, 3] [4, 5] [8, 9] [6, 7] p * * [10,11] [8, 9] [8, 11] [6, 7] [0, 1] p[2, 3] p[4,5] [6, 7] [6, 7] An 8 8 multiplier built of 4 2 AMMs. Inputs marked with an asterisk carry 0s. p p an 8 8 multiplier built of 4 2 AMMs using dot notation Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 63

222 Bit-Serial Multipliers Bit-serial adder (LSB first) y 2 y 1 y 0 FF FA s 2 s 1 s 0 Bit-serial multiplier a 2 a 0 a ? p 2 p 0 p 1 Systolic arrays: synchronous arrays of processing elements that are interconnected by only short, local wires thus allowing very high clock rates. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 64

223 Semisystolic Serial-Parallel Multiplier Multiplicand (parallel in) a 3 a 2 a a 0 Multiplier (serial in) LSB-first Carry FA Sum FA FA FA Product (serial out) Semi-systolic circuit for 4 4 multiplication in 8 clock cycles. This is called semisystolic because it has a large signal fan-out of k (k-way broadcasting) and a long wire spanning all k positions Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 65

224 Systolic Retiming as a Design Tool A semisystolic circuit can be converted to a systolic circuit via retiming, which involves advancing and retarding signals by means of delay removal and delay insertion in such a way that the relative timings of various parts are unaffected Cut e f CL CR CL C R g h +d g h e+d f+d Original delays Adjusted delays Eample of retiming by delaying the inputs to C L and advancing the outputs from C L by d units +d Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 66

225 A First Attempt at Retiming Multiplicand (parallel in) a 3 a 2 a 1 a Carry FA Sum FA FA FA Multiplier (serial in) LSB-first Product (serial out) Multiplicand (parallel in) a 3 a 2 a 1 a Multiplier (serial in) LSB-first Carry FA Sum FA FA FA Product (serial out) Cut 3 Cut 2 Cut 1 A retimed version of our semisystolic multiplier. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 67

226 Deriving a Fully Systolic Multiplier Carry Multiplicand (parallel in) a 3 a 2 a 1 a FA Sum FA FA FA Multiplier (serial in) LSB-first Product (serial out) Multiplicand (parallel in) a 3 a 2 a a 0 Multiplier (serial in) LSB-first Carry FA Sum FA FA FA Product (serial out) A retimed version of our semisystolic multiplier. Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 68

227 A Direct Design for a Bit-Serial Multiplier t out p (i?) a i i t in a i i a (i - 1) (i - 1) a c out s in 2 (5; 3)-counter 1 Mu Building block for a latency-free bitserial multiplier The cellular structure of the bitserial multiplier based on the cell in Fig t c s out out in a i 1 0 s i t in c in out c s in out a i i LSB 0 p i a i i a i i a i (i - 1) Already accumulated into three numbers Fig Bit-serial multiplier design in dot notation. i a (i - 1) Already output (a) Structure of the bit-matri p (i - 1) a a i i 2p (i ) (i - 1) (i - 1) (b) Reduction after each input bit p Shift right to obtain p (i ) Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 69

228 Modular Multipliers FA FA... FA FA FA Modulo-(2 b 1) carry-save adder Mod-15 CSA Divide by 16 Mod-15 CSA Design of a 4 4 modulo-15 multiplier. Mod-15 CPA 4 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 70

229 Other Eamples of Modular Multiplication One way to design of a 4 4 modulo-13 multiplier. 16 mod 13 = 3 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 71

230 Squaring Multiply by p 9 p 8 p 7 p 6 p 5 p 4 4 p 1 p 3 p 2 p 0 Simplify _ p 9 0 p 8 p 7 p 6 p 5 p 4 p 3 Design of a 5-bit squarer p 2 0 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 72

231 Constant Multiplier Source: Ercegovac and Lang, Digital Arithmetic, pp.224 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 73

232 Multiple Constant Multiplier Source: Ercegovac and Lang, Digital Arithmetic, pp. 225 Computer Arithmetic 3, Dept. of EE, Fu Jen Catholic University, Taiwan 74

233 Division Instructor: Kuan Jen Lin Dept. of EE, FJU, Taiwan Room: SF 727B Most slides are revision of PowerPoint files gotten from tetbook website. Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 1

234 Division Review Division schemes and various speedup methods Hardest basic operation (fortunately, also the rarest) Division speedup methods: high-radi, array,... Combined multiplication/division hardware Digit-recurrence vs convergence division schemes Topics in This Part Chapter 13 Basic Division Schemes Chapter 14 High-Radi Dividers Chapter 15 Variations in Dividers Chapter 16 Division by Convergence Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 2

235 13 Basic Division Schemes Chapter Goals Study shift/subtract or bit-at-a-time dividers and set the stage for faster methods and variations to be covered in Chapters Chapter Highlights Shift/subtract divide vs shift/add multiply Hardware, firmware, software algorithms Dividing 2 s-complement numbers The special case of a constant divisor Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 3

236 Shift/Subtract Division Algorithms Notation for our discussion of division algorithms: z Dividend z 2k 1 z 2k 2... z 3 z 2 z 1 z 0 d Divisor d k 1 d k 2... d 1 d 0 q Quotient q k 1 q k 2... q 1 q 0 s Remainder, z (d q) s k 1 s k 2... s 1 s 0 Initially, we assume unsigned operands d Divisor z q d 23 3 q d 22 2 q d 21 1 q d 20 0 s Quotient Dividend Subtracted bit-matri Remainder Division of an 8-bit number by a 4-bit number in dot notation. q Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 4

237 Division versus Multiplication (1/2) Division is more comple than multiplication: Need for quotient digit selection or estimation Overflow possibility: the high-order k bits of z must be strictly less than d; the quotient of a 2k bit number divided by a k bit number may have a width of more than k bits. d Divisor q z q d 23 3 q d 22 2 q d 21 1 q d 20 0 s Quotient Dividend Subtracted bit-matri Remainder Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 5

238 Division versus Multiplication (2/2) Pentium III latencies Instruction Latency Cycles/Issue Load / Store 3 1 Integer Multiply 4 1 Integer Divide Double/Single FP Multiply 5 2 Double/Single FP Add 3 1 Double/Single FP Divide Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 6

239 Division Recurrence d Divisor q z q d 23 3 q d 22 2 q d 21 1 q d 20 0 s Quotient Dividend Subtracted bit-matri Remainder k bits 2z 0 2 k d k bits Division with left shifts (There is no corresponding right-shift algorithm) s (j) = 2s (j 1) q k j (2 k d) with s (0) = z and shift s (k) = 2 k s subtract Integer division is characterized by z = d q + s 2 2k z = (2 k d) (2 k q) + 2 2k s z frac = d frac q frac + 2 k s frac Divide fractions like integers; adjust the remainder No-overflow condition for fractions is: z frac < d frac Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 7

240 Division Recurrence Steps Initialization Iterations One digit arithmetic left-shift of s (j) to produce rs (j) Determination of the quotient digit q j+1 by the quotient-digit selection function; The inde of q could be different Generation of the divisor multiple d q j+1 Subtraction of dq j+1 from rs (j). On-the-fly conversion of the quotient Or done in the termination step Termination: make sign(s)=sign(d)), conversion Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 8

241 Eamples of Basic Division Integer division Fractional division ====================== ===================== z z frac d d frac ====================== ===================== s (0) s (0) s (0) s (0) q d {q 3 = 1} q 1 d {q 1 =1} s (1) s (1) s (1) s (1) q d {q 2 = 0} q 2 d {q 2 =0} s (2) s (2) s (2) s (2) q d {q 1 = 1} q 3 d {q 3 =1} s (3) s (3) s (3) s (3) q d {q 0 = 1} q 4 d {q 4 =1} s (4) s (4) s s frac q q frac ====================== ===================== Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan Notice the inde of q What is the residual of / 0.1? 9

242 Main Factors Affecting the Overall Eecution Time and Cost Radi r Quotient-digit set Redundant signed digit? Representation of the residual CSA? Quotient-digit selection Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 10

243 Programmed Division Carry Flag Shifted Partial Remainder Rs Shifted Partial Quotient Rq Net quotient digit inserted here Partial Remainder (2k j Bits) Partial Quotient (j Bits) Rd Divisor d k 2 d Register usage for programmed division. Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 11

244 Assembly Language Program for Division {Using left shifts, divide unsigned 2k-bit dividend, z_high z_low, storing the k-bit quotient and remainder. Registers: R0 holds 0 Rc for counter Rd for divisor Rs for z_high & remainder Rq for z_low & quotient} {Load operands into registers Rd, Rs, and Rq} div: load Rd with divisor load Rs with z_high load Rq with z_low {Check for eceptions} branch d_by_0 if Rd = R0 branch d_ovfl if Rs > Rd {Initialize counter} load k into Rc {Begin division loop} d_loop: shift Rq left 1 {zero to LSB, MSB to carry} rotate Rs left 1 {carry to LSB, MSB to carry} skip if carry = 1 branch no_sub if Rs < Rd sub Rd from Rs incr Rq {set quotient digit to 1} no_sub: decr Rc {decrement counter by 1} branch d_loop if Rc 0 {Store the quotient and remainder} store Rq into quotient store Rs into remainder d_by_0:... d_ovfl:... d_done:... Carry Flag Shifted Partial Remainder Rs Partial Remainder (2k?j Bits) Rd Divisor d k 2 d Register usage for programmed division. Shifted Partial Quotient Rq Partial Quotient (j Bits) Programmed division using left shifts. Net quotient digit inserted here Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 12

245 Time Compleity of Programmed Division Assume k-bit words k iterations of the main loop 6 or 8 instructions per iteration, depending on the quotient bit Thus, 6k + 3 to 8k + 3 machine instructions, ignoring operand loads and result store k = 32 implies instructions on average This is too slow for many modern applications! Microprogrammed division would be somewhat better Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 13

246 Restoring Hardware Dividers Shift Trial difference Quotient q k Partial remainder s (initial value z) (j) q k j Load Shift Divisor d 0 1 Mu k MSB of 2s (j 1) Quotient digit selector k c out Adder c in 1 Shift/subtract sequential restoring divider. Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 14

247 Indirect Signed Division In division with signed operands, q and s are defined by z = d q + s sign(s) = sign(z) s < d Eamples of division with signed operands z = 5 d = 3 q = 1 s = 2 z = 5 d = 3 q = 1 s = 2 z = 5 d = 3 q = 1 s = 2 z = 5 d = 3 q = 1 s = 2 (not q = 2, s = 1) Magnitudes of q and s are unaffected by input signs Signs of q and s are derivable from signs of z and d Will discuss direct signed division later Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 15

248 Eample of Restoring Unsigned Division ======================= z d d ======================= 1 s s (0) ( 2 4 d) s (1) s (1) Positive, so set q 3 = 1 s (2) Negative, so set q 2 = 0 1 and restore s =2s ( 2 4 d) s (2) s 2s (3) ( 2 4 d) Positive, so set q 1 = 1 +( 2 4 d) s (4) s Positive, so set q 0 = 1 q ======================= No overflow, because (0111) two < (1010) two Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 16

249 Nonrestoring and Signed Division The cycle time in restoring division must be long enough to allow: Shifting the registers Allowing signals to propagate through the adder Determining and storing the net quotient digit Storing the trial difference, if required Nonrestoring division to the rescue! Assume q k j = 1 and subtract Store the result as the new PR (the partial remainder can become incorrect, hence the name nonrestoring ) Trial difference c out 0 1 Mu Adder Quotient q Partial remainder s (initial value z) Divisor d k k (j) k c in Shift Shift MSB of 2s (j 1) 1 q k j Load Quotient digit selector Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 17

250 Justification for Nonrestoring Division Why it is acceptable to store an incorrect value in the partial-remainder register? Shifted partial remainder at start of the cycle is u Suppose subtraction yields the negative result u 2 k d Option 1: Restore the partial remainder to correct value u, shift left, and subtract to get 2u 2 k d Option 2: Keep the incorrect partial remainder u 2 k d, shift left, and add to get 2(u 2 k d) + 2 k d = 2u 2 k d Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 18

251 Eample of Nonrestoring ======================= z d d ======================= s (0) s (0) Positive, +( 2 4 d) so subtract s (1) s (1) Positive, so set q 3 = 1 +( 2 4 d) and subtract s (2) s (2) Negative, so set q 2 = d and add s (3) s (3) Positive, so set q 1 = 1 +( 2 4 d) and subtract s (4) Positive, so set q 0 = 1 s q ======================= Unsigned Division No overflow: (0111) two < (1010) two Applying if sign(s) = sign(d) then qk j = 1 else qk j = -1, we get 11-11, that equals 1011 Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 19

252 Graphical Depiction of Nonrestoring Division Eample Partial remainder s (0) 74 s (1) s (2) s (3) s (4) =16s ( ) two / ( ) two 100 (a) Restoring (117) ten / (10) ten Partial remainder s (0) 74 s (1) s (2) s (3) s (4) =16s 100 (b) Nonrestoring Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 20

253 Nonrestoring Division with Signed Operands Restoring division q k j = 0 means no subtraction (or subtraction of 0) q k j = 1 means subtraction of d Nonrestoring division Eample: q = We always subtract or add It is as if quotient digits are selected from the set {1, 1}: 1 corresponds to subtraction 1 corresponds to addition Our goal is to end up with a remainder that matches the sign of the dividend This idea of trying to match the sign of s with the sign z, leads to a direct signed division algorithm if sign(s) = sign(d) then q k j = 1 else q k j = 1 Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 21

254 Quotient Conversion and Final Correction Partial remainder variation and selected quotient digits during nonrestoring division with d > 0 Quotient with digits 1 and 1 z d 0 d +d 2 d 2 +d 2 +d 2 d d Replace 1s with 0s Shift left, complement MSB, and set LSB to 1 to get the 2 s-complement quotient Check: = 25 = Final correction step if sign(s) sign(z): Add d to, or subtract d from, s; subtract 1 from, or add 1 to, q Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 22

255 ======================== z d d Signed Division ======================== s (0) s (0) sign(s (0) ) sign(d), +2 4 d so set q 3 = 1 and add s (1) s (1) sign(s (1) ) = sign(d), +( 2 4 d) so set q 2 = 1 and subtract s (2) s (2) sign(s (2) ) sign(d), +2 4 d so set q 1 = 1 and add s (3) s (3) sign(s (3) ) = sign(d), +( 2 4 d) so set q 0 = 1 and subtract s (4) sign(s (4) ) sign(z), +( 2 4 d) so perform corrective subtraction s (4) s q ======================== Eample of Nonrestoring p = Shift, compl MSB Add 1 to correct Check: 33/( 7) = 4 Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 23

256 On-The-Fly Conversion Source: Ercegovac and Lang, Digital Arithmetic, pp. 257 Computer Arithmetic 4, Dept. of EE, Fu Jen Catholic University, Taiwan 24

Part II Addition / Subtraction

Part II Addition / Subtraction Parts Chapters I. Number Representation 1. 2. 3. 4. Numbers and Arithmetic Representing Signed Numbers Redundant Number Systems Residue Number Systems Elementary Operations