Computer Architecture 10 Fast s Ma d e wi t h Op e n Of f i c e. o r g 1
Carry Problem Addition is primary mechanism in implementing arithmetic operations Slow addition directly affects the total performance of the computer Complex (and fast) addition schemes increase the cost of the final implementation The choice for adder structure must be tailor- made to the potential applications Ma d e wi t h Op e n Of f i c e. o r g 2
RCA (Computer Architecture I) RCA Ripple Carry Simplest, Smallest and Slowest (SSS), but... Worst-case carry-propagation is Θ(k) k - digits Ma d e wi t h Op e n Of f i c e. o r g 3
Bit-Serial RCA VLSI implementation advantages: small pin count reduced wire length high clock rate small space low power consumption Alternative to pipeline units for parallel processing shift X Y c FA Bit-serial RCA implementation X+Y shift Ma d e wi t h Op e n Of f i c e. o r g 4
Coping with Carry Detect the end of carry propagation asynchronous adders Speed up the propagation (CPA Carry Propagate s) CLA / CSLA / CSKA Carry Lookahead / Select / s Limit the carry propagation CSA Carry Save Estimate & Parallel Carry s Eliminate the carry propagation Carry-free RNS operations Ma d e wi t h Op e n Of f i c e. o r g 5
s Overview Not complete 1-bit adders Half HA Full FA Bit Counter (m,k) CPA RCA CSKA CSLA CLA multi-operand adders 3-operand Array CSA Tree Ma d e wi t h Op e n Of f i c e. o r g 6
Asynchronous Based on RCA - detection of carry completion Average longest chain of propagation ~ log 2 k Not suitable for synchronous processing a i +b i d i+1 c i+1 a i *b i a i +b i d i carry (c, d) from previous stage c i Complete a i *b i from other bit positions (c,d) extended carry ----------------------------------- (0,0) Carry not yet known (0,1) Carry 0 (1,0) Carry 1 Ma d e wi t h Op e n Of f i c e. o r g 7
CLA (Computer Architecture I) CLA Carry Lookahead Fast but Complex Worst-case carry propagation is Θ(log k) 4 4 4 LA LA LA LA Logic - 1 st level LA Logic - 2 nd level k - digits Ma d e wi t h Op e n Of f i c e. o r g 8
Carry Select (CSLA) The oldest logarithmic-time adder The additions are performed in parallel according to alternative scenarios: carry= 0 or 1 The final selection of results is made with the computed value of carry CLSAs have Θ(log k) addition time, but with high complexity of hardware 1 0 1 0 1 0 c 0 Ma d e wi t h Op e n Of f i c e. o r g 9
One-Level CSLA k 1 ½k k/2 RCA 0 ½k 1 0 k 1 ½k k/2 RCA c 0 k/2 RCA 1 k/2 k/2 Mux 2 1 (k/2 buses) c k/2 k/2 k/2 c out + High k/2 bits of result Low k/2 bits of result Ma d e wi t h Op e n Of f i c e. o r g 10
Block Propagation in CSLA Not optimal due to block-carry propagation delay k-1 ¾k ¾k-1 ½k ½k-1 ¼k ¼k-1 0 k/4 RCA 1 k/4 RCA 1 k/4 RCA 1 k/4 RCA k/4 RCA 0 k/4 RCA k/4 RCA 0 0 c 0 c 3k/4 c k/2 c k/4 c k Res. k-1...¾k ¾k-1...½k ½k-1...¼k ¼k-1...0 Ma d e wi t h Op e n Of f i c e. o r g 11
Two-Level CSLA Block-carry propagation is fast, but design complex k-1 ¾k ¾k-1 ½k ½k-1 ¼k ¼k-1 0 k/4 RCA 1 c 3k/4 k/4 RCA 1 k/4 RCA k/4 RCA 0 k/4 RCA 0 k/4 RCA 0 1 k/4 RCA c 0 c 3k/4 c k/4 c k/2 c k Result k-1...½k ½k-1...¼k ¼k-1...0 Ma d e wi t h Op e n Of f i c e. o r g 12
Propagation Chains Carries can be generated, propagated or absorbed Propagation chains are evaluated in parallel How to speed up the worst-case propagation? worst-case carry propagation Ma d e wi t h Op e n Of f i c e. o r g 13
Carry (CSKA) Carry-in is propagated through n-stages if p=1 Propagate condition p is easily computable c i +1 FA FA FA FA p = p i *p i+1 *p i+2 *p i+3 propagation condition p i =a i +b i c i 4-bit RCA p Carry skip Ma d e wi t h Op e n Of f i c e. o r g 14
CSKA 16-bit CSKA 4-bit RCA 4-bit RCA 4-bit RCA 4-bit RCA p p p p c 16 Carry skip c 12 Carry skip c 8 Carry skip c 4 Carry skip c 0 carry-propagate carry-skip carry-propagate worst-case The longest delay due to carry propagation: propagation through bits 1-3 and OR skip bits 4-11 propagation through bits 12-14 (block 0 and last do not contribute to carry-propagation delay) Ma d e wi t h Op e n Of f i c e. o r g 15
CSKA Delay Analysis Assume: 1 block skip-delay = 1 bit carry-propagation k bits, b bits in skip-block (fixed-size) T fixcska carry-propagation delay in fixed-block size CSKA (number of stages the carry must be propagated through) 4-bit RCA 4-bit RCA 4-bit RCA 4-bit RCA p p p p c 16 Carry skip c 12 Carry skip c 8 Carry skip c 4 Carry skip c 0 T fixcska = (b 1) + 0.5 + (k/b 2) + (b 1) = 2b + k/b -3.5 in block 0 + OR gate + all skips + in last block e.g. 32-bit: T fixcska = 12.5 (b=4, k=32) Ma d e wi t h Op e n Of f i c e. o r g 16
Fixed-Size Blocks Optimal size of fixed-size skip blocks: dt fixcska /db = 0 d(2b + k/b -3.5)/db = 2 k/b 2 = 0 b opt = (k/2) 1/2 T fixcska-opt = 2(2k) 1/2 3.5 e.g. 16-bit adder b opt 3, T fixcska 8 e.g. 32-bit adder b opt = 4, T fixcska = 12.5 e.g. 64-bit adder b opt 6, T fixcska 19 Ma d e wi t h Op e n Of f i c e. o r g 17
Variable-Size Blocks Variable-size skip blocks shorten the propagation Optimal configuration is to have the longest block in the middle and the shortest at both ends t number of blocks (even number) b bits in smallest skip block b b+1... b+t/2-1 b+t/2-1... b+1 b k = t*(b + t/4 1/2) b = k/t t/4 + 1/2 Ma d e wi t h Op e n Of f i c e. o r g 18
Variable-Size Blocks T varcska carry-propagation delay in fixed-block size CSKA (number of stages the carry must be propagated through) b 1 t 2 0.5 + b 1 T varcska = 2*(b 1) + 0.5 + (t 2) = 2k/t + t/2 2.5 Optimal size of fixed-size skip blocks: dt varcska /db = 0 2k/t 2 + 1/2 = 0 t opt = 2 k 1/2 T varcska-opt = 2 k 1/2 2.5 Ma d e wi t h Op e n Of f i c e. o r g 19
Variable-Size Blocks T varcska-opt t opt = 2 k 1/2 varcska-opt = 2 k 1/2 2.5 b opt 1 e.g. 16-bit adder t opt = 8, T fixcska 5.5 e.g. 32-bit adder t opt 12, T fixcska 9 e.g. 64-bit adder t opt = 16, T fixcska 13.5 Ma d e wi t h Op e n Of f i c e. o r g 20
Multilevel CSKA First-level skip blocks get propagate-condition signal from block adders Second-level skip blocks get propagate-condition signal from first-level skip blocks Carry can be propagated over group of skip blocks Ma d e wi t h Op e n Of f i c e. o r g 21
Multilevel CSKA Multilevel structures are results of complex optimizations and usually are not regular adders first level skip blocks second level skip blocks Ma d e wi t h Op e n Of f i c e. o r g 22
Hybrid Fast s Combination of RCA, CLA, CSKA and others allows to satisfy various criteria: high performance cost-effectiveness low power consumption Ma d e wi t h Op e n Of f i c e. o r g 23
Example CSLA+CSKA+CLA CL Logic CSKA CSKA 0 1 CSKA CSKA 0 1 CSKA CSKA 0 1 CSKA Ma d e wi t h Op e n Of f i c e. o r g 24
Multioperand s Applications in multiplication, vector and matrix arithmetics and others x y ------- --------------- x*y x y ------- --------------- x*y Adding n-numbers of k-bits, total sum has k+log 2 n bits Ma d e wi t h Op e n Of f i c e. o r g 25
Serial Implementation Latency Θ(n Θ log k) faster than linear dependence on number of operands logarithmic dependence for scaling the operand size Partial sum k+log n bits n-operands k-bits Fast adder Θ(log k) Shift register Ma d e wi t h Op e n Of f i c e. o r g 26
Tree Implementation with CPA Tree of 2-operand adders (RCA are best here!) For n-operands, n-1 adders are needed (costly) Latency Θ(k+log Θ n) scales well with n n k-bit operands RCA RCA RCA k+1 k+1 k+1 RCA RCA k+2 k+2 RCA k+3 bit result Ma d e wi t h Op e n Of f i c e. o r g 27
Look into Tree of RCAs s at higher levels need not wait for full carry propagation from lower-level adders All adders start with just one FA-delay after previous level FA FA HA single RCA at level i FA HA single RCA at level i+1 Ma d e wi t h Op e n Of f i c e. o r g 28
Tree Implementation with CSA Tree of 3-operand carry-save adders CSA reduce n-operands to 2-operands, Θ(log n) Final fast CPA is needed, Θ(log k) Latency Θ(log Θ n + log k) scales well with n & k n k-bit operands CSA CSA CSA CSA CSA CPA Ma d e wi t h Op e n Of f i c e. o r g 29