Computer Architecture 10. Fast Adders

Similar documents
Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Part II Addition / Subtraction

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Chapter 5 Arithmetic Circuits

Part II Addition / Subtraction

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Lecture 8: Sequential Multipliers

ECE 645: Lecture 2. Carry-Lookahead, Carry-Select, & Hybrid Adders

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

Cost/Performance Tradeoffs:

Number representation

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

CS 140 Lecture 14 Standard Combinational Modules

An Area Efficient Enhanced Carry Select Adder

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Binary addition by hand. Adding two bits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Adders, subtractors comparators, multipliers and other ALU elements

Sample Test Paper - I

Midterm Exam Two is scheduled on April 8 in class. On March 27 I will help you prepare Midterm Exam Two.

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

What s the Deal? MULTIPLICATION. Time to multiply

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Arithmetic in Integer Rings and Prime Fields

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Tree and Array Multipliers Ivor Page 1

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

COE 202: Digital Logic Design Sequential Circuits Part 4. Dr. Ahmad Almulhem ahmadsm AT kfupm Phone: Office:

EECS150 - Digital Design Lecture 22 - Arithmetic Blocks, Part 1

EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS

Hardware Design I Chap. 4 Representative combinational logic

Lecture 8. Sequential Multipliers

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Adders allow computers to add numbers 2-bit ripple-carry adder

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

Problem Set 6 Solutions

Digital Electronics II Mike Brookes Please pick up: Notes from the front desk

Area-Time Optimal Adder with Relative Placement Generator

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

Chapter 4. Combinational: Circuits with logic gates whose outputs depend on the present combination of the inputs. elements. Dr.

Digital Electronics. Part A

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

Arithmetic Circuits How to add and subtract using combinational logic Setting flags Adding faster

Arithmetic Circuits-2

CPE100: Digital Logic Design I

Lecture 7: Logic design. Combinational logic circuits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Sequential Logic Worksheet

EECS150 - Digital Design Lecture 10 - Combinational Logic Circuits Part 1

Radix-4 Vectoring CORDIC Algorithm and Architectures. July 1998 Technical Report No: UMA-DAC-98/20

14:332:231 DIGITAL LOGIC DESIGN

LOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.

Cost/Performance Tradeoff of n-select Square Root Implementations

Arithmetic Circuits-2

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

Multiplication Ivor Page 1

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Carry Look Ahead Adders

Design of Sequential Circuits

Adders, subtractors comparators, multipliers and other ALU elements

Lecture 4. Adders. Computer Systems Laboratory Stanford University

A High-Speed Realization of Chinese Remainder Theorem

VLSI Design I; A. Milenkovic 1

Reg. No. Question Paper Code : B.E./B.Tech. DEGREE EXAMINATION, NOVEMBER/DECEMBER Second Semester. Computer Science and Engineering

CMPUT 329. Circuits for binary addition

L8/9: Arithmetic Structures

Chapter 7. VLSI System Components

ECE/Comp Sci 352 Digital Systems Fundamentals. Charles R. Kime Section 2 Fall Logic and Computer Design Fundamentals

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1))

Computer Arithmetic Design

EECS 427 Lecture 8: Adders Readings: EECS 427 F09 Lecture 8 1. Reminders. HW3 project initial proposal: due Wednesday 10/7

Digital Design for Multiplication

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

Design and Comparison of Wallace Multiplier Based on Symmetric Stacking and High speed counters

Combinatorial circuits - arithmetics

Lecture 3 Review on Digital Logic (Part 2)

ECE 2300 Digital Logic & Computer Organization

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

ECE 341. Lecture # 3

DIGIT-SERIAL ARITHMETIC

GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders

LOGIC CIRCUITS. Basic Experiment and Design of Electronics

CMP 334: Seventh Class

10/12/2016. An FSM with No Inputs Moves from State to State. ECE 120: Introduction to Computing. Eventually, the States Form a Loop

per chip (approx) 1 SSI (Small Scale Integration) Up to 99

Digital Electronics Final Examination. Part A

DE58/DC58 LOGIC DESIGN DEC 2014

An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits

Logic Design II (17.342) Spring Lecture Outline

Lecture 11: Adders. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

ECE380 Digital Logic. Positional representation

Arithmetic Circuits Didn t I learn how to do addition in the second grade? UNC courses aren t what they used to be...

Transcription:

Computer Architecture 10 Fast s Ma d e wi t h Op e n Of f i c e. o r g 1

Carry Problem Addition is primary mechanism in implementing arithmetic operations Slow addition directly affects the total performance of the computer Complex (and fast) addition schemes increase the cost of the final implementation The choice for adder structure must be tailor- made to the potential applications Ma d e wi t h Op e n Of f i c e. o r g 2

RCA (Computer Architecture I) RCA Ripple Carry Simplest, Smallest and Slowest (SSS), but... Worst-case carry-propagation is Θ(k) k - digits Ma d e wi t h Op e n Of f i c e. o r g 3

Bit-Serial RCA VLSI implementation advantages: small pin count reduced wire length high clock rate small space low power consumption Alternative to pipeline units for parallel processing shift X Y c FA Bit-serial RCA implementation X+Y shift Ma d e wi t h Op e n Of f i c e. o r g 4

Coping with Carry Detect the end of carry propagation asynchronous adders Speed up the propagation (CPA Carry Propagate s) CLA / CSLA / CSKA Carry Lookahead / Select / s Limit the carry propagation CSA Carry Save Estimate & Parallel Carry s Eliminate the carry propagation Carry-free RNS operations Ma d e wi t h Op e n Of f i c e. o r g 5

s Overview Not complete 1-bit adders Half HA Full FA Bit Counter (m,k) CPA RCA CSKA CSLA CLA multi-operand adders 3-operand Array CSA Tree Ma d e wi t h Op e n Of f i c e. o r g 6

Asynchronous Based on RCA - detection of carry completion Average longest chain of propagation ~ log 2 k Not suitable for synchronous processing a i +b i d i+1 c i+1 a i *b i a i +b i d i carry (c, d) from previous stage c i Complete a i *b i from other bit positions (c,d) extended carry ----------------------------------- (0,0) Carry not yet known (0,1) Carry 0 (1,0) Carry 1 Ma d e wi t h Op e n Of f i c e. o r g 7

CLA (Computer Architecture I) CLA Carry Lookahead Fast but Complex Worst-case carry propagation is Θ(log k) 4 4 4 LA LA LA LA Logic - 1 st level LA Logic - 2 nd level k - digits Ma d e wi t h Op e n Of f i c e. o r g 8

Carry Select (CSLA) The oldest logarithmic-time adder The additions are performed in parallel according to alternative scenarios: carry= 0 or 1 The final selection of results is made with the computed value of carry CLSAs have Θ(log k) addition time, but with high complexity of hardware 1 0 1 0 1 0 c 0 Ma d e wi t h Op e n Of f i c e. o r g 9

One-Level CSLA k 1 ½k k/2 RCA 0 ½k 1 0 k 1 ½k k/2 RCA c 0 k/2 RCA 1 k/2 k/2 Mux 2 1 (k/2 buses) c k/2 k/2 k/2 c out + High k/2 bits of result Low k/2 bits of result Ma d e wi t h Op e n Of f i c e. o r g 10

Block Propagation in CSLA Not optimal due to block-carry propagation delay k-1 ¾k ¾k-1 ½k ½k-1 ¼k ¼k-1 0 k/4 RCA 1 k/4 RCA 1 k/4 RCA 1 k/4 RCA k/4 RCA 0 k/4 RCA k/4 RCA 0 0 c 0 c 3k/4 c k/2 c k/4 c k Res. k-1...¾k ¾k-1...½k ½k-1...¼k ¼k-1...0 Ma d e wi t h Op e n Of f i c e. o r g 11

Two-Level CSLA Block-carry propagation is fast, but design complex k-1 ¾k ¾k-1 ½k ½k-1 ¼k ¼k-1 0 k/4 RCA 1 c 3k/4 k/4 RCA 1 k/4 RCA k/4 RCA 0 k/4 RCA 0 k/4 RCA 0 1 k/4 RCA c 0 c 3k/4 c k/4 c k/2 c k Result k-1...½k ½k-1...¼k ¼k-1...0 Ma d e wi t h Op e n Of f i c e. o r g 12

Propagation Chains Carries can be generated, propagated or absorbed Propagation chains are evaluated in parallel How to speed up the worst-case propagation? worst-case carry propagation Ma d e wi t h Op e n Of f i c e. o r g 13

Carry (CSKA) Carry-in is propagated through n-stages if p=1 Propagate condition p is easily computable c i +1 FA FA FA FA p = p i *p i+1 *p i+2 *p i+3 propagation condition p i =a i +b i c i 4-bit RCA p Carry skip Ma d e wi t h Op e n Of f i c e. o r g 14

CSKA 16-bit CSKA 4-bit RCA 4-bit RCA 4-bit RCA 4-bit RCA p p p p c 16 Carry skip c 12 Carry skip c 8 Carry skip c 4 Carry skip c 0 carry-propagate carry-skip carry-propagate worst-case The longest delay due to carry propagation: propagation through bits 1-3 and OR skip bits 4-11 propagation through bits 12-14 (block 0 and last do not contribute to carry-propagation delay) Ma d e wi t h Op e n Of f i c e. o r g 15

CSKA Delay Analysis Assume: 1 block skip-delay = 1 bit carry-propagation k bits, b bits in skip-block (fixed-size) T fixcska carry-propagation delay in fixed-block size CSKA (number of stages the carry must be propagated through) 4-bit RCA 4-bit RCA 4-bit RCA 4-bit RCA p p p p c 16 Carry skip c 12 Carry skip c 8 Carry skip c 4 Carry skip c 0 T fixcska = (b 1) + 0.5 + (k/b 2) + (b 1) = 2b + k/b -3.5 in block 0 + OR gate + all skips + in last block e.g. 32-bit: T fixcska = 12.5 (b=4, k=32) Ma d e wi t h Op e n Of f i c e. o r g 16

Fixed-Size Blocks Optimal size of fixed-size skip blocks: dt fixcska /db = 0 d(2b + k/b -3.5)/db = 2 k/b 2 = 0 b opt = (k/2) 1/2 T fixcska-opt = 2(2k) 1/2 3.5 e.g. 16-bit adder b opt 3, T fixcska 8 e.g. 32-bit adder b opt = 4, T fixcska = 12.5 e.g. 64-bit adder b opt 6, T fixcska 19 Ma d e wi t h Op e n Of f i c e. o r g 17

Variable-Size Blocks Variable-size skip blocks shorten the propagation Optimal configuration is to have the longest block in the middle and the shortest at both ends t number of blocks (even number) b bits in smallest skip block b b+1... b+t/2-1 b+t/2-1... b+1 b k = t*(b + t/4 1/2) b = k/t t/4 + 1/2 Ma d e wi t h Op e n Of f i c e. o r g 18

Variable-Size Blocks T varcska carry-propagation delay in fixed-block size CSKA (number of stages the carry must be propagated through) b 1 t 2 0.5 + b 1 T varcska = 2*(b 1) + 0.5 + (t 2) = 2k/t + t/2 2.5 Optimal size of fixed-size skip blocks: dt varcska /db = 0 2k/t 2 + 1/2 = 0 t opt = 2 k 1/2 T varcska-opt = 2 k 1/2 2.5 Ma d e wi t h Op e n Of f i c e. o r g 19

Variable-Size Blocks T varcska-opt t opt = 2 k 1/2 varcska-opt = 2 k 1/2 2.5 b opt 1 e.g. 16-bit adder t opt = 8, T fixcska 5.5 e.g. 32-bit adder t opt 12, T fixcska 9 e.g. 64-bit adder t opt = 16, T fixcska 13.5 Ma d e wi t h Op e n Of f i c e. o r g 20

Multilevel CSKA First-level skip blocks get propagate-condition signal from block adders Second-level skip blocks get propagate-condition signal from first-level skip blocks Carry can be propagated over group of skip blocks Ma d e wi t h Op e n Of f i c e. o r g 21

Multilevel CSKA Multilevel structures are results of complex optimizations and usually are not regular adders first level skip blocks second level skip blocks Ma d e wi t h Op e n Of f i c e. o r g 22

Hybrid Fast s Combination of RCA, CLA, CSKA and others allows to satisfy various criteria: high performance cost-effectiveness low power consumption Ma d e wi t h Op e n Of f i c e. o r g 23

Example CSLA+CSKA+CLA CL Logic CSKA CSKA 0 1 CSKA CSKA 0 1 CSKA CSKA 0 1 CSKA Ma d e wi t h Op e n Of f i c e. o r g 24

Multioperand s Applications in multiplication, vector and matrix arithmetics and others x y ------- --------------- x*y x y ------- --------------- x*y Adding n-numbers of k-bits, total sum has k+log 2 n bits Ma d e wi t h Op e n Of f i c e. o r g 25

Serial Implementation Latency Θ(n Θ log k) faster than linear dependence on number of operands logarithmic dependence for scaling the operand size Partial sum k+log n bits n-operands k-bits Fast adder Θ(log k) Shift register Ma d e wi t h Op e n Of f i c e. o r g 26

Tree Implementation with CPA Tree of 2-operand adders (RCA are best here!) For n-operands, n-1 adders are needed (costly) Latency Θ(k+log Θ n) scales well with n n k-bit operands RCA RCA RCA k+1 k+1 k+1 RCA RCA k+2 k+2 RCA k+3 bit result Ma d e wi t h Op e n Of f i c e. o r g 27

Look into Tree of RCAs s at higher levels need not wait for full carry propagation from lower-level adders All adders start with just one FA-delay after previous level FA FA HA single RCA at level i FA HA single RCA at level i+1 Ma d e wi t h Op e n Of f i c e. o r g 28

Tree Implementation with CSA Tree of 3-operand carry-save adders CSA reduce n-operands to 2-operands, Θ(log n) Final fast CPA is needed, Θ(log k) Latency Θ(log Θ n + log k) scales well with n & k n k-bit operands CSA CSA CSA CSA CSA CPA Ma d e wi t h Op e n Of f i c e. o r g 29