Lecture 4. Adders. Computer Systems Laboratory Stanford University

Similar documents
EECS 427 Lecture 8: Adders Readings: EECS 427 F09 Lecture 8 1. Reminders. HW3 project initial proposal: due Wednesday 10/7

Lecture 11: Adders. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Hw 6 due Thursday, Nov 3, 5pm No lab this week

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

Where are we? Data Path Design

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Where are we? Data Path Design. Bit Slice Design. Bit Slice Design. Bit Slice Plan

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

EE141-Fall 2010 Digital Integrated Circuits. Announcements. An Intel Microprocessor. Bit-Sliced Design. Class Material. Last lecture.

Bit-Sliced Design. EECS 141 F01 Arithmetic Circuits. A Generic Digital Processor. Full-Adder. The Binary Adder

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Lecture 5. Logical Effort Using LE on a Decoder

EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

C.K. Ken Yang UCLA Courtesy of MAH EE 215B

Design and Implementation of Carry Tree Adders using Low Power FPGAs

EECS150 - Digital Design Lecture 10 - Combinational Logic Circuits Part 1

Full Adder Ripple Carry Adder Carry-Look-Ahead Adder Manchester Adders Carry Select Adder

VLSI Design I; A. Milenkovic 1

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

Part II Addition / Subtraction

EE M216A.:. Fall Lecture 5. Logical Effort. Prof. Dejan Marković

Logic Synthesis and Verification

Binary addition by hand. Adding two bits

Part II Addition / Subtraction

L8/9: Arithmetic Structures

Area-Time Optimal Adder with Relative Placement Generator

Lecture 12: Datapath Functional Units

Arithmetic Building Blocks

Logical Effort. Sizing Transistors for Speed. Estimating Delays

Robust Energy-Efficient Adder Topologies

CS 140 Lecture 14 Standard Combinational Modules

Hardware Design I Chap. 4 Representative combinational logic

Parallelism in Computer Arithmetic: A Historical Perspective

EE141- Spring 2004 Digital Integrated Circuits

Lecture 12: Adders, Sequential Circuits

EECS150 - Digital Design Lecture 22 - Arithmetic Blocks, Part 1

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

Lecture 12: Adders, Sequential Circuits

ECE429 Introduction to VLSI Design

Lecture 13: Sequential Circuits

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

CS61C : Machine Structures

Floating Point Representation and Digital Logic. Lecture 11 CS301

CS61C : Machine Structures

Slide Set 6. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Lecture 18: Datapath Functional Units

Computer Architecture 10. Fast Adders

VLSI Design, Fall Logical Effort. Jacob Abraham

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

14:332:231 DIGITAL LOGIC DESIGN

What s the Deal? MULTIPLICATION. Time to multiply

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Digital Integrated Circuits A Design Perspective

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic

Fast Ripple-Carry Adders in Standard-Cell CMOS VLSI

Adders, subtractors comparators, multipliers and other ALU elements

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

Logical Effort Based Design Exploration of 64-bit Adders Using a Mixed Dynamic-CMOS/Threshold-Logic Approach

EE371 - Advanced VLSI Circuit Design

Latches. October 13, 2003 Latches 1

Adders, subtractors comparators, multipliers and other ALU elements

Implementation of Reversible ALU using Kogge-Stone Adder

Arithmetic Circuits Didn t I learn how to do addition in the second grade? UNC courses aren t what they used to be...

CSE140: Components and Design Techniques for Digital Systems. Logic minimization algorithm summary. Instructor: Mohsen Imani UC San Diego

Logarithmic Circuits

CSE370 HW3 Solutions (Winter 2010)

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

I. INTRODUCTION. CMOS Technology: An Introduction to QCA Technology As an. T. Srinivasa Padmaja, C. M. Sri Priya

7. Combinational Circuits

Switching Activity Calculation of VLSI Adders

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

For smaller NRE cost For faster time to market For smaller high-volume manufacturing cost For higher performance

Reversible ALU Implementation using Kogge-Stone Adder

Digital Integrated Circuits Designing Combinational Logic Circuits. Fuyuzhuo

Lecture 12: Datapath Functional Units

Lecture 6: Logical Effort

Slides for Lecture 19

Lecture 8: Combinational Circuit Design

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers

Digital Integrated Circuits A Design Perspective

Homework #2 10/6/2016. C int = C g, where 1 t p = t p0 (1 + C ext / C g ) = t p0 (1 + f/ ) f = C ext /C g is the effective fanout

ECE 545 Digital System Design with VHDL Lecture 1. Digital Logic Refresher Part A Combinational Logic Building Blocks

Implementation of Carry Look-Ahead in Domino Logic

EE 447 VLSI Design. Lecture 5: Logical Effort

Problem Set 6 Solutions

Combinational Logic Design Arithmetic Functions and Circuits

9/18/2008 GMU, ECE 680 Physical VLSI Design

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

Transcription:

Lecture 4 Adders Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 2006 Mark Horowitz Some figures from High-Performance Microprocessor Design IEEE 1

Overview Readings Today s topics Fast adders generally use a tree structure for parallelism We will cover basic tree terminology and structures Look at a few example adder architectures Examples will spill into next lecture as well 2

Adders Task of an adder is conceptually simple Sum[n:0]=A[n:0]+B[n:0]+C 0 Subtractors also very simple: -B = ~B+1, so invert B and set C 0 =1 Per bit formulas Sum i = A i XOR B i XOR C i Cout i = C i+1 = majority(a i,b i,c i ) Fundamental problem is calculating the carry to the n th bit All carry terms are dependent on all previous terms So LSB input has a fanout of n And an absolute minimum of log 4 n FO4 delays without any logic 3

Single-Bit Adders Adders are chock-full of XORs, which make them interesting One of the few circuits where pass-gate logic is attractive A complicated differential passgate logic (DPL) block from the text Lousy way to draw a pair of inverters 4

G and P and K, Oh My! Most fast adders G enerate, P ropagate, or K ill the carry Usually only G and P are used; K only appears in some carry chains When does a bit Generate a carry out? G i = A i AND B i If G i is true, then Cout i = C i+1 is forced to be true When does a bit Propagate a carry in to the carry out? P i = A i XOR B i If P i is true, then Cout i (=C i+1 ) follows C i Usually implemented as P i = A i OR B i OR is cheaper/faster than an XOR If you are doing logic, Cout is still equal to G i +P i C i Just beware that Sum i!= P i XOR C i 5

Using G and P We can combine G i and P i into larger blocks Call these group generate and group propagate terms When does a group Generate a carry out? (e.g., 4 bits) G 3:0 = G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 When does a group Propagate a carry in to the carry out? P 3:0 = P 3 P 2 P 1 P 0 We can also combine groups of groups G i:j = G i:k + P i:k G k-1:j i:k k-1:j P i:j = P i:k P k-1:j MSB LSB 6

Linear Adders Using P,G Simple adders ripple the carry; faster ones bypass it Better to try to work out the carry several bits at a time Best designs are around 11FO4 for 64b Useful for small adders (16b) and moderate performance long adders Example of carry bypass from the text 7

Figure Is Not Quite Right How does the drawn critical path go down from the OR gate?! Fix: The OR gate output is the true input to the next group Subtle point: Each block spits out a G term for the OR Not simply a Cout i term Avoids a nasty critical path (11 1 + 00 0 + C in ; C in goes 1 0) 8

Faster Carry Bypass (or Carry Skip) Adders We see the basic idea is to form multi-level carry chains Break the bits into groups Ripple the carry in each group, in parallel Ripple the global carry across the groups How big should each group be? (N bits total, k bits per group) If ripple time equals block skip time then delay = 2(N-1)+(N/k 2) Would groups of different sizes be faster? (yes) Middle groups have longer to generate carry outs; should be larger Early and late groups have ripples in critical path; should be shorter Called Variable Block Adders 9

Carry Select Adders Why wait for the carry in? (If you can t find parallelism, invent it!) Calculate answers for a group assuming C i = 1 AND C i = 0 Use two adders, and rely on the fact that transistors are cheap Don t do this on the full adder (too expensive), just the MSBs Setup P,G 0 0 carry propagate 1 1 carry propagate Mux C o,k-1 Sum generation Carry vector C o,k+3 10

Many Papers on These Adders But no one builds them anymore Or rather, nobody publishes papers on them (or gets PhDs on them) These are all clever improvements on adders That tend to optimize transistors along with performance Or are best for narrow-width operands (n=64 is slow) But scaling is pushing these adders to the wayside We have very wide-word machines (media applications) We have more transistors than we know what to do with Question: As power density questions increase will these simpler adders make a comeback? 11

Logarithmic, or Tree, Adders Fundamental problem: to know C i, we need C i-1 So delay is linear with n, and this dominates for wide adders (n>16) Can we lookahead across multiple levels to figure out carry? Yes. Called prefix computation turns delay into logarithmic with n Notation is always an issue; everybody does it differently Here, A i:j means the signal A for group the i th to j th position P = propagate (A+B) G = generate (AB) C = CarryIn to this bit/group position 12

Logic Stages For Logarithmic/Tree Adders 1. Compute single bit values 0<=i<n G i =A i B i P i =A i +B i 2. Compute two-bit groups 0<=i<(n/2) G 2i+1:2i =G 2i+1 +G 2i P 2i+1 P 2i+1:2i =P 2i+1 P 2i 3. Compute four-bit groups 0<=i<(n/4) G 4i+3:4i =G 4i+3:4i+2 +G 4i+1:4i P 4i+3:4i+2 P 4i+3:4i =P 4i+3:4i+2 P 4i+1:4i 4. Go down tree for G&P, then go back up for Cin 5. Compute four-bit carries 0<=i<(n/8) C 8i+7:8i+4 = G 8i+3:8i + C 8i+7:8i P 8i+3:8i C 8i+3:8i = C 8i+7:8i 6. Compute two-bit carries 0<=i<(n/4) C 4i+3:4i+2 = G 4i+1:4i + C 4i+3:4i P 4i+1:4i C 2i+1:2i = C 2i+3:2i 7. Compute single-bit carries 0<=i<(n/2) C 2i+1 = G 2i + C 2i+1:2i P 2i C 2i = C 2i+1:2i 13

An Eight-bit Example Lines and dots notation shows the tree structure clearly 7 6 5 4 3 2 1 0 P 7 G 7 P 6 G 6 P 5 G 5 P 4 G 4 P 3 G 3 P 2 G 2 P 1 G 1 P 0 G 0 P 7:6 G 7:6 P 5:4 G 5:4 P 3:2 G 3:2 G 1:0 P 7:4 G 7:4 G 3:0 G 7:0 Takes log 2 n time to get the final carry-out (Cout 7 =G 7:0 ) 14

An Eight-Bit Example, Redrawn More common to line up the PG terms with their appropriate bits 7 6 5 4 3 2 1 0 P 7 G 7 P 6 G 6 P 5 G 5 P 4 G 4 P 3 G 3 P 2 G 2 P 1 G 1 P 0 G 0 P 5:4 G 5:4 P 3:2 G 3:2 P 7:6 G 7:6 G 1:0 G 3:0 P 7:4 G 7:4 G 7:0 15

Layout Of Our Example Tree Adder Logarithmic structures have somewhat ugly layout 7 6 5 4 3 2 1 0 Worst wire length grows as n increases (n=64? 128?) 16

That Was Half The Algorithm 1. Compute single bit values 0<=i<n G i =A i B i P i =A i +B i 2. Compute two-bit groups 0<=i<(n/2) G 2i+1:2i =G 2i+1 +G 2i P 2i+1 P 2i+1:2i =P 2i+1 P 2i 3. Compute four-bit groups 0<=i<(n/4) G 4i+3:4i =G 4i+3:4i+2 +G 4i+1:4i P 4i+3:4i+2 P 4i+3:4i =P 4i+3:4i+2 P 4i+1:4i 4. Go down tree for G&P, then go back up for Cin 5. Compute four-bit carries 0<=i<(n/8) C 8i+7:8i+4 = G 8i+3:8i + C 8i+7:8i P 8i+3:8i C 8i+3:8i = C 8i+7:8i 6. Compute two-bit carries 0<=i<(n/4) C 4i+3:4i+2 = G 4i+1:4i + C 4i+3:4i P 4i+1:4i C 2i+1:2i = C 2i+3:2i 7. Compute single-bit carries 0<=i<(n/2) C 2i+1 = G 2i + C 2i+1:2i P 2i C 2i = C 2i+1:2i 17

An Eight-Bit Example, Finished A Brent-Kung adder (1982): what s the critical path? 7 6 5 4 3 2 1 0 C 7 C 6 C 5 C 4 C 3 C 2 C 1 C 0 18

A 16b Brent-Kung Adder Limit fanout to 2 (can collapse some nodes with higher FO) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Carry out for each bit position 19

Many Kinds of Tree Adders We can vary some basic parameters Radix, tree depth, wiring density, and fanout Radix: how many bits are combined in each Pgroup, Gg term? Radix is generally < 4 (why not more?); prior example was 2 Radix-n can just compute the Pg and Gg terms directly Radix-2 block Radix-3 block Radix-4 block Or it can compute the intermediate P# and G# terms as well Radix-2 block Radix-3 block Radix-4 block 20

Building Multiple-Bit PG blocks Radix-4 Pg, Gg block that generates intermediate terms Spits out P 3:0 and G 3:0 terminology Also spits out P3, G3, P2, G2 and passes along P1, G1 P0,G0 P1:0,G1:0 P2:0,G2:0 P3:0,G3:0 Allows for quick computation of the various carries, once we know C in How do we build this block? 21

Building Multiple-Bit PG blocks, con t Can we use dynamic logic to build fast blocks? A Manchester carry chain can gather the multiple-bit G terms P 1 P 2 P n G 0 G 1 G 2 G n Gn For C, since we already have Pgs and Ggs we can do better Cin P0 P1:0 Pn:0 G0 G1:0 Gn:0 22

Dynamic Logic for 4-Bit PG Block Motorola design Clk 23

32b Mixed-Radix Brent-Kung Adder Vary radices at different tree levels because n may not be (radix) k 24

32b Mixed-Radix, Redrawn As Folded Tree PG goes down and Carry goes back up 25

64b Radix-4 Brent-Kung Adder Takes longer to draw in Powerpoint than it does to design! 26

Radix 4 PG and Carry Trees (Argh! Why do people put the LSB on the left side?) 27

Tree Depth, Wiring Density, and Fanout Previous slides have all examined changing the adder radix We can also change the tree depth, wire density, and/or fanout These usually get changed together; one affects the others Density: How many wires criss-cross the tree? Depth: How many stages of logic? Fanout: How far is the reach of each stage? Reduce depth by chopping trees Many adders use carry select at the final stage Compute two results, and use a carry to select right result Eliminate the carry tree altogether Increasing wiring density or increasing fanout 28

A Very Dense 16b Tree Eliminate the carry out tree by computing a group for each bit Kogge-Stone architecture (1973) Lots of wires, but minimizes the number of logic levels We can use this for a quick swag at the minimum delay 29

Minimum 64b Adder Delay Make a few assumptions Output load is equal to the load on each input Use static gates; very aggressive domino logic may change results Simple approximation Need to compute Sum i = A i XOR B i XOR C i C in (LSB) must fanout to all bits for a fanout of 64 Extra logic in chain raises effective fanout to about 128 3.5 FO4 More complicated approximation At each stage, P drives 3 gates, G drives 2; effective fanout ¼ 3.5 Total fanout = 1.5 (first NAND/NOR) * 3.5 6 * 1-ish (final mux) 5.7 FO4, not really accounting for parasitic delay correctly 30

Higher Radix Still Possible Radix-4 Kogge-Stone tree Trades off two layers of logic for lots and lots of wires Not a good idea in CMOS it tends to increase stage efforts >> 4 Not bad, though, for domino much lower logical effort 31

Reduce Depth With Fanout Can also reduce tree depth by increasing the stage fanouts Sklansky (1960) called this a divide-and-conquer tree Fanouts increase the further from the start you go 32

A Taxonomy Following Harris s 2003 paper Assume 16b radix-2 adder families for this discussion We can modify tree s depth, fanout, and wiring density What we ve seen already Brent-Kung: 7 logic levels, fanout of 2, one wiring track enough Kogge-Stone: 4 logic levels, fanout of 2, eight wiring tracks Sklansky: 4 logic levels, fanout of up to 9, one wiring track enough Formalism: Use a triplet (l,f,t) to represent the adders Logic levels = log 2 N + l Brent-Kung: (3,0,0) Fanout = 2 f +1 Kogge-Stone: (0,0,3) Wiring tracks = 2 t Sklansky: (0,3,0) 33

Points on a plane? All major adder architectures fall onto the same plane Defined by l+f+t=log 2 N-1 Using this, we may expect a Han-Carlson adder to Trade off logic layers for some increased wiring 34

Han-Carlson Adder (1987) Think of this as a sparse Kogge-Stone Called a sparse tree with sparseness of 2 35

Adder Tradeoffs 36

Other Sparse Trees Mathew, VLSI 02 37