NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Similar documents
Unbounded hardware is equivalent to deterministic Turing machines

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Area-Time Optimal Adder with Relative Placement Generator

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers

GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders

A High-Speed Realization of Chinese Remainder Theorem

ARITHMETIC COMBINATIONAL MODULES AND NETWORKS

Comparative analysis of QCA adders

Midterm Exam Two is scheduled on April 8 in class. On March 27 I will help you prepare Midterm Exam Two.

EFFICIENT MULTIOUTPUT CARRY LOOK-AHEAD ADDERS

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

Chapter 4. Combinational: Circuits with logic gates whose outputs depend on the present combination of the inputs. elements. Dr.

Part II Addition / Subtraction

CSE140: Components and Design Techniques for Digital Systems. Logic minimization algorithm summary. Instructor: Mohsen Imani UC San Diego

Optimal Circuits for Parallel Multipliers

Lecture 12: Adders, Sequential Circuits

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

Revisiting Finite Field Multiplication Using Dickson Bases

Design and Implementation of Efficient Modulo 2 n +1 Adder

Number representation

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Cost/Performance Tradeoffs:

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

Lecture 11: Adders. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

A Digit-Serial Systolic Multiplier for Finite Fields GF(2 m )

Lecture 4. Adders. Computer Systems Laboratory Stanford University

Carry Look Ahead Adders

Adders, subtractors comparators, multipliers and other ALU elements

THEORETICAL ESTIMATION OF POWER CONSUMPTION IN BINARY ADDERS. Robert A. Freking and Keshab K. Parhi

Lecture 12: Adders, Sequential Circuits

Design of A Efficient Hybrid Adder Using Qca

Lecture 8: Sequential Multipliers

Binary addition by hand. Adding two bits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Very Large Scale Integration (VLSI)

Design and Implementation of Carry Tree Adders using Low Power FPGAs

EECS150. Arithmetic Circuits

Arithmetic in Integer Rings and Prime Fields

Tree and Array Multipliers Ivor Page 1

Residue Number Systems Ivor Page 1

CMPUT 329. Circuits for binary addition

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Chapter 5 Arithmetic Circuits

Part II Addition / Subtraction

Hw 6 due Thursday, Nov 3, 5pm No lab this week

L8/9: Arithmetic Structures

A General Method for Energy-Error Tradeoffs in Approximate Adders

Hardware Design I Chap. 4 Representative combinational logic

Adders, subtractors comparators, multipliers and other ALU elements

Computer Architecture 10. Fast Adders

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

EECS 427 Lecture 8: Adders Readings: EECS 427 F09 Lecture 8 1. Reminders. HW3 project initial proposal: due Wednesday 10/7

CARNEGIE MELLON UNIVERSITY DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING DIGITAL INTEGRATED CIRCUITS FALL 2002

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

EE141- Spring 2004 Digital Integrated Circuits

HIGH SPEED AND INDEPENDENT CARRY CHAIN CARRY LOOK AHEAD ADDER (CLA) IMPLEMENTATION USING CADENCE-EDA K.Krishna Kumar 1, A.

CSE140: Components and Design Techniques for Digital Systems. Decoders, adders, comparators, multipliers and other ALU elements. Tajana Simunic Rosing

Sums of the Thue Morse sequence over arithmetic progressions

DESIGN OF LOW POWER-DELAY PRODUCT CARRY LOOK AHEAD ADDER USING MANCHESTER CARRY CHAIN

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Arithmetic Building Blocks

Novel Modulo 2 n +1Multipliers

Arithmetic Circuits Didn t I learn how to do addition in the second grade? UNC courses aren t what they used to be...

Cost/Performance Tradeoff of n-select Square Root Implementations

CSEE 3827: Fundamentals of Computer Systems. Combinational Circuits

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Lecture 13: Sequential Circuits

Analysis and Synthesis of Weighted-Sum Functions

Combinational Logic. By : Ali Mustafa

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

CprE 281: Digital Logic

CS 140 Lecture 14 Standard Combinational Modules

A Hardware-Oriented Method for Evaluating Complex Polynomials

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

GF(2 m ) arithmetic: summary

Bit-Sliced Design. EECS 141 F01 Arithmetic Circuits. A Generic Digital Processor. Full-Adder. The Binary Adder

Parallelism in Computer Arithmetic: A Historical Perspective

Implementation of Carry Look-Ahead in Domino Logic

Notes on Computer Theory Last updated: November, Circuits

Problems in VLSI design

TAU 2014 Contest Pessimism Removal of Timing Analysis v1.6 December 11 th,

3. Combinational Circuit Design

Logic BIST. Sungho Kang Yonsei University

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Review. EECS Components and Design Techniques for Digital Systems. Lec 18 Arithmetic II (Multiplication) Computer Number Systems

Sample Test Paper - I

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

A New Bit-Serial Architecture for Field Multiplication Using Polynomial Bases

EE141-Fall 2011 Digital Integrated Circuits

Algorithmic Approach to Counting of Certain Types m-ary Partitions

Binary addition example worked out

Transcription:

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying of this document without permission ofits author may be prohibited by law.

CMU-CS-79-131' A REGULAR LAYOUT FOR PARALLEL ADDERS R.P. Brent Department of Computer Science Australian National University Canberra, A.C.T. 2600 Australia. H.T. Kung Department of Computer Science Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 U.S.A. June 1979 Copyright 1979 by R.P. Brent and H.T. Kung This research was supported in part by the National Science Foundation under Grant MCS 78-236-76 and the Office of Naval Research under Contract N00014-76-C-Q370, NR 044-422.

ABSTRACT With VLSI architecture the chip area is a better measure cost than the conventional gate count. We show that addition of n binary numbers can be performed on a chip in time proportional to log n and with area proportional to n log n. Key Words and Phrases Addition, area-time complexity, carry lookahead, circuit design, combinational logic, models of computation, parallel addition, parallel polynomial evaluation, prefix computation, VLSI. PITTSBUKUH. hl.%ns'flvania i5213

1. 1. Introduction We are interested in the design of parallel "carry lookahead" adders suitable for implementation in VLSI architecture. The addition problem has been considered by many other authors. See, for example, Winograd [65], Brent [70], Tung [72], Savage [76], and Kuck [78]. Much attention has been paid to the tradeoff between time and the number of gates, but little attention has been paid to the problem of connecting the gates in an economical and regular way to minimize chip area and design costs. In this paper we show that a simple and regular design for a parallel adder is possible. In Section 2 we briefly describe our computational model. Section 3 contains a description of the addition problem, and shows how it reduces to a carry computation problem. the reduction of carry computation to a "prefix 11 The basis of our method, -computation, is described in Section 4. Although the same idea was used by Ladner and Fischer [77], their results are not directly applicable because they ignored fanout restrictions, and used the gate count rather than area as a complexity measure. In Section 5 we use the results of Section 4 to give a simple and regular layout for carry computation. Our construction demonstrates that the addition of n-bit numbers can be performed in time 0(log n), using area 0(n log n). The implied constants are sufficiently small that the method is quite practical, and it is especially suitable for a pipelined adder. In Section 6 we generalize the result of Section 5, and show that n-bit numbers can be added in time 0(n/w + log w), using area 0(w log w + 1), if the input bits from each operand are available w at a time (for 1 < w < n).

2. 2. The computational model Our model is intended to be general, but at the same time realistic enough to apply (at least approximately) to current VLSI technology. We assume the existence of circuit elements or "gates 11 which compute a logical function of two inputs in constant time. An output signal can be divided ("fanned out") into two signals in constant time. Gates have constant area, and the wires connecting them have constant minimum width (or, equivalently, must be separated by at least some minimal spacing). At most two wires can cross at any point. We assume that a signal travels along a wire of any length in constant time. This is realistic as propagation delays are limited by line capacitances rather than the velocity of-light. A longer wire will generally have a larger capacitance, and thus require a larger driver, but we can neglect the driver area as it need not exceed a fixed percentage of the wire area: see Mead and Conway [79]. The computation is assumed to be performed in a convex planar region, with inputs and outputs available on the boundary of the region. Our measure of the cost of a design is the area*rather than the number of gates required. This is an important difference between our model and earlier models of Winograd [65], Brent [70] and others. For further details of our model, see Brent and Kung [79].

3. 3. Outline of the General Approach Let a a,...a, and b b -b, be n-bit binary numbers with n n-1 1 n n-1 1 sum s,s...s-. The usual method for addition computes the s. f s by 1 n+1 n 1 c Q = 0, c ± = (a ± A b i ) v ( A± A C ± - B 1) V (b ± A c ± - 1 ) f S i = a i b i c i-l ' i S5 l»---»n, n+1 n* where (+) means the sum mod 2 and c± is the carry from bit position i. following scheme: It is well-known that the c f s can be determined using the c 0 = 0, (3.1) C ± - g ± V ( P. A c ^ ), where g 4 = ai A b i' and for i»l,2 f,n. One can view the g and p ± as the "carry generate" and "carry propagate" conditions at bit position i. The relation (3.1) corresponds to the fact that the carry c^ is either generated by a^ and h A or propagated from the previous carry c^^. This is illustrated in Figure 3.1

H- c i c i-l c 0 (=0) g i'pi g l' P l Figure 3.1: The carry chain In Section 5 we present a layout design for computing all the carries in parallel assuming that the g^f s and P^fs are given. The design of a parallel adder is then very straightforward, and is illustrated in Figure 3.2. Note that in Figure 3.2(b), the bottom rectangle represents a buffer that transforms the a^fs and? b i s into the g i ts and P^s. For computing the s^s we use the fact that s±» Pi 0^! f o r i=l J #* -»n. Figure 3.2: (a) Abstraction of a parallel carry chain computation, and (b) abstraction of a parallel adder based on the design for the carry chain computation.

5. 4. Reformulation of the Carry Chain Computation We define an operator "o" as follows: (g,p) o (g,p) d = f (g v (p A g),p A p), for any Boolean variables g, p, g and p. The following two lemmas show why the operator "o" is useful for carry computation. Lemma 4.1; Let f(g,p ) if i=l, (G?) = / 1 1 Ug ±,P i ) o (G^.P^) if 2<i<n. Then c i = G^ for i=l,2,,n. Proof: We prove the Lemma by induction on i. Since CQ - 0, (3.1) gives c± = S ± v ( PL A 0) = g;l = G x, so the result holds for 1*1. If i > 1 and c i ]_ 558 then <G.,P.) = (g., Pi ) o (G..,,?..,) = ( gi, Pi ) o (c^.p.^) = (.g. V (p. A C. ^ ), P ± A P.^). Thus G. = g. v (p. A c.^) and, from (3.1), we have Q? G. = c.. l l Pin:..... The result now follows by induction.

6. Lemma 4.2: The operator "o" is associative. Proof: For any (g 3,p 3 ), (g 2,p 2 ), (g^p^, we have [(g 3»P 3 ) o (g 2,P 2 )] o (g 1,p 1 ) - [g 3 v - [g 3 V (p 3 Ag 2 ),p 3 Ap 2 ] o (g 1,p 1 ) (P 3 A g 2 ) V A (P 3 P 2 A g l ) f P 3 Ap 2 A P l ], and (g 3»P 3 ) o [(g 2,p 2 ) o (g 1,p 1 )] - (g 3,p 3 ) o [g 2 v (p 2 A g 1 ),p 2 A P ; L ] - tg 3 v (P 3 A (g 2 v A A (P 2 g 1 )))»P 3 P 2 APi 1- One can check that the right hand sides of the above two expressions are equal, using.the distributivity-of J'A" over "v fi. law is not required.) (The dual distributive Lemmas 4.1 and 4.2, To compute it suffices to compute ^3"P ) 9 but, by (G i,p i ) - (g i,p ± ) o (S^fP^) o o (g-^p^ can be evaluated in any order from the given g^'s and p^fs. This is the motivation for the introduction of the operator "o". (Intuitively, may be regarded as a "block carry generate" condition, and as a "block carry propagate" condition.)

5. A Layout for the Carry Chain Computation Recall that for computing the carries it suffices to compute the ( G i» p i) f r ^ i"l,,##»n. Consider first the simpler problem of computing only ( G n>p n )* Since the operator "o" is associative, ( G n»^n) can be computed in the order defined by a binary tree. This is illustrated in Figure 5.1 for the case n=16. In the figure, each black processor performs the function defined ty the operator "o" and each white processor simply transmits data. The white and black processors are depicted in Figure 5.2. Note that for Figure 5.1 each processor is required to produce only one of its two identical outputs, and the units of time are such that one computation by a black processor and propagation of the results takes unit time.

Figure 5.1: The computation of (G-.,P-.) Lb Lo using a tree structure.

9. (g out* P out ) A (8out' P out ) A 8 out 8 in' P out " P in (gout' P out ) (8 out' P out ) A 8 out g in V (p in A *±J» p out = p m A p i n (g in' p in> ( 8 ^ } (a) (b) Figure 5.2: (a) The white processor, and (b) the black processor. Consider now the general problem of computing all the (G.,P.) for i=l,»**,n. This computation can be performed by using the tree struciure of Figure 5.1. once more, this time in the reverse order. We illustrate the computation,for the case n*16,in Figure 5.3. It is easy to check that, at time T=7, all the (G^*^) are computed along the top boundary of the network. which are the carries c^. As the final outputs, we only keep the From the layout shown in Figure 5.3, we have the following theorem. Theorem 5.1: All the carries in an n-bit addition can be computed in time proportional to log n and in area proportional to n log n, n ^ 2. Corollary 5.1: Addition of two n-bit binary numbers can be performed in time proportional to log n and in area proportional to n log n, n ^ 2.

10. C 16 C 15 C 14 C 13 C 12 C ll C 10 C 9 C 8 C 7 C 6 C 5 C 4 c 3 A A A A A A A A A A A A A A A A I I I» I I ' I I I I I I I I I C 2 C l Figure 5.3: The computation of all the carries for n=16.

11. 6. A Pipeline Scheme for Addition of Long Integers We define the width w of a parallel adder to be the number of bits it accepts at one time from each operand. For the parallel adder corresponding to the network in Figure 5.3, v * 16. We have hitherto assumed that the width of a network is equal to the number n of bits in each operand. In this section we consider the case w < n. We show that this case can be handled efficiently using a pipeline scheme on a network which is a modification of the one depicted in Figure 5.3. For simplicity, assume that n is divisible by w. One can partition an n-bit integer into n/w segments, each consisting of w consecutive bits. To illustrate the idea, suppose that w = 16. Then the carry chain computation corresponding to each segment can be done on the network in Figure 5.3, and the computations for all the segments can be pipelined, starting from.the least significant segment. The results coming out from the top of the network are not the final solutions, though. Results corresponding to the i-th least significant segment (i > 1) have to be modified by applying (G,. - *, P / # ) on the right using the operator "o". To facilitate this modification, we superimpose another tree structure on the top half of the network, as shown in Figure 6.1. Using this additional tree, the contents of the "square 11 processor (denoted by " ") are broadcast to all the leaves, which are black processors. The square processor, shown in Figure 6.2, is an accumulator which initially has value (g, p) = (0, 1), and successively has values (g, P) - ( G 9 P (i-i) w (i-l)w^ f o r 1 = 2» 3 > * A t t h e t i m e

12. when a particular (G,. 1.,P f 1 V ) reaches the leaves, it is combined with the results just coming out from the old network there. By this pipeline scheme and Theorem 5.1, we have the following result: Theorem 6.1: Let l^w^n. Then all the carries in an n-bit addition can be computed in time proportional to n/w + log w and in area proportional to w log w + 1. 2 From Theorem 6.1, the area-time product is 0(n log w + w log w + n), 2 which is 0(n log n) when w=n,and 0(n) when w is a constant. When w=l the method outlined in this section is essentially the usual serial carry-chain computation.

13. A A A A T=7 T=6 T-5 T=4 "This is the same left-most processor at level T=4 of the network as in Figure 5.3. Figure 6.1: The additional tree structure to be superimposed on the top half of the network in Figure 5.3. (gout'pout ) ^out'pout) [dej.ay] (g.»p- ) (g,p) = P "out in A P ( >!>) = (g -p ) out *out [delayed] Figure 6.2: The "square" processor that accumulates (G P > (i-l)w' (i-l)w ;

14 7 Summary and Conclusions The preliminary and final stages of binary addition with our scheme (generation of {g.,p.} and computation of {s i - Pi ci-i } respectively) are straightforward. Figures 5.2 and 5.3 illustrate that the intermediate phase (fast carry computation) is conceptually simple, and the layout illustrated in Figure 5.3 is extremely regular. The design of the white processor is trivial, and the black processor is about as complex as a one-bit adder. After these two basic processors are designed, we can simply replicate them and connect their copies in the regular way illustrated in Figure 5.3. We conclude that, using the approach of this paper, parallel adders with carry lookahead are well-suited for VLSI implementation. Mead, and Conway [79, Chapter 5] considered several lookahead schemes, but concluded that "they added a great deal of complexity to the system without much gain in performance". To show that this comment does not apply to our scheme, suppose that the operations "A", "V", and " " take unit time. Table 7.1 gives the computation time for our scheme and for a straightforward serial scheme where the c_^ are computed from (3.1) for various n. (n is the number of bits in each operand.) For n=2^the general formulae are 4k and 2n^l^respectively. Table 7.1 Comparison of parallel and serial addition times n Time (parallel) Time (serial) 8 12 15 16 16 31 32 20 63 64 24 127

15 In this paper we assumeda binary number system and restricted our attention to 2 f s complement arithmetic. Only minor modifications of our results are required to deal with l? s complement arithmetic or sign and magnitude representations of signed integers. Brent and Kung [79] consider the problem of multiplying n-bit binary integers, and show that the area A and time T for any method satisfy AT >K X n 3 /2 and AT 2 K 2 n 2 for certain constants > 0 (assuming the model of Section 2 with some mild additional restrictions). For binary addition we can achieve AT = 0(n) by a trivial serial method, and AT 2-0(n log 3 n) by the method of Section 5. Thus, binary multiplication is harder 2 than binary addition if either AT or AT is used as the complexity measure. In the proof of Lemmas 4.1 and 4.2 we used only one distributive law. Thus, the layout of Figure 5.1 could be used to evaluate arithmetic expressions of the form (7.1) g n + p n {g n _ 1 + P n _ 1 t...p 3 (g 2 +p 2 g 1 )...]} where g_^, p_^ are numbers and the black processor in Figure 5.2(b) now computes g Q u t = g i n + p^g^ and p ^ = p^p^. Note that the case = * * * = P n = x of (7.1) is the polynomial g n + S n-1 x + ' + g^ 1 1 " 1

16 REFERENCES Brent, R.P. [70], "On the Addition of Binary Numbers", IEEE Transactions on Computers, C-19 (1970), pp. 758-759. Brent, R.P. and Kung, H.T. [79], "The Area-Time Complexity of Binary Multiplication", to appear as a Technical Report, Dept. of Computer Science, Carnegie-Mellon Univ., July 1979. Kuck, D.J. [78], The structure of computers and computations, Vol. 1, John Wiley & Sons, New York, 1978. Ladner, R.E. and Fischer, M.J. [77], "Parallel Prefix Computation", Proc. 1977 International Conference on Parallel Processing, 1977, pp. 213-223. Mead, C.A. and Conway, L.A. [79], Introduction to VLSI Systems, Addison Wesley, 1979. Savage, J.E. [76], The Complexity of Computing, John Wiley & Sons, New York, 1976. Tung, C. [72], "Arithmetic", in Computer Science, ed. by A.F. Cardenas, L. Press, M.A. Marin, Wiley-Interscience, New York, 1972. Winograd, S. [65], "On the Time Required to Perform Addition", J.ACM 12 (1965), pp. 277-285.

UNCLASSIFIED SECURITY CLAOJPI^^Jj REPORT DOCUMENTATION PAGE [2. GOVT ACCESSION NO. REPORT KUMBtS CMU-CS-79-131 4. TITLE (and Submit) A REGULAR IAYOUT FOR- PARALLEL ADDERS READ INSTKITTIONS REFORE COV.PLETING FOR- RECIPIENT'S CATALOG NUMBER 5. TYPE OF REPORT & PERIOD COVERED Interim R # P. BRENT & H.T. KDNG N00014-76-C-0370 10. PROGRAM ELEMENT. PROJECT, TASK AREA 4 WORK UNIT NUMBERS Approved for public release; distribution unlimited. U. 0,STR,BUT,ON STATEMENT,ol». to Bloc. 20. Il diligi A - K. Po, Iß. SUPPLEMENTARY NOTtiS 19. KEY WORDS (Cont 20. ABSTRACT (Conflnu. on 1 UNIVERSA UBRARtfS CARH 6»t-MUL0NUNIVFRSlïi PITTSBURGH. PENNSYLVANIA 15213 P«FORM 1472 EomOM Or 1 NOV es IS OBSOLETE LU 1 JAN 73 s / N O l 0 2. o l 4. 6 60 1 i UNCLASSIFIED SECURITY CLASS.FiCAT.ON OF THIS PACE (Wkmn O.i- n..r.d;