GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders

Size: px

Start display at page:

Download "GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders"

Priscilla Jacobs
5 years ago
Views:

1 GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders Dhananjay S. Phatak Electrical Engineering Department State University of New York, Binghamton, NY Israel Koren Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA ABSTRACT In this paper, we present a novel algorithm and its VLSI implementation for performing fixed point two s complement addition. The proposed adder has a hybrid architecture where the generation of the sum bits is separated from the generation and propagation of the carry bits. The sum bits for every group of (consecutive) bit positions are obtained through a carry-select scheme. The incoming carries into these groups are generated by a binary carry-look-ahead tree. We achieve a substantial reduction in the delay of the carry-look-ahead tree and in the number of transistors required for its implementation by introducing a new pair of variables, namely, carryoriginate and carry-propagate variables. The new carry-originate variable includes the conventional carry-generate variable as a special case. The theoretical framework developed here leads to the definition of a fundamental carry-selection operator which can be implemented using a fast and compact multiplexor-based circuit. By employing this multiplexor-based selection circuit the delay of each level of the look-ahead tree is reduced to that of a single gate. The block (group) size for the generation of sum bits through the carry-select scheme is determined in such a way that the two alternative subsets of sum bits are ready prior to the generation of the block s incoming carry. As a result, the delay of the proposed adder is equal to 4 + log 2 n gate delays when adding two n-bit operands. A comparison with other recently proposed designs such as an adder utilizing intermediate signed digits and the adder in Digital s ALPHA processor demonstrates that our design is significantly faster for all wordlengths; and, equally important, requires fewer transistors for wordlengths equal to and larger than 32. The architecture presented in this paper is being patented.

2 I Introduction Despite the fact that VLSI technology is now mature and complex arithmetic operations can be directly implemented in hardware, something as basic as the operation of addition still continues to fascinate researchers. This is demonstrated by the continual efforts to synthesize ever faster and/or more compact VLSI adders [1], [2], [3], [7], [8], [12], [16], [17], [18], [19]. Undoubtedly, these efforts are well directed and worthwhile because the addition is the most fundamental arithmetic operation. Many other arithmetic operations are implemented with additions and shifts as basic steps. Conventional high-speed adders include carry-look-ahead [19] and its variants [1, 5, 13]; conditionalsum [17]; carry-select; and carry-skip [12] along with its derivatives [2, 3, 7]. To obtain further speed enhancement, it is possible to combine two or more of these techniques (such as carry-look-ahead and carry-select) as illustrated by the designs in [4, 8, 18]. Most of these techniques are independent of the technology used to implement the actual hardware and represent modifications at the algorithm level. Given an algorithm, elaborate optimizations are often employed at the implementation level in order to extract the fastest circuit at the lowest possible hardware cost. Various implementations of Manchester carry chains, odd-even ripple cells [6] etc., are examples of enhancements achieved via circuit optimizations. If the underlying algorithm can be modified so as to best match a given technology, the potential speed enhancement and hardware savings can exceed those that can be obtained by either algorithm modification alone or by circuit optimizations alone. For example, Ling [13] re-writes the basic look-ahead equations in order to reduce the hardware cost (fan-in). The design in [18], on the other hand employs an intermediate signed-digit representation and then utilizes a tree of multiplexors in order to achieve speed enhancement. It is well known that a multiplexor (MUX) is cheap to implement in CMOS technology (in both static as well as dynamic CMOS circuits). In fact, the (dynamic CMOS) adder in Digital Equipment Corporation s ALPHA AXP microprocessor utilizes the conditional-sum technique because their clever implementation of dual MUXes makes it possible to select both possible outcomes with minimal delay and hardware cost [4, 9, 10]. The multiplier design in [14] also exploits the fact that MUXes are faster and require fewer transistors in order to realize fast addition. Like the design in [18] it also utilizes an intermediate signed-digit representation. In this paper we present a CMOS architecture that is extremely fast and very economical in terms of the number of transistors. We employ a combination of carry-look-ahead and carry-select techniques. In particular, we express the basic carry-look-ahead equations in a different form so that those can be implemented by MUXes. We demonstrate that the conventional and the MUX-based implementations are interchangeable, thereby allowing the designer to freely mix and match and select the most suitable implementation anywhere in the look-ahead tree that employs conventional generate and propagate variables. Furthermore, we introduce a generalized form of the fundamental carry operator as well as a new pair consisting of carry originate and propagate variables. The new bit-wise originate variable introduced in 1

3 this paper includes the conventional generate variable as a special case, and leads to a significant saving in the number of transistors required, without affecting the critical path delay. The rest of the paper is organized as follows. The next section briefly summarizes background material on carry look-ahead techniques and the recursion unrolling therein. Section III builds the theoretical framework by defining new originate and propagate variables and establishing the recursion relations those variables satisfy. Section IV describes the delay model; presents the implementation of basic building blocks; the structure of our look ahead tree and the resultant adder architecture. Section V compares and contrasts our architecture with some others published in the literature. The comparison clearly demonstrates that our architecture is significantly faster and requires fewer transistors than those in [4] and [18]. The last section presents discussion and conclusions. II Background Carry-look-ahead schemes for addition are well known [9]. Let A = (a n 1 a n 2 a 0 ) and B = (b n 1 b n 2 b 0 ) be the two operands to be added, with a i and b i being the operand bits in the ith bit position. Here, a n 1 and b n 1 denote the most significant bits (MSBs), while a 0 and b 0 denote the least significant bits (LSBs). Let c i and c i+1 denote the carries into and out of the ith bit position, respectively. The Boolean carry propagation equation is c i+1 = a i b i + c i (a i + b i ) = G i + P i c i where (1) G i = a i b i and (2) P i = a i + b i or P i = a i b i (3) Here, G i and P i are the well-known generate and propagate terms. In the above equations (and in all other Boolean equations in the rest of the manuscript) a or a product term without any symbol between the literals represents a logical AND operation and a + indicates the logical OR. Following the literature, the symbol is used to represent the Exclusive-OR (XOR) operation and a bar over a literal is used to indicate its complement. Henceforth, all the equations in the manuscript are Boolean, unless stated otherwise. Also, the words variable, term and signal are used synonymously, since a Boolean variable or term can be directly associated with a physical (binary) signal. The propagate and generate signals for a group of bit positions i, i 1,, j (where i j) are denoted by P i:j and G i:j, respectively, and can be expressed in terms of the bit-wise signals as follows: P i:j = P i P i 1 P j G i:j = G i + P i G i 1 + P i P i 1 G i P i P i 1 P j+1 G j (5) (4) Any fast addition scheme must unroll the recursion inherent in the basic carry propagation equation (??) above. This is done with the help of the following properties satisfied by the generate and propagate 2

4 variables: P i:j = P i:m P m 1:j G i:j = G i:m + P i:m G m 1:j where i m j + 1 c i+1 = G i:j + P i:j c j (6) (7) (8) Instead of dealing with each of the group carry functions separately, the pair (G i:j, P i:j ) can be used along with a Boolean operator often called the fundamental carry operator [1, 10] henceforth denoted by, and defined by (P, G) ( P, G) = (P P, G + P G) (9) In terms of this operator, equations (??) to (??) can be generalized as follows [10] (P i:j, G i:j ) = (P i:m, G i:m ) (P v:j, G v:j ) where i m j + 1 and i v m 1 (10) (P i:j, c i+1 ) = (P i:j, G i:j ) (1, c j ) (11) Unrolling this type of affine recursion leads to a tree structure as illustrated by the implementation in [1]. At each level of the tree, the generate and propagate signals for smaller groups are combined into larger groups, as per equations (??) and (??). As is commonly done, we assume that the delay of an inverter, a two input NAND/NOR gate and a transmission gate is approximately the same and designate it to be 1 unit of delay (the delay model is explained in detail in Section IV). Under this assumption, the delay required for the fundamental carry operation is two units when implemented directly in the form shown in equations (??) and (??). Note that in CMOS technology, an AND must be implemented as a NAND followed by an inverter. Hence the generation of group propagate signal from its subgroups also requires two units even though it is a simple AND operation. Similarly, the synthesis of the group generate signal requires more than 1 unit of delay (it can be implemented with 2 units of delay). Thus, the delay associated with the look-ahead tree is O(2 log 2 n) where n is the word length of the input operands [1]. III Theoretical Framework In this section we demonstrate that expressing the fundamental carry operation in a different form makes it possible to implement it with a MUX, which requires a single unit of delay. We define the originate and propagate signals for such a scheme and show their relationship with their conventional counterparts. We show that the originate and propagate signals for a group can be synthesized from those of two subgroups with a single unit of delay. Hence the delay of our look-ahead tree is O(log 2 n). As 3

5 illustrated in Sections IV and V, this translates into significantly smaller critical path delay for all word lengths. We begin with the carry propagation equation (??), which can be expressed in the following form: c i+1 = (a i b i ) Q i + (a i b i ) c i = P i Q i + P i c i where (12) P i = a i b i and (13) a i b i or Q i = a i or (14) b i Equation (??) can be implemented as a MUX, where the signal P i selects between the incoming carry c i and the originate signal Q i. Figure 1 illustrates a transmission gate (TG) based static CMOS MUX implementation of the above equation. Figure 2 illustrates a direct implementation of c i+1. It is a straightforward fully complementary static CMOS circuit that actively restores the output signal to the proper logic level. This is different from the circuit in Figure 1, since the transmission gate is a passive device. Assuming that the select signals P i and P i are available simultaneously with or before the inputs Q i and c i ; the delay of the TG-based implementation shown in Figure 1 is one unit. The inversion of output caused by the circuit in Figure 2 does not cost any extra delay, as shown later in section IV: it is possible to have inversion at every level and operate on complementary signals. In other words, it is not necessary to retrieve c i+1 explicitly in an uncomplemented form by adding an inverter to the circuit of Figure 2. Hence, the delay of the circuit in Figure 2 is equivalent to the delay for charging or discharging through at most 2 transistors in series, or that of a two input NAND/NOR gate, i.e., 1 unit. However, the TG based implementation needs only 4 transistors, while the actively restoring circuit in Figure 2 needs 8 transistors, which is a substantial increase in view of the fact that MUXes are used at every bit position and at all levels of the look-ahead tree. Note that the P i as defined by equation (??) is the digit sum (modulo 2). If the digit sum is 1 then the carry-out equals the carry-in, or in other words, the incoming carry propagates. If the digit sum is 0 then the operands a i and b i are both 0 or both 1. A carry is generated if both are one, which implies that Q i should be set to 1. The interesting point is that one need not restrict Q i to the bit-wise AND. The originate signal Q i is selected only when P i is zero. In this case, a i = b i and hence a i b i = a i = b i. Thus, one can use any of the bits a i or b i by itself in place of (a i b i ). This saves the NAND gate and inverter required at every bit position in order to generate the bit-wise AND. In fact, the pair (P i, Q i ) contains all the information that the pair (a i, b i ) has, when Q i = a i (or b i ). P i indicates whether or not the bits are identical. This together with the value of one bit is sufficient to retrieve the value of the other bit. Note that in order to allow Q i = a i (or b i ), we must restrict P i to a i b i, while the conventional carry-look-ahead scheme allows P i to equal a i + b i as well (see equation (??)). Throughout the rest of the manuscript, P i is 4

6 therefore restricted to the XOR operation defined by equation (??). that We now derive the originate and propagate signals for a group of two bits. To this end, first observe P i Q i = (ā i bi + a i b i ) Q i = a i b i = G i (15) when Q i assumes any of the three values as defined in equation (??). For bit position 0 (i.e., i = 0), we obtain c 1 = P 0 Q 0 + P 0 c 0 (16) and for bit position 1, c 2 = P 1 Q 1 + P 1 c 1 = P 1 Q 1 + P 1 ( P 0 Q 0 + P 0 c 0 ) (17) = P 1 Q 1 + P 1 P0 Q 0 + P 1 P 0 c 0 Equation (??) can be re-written as c 2 = (P 1 P 0 ) ( P 1 Q 1 + P 1 Q 0 ) + (P 1 P 0 ) c 0 (18) or, in terms of the group propagate and originate signals as c 2 = P 1:0 Q 1:0 + P 1:0 c 0 (19) where P 1:0 = P 1 P 0 is the propagate term for the group of two bit positions and Q 1:0 = P 1 Q 1 + P 1 Q 0 is the originate term for the group. Equation (??) has the same form as the right hand side of equation (??). Similarly, c 3 can be derived as c 3 = P 2 Q 2 + P 2 P1 Q 1 + P 2 P 1 P0 Q 0 + P 2 P 1 P 0 c 0 (20) = (P 2 P 1 P 0 ) ( P 2 Q 2 + P 2 P1 Q 1 + P 2 P 1 Q 0 ) + (P 2 P 1 P 0 ) c 0 = P 2:0 Q 2:0 + P 2:0 c 0 where P 2:0 = P 2 P 1 P 0 and (21) Q 2:0 = P 2 Q 2 + P 2 P1 Q 1 + P 2 P 1 Q 0 (22) 5

7 From the above definitions, one can derive the following general expressions: P i:j = and Q i:j = P i P i 1 P j when i > j P i when i = j P i Q i + P i Pi 1 Q i 1 + P i P i 1 Pi 2 Q i P i P i 1 P j+1 Q j+1 + P i P i 1 P j+1 Q j when i > j Q i when i = j (23) (24) Using the relation G k = P k Q k, equation (??) for the conventional group generate term G i:j can be re-written as G i:j = P i Q i + P i Pi 1 Q i P i P i 1 P j+1 Q j+1 + P i P j+1 Pj Q j (25) Note that the only difference between Q i:j and G i:j as defined by equations (??) and (??), respectively, is that the last term in the latter contains the literal Pj which is not present in the last term in (??). The group originate and propagate signals defined above satisfy the following property: Lemma 1 : Pi:j Q i:j = G i:j for i j. Proof : Equation (??) states that P i:i Q i:i = G i:i and takes care of the case where i = j. Next, consider the case when i > j. In this case, P i:j Q i:j = ( P i + + P j ) ( P i Q i + P i Pi 1 Q i P i P j+1 Q j+1 + P i P j+1 Q j ) (26) We have P i Q i:j = P i Q i + 0 = P i Q i (27) P i 1 Q i:j = P i Q i Pi 1 + P i Pi 1 Q i = P i Q i Pi 1 + P i Pi 1 Q i 1 (28) From equations (??) and (??) above, we obtain ( P i + P i 1 ) Q i:j = P i Q i + P i Pi 1 Q i 1 (29) 6

8 In an identical manner, it follows that ( P i + P i 1 + P i 2 ) Q i:j = P i Q i + P i Pi 1 Q i 1 + P i P i 1 Pi 2 Q i 2. ( P i + P i 1 + P i P j+1 ) Q i:j = P i Q i + P i Pi 1 Q i P i P i 1 P j+1 Q j+1 ( P i + P i 1 + P i P j ) Q i:j = P i Q i + P i Pi 1 Q i P i P i 1 P j+1 Q j+1 + P i P i 1 P j Q j = G i:j (30) which proves the desired result. From equation (??) and Lemma 1, the group carry-out can be expressed as c i+1 = G i:j + P i:j c j = P i:j Q i:j + P i:j c j (31) which can be implemented by a MUX. Thus far, we have expressed the group originate and propagate signals Q i:j and P i:j in terms of the bit-wise originate and propagate signals Q m and P m where i m j. While this is necessary, it still does not tell us how to combine the originate and propagate signals of two smaller groups into those of a larger group. This operation is essential for a carry-look-ahead tree where smaller groups are combined into successively larger groups at each level of the tree. The following theorem establishes the required relations. Theorem 1 : The group propagate and originate variables satisfy P i:j = P i:m P v:j and (32) Q i:j = P i:m Q i:m + P i:m Q v:j (33) where i m j, i v j and v m 1 Proof : The proof of equation (??) is straightforward and is omitted for the sake of brevity. To prove equation (??), first re-write it using Lemma 1 as follows Q i:j = G i:m + P i:m Q v:j (34) Expanding the right hand side of the above equation using relations (??), (??) and (??) that define G i:j, P i:j and Q i:j, respectively, we get G i:m + P i:m Q v:j = P i Q i + P i Pi 1 Q i P i P i 1 P v Q v P i P i 1 P m Q m (35) + P i P v P m ( P v Q v + P v Pv 1 Q v P v P v 1 P m Q m ) + P i P v P m (P v P m 1 Q m P v P j+1 Q j+1 + P v P j+1 Q j ) 7

9 Note that the expression on the second line of the above equation reduces to zero. Using this fact and re-writing the expression on the third line of the above equation, we obtain G i:m + P i:m Q v:j = P i Q i + P i Pi 1 Q i P i P i 1 P v Q v P i P i 1 P m Q m (36) = Q i:j +P i P i 1 P m Pm 1 Q m P i P j+1 Q j+1 + P i P j+1 Q j which completes the proof. This theorem enables us to define a fundamental carry-select operator, henceforth denoted by the symbol, analogous to the conventional fundamental carry operator defined above by equation (??), as follows: (P, Q) ( P, Q) = (P P, P Q + P Q) (37) The relations stated in Theorem 1 that are satisfied by the group variables P and Q can be rewritten using the fundamental carry-select operator in the following manner (P i:j, Q i:j ) = (P i:m, Q i:m ) (P v:j, Q v:j ) where i m j and i v m 1 (38) (P i:j, c i+1 ) = (P i:j, Q i:j ) (1, c j ) (39) The above equations indicate that one can employ multiplexors to implement the look-ahead tree. Next, we elaborate on the relation between the conventional (P, G) variables and the (P, Q) variables introduced here. Note that the variable Q i includes G i as a special case. Let Q s i = a i b i = G i (40) denote Q in the special case when it is selected to be the bit-wise AND (the superscript s indicates that this is a special case). The variable G i:j (which is the same as Q s i:j ) satisfies the following property: Lemma 2 : Pi:j Q s i:j = P i:j G i:j = G i:j = Q s i:j for i j The proof is identical to that of Lemma 1 and is therefore omitted for the sake of brevity. In view of the above lemma, it is clear that the fundamental carry-select operator can also serve in place of the fundamental carry operator, i.e., the P and G variables also satisfy (P i:j, G i:j ) = (P i:m, G i:m ) (P v:j, G v:j ) where i m j and i v m 1 (41) (P i:j, c i+1 ) = (P i:j, G i:j ) (1, c j ) (42) 8

10 The conventional fundamental carry operator can therefore be generalized to include both the operators, viz., and defined above. This flexibility of the generalized fundamental carry operator has important implication with regard to the actual implementation, viz., that one can freely mix-and-match the conventional implementation (i.e., the one that directly implements equation of the form of (??) with AND and OR gates) and the MUX-based implementation anywhere in the look-ahead tree. Thus, the (P, G) pair is more flexible and better suited than the (P, Q) pair. However, Q i can be set to one of the operand bits a i or b i, thereby saving a significant number of transistors. Furthermore, one can switch from Q variables to G variables with a simple AND operation. We would like to point out that using Q i G i has significance only at the first level of the look-ahead tree. From the second level onwards, the (P, G) pair is more desirable since it lends itself to both the conventional as well as the MUX implementation. However, the conversion from Q to G leads to extra delay in the total addition time. To avoid this delay, we deal only with the Q variables in our design. 9

11 IV Implementation (A) Delay Model : While a time delay measured from an actual chip or a layout is desirable, raw nanoseconds do not always give an idea of the complexity of the underlying circuit. Also, the absolute delays change with technology, materials and processes. On the other hand, if the delay is specified in terms of equivalent gate delays, then it gives an idea of how many levels of logic the signal might have to traverse or what is the complexity of the underlying circuit. If the gate delay estimation is done realistically and correctly, it becomes somewhat independent of technology. Everything is relative to the delay of a basic unit such as a two input NAND/NOR gate which can be estimated fairly accurately for a given technology. Thus the gate delay model can be used to estimate delays when comparing different algorithms and has been used very extensively in the literature. We have therefore adopted it in this manuscript. Majority of the references cited here have also used the gate delay model, allowing us to make a meaningful comparison. As is commonly done, we assume that the delay of an inverter, a 2 input NAND/NOR gate, and a transmission gate is approximately the same and designate it to be 1 unit delay. If the number of transmission gates cascaded in series is not too large, then the delay per transmission gate can be assumed to be 1 unit. A chain of 4 transmission gates has a delay of 4 units under this assumption. If more gates are cascaded in series, the delay could increase in a nonlinear (superlinear) manner. For instance, the number of series devices is limited to 8 in the adder in Digital s ALPHA-AXP processor. In any case, a TG-based multiplexor can be replaced with the inverting MUX structure shown in Figure 2, which actively restores the logic levels. This leads to an inversion of the output, but it turns out that the critical path delay is not affected by this inversion since it is possible to use the true and complemented form of the relevant signals at alternate levels of the look-ahead tree. Note that the inverted outputs generated by the MUX shown in Figure 2 can be directly used at the next level. This is briefly explained as follows. Suppose that a normal MUX selects between signals A and B normal MUX output = SA + SB (43) If only the complemented signals Ā and B are available and one uses an inverting MUX, the resulting output is inverting MUX output = ( S Ā + S B) = SA + SB + AB (44) = SA + SB = normal MUX output This obviates the needs for extra inverters thereby reducing the critical path delay. In the circuit shown in Figure 2, the critical path delay is associated with a pull-up or pull-down through at most two series transistors and is therefore equivalent to the delay of a two input NAND/NOR gate or 10

12 1 unit. If the delay of a cascaded chain of transmission gates becomes superlinear, one can always use the inverting implementation, whose delay grows in proportion to the length of the series chain. Thus, there is no delay penalty if one uses the inverting MUX instead of a transmission gate MUX, but the number of transistors required goes up from 4 to 8. Our design utilizes 2 input XOR/XNOR gates that can be implemented in fully complementary logic, which requires 12 transistors and 2 units of delay, or with a transmission structure [20]. The latter implementation needs 6 transistors when both the inputs are available only in the true (uncomplemented) form and 4 transistors when at least one of the inputs is available in both true and complemented form. A TG-based implementation of XOR/XNOR [20] has an inverter followed by a transmission structure which makes the total delay closer to 2 units than 1. Hence, the delay associated with an XOR/XNOR gate is assumed to be 2 units. The extra inverter in the TG XOR circuit causes the delay associated with the two inputs to be unequal: the delay of the input signal that has to propagate through the inverter and the transmission structure is 2 while the delay of the signal that has to propagate only through the transmission structure is (approximately) 1 unit. Finally, we assume that the delay associated with complex CMOS gates with up to 4 independent inputs is 1.5 units. This abstraction is commonly used in the literature [18, 11, 15]. The AND-OR-INVERT (A-O-I) and OR-AND-INVERT (O-A-I) gates and majority gates are examples of complex gates whose delay is roughly 1.5 units. (B) Architecture : We use a look-ahead tree to generate the carries with the shortest possible delay. The blocking factor or fan-in of our tree is two at each level. In other words, we always combine P and Q signals of two smaller groups to generate the P and Q signals of the larger group at each node of the look-ahead tree. This differs from a conventional implementation where the group P and G signals for four smaller groups are combined into a larger group, at each node of the look-ahead tree. It turns out that for the static CMOS technology, such a binary tree is the fastest possible way of generating the carries for the word lengths of practical significance (less than 1024). Generation of a carry for each individual bit position can be avoided by using the carry-select technique. Here, the input bits are grouped into blocks and a ripple-carry addition is performed for each block assuming both possibilities, viz., the incoming carry is 0 and 1. When the actual carry input to a block becomes available (it is generated by the look-ahead tree), it is used to drive a block multiplexor that selects the correct output. This way, the need to generate a carry input for each individual bit position is obviated, thereby reducing the total critical path delay. The larger the block size, the fewer the number of carries to be generated; but the time delay required for the ripple-carry addition through a block increases with its size. The optimal size of the block should therefore be selected in such a way that the delay of the ripple-carry addition through the most significant block and the delay associated with the generation of the carry input to that block are equal or as balanced as possible, so that all the inputs to the block multiplexor arrive at approximately the same time. Later in this section, we tabulate the optimal block size as a function of operand word length. For the time being, consider a block size b = 4, purely for the purpose of illustration. The ripple-carry adder for a group of 4 11

13 bit positions is illustrated in Figure 3. In this figure, the XOR/XNOR gates that generate the final sum output bits are implemented with fully complementary logic (as opposed to transmission gate logic) and require a delay of 2 units, leading to a total delay of 5 units for the ripple-carry addition. We would like to point out that the selection of the fan-in factor of our look-ahead tree is totally independent of the choice of the block size b for the ripple-carry addition. For instance, the optimal block size changes along with the word length. The fan-in of the look-ahead tree, however, remains two for all these word lengths. As mentioned above, the generation of the group P and Q signals is accomplished with a binary lookahead tree. Figure 4 depicts the generation of the group P and Q signals for a group of 4 bits. The generation of the group Q signal is achieved by a two level binary tree of MUXes. At each node of the tree, a MUX is used to select between the incoming carry and the generated carry, based on whether or not the digit sums, i.e., the P s in that group are all ones. Since both P and its complement P are required to drive the MUXes, both these signals are generated at all levels of the tree. P i:j = ( P i:m + P v:j ) and (45) P i:j = (P i:m P v:j ) where i m j and i v m 1 (46) Note that equation (??) can be implemented by a single NOR gate while equation (??) can be implemented by a single NAND gate. If both P s and their complements are not available at each level of the tree, then one cannot synthesize the required signals in a single unit of delay since a direct implementation of an AND requires a NAND followed by an inverter. This way, the group P, P and Q signals are generated simultaneously from those of two subgroups within one unit delay, thereby allowing us to double the group size at the cost of a single unit of delay. The overall architecture of our 32-bit adder is shown in Figure 5. For this wordlength, the optimum block size turns out to be 4. The boxes labeled CSC (Carry Select Circuit) at the top employ the circuit shown in Figure 4 to implement the first two levels of the carry look-ahead tree. The outputs of the CSC blocks are the P, P and Q signals for groups of 4 bits. Note that the group of the 4 least significant bits employs a different circuit labeled CG ( Carry Generator) in place of the CSC circuit. The need for this asymmetry is explained a bit later in this section. The multiplexors labeled M are of the fully complementary inverting type shown in Figure 2. The modules near the bottom that are labeled BM are the block multiplexors that select one of the two sum outputs, depending on the incoming carry signal (the block carry and its complement drive the MUXes to select the proper sum output). The block MUXes BM can be of the transmission gate type, since all their inputs are actively restored (i.e., there are no transmission gates in series with the block MUXes). The delays associated with various signals shown in the figure are indicated with numbers enclosed in curly braces {}. For instance, the signal Q23:16 {3} indicates that it is the complement of the originate variable of the group of bits 16 to 23, and the postfix {3} indicates that this signal is generated 3 units 12

14 after the bit-wise P i and Q i signals become available to the CSC modules. Similarly, c 32 {5} indicates that the complement of the carry out of the MSB is available 5 units after the bit-wise P i and Q i have been generated. The outputs of the block multiplexors are the final sum bits. Note that the circuit in Figure 4 (which implements the first two levels of the look-ahead tree) has two transmission gates in series. Since the P inputs to this circuit are themselves generated using transmission structures (i.e., TG-based XOR/XNOR gates), the multiplexors at the third level of the tree (that select signals out of two CSC blocks) should not be implemented using transmission gates, otherwise, too many of them get cascaded in series. Hence, we use the fully complementary inverting multiplexor shown in Figure 2. From the third level onwards, the fan-out load on the MUX outputs also increases. Hence, we choose all the MUXes in the rest of the look-ahead tree to be of the inverting type since it actively restores the signals and also makes it possible to drive larger loads by properly sizing the transistors. The critical paths in the carry generation tree traverse via the inputs to the CSC circuits and the multiplexors that select the block carry-outs for all blocks in the most significant half. It is seen that c 16 is available after 4 units (since log 2 16 = 4 and we require unit delay per level of the look-ahead tree). While the computation of c 16 is being performed in the least significant half, the group propagate and originate signals for all the bit groups in the most significant half, viz., P 19:16, Q 19:16 ; P 23:16, Q 23:16 ; P 27:16, Q 27:16 and P 31:16, Q 31:16 are also computed in parallel (this is clearly illustrated in Figure 5). Consequently, all that remains is the computation of the carries c 20, c 24, c 28 and c 32 from the corresponding group P, P and Q signals and the carry c 16, which can also be done in parallel. Thus there is only one extra level of MUXes after the signal c 16, leading to a total of 5 levels in the look-ahead circuit. Generation of the complements of these carry signals requires one extra unit of delay (that of an inverter). Finally, the block multiplexors take one more unit of delay to give a total critical path delay of 7 units (from the time the inputs to the CSC blocks on top become available, to the generation of the sum outputs). Note that the delay associated with a 4 bit ripple-carry adder is 5 units, while the carry signals (and their complements) are available after 6 gate delays. So the ripple-carry adders are not on the critical paths. The least significant CG module (corresponding to bit positions 0 through 3) is slightly different from all others. This is done in order to shorten the critical path as explained next. The carry input to the least significant bit, viz., c 0 is typically 0 for a normal two operand addition. This is a fixed constant known beforehand and hence, there is no need to select between this constant and the originate signal Q 3:0 for the group. Such a selection would put one more MUX in the critical path, and increase the delay by one unit. In particular, note that c 4 = G 3:0 + P 3:0 c 0 = G 3:0 when c 0 = 0 (47) Hence, instead of generating Q 3:0 it suffices to generate G 3:0. This can be accomplished by setting the originate signal Q 0 to the logical AND of the inputs, a 0 and b 0, i.e., Q 0 = a 0 b 0. From equation (??), and 13

15 Lemma 1, note that G i:j = P i:m Q i:m + P i:m G v:j (48) In other words, if one replaces the lower order originate variable Q v:j in equation (??) by the generate variable G v:j, then the output is G i:j instead of Q i:j. We can exploit this fact to generate G 3:0. Note that setting Q 0 = a 0 b 0 = G 0 in Figure 4 causes multiplexor M1 in the CSC block to generate G 1:0 instead of Q 1:0 as per the above equation. As a result, multiplexor M3 in Figure 4 generates G 3:0 instead of Q 3:0 because the lower order generate term G 1:0 is used instead of Q 1:0. This scheme reduces the critical path delay at the expense of flexibility since the carry-in c 0 is not allowed to be an independent external input. As shown next, the block CG can be further modified to allow a variable external carry-in c 0 without affecting the critical path delay. A carry-in of 1 might be required, for instance, when performing a subtraction or a multi-operand addition. Subtraction A B is achieved by taking the bit-wise complement of B and adding a 1 to the least significant bit (LSB) position. The addition of 1 in the LSB position can be accomplished by setting the carry c 0 to 1. To see that a variable carry-in can be handled without affecting the total delay, note that in Figure 4, if the signal Q 0 is replaced by the actual carry c 1, then multiplexor M1 generates P 1 Q 1 + P 1 c 1 = c 2 (49) i.e., the true carry c 2. This, in turn, causes multiplexor M3 to generate P 3:2 Q 3:2 + P 3:2 c 2 = c 4 (50) i.e., the true carry c 4 as required. Thus, the problem now reduces to whether the carry c 1 (or its complement) can be synthesized without increasing the critical path delay, so that it can be used in place of Q 0. Note that the bit-wise propagate signals P i and P i are the XOR/XNOR functions whose realization requires 2 units of delay, as explained earlier. The propagation of signals in the look-ahead tree begins after the P and P signals become available. Therefore, if the carry c 1 (or its complement) can be generated within a delay smaller than two units, the total critical path delay remains unaffected. Fortunately, it turns out that c 1 can be implemented with a delay smaller than 2 units by employing the inverting majority gate (a complex gate) as illustrated in Figure 6. This gate realizes the complement of the majority function (a 0 b 0 + b 0 c 0 + c 0 a 0 ). Note that the worst case delay for this circuit is equivalent to that of a pull-up or pull-down through no more than 3 series transistors, or 1.5 units. Retrieving the uncomplemented c 1 by adding an inverter causes additional delay which would defeat the purpose. Hence we employ the complement Q 1 along with an inverting MUX M to obtain the carry c 2 in the true (uncomplemented) form. The resultant architecture of the CG block is shown in Figure 7 and this is the circuit that is used in the overall block diagram shown in Figure 5. 14

16 Next, we describe the selection of optimal value for the block size b for a given word length W. Recall that the carry look-ahead structure is a binary tree with a total critical path delay of (1 + log 2 W ) units, where x denotes the smallest integer greater than or equal to x and the constant 1 arises because the block multiplexors need both the incoming group carry signal and its complement to select the final sum bits. Some of these inverters (those driving the block MUXes in the most significant half) fall on the critical path leading to one additional unit of delay. The block size is chosen in such a way that the ripple addition through a block completes before the incoming carry becomes available (ideally these inputs to the block MUXes should arrive at the same time, but that is not always feasible). The optimal values of b for different wordlengths are summarized in Table 1 below. Word Length Ripple-carry block size b Delay of ripple-carry addition Delay of carry through b bits look-ahead tree Table 1: Optimal ripple-carry block size b as a function of word length of the input operands. The delays refer to the time required to generate the inputs to the block multiplexors that select the final sum bits, from the time the digit wise P and Q signals become available. Our investigation revealed that going to unequal block sizes cannot shorten the overall critical path delay beyond the values indicated in Table 1 (under the delay model discussed earlier). For instance, increasing b to 5, 6 or 7 in the most significant half for W = 32 or W = 64 does not lead to any further speed up because such unequal block sizes increase the delay through the look-ahead tree by 1 unit (as compared to that when the block size is uniform and equals 4). The extra delay is incurred due to the asymmetry introduced in the bit positions for which a group carry in signal is to be generated. Only when the delay of the look-ahead tree becomes long enough to be comparable to that of a ripplecarry addition through 8 bits does the block size b = 8 become optimal. The delay of ripple-carry addition through 8 bits is 9 units. Hence, a block size of 8 becomes optimal only for word lengths equal to and beyond 256 bits, that are of limited practical significance. We would like to point out that the delay model ignores fan-out loads. If fan-out loading is considered, then transistor sizing for higher drive capability and/or extra buffering might be necessary, thereby increasing the delay of our 64 bit adder tree by an extra unit or two. In that case, larger block size in the most significant half might turn out to be better. However, this kind of optimization depends on the actual circuit parameters such as resistance, capacitance, etc. which in turn depend on the layout, fabrication technology, etc. Analysis at this detailed level is impossible without a full layout and is beyond the scope of this manuscript. The main point is that the circuit level and layout level optimizations can be carried out on top of the algorithm improvements presented in this paper. Thus, under the delay model discussed above, it is better to have equal sized blocks with b values 15

17 shown in Table 1. V Comparison The addition times for different designs are listed in Table 2 as a function of the wordlength of the input operands (W ). Parameters Addition Time Delays Our Adders Alternate Design Adder in Digital s Word Our Block Proposed (Carry select + Adder DEC ALPHA-AXP Length size Design Binary look-ahead in [18] processor [4] W b tree without (conditional sum) MUXes) Table 2: Comparison of addition times (gate delays) for different designs as function of operand wordlength W. The delay for the adder in ALPHA-AXP processor is estimated based on the block and circuit diagrams presented in [4]. It was assumed that a ripple propagation through 8 bits in that design takes 8 units of delay, and that the cascode dual MUX switches have a delay of 1 unit. All buffering and latching delays have been ignored. It is seen that the use of MUXes enables doubling of word length for 1 unit of additional delay. One of the main reasons which makes that design slower is the fact that the ripple-carry addition through 8 bits takes a long time. Column 5 in Table 2 lists delays for the adder in [18]. As pointed out at the beginning of this section, it is more realistic to assume the delay of an XOR/XNOR gate to be 2 units. This assumption causes the delay values listed in the table above to be 2 units higher than the corresponding delay values originally presented in [18]. It is seen that our adder is faster than that of [18] for all wordlengths. For 32 and 64 bit operands, the speed improvements are 33% and 31%, respectively. The adder in [18] is slower than ours because (i) (ii) The ripple through 8 bit positions is slow. The design in [18] inserts inverters in the tree of MUXes to retrieve the signs in true (uncomplemented) form. This is not required and can be avoided as explained in Section IV. Thus, our look-ahead tree architecture is faster than that of [18] for all word lengths. 16

18 (iii) The generation of the signals that feed the look-ahead tree in [18] takes longer delay (4 units as opposed to 2). Column 4 of Table 2 lists the delay of an alternate architecture which is the same as the proposed (optimal) one in all respects (i.e., the block size for ripple carry addition and overall architecture of the look-ahead tree) except that it implements equations (??) and (??) directly, without using MUXes. In the alternate design, the complement of G i:j is synthesized in 1.5 units of delay by using an AND-OR-INVERT (AOI) complex gate with 3 inputs, viz., G i:m, P i:m and G m 1:j. The inversion inherent in CMOS causes no extra delay and nodes at all levels of the tree can be inverting, thereby yielding a delay of 1.5 units per level. We decided to present this alternate architecture for comparison instead of the binary look-ahead design presented in [1] for two reasons: (1) It clearly pinpoints the advantage gained by the use of MUXes even though everything else remains the same: only 1 unit of delay is required per level of the look-ahead tree, resulting in a saving of 0.5 unit per level (as compared with a design that uses NAND/NOR gates instead of MUXes to implement equations (??) and (??)). (2) The original design in [1] does not employ carry selection at all. It calculates carries for each individual bit position. This causes the number of levels in the look-ahead circuit to grow considerably. For a 2 w bit adder, the number of levels in the carry generation circuit in their design is (2w 1) because the carries of successively larger groups are computed in w levels and then an inverted tree is used to calculate the carries for all the intermediate individual bit positions. This makes the design in [1] very slow and hence it was decided that a direct comparison with this design is not particularly interesting. An inverted look-ahead tree is unnecessary when the carry-selection technique is employed. Consequently, our architecture of a 2 w -bit adder has only w levels in the look-ahead circuit. While the carry out of the least significant half, i.e., c 2 w 1 is being computed, the P and Q signals for all the bit blocks in the most significant half are also computed in parallel. In fact the generation of c 2 w 1 and all the P and Q signals in the most significant half finishes at exactly the same time, i.e., after w 1 levels of the tree. Consequently, once c 2 w 1 is available, the carry-in signals for all the bit blocks in the most significant half are calculated in parallel at the last level of the look-ahead circuit. A special case of this general property of our architecture is illustrated by the 32-bit adder shown in Figure 5. For this case, w = 5 and c 16 is available after w 1 = 4 time units. All the carries in the most significant half are then generated in the last level of the look-ahead circuit, leading to a delay of w = 5 units (for signals to propagate through the tree) The alternate design thus demonstrates that one need not be restricted to P and Q signals only: it is possible to have the same (w) levels in the look-ahead tree and use the conventional P and G signals along with the fully complementary, actively restoring inverting implementation of equation (??). 17

19 It is seen that our alternate design with the conventional P and G signals is faster than the design in [18]. This happens despite the fact that the conventional design needs 1.5 units of delay per level of the look-ahead tree while the design in [18] employs MUXes requiring only 1 unit of delay per level of the tree. The main reason is that the ripple-carry addition through 8 bits in that design is very slow and offsets the speed advantage gained by using MUXes. Thus, the use of MUXes alone does not fully optimize the delay, the architecture of the look-ahead tree must also be optimized in order to achieve the best possible result. Table 3 compares the transistor count estimates for different designs. The transistor counts for the designs in [18] are reproduced from Table III therein, for the sake of comparison. Word Our Adder (Proposed Design) Adder Our Adder Length (W) Allowing up to 4 Allowing up to 6 in [18] Alternate series transmission gates series transmission gates Design Table 3: Transistor counts for different adders for different wordlengths. Table 3 shows that our adder (proposed design) needs a smaller number of transistors than that required by the adder in [18] for wordlengths equal to and larger than 32 bits. It should be pointed out that the design in [18] has up to 11 transmission gates cascaded in series. When transmission gates are replaced by their fully complementary actively restoring circuits, the transistor count goes up significantly. For example, the XOR gates that generate the sum output bits in the ripple circuit in Figure 3 can be implemented very economically with merely 4 transistors each, if one utilizes transmission structures. The delay is likely to be small ( 1 unit) as well, since both P and P signals are already available and hence can be put on the slow input of the TG XOR circuit, leaving the late arriving ripple carry on the fast input. A fully complementary implementation, on the other hand needs 10 transistors and has a delay of 2 units. Even when we limit the number of series transmission gates to 4, our 32 and 64 bit adders require 7% and 4% fewer transistors than the corresponding design in [18]. If we limit the number of series transmission gates to 6 (which is reasonable, especially as the feature size continues to shrink) the improvements in savings in the number of transistors are substantial: 28% and 25%, for 32 and 64 bit adders, respectively. The savings come about mainly because the design in [18] needs too much processing (transistors) at the top to generate the inputs required for the look-ahead tree. The cells that provide the inputs to the look-ahead tree are essentially at the leaf nodes of the tree. This, along with the fact that the number of leaf nodes in a tree is (1 + the number of internal nodes) implies that any saving in transistors at the leaf nodes will lead to a significant saving in the total number of transistors. The last column of Table 3 shows that our alternate architecture (combining carry-select and binary look-ahead with conventional P and G signals), requires a larger number of transistors than our MUX- 18

20 based (optimal and preferred) design. Actually, the transistor counts for the adders excluding the cells that generate the bit-wise P, Q and P, G signals (i.e., transistor counts of all the blocks shown in Figure 5) turn out to be about the same for both the proposed and the alternate design. The main difference is that the proposed design utilizes the Q i variables which require no processing whatsoever since Q i can be set to either a i or b i by itself. The alternate design, on the other hand, must use G i = a i b i, i.e., a bit-wise AND (in reality, a NAND is sufficient). This causes the transistor count of the P i, G i generator cell in the alternate design to be 14, instead of the 10 transistors that are required for the P i, Q i generation in the proposed design. This leads to a sizable savings in the total number of transistors and clearly brings out the advantage of using the Q variables instead of the G variables. The conditional-sum architecture is expensive in terms of the number of transistors [9]. Since the adder in the ALPHA chip is basically a conditional sum design, its transistor count is likely to be larger than all those presented in Table 3. Comparing its transistor count does not lead to any additional insight and was therefore excluded from Table 3. Tables 2 and 3 clearly demonstrate that our design is both (i) significantly faster than those in [18] and [4]; and (ii) requires fewer transistors. A substantial speedup ratio of 31% or higher is achieved despite the fact that our design needs less hardware. Finally, we compare our design with a recently published dynamic CMOS design [8] which adds two 56 bit operands (56 bits is the length of the significand prescribed in the IEEE standard for doubleprecision floating-point numbers). It employs variable length Manchester chains and several other circuit optimizations to achieve a propagation delay of 1.85 ns through the look-ahead tree (up to the generation of sum bits). However, it is a dynamic CMOS design and the 1.85 ns excludes the delay required to generate the bit-wise G and P signals, as well as the delay required to latch the final outputs. In [8], the delay of a basic two input gate was measured to be about 0.25 ns for a 1 micron technology. Assuming a delay of 2 units to generate the G and P signals and 1 unit to latch the final output, the total delay for that design would be = 2.6 ns. The delay of our 64 bit design, on the other hand would be = 2.5 ns., which is comparable to that of [8]. Note, however, that our design can handle 64 bits. The design in [8], on the other hand, was fine tuned and optimized for a 56 bit addition. Consequently, it is not clear whether that design can be extended easily for a 64 bit addition. Furthermore, if it is required to add 128 bits (perhaps in the final stage of a double precision multiply operation), our design scales up easily while the design in [8] might have to be completely re-done. We would like to stress that a complete comparison is not possible until a detailed layout is available. However, we expect our design to perform better because the speedup and transistor savings come about as a result of improvements at the algorithm and architecture levels. All the circuit optimizations employed in other contemporary designs can also be utilized along with our algorithm to yield even further performance enhancements. VI Conclusion 19

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10, A NOVEL DOMINO LOGIC DESIGN FOR EMBEDDED APPLICATION Dr.K.Sujatha Associate Professor, Department of Computer science and Engineering, Sri Krishna College of Engineering and Technology, Coimbatore, Tamilnadu,