Forward and Reverse Converters and Moduli Set Selection in Signed-Digit Residue Number Systems

Size: px

Start display at page:

Download "Forward and Reverse Converters and Moduli Set Selection in Signed-Digit Residue Number Systems"

Judith Daniels
5 years ago
Views:

1 J Sign Process Syst DOI /s Forward and Reverse Converters and Moduli Set Selection in Signed-Digit Residue Number Systems Andreas Persson Lars Bengtsson Received: 8 March 2007 / Revised: 21 May 2008 / Accepted: 12 June Springer Science + Business Media, LLC. Manufactured in The United States Abstract This paper presents an investigation into using a combination of two alternative digital number representations; the residue number system (RNS) and the signed-digit (SD) number representation in digital arithmetic circuits. The combined number system is called RNS/SD for short. Since the performance of RNS/SD arithmetic circuits depends on the choice of the moduli set (a set of pairwise prime numbers), the purpose of this work is to compare RNS/SD number systems based on different sets. Five specific moduli sets of different lengths are selected. Moduli-setspecific forward and reverse RNS/SD converters are introduced for each of these sets. A generic conversion technique for moduli sets consisting of any number of elements is also presented. Finite impulse response (FIR) filters are used as reference designs in order to evaluate the performance of RNS/SD processing. The designs are evaluated with respect to delay and circuit area in a commercial 0.13 μm CMOS process. For the case of FIR filters it is shown that generic moduli sets with five or six moduli results in designs with the best area delay products. A. Persson Centre for Research on Embedded Systems (CERES), Halmstad University, Sweden andreas.persson@hh.se L. Bengtsson (B) Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden labe@chalmers.se Keywords Residue number system Signed-digit Moduli-selection Converters FIR filters 1 Introduction This paper presents an investigation into a combination of two number representations; the residue number system (RNS) and the signed-digit (SD) number system. In RNS, an integer is decomposed into a set of residues with shorter binary representations, which can be processed in parallel. Carry propagation within RNS arithmetic circuits can be eliminated by using the SD number system to represent the residues. The SD number system provides a redundant number representation that facilitates carry-free addition. The use of SD numbers also implies efficient modulo arithmetic, which helps to simplify crucial RNS operations. The basis of a residue number system is a set of pairwise prime integers, called the moduli set. The performance of RNS processing depends on the choice of this set and on the implementation of forward and reverse RNS conversion. Many moduli sets and conversion techniques have been suggested for RNS systems with residues on 2 s complement form. This work investigates moduli sets and converters for use with signeddigit residue number systems (RNS/SD). The aim is to investigate how the choice of the moduli set affects the performance of RNS/SD arithmetic operations. A number of moduli sets were selected for evaluation and forward and reverse RNS/ SD converters were implemented for each of these sets. In order to compare the performance of RNS/SD processing, RNS/SD finite impulse response (FIR) filters were implemented using Synopsys Design Compiler and a

2 A. Persson, L. Bengtsson 0.13 μm CMOS cell library from UMC. The synthesized designs are compared with respect to delay (speed) and circuit area. The paper is organized as follows. Section 2 gives an introduction to the signed-digit residue number system and outlines the principles of forward and reverse RNS/SD conversion. The moduli sets selected for evaluation in this work are presented in Section 3 together with guidelines for selecting efficient moduli sets. Section 4 presents a technique for RNS/SD encoding and Section 5 presents moduli-set-specific decoding techniques for each of the of the sets introduced in Section 3. A reverse conversion technique for RNS/SD number systems using general moduli sets is presented in Section 6. The technique detailed in Section 6 is applicable to all coprime sets with one element of the form 2 n. The RNS/SD finite impulse response (FIR) filters which have been used as reference designs are described in Section 7. ASIC synthesis results are presented in Section 8 where performance evaluations and comparisons are made with respect to delay and circuit area. Section 9 gives the conclusions. 2 Background 2.1 Residue Number System The residue number system (RNS) is an integer system capable of supporting high speed concurrent arithmetic. In RNS, an integer is decomposed into a set of smaller integers (i.e. with shorter binary representations), which can be processed independently and in parallel. The basis of an RNS is a set of pairwise prime integers S ={m 1, m 2,...,m L },wheregcd(m i, m j ) = 1 for i = j. ThesetS is called the moduli set and the dynamic range of the number system is [0, M), where M is the product of all moduli m i in S. Any integer X within the dynamic range has a unique RNS representation given by an ordered set of residues X {x 1, x 2,...,x L }, x i = X mi. where X mi denotes X mod m i. The most important characteristic of the RNS representation is that it is a non-weighted number system, which facilitates parallel computing. If integers A and B have RNS representations {a 1, a 2,...,a L } and {b 1, b 2,...,b L } respectively, then the RNS representation of C = A B is C {c 1, c 2,...,c L }, c i = a i b i mi. where denotes addition, subtraction, multiplication or any combination of the three. The computation of c i depends upon a i, b i and m i only. Hence, each c i can be computed using a separate arithmetic unit, often called a channel. The reconstruction X from {x 1, x 2,...,x L } is based on the Chinese Remainder Theorem (CRT) X = where L i=1 ˆM i ˆM 1 i x i m i M M = L i=1 m i, ˆM i = M m i and, (1) ˆ M 1 i m i is the multiplicative inverse of ˆM i modulo m i, such that ˆM i ˆM 1 i 1. m i 2.2 Signed-Digit Number System The radix-2 signed-digit (SD) number system has the digit-set {1, 0, 1}, where 1 denotes 1. AnN-digit SD number Y =[y N 1...y 0 ] SD, y i {1, 0, 1}, has the value N 1 Y = 2 i y i (2) i=0 which is the same as for an unsigned binary number except that y i can be 1. This yields a redundant number representation. For example, 6 can be represented as [0110] SD, [ ] SD or [ ] SD.Zero, however, has a unique representation. To represent an SD digit y, two bits, y and y + are required. That is, y =[y y + ]. Using this digit encoding, the value of an N-digit SD number Y=[y n 1...y 0 ] SD is given by Eq. 3. N 1 N 1 Y = 2 i y + i 2 i y i (3) i=0 i=0 Note that, unlike the 2 s complement representation, it is possible to represent any integer and its negation with an equal number of digits. The negation of an integer is a very simple operation in the SD number system. The negation of y =[y y + ] is ( y) =[y + y ],ascanbe seen by negating Eq. 3. No logic gates are required for this operation. By exploiting the redundancy of the Signed Digit number representation, carry propagation is limited to one bit position when adding SD numbers. Addition of two numbers X and Y is performed according to the set of rules presented in Table 1, inwhichc i denotes the (non-propagating) carry and u i the interim sum. These rules avoid any carry propagation when the final sum s i is computed according to Eq. 4. Consequently, addition

3 Forward and reverse converters and moduli set selection... Table 1 Rules for adding SD numbers. Rules x i y i x i 1 y i 1 neither is 1 at least one is 1 neither is 1 at least one is 1 c i u i Q2 is performed in constant time, regardless of operand widths. s i = u i + c i, s i {1, 0, 1} (4) 2.3 Combining RNS and SD The use of the SD number system has been suggested as a way to eliminate carry propagation within RNS arithmetic circuits. The carry-free properties of the SD number system provides constant time addition operations. The parallel processing capabilities of the residue number system results in faster and more area-efficient multiplication operations. The use of signed digit representation also implies efficient modulo arithmetic, which helps to simplify crucial RNS operations. An important consideration when designing RNS systems is the choice of the moduli set. Sets with elements of the forms 2 n, 2 n 1 and 2 n + 1 are of special interest. Such low-cost moduli facilitates the use of simplified arithmetic units. The properties of the SD number system helps to further simplify modulo arithmetic for low-cost moduli. Addition modulo 2 n 1 and 2 n + 1 is performed using SD adders with end-aroundcarry logic [1]. Due to the limited carry-propagation in SD adders, there is no delay penalty for the endaround-carry operations. Furthermore, unlike in the 2 s complement number system, the result of modulo 2 n + 1 addition is represented using n SD-digits, since representations for sums greater than 2 n 1 are taken from the negative range. Modulo multiplication by powers of two with respect to the low-cost moduli relies on simple shift operations, according to the rules in Eq. 5,wherex =[x n 1...x 0 ] SD is an n-digit SD Number. These operations are accomplished by wiring connections appropriately. 2 a x 2 n 1 =[x n a 1...x 0 x n 1...x n a ] SD 2 a x 2 n =[x n a 1...x ] SD 2 a x 2 n +1 =[x n a 1...x 0 x n 1... x n a ] SD (5) 2.4 Previous Work The RNS and SD number systems are well known and have been thoroughly studied in the literature, for example in [2 6]. The possibility to combine RNS and SD arithmetic has also, in a lesser extent, been studied, most notably by Wei and Shimizu [7 9]. In [1], Lindström et al. present efficient forward and reverse converters for the combined RNS/ SD number system using the popular moduli set {2 n 1, 2 n, 2 n + 1}. In [10], Lindahl and Bengtsson present direct form finite impulse response (FIR) filter implementations using RNS/SD. Their work shows that the use of the RNS/SD number representation can reduce circuit area and power dissipation while the clock period is retained. 3 Moduli Selection in RNS/SD One of the most important considerations when designing RNS Systems is the choice of the moduli set. The choice of moduli affects the complexity of forward and reverse converters as well as RNS arithmetic circuits. In [11], Abdallah and Skavantzos state that the moduli set, S = m 1,...,m L, should be chosen such that the moduli m i s satisfy the following criteria: 1. They should be pairwise prime. That is, gcd(m i, m j ) = 1 for all m i = m j. 2. Each moduli m i should be as small as possible so that operations modulo m i require minimum computational time. 3. The moduli m i s should imply simple binary to RNS and RNS to binary conversions as well as simple RNS arithmetic. 4. The moduli product should be large enough to implement the desired dynamic range. 5. The moduli should provide a well balanced decomposition of the dynamic range. This means that the difference in word length between the moduli should be as small as possible. Sets with all elements being of the forms 2 n, 2 n 1 and 2 n + 1 satisfy the requirement of simple conversions and efficient modulo arithmetic. Since the SD number system is used to represent residues, addition and subtraction are performed in constant time, regardless of operand widths. Consequently, criteria 2 and 5 are less important for adder-based RNS/SD applications. However, for multiplication-intensive applica-

4 A. Persson, L. Bengtsson tions, moduli sets with small and balanced moduli results in faster and more area-efficient implementations. 3.1 Parameterized Moduli Sets Several types of moduli sets have been considered by RNS researchers. A large number of different parameterized moduli sets have been suggested in the literature. The parameterized sets consist of a small number of low-cost moduli on a fix form, where each moduli is expressed as a function of a parameter, say n. The dynamic range of such sets can easily be scaled by adjusting n. The use of parameterized moduli implies efficient RNS conversions, since it is possible to take advantage of moduli-set-specific properties, such as attractive close form expressions for the moduli product and for the multiplicative inverses required for reverse conversion. However, as the number of moduli is increased, such attractive properties are rare and come at the cost of balance in residue word lengths. Five parameterized moduli sets, S 1,...,S 5, have been selected for evaluation in this work. S 1 ={2 n 1, 2 n + 1} S 2 ={2 n 1, 2 n, 2 n + 1} S 3 ={2 n 1, 2 n, 2 n + 1, 2 2n + 1} S 4 ={2 n 1, 2 n 1 1, 2 n 1, 2 n + 1} S 5 ={2 n, 2 n 1, 2 n 1 1, 2 n 1 + 1} Moduli-set-specific forward and reverse converters have been implemented for RNS/SD number system based on each of the five sets. The forward conversion technique is outlined in Section 4. Reverse converters for the parameterized moduli sets are presented in Section General Moduli Sets If large dynamic ranges are required, a general moduli set consisting of a larger number of moduli might result in better performance. If low-cost moduli are used, RNS/SD forward conversions for such sets are as efficient as for the parameterized moduli sets. Reverse conversion, on the other hand, is a significantly more difficult task. A generic RNS/SD decoder must handle the adverse properties of the Chinese Remainder Theorem, that is, modulo M operations for a large-valued M and multiplications by constant factors which do not necessarily have attractive forms. A decoder for general moduli sets has been designed and the implementation is outlined in Section 6. 4 RNS/SD Encoders For each moduli set presented in Section 3, an RNS/SD encoder has been developed. The encoder for a moduli set S ={m 1, m 2,...,m L } converts an integer in binary form into L SD residues. Modulo reduction for lowcost moduli is straightforward when using the SD number System. To construct a residue x i from an integer X, X is partitioned into vectors of the same length as the corresponding moduli m i. The last vector is padded with constant zeros if necessary. x i = X mi = k i,0 + 2 n i k i, n i k i, ln i k i,l m i, (6) where k i,0 = [ x ni 1x ni 2...x 0 ], k i,1 = [ x 2ni 1x ni 2...x ni ],... k i,l = [ 0...0x W 1...x lni ]. n i is the word length of moduli m i and W is the word length of X.Sincem i is either 2 n i, 2 n i 1 or 2 n i + 1,the rules of Eq. 5 apply for multiplication by powers of two and Eq. 6 simplifies to ki,0 + k i,1 + k i,2 + k i, if m 2 ni 1 i =2 n i 1, x i = k i,0 if m i =2 n i, ki,0 k i,1 + k i,2 k i, if m 2 ni +1 i =2 n i + 1. The RNS/SD encoders consist of a number of multioperand SD modulo adders, one for each m i = 2 n i. Since the input X is in binary form. It is possible to reduce the complexity of the encoder by using simplified SD adder cells on the first levels of each adder tree. Figure 1 shows the encoder for the RNS/SD number system with moduli set {128, 129, 127, 65, 17}. 5 RNS/SD Decoders for Parameterized Moduli Sets 5.1 RNS/SD Decoder for Moduli Set S 1 The proposed architecture for decoding SD residues with respect to the S 1 ={m 1, m 2 }={2 n 1, 2 n + 1} moduli set is based on the Chinese Remainder Theorem, as presented in Section 2.1. For an RNS with two moduli, the CRT procedure in Eq. 1 is reduced to 1 X = ˆM 1 1 ˆM 1 x 1 + ˆM 2 ˆM 2 x 2. (7) m 1 m 2 M

5 Forward and reverse converters and moduli set selection... Figure 1 RNS/SD Encoder for moduli set {128, 129, 127, 65, 17}. For the particular set S 1, we have M = 2 2n 1, ˆM 1 = M m 1 = 2 n + 1, ˆM 2 = M m 2 = 2 n 1. It is easy to see that the two multiplicative inverses needed for computation of Eq. 7 are both powers of two. 1 Claim: ˆM 1 = 2 n 1 m1 1 Proof: ˆM 1 ˆM 1 = 2 n 1 (2 n + 1) m 2 n 1 1 = 2 n 1 2 n n 1 Claim: Proof: 1 ˆM 2 m2 1 ˆM 2 ˆM 2 = 2 n n 1 = 2 n 2 n 1 = 1 = 2 n 1 = 2 n 1 (2 n 1) m 2 n +1 2 = 2 n 1 2 n 1 2 n +1 = 2 n 1 ( 2) 2 n +1 2 n 1 2 n +1 = 2 n 2 n +1 = 1 1 Inserting the derived expressions for ˆM 1, ˆM 2, ˆM 1, m1 1 ˆM 2 and M into Eq. 7 yields m2 X = (2 n + 1)2 n 1 x 1 + (2 n 1)2 n 1 x 2 2 2n 1 = ( 2 2n n 1) x 1 + ( 2 2n 1 2 n 1) x 2 2 2n 1 = Ax 1 + Bx 2 2 2n 1 (8) Using the rules for multiplication by powers of two from Eq. 5, together with the fact that x 1 and x 2 both have digit-length n in the SD number system, of Eq. 8 can be computed as the sum of two 2n-digit SD vectors, formed by concatenation, rotation and negation. Ax 1 = 2 2n 1 x n 1 [ ] x 1 = x10 x 1n 1...x 10 x 1n 1...x 11 Bx 2 = 2 2n 1 x 2 2 n 1 [ ] x 2 = x10 x 1n 1... x 10 x 1n 1...x 11 No logic gates are required to form Ax 1 and Bx 2. One modulo 2 2n 1 SD adder is sufficient to generate X. The result will be in the range ( M, M), due to the fact that SD modulo adders use the negative range as well as the positive. If the output is required to be in the range [0, M), the correct result is obtained by adding M =[1 2n ] to X, whenx is negative. Adding constant ones to an SD integer is a simple operation, as shown in [1]. Carry-look-ahead (CLA) adders are used to obtain the binary representation of X, accordingtoeq.3. In order to minimize the extra delay introduced by this range correction, both X and X + M are decoded to binary form, using two CLAs operating in parallel. The correct value is selected by examining the carry-out bit of the adder for X. The hardware architecture of the decoder is depicted in Fig RNS/SD Decoder for Moduli Set S 2 The set S 2 ={2 n 1, 2 n, 2 n + 1} is probably the most widely used moduli set for RNS. It is also the moduli set that has been most intensively studied in the literature. The decoding of binary-residue number systems based on set S 2 is studied, for example, in [12 14]. An efficient decoder for RNS systems with SD residues is presented in [1]. This is also the decoder that has been used in this work, with some minor modifications. The conversion technique is outlined again in this section since the new decoder for moduli set S 3, presented in Section 5.3, is based on a similar approach. The decoder is based upon a modified formulation of the Chinese Remainder Theorem, the New CRT-I,

6 A. Persson, L. Bengtsson Claim: Proof: k 2 = 2 n 1 k 2 m 1 m 2 m3 = 2 n 1 2 n (2 n + 1) 2 n 1 = 2 n 1 2 n 2 n 1 2 n n 1 = 2 n n 1 = 2 n 2 n 1 = 1 2 n 1 Using the expressions for m 1, m 2 and m 3, together with the derived expressions for k 1 and k 2, Eq. 9 simplifies to X = x n X, X = 2 n (x 2 x 1 ) + 2 n 1 ( 2 n + 1 ) (x 3 x 2 ) 2 2n 1. (10) Figure 2 RNS/SD decoder for moduli set S 1. By expanding the terms of Eq. 10 and grouping the coefficients of x 1, x 2 and x 3, the expression for X can be rewritten as the sum of three terms X = Ax 1 + Bx 2 + Cx 3 2 2n 1, (11) as presented in [15]. According to the New CRT-I, the binary representation X of a residue number {x 1, x 2,...,x L } can be computed as X = x 1 + m 1 X, k1 (x 2 x 1 ) + k 2 m 2 (x 3 x 2 ) +... X = +k L 1 m 2 m 3...m L 1 (x L x L 1 ) m 2 m 3...m L, where {m 1, m 2,...,m L } is the moduli set and k 1, k 1,...,k L 1 are multiplicative inverses, given by k 1 m 1 m2 m 3...m L 1 k 2 m 1 m 2 m3 m 4...m L 1... k L 1 m 1 m 2...m L 1 ml 1 (9) If the elements of S 2 are rearranged, such that m 1 =2 n, m 2 =2 n +1 and m 3 =2 n 1, then the two multiplicative inverses k 1 and k 2 are both powers of two. Claim: k 1 = 2 n Proof: k 1 m 1 m2 m 3 = 2 n 2 n 2 2n 1 = 2 2n 2 2n 1 = 1 where A = 2 n, B = 2 2n n 1, C = 2 2n n 1. No logic gates are required to form SD representations of Ax 1, Bx 2 and Cx 3. Using the rules for multiplication by powers of two from Eq. 5, we have Ax 1 = [ x 1n 1... x 10 0 n ]SD, Bx 2 = [ x 20 x 2n 1...x 20 x 2n 1... x 21 ]SD, Cx 3 = [ x 30 x 3n 1...x 30 x 3n 1...x 31 ]SD. The result X is the concatenation of x 1 and X.TwoSD modulo adders are required to generate X according to Eq. 11. To make sure that the result is in the positive range, M =[1 2n n ] is added to X, when X is negative. Note that this has no effect on the lower part of X. Figure 3 shows the hardware architecture of the RNS/SD reverse converter for moduli set S RNS/SD Decoder for Moduli Set S 3 The four-moduli set S 3 ={2 n 1, 2 n, 2 n + 1, 2 2n + 1} is an extension of the popular S 2 moduli set, and has been suggested as a way to increase the dynamic range of the RNS. The resulting RNS decoder is as efficient as the S 2 decoder, while the dynamic range is increased

7 Forward and reverse converters and moduli set selection... Claim: k 2 = 2 n 1 Proof: k 2 m 1 m 2 m3 m 4 = 2 n 1 2 n (2 2n + 1) (2 n 1)(2 n +1) = 2 n 1 2 n 2 2n n 1 = 2 n 1 2 n 2 2 2n 1 = 2 2n 2 2n 1 = 1 2 2n 1 Claim: Proof: k 3 = 2 n 2 k 3 m 1 m 2 m 3 m4 = 2 n 2 2 n (2 2n +1) (2 n +1) 2 n 1 = 2 n 2 2 n 2 n 1 2 2n n 1 2 n n 1 2 n 1 = 2 n n 1 = 2 n 2 n 1 = 1 Figure 3 RNS/SD decoder for moduli set S 2. Inserting the expressions for m 1...m 4 and k 1...k 3 into Eq. 12 yields from 3n 1 bits to 5n 1 bits. An adder-based binaryresidue decoder for the S 3 set is presented in [16]. By applying a new moduli reordering scheme and by exploiting the properties of the SD number system, the number of terms which need to be added has been reduced from six for the decoder from [16] to four for the decoder proposed in this section. For an RNS with four moduli, the New CRT-I procedure from Eq. 9 is reduced to X = x 1 + m 1 X, k1 (x 2 x 1 ) + k 2 m 2 (x 3 x 2 ) + X = +k 3 m 2 m 3 (x 4 x 3 ) m 2 m 3 m 4. (12) The elements of S 3 are rearranged, such that m 1 = 2 n, m 2 = 2 2n + 1, m 3 = 2 n + 1 and m 4 = 2 n 1. Using this ordering, k 1, k 2 and k 3 in Eq. 12 are powers of two. Claim: k 1 = 2 3n Proof: k 1 m 1 m2 m 3 m 4 = 2 3n 2 n 2 4n 1 = 2 4n 2 4n 1 = 1 X = x n X, 2 3n (x 2 x 1 ) + 2 2n 1 (2 2n + 1) (x 3 x 2 ) + X = +2 n 2 (2 2n + 1)(2 n + 1) (x 4 x 3 ) 2 4n 1. (13) By expanding all terms in the expression for X of Eq. 13 and grouping the coefficients of each residue x 1,...,x 4, we find that X canberewrittenas X = Ax 1 + Bx 2 + Cx 3 + Dx 4 2 4n 1, (14) where A = 2 3n, B = 2 3n 1 2 n 1, C = 2 4n n 2 2 2n n 2, D = 2 4n n n n 2. Studying A, B, C and D, we find that the distance between two consecutive non-zero digits in the SD representation of each term, is equal to the word length of the corresponding residue (n bits for A, C, D and 2n bits for B). Consequently, no logic gates are required to form Ax 1, Bx 2, Cx 3 and Dx 4. Again, we use the rules

8 A. Persson, L. Bengtsson for multiplication by powers of two from Eq. 5 to form terms using concatenation, rotation and negation. Ax 1 = [ ] x 1n 1... x n Bx 2 = [ ] x 2n...x 20 x 22n 1... x 20 x 22n 1...x 2n+1 Cx 3 = [ ] x 31 x 30 x 3n 1...x 30 x 3n 1... x 30 x 3n 1...x 30 x 3n 1... x 32 Dx 4 = [ ] x 41 x 40 x 4n 1...x 40 x 4n 1...x 40 x 4n 1...x 40 x 4n 1...x 42 Three modulo 2 4n 1 SD adders are required to generate X. The result X is the concatenation of X and x 1. As for the decoder for moduli set S 2, range correction is carried out by adding M = [ 1 4n n ] to X, whenx is negative. Figure 4 depicts the hardware architecture of the RNS/SD decoder. 5.4 RNS/SD Decoder for Moduli Sets S 4 and S 5 The set S 4 ={2 n 1, 2 n 1 1, 2 n 1, 2 n + 1} is a balanced moduli set, well suited for large dynamic ranges. However, the elements of S 4 are pairwise prime for even values of n only. This might be a disadvantage when tailoring the set for a given dynamic range. To overcome this problem, the set S 5 ={2 n, 2 n 1, 2 n 1 1, 2 n 1 + 1} will be used as a complement to S 4 for odd values of n. S 5 has a similar form compared to S 4, only the exponents differ. The elements of S 5 are pairwise prime for odd values of n only. The two sets can be expressed on a common form as {m 1, m 2, m 3, m 4 }={2 a, 2 a 1, 2 b 1, 2 b + 1} where a = n 1, b = n for S 4 and a = n, b = n 1 for S 5. The proposed decoders for RNS/SD number systems using these moduli are SD implementations of a two-level approach to RNS decoding, detailed in [17]. On the first level, the moduli set is decomposed into two subsets, {m 1, m 2 } and {m 3, m 4 }. The corresponding residue subsets ({x 1, x 2 } for {m 1, m 2 } and {x 1, x 2 } for {m 3, m 4 }) are decoded using two reverse converters operating in parallel. The second level is a decoder for an RNS with moduli set {m 1 m 2, m 3 m 4 } where the residues X 1 m1 m 2 and X 2 m3 m 4 are the results from the first conversion step. The first-level converter for moduli subset {m 1, m 2 }={2 a, 2 a 1} is a variant of the New CRT-I decoders presented in Sections 5.2 and 5.3. The required multiplicative inverse has the value 1. The proof of this is trivial, since 2 a 2 a 1 = 1. Inserting the expressions for m 1 and m 2 into Eq. 9 yields X 1 = x a X 1, X 1 = x 2 x 1 2 a 1. (15) The other decoder on the first level is the CRT decoder from Section 5.1,where X 2 = ( 2 2b b 1) x 3 + ( 2 2b 1 2 b 1) x 4 2 2b 1. (16) One modulo 2 a 1 SD adder is needed to compute X 1 in Eq. 15. X 1 is the concatenation of X 1 and x 1.The computation of X 2 according to Eq. 16 requires one modulo 2 2b 1 SD adder. On the second level, two residues are decoded with respect to the moduli set {2 a (2 a 1), 2 2b 1}. Equation 9 is reduced to X = X a ( 2 a 1 ), X = k (X 2 X 1 ) 2 2b 1. (17) Figure 4 RNS/SD decoder for moduli set S 3. Equation 17 differs from the applications of the New CRT-I seen so far. The multiplicative inverse k does not have a closed form expression. A modulo 2 2b 1 adder/scaler is required to compute X = k (X 2 X 1 ) 2 2b 1 for a precalculated value of k. The final result X is computed as the regular (not modulo)

9 Forward and reverse converters and moduli set selection... sum of two terms, [ X X 1 ] and 2 a X,where [ X X 1 ] is the concatenation of X and X 1. Range correction is carried out by adding M = m 1 m 2 m 3 m 4 to negative values of X using simplified SD adder cells before X is converted to binary form. The complete decoder is depicted in Fig RNS/SD Decoders for General Moduli Sets A reverse conversion technique for RNS/SD number systems using general moduli sets is presented. The only constraint given for the moduli set is that one of the elements, say m 1, should be a power of two. The conversion technique presented here is inspired by the work of Wang et al. [18]. Although the general sets studied in this work consist of low-cost moduli exclusively, the technique is applicable to all coprime moduli sets with one element of the form 2 n. Figure 6 RNS/SD decoder for general moduli sets. In [18], Wang et al. propose a new formulation of the Chinese Remainder Theorem. For an RNS with moduli set {m 1,...,m L } and residues {x 1,...,x L }, the value of X is X = x 1 + m 1 X m 2 m 3...m L, X = where k 1 = L k i x i, (18) i=1 ˆM 1 ˆM m 1, m 1 ˆM i ˆM 1 i m k i = i, for i = 2, 3,...,L. m i ˆM i and ˆM 1 i are from the original formulation of the m i CRT in Eq. 1,thatis L M = m i, ˆM i = M, m i i=1 ˆM i ˆM 1 i 1. m i MSD(k): if (k = 0): return {} else: find e, such that 2 e k < 2 e +1 if (3k < 2 e +2 ): return {2 e +1, MSD(2 e +1 k)} else: return {2 e, MSD(k 2 e )} Figure 5 RNS/SD decoder for moduli sets S 4 and S 5. Figure 7 Algorithm for finding a minimal signed-digit representation of an integer k.

10 A. Persson, L. Bengtsson x 1 (3) x 1 (2) x 1 (1) x 1 (0) x 1 (3) x 1 (2) x 1 (1) x 1 (0) x 1 (3) x 1 (2) x 1 (1) x 1 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 3 (2) x 3 (1) x 3 (0) x 3 (2) x 3 (1) x 3 (0) x 3 (2) x 3 (1) x 3 (0) x 3 (2) x 3 (1) x 3 (0) x 3 (2) x 3 (1) x 3 (0) x 4 (2) x 4 (1) x 4 (0) x 4 (2) x 4 (1) x 4 (0) x 4 (2) x 4 (1) x 4 (0) x 4 (2) x 4 (1) x 4 (0) x 5 (1) x 5 (0) x 5 (1) x 5 (0) x 5 (1) x 5 (0) x 5 (1) x 5 (0) Figure 8 Partial product array. The proposed converter has two parts, a multiplication-accumulation (MA) array and a modulo reduction unit. The MA array is used to generate X. The factors k 1, k 2,...,k L are constants and are calculated a priori. The modulo operation of Eq. 18 and the final range correction is carried out by the modulo reduction unit. As described in earlier chapters, X of Eq. 18 can be formed using concatenation if the moduli m 1 is chosen to be a power of two. The hardware architecture of the general converter is depicted in Fig. 6. Implementations of the MA array and the modulo reduction unit are detailed in Sections 6.1 and The SD Multiplication-Accumulation Array The task of the MA array is to compute L i=1 k ix i, where x 1, x 2,...,x L are variables and k 1, k 2,...,k L are integer constants. Wang et al. presents an implementation of an MA array for variables on binary form. The MA architecture outlined in [18] uses the Modified Booth recoding algorithm to form partial products which are added using a Wallace tree adder. The partial product generation of the MA array proposed here relies on a minimal signed-digit recoding scheme. The algorithm MSD(k) is used to find SD representations for the constant factors k 1, k 2,...,k L. In [19], it is proven that the algorithm given in Fig. 7 results in representations of minimal Hamming weight, that is, with a minimum number of non-zero digits. The resulting SD representation has no two adjacent non-zero digits. Thus, for an integer k, the number of non-zero digits is at most log 2 k / For example, MSD(383) returns {512, 128, 1} which corresponds to an SD representation of [ ] SD. Each nonzero digit in the minimal SD representations results in a partial product. The partial products are formed using shift and negation operations. For example, 383x is computed as (x 9) (x 7) x. The operation of the MA array is best explained using an example. Consider the five-moduli set S = {16, 17, 9, 7, 5} with a dynamic range of 16 bits. The x 2 (3) x 2 (2) x 1 (3) x 1 (2) x 1 (1) x 1 (0) x 2 (2) x 2 (1) x 1 (3) x 1 (2) x 1 (3) x 1 (2) x 1 (1) x 1 (0) x 2 (1) x 2 (0) - - x 2 (1) x 2 (3) x 2 (2) x 2 (3) x 2 (0) x 3 (2) x 2 (3) x 2 (2) x 1 (1) x 1 (0) x 2 (3) x 2 (2) x 4 (0) x 5 (0) - - x 3 (2) x 2 (0) x 3 (0) x 2 (1) x 3 (1) x 3 (0) x 2 (0) x 3 (2) x 2 (3) x 2 (2) x 2 (1) x 2 (0) x 5 (1) x 4 (2) x 3 (1) x 4 (2) x 3 (2) x 4 (0) - x 3 (1) x 3 (0) x 2 (1) x 2 (0) x 3 (1) x 3 (0) x 4 (1) x 4 (0) x 4 (1) - - x 5 (1) x 5 (0) x 3 (1) x 3 (2) x 4 (2) x 4 (1) x 5 (1) x 5 (0) x 4 (2) x 3 (0) x 4 (0) x 5 (1) x 4 (1) x 5 (0) Figure 9 Compressed partial product array.

11 Forward and reverse converters and moduli set selection... As seen in Fig. 9, The computation of X =1,004x 1 + 4,725x 2 + 2,380x 3 + 1,530x 4 + 1,071x 5 is achieved by adding eight terms, each 16 bits wide. Because of the carry-free properties of SD adders, there is no need to employ a complicated adder tree structure (Wallace, Dadda etc.). A binary tree of SD adders is used. Since some of the compressed partial products contains constant zeros, simplified SD adder cells are used where possible. For the example case of S ={16, 17, 9, 7, 5}, the adder tree has three levels. Thus, the total delay of the multiplication-accumulation unit is approximately three times the delay of an SD full adder cell. 6.2 The SD Modulo Reduction Unit Figure 10 Modulo reduction unit. constants k 1,...,k 5 and the corresponding minimal SD representations are precalculated. k 1 = 1,004 =[ ] SD, k 2 = 4,725 =[ ] SD, k 3 = 2,380 =[ ] SD, k 4 = 1,530 =[ ] SD, k 5 = 1,071 =[ ] SD. The SD representations of k 1,...,k 5 contain a total of 22 non-zero digits. Consequently, 22 partial products need to be added. Figure 8 shows the resulting partial product array, where each row represents a partial product. The array contains a large number of constant zero operands, depicted by -s in Fig. 8. The zero operands will not affect the result and can be eliminated by compression of the partial product array. As many zero operands as possible are removed, while the weights of non-constant operands are preserved. Figure 9 shows the compressed partial product array for the given example. The modulo reduction unit computes X M,whereX is the result from the MA step and M is the moduli product with m 1 excluded, that is M = m 2 m 3...m L. Since no modulo reduction is performed in the MA stage, the word length of X is greater than the word length of M. Let n be the digit-length of X and let a = log 2 M. Two SD vectors are created from X: X = 2 a 1 X high + X low, X high =[X n 1...X a 1 ] SD, X low =[X a 2...X 0 ] SD. Since X low has digit-length a 1, we know for sure that M < X low < M. A ROM look-up table is used to generate X LUT = 2 a 1 X high. It is not practical to use M redundant signed-digit numbers for ROM addressing. Instead, X high is decomposed into its binary components X + high and X high. X+ high and X high are unsigned binary numbers. Two ROM look-up tables are used to find X LUT. X + LUT 2 = a 1 X + high X LUT 2 = a 1 X high M, M, X LUT = X + LUT X LUT. Figure 11 Transposed form FIR filter.

12 A. Persson, L. Bengtsson The two lock-up tables are identical and a single twoport ROM memory, addressed by n a + 1 bits, can be used for the look-up operations. X + LUT and X LUT are unsigned binary numbers in the range [0, M). The SD subtractor cell for unsigned binary operands consists of just two logic gates and the gate depth is one. The result, X LUT,isana-digit SD number in the range ( M, M). The result of the modulo operation is the sum of X LUT and X low. These two numbers are both in the range ( M, M). Thus, their sum is in the range ( 2M, 2M). Four potential results are computed: Figure 12 Mod m i FIR filter tap. R 0 = X LUT + X low, R 1 = X LUT + X low M, R 2 = X LUT + X low + M, R 2 = X LUT + X low + 2M. The constant terms M, M and 2M are added to X low using simplified SD adders in parallel to the look-up operation. One of the potential results is in the desired range of [0, M). R 0,...,R 3 are converted to binary form using four carry-look-ahead adders operating in parallel. The correct result is selected by examining the carry out bits of the CLA adders. The hardware architecture of the modulo reduction unit is depicted in Fig RNS/SD FIR Filters In order to evaluate the performance of RNS/SD processing using the presented moduli sets, RNS/SD finite impulse response (FIR) filters have been implemented as reference designs. The filter designs implement programmable N-tap FIR filters. with forward and reverse RNS/SD converters. Implementation results are presented in Section 8. 8 VLSI Implementation Results The presented designs have been coded in structurallevel VHDL and mapped to standard-cells using Synopsys Design Compiler and a UMC 0.13 m CMOS cell library with eight metal layers and a core voltage of 1.2 Volts. The VHDL designs were compiled for typical operating and wire load conditions and synthesised for four different equivalent (binary) word lengths (16, 24, 32 and 40 bits). For the parameterized moduli sets, the parameter n was chosen such that the resulting moduli product was as small as possible, but at least equal to desired dynamic range. General moduli sets of length five and six have also been evaluated. The general sets were chosen according to the criteria for effective moduli sets given in Section 3. The moduli sets used for VLSI implementation are presented in Table 2. Note that no six-moduli set has been selected for the 16 bit dynamic range. It is not possible to form a set of six coprime low-cost moduli with a moduli product as small as N y(n) = a k x(n k) k=1 realized in transposed form as shown in Fig. 11. The filter coefficients a 1,...,a N are calculated a priori. For an RNS with moduli set S ={m 1, m 2,...,m L }, the FIR filter is decomposed into L subfilters operating in parallel, each subfilter using modulo m i arithmetic. Each filter tap consists of a modulo adder, a modulo multiplier and a register. Figure 12 shows an SD filter tap and Fig. 13 depicts an RNS/SD FIR filter, complete Figure 13 RNS/SD FIR filter.

13 Forward and reverse converters and moduli set selection... Table 2 Moduli sets used for VLSI implementation. Moduli set 16 bits 24 bits 32 bits 40 bits Number Values Number Values Number Values Number Values S 1 8 {255, 257} 12 {4095, 4097} 16 {65535, 65537} 20 {220 1, } S 2 6 {63, 64, 65} 8 {255, 256, 257} 11 {2,047, 2,048, 2,049} 14 {16,383, 16,384, 16,385} S 3 4 {15, 16, 17, 257} 5 {31, 32, 33, 1,025} 7 {127, 128, 129, 16,385} 8 {255, 256, 257, 65,537} S 4 /S 5 5 {32, 31, 15, 17} 7 {128, 127, 63, 65} 9 {512, 511, 255, 257} 11 {2,048, 2,047, 1,023, 1,025} Five moduli {16, 17, 9, 7, 5} {64, 65, 31, 17, 7} {128, 129, 127, 65, 17} {512, 511, 257, 129, 65} Six moduli {32, 33, 31, 17, 7, 5} {128, 127, 65, 31, 17, 7} {256, 257, 129, 127, 31, 17} Table 3 Performance evaluation for RNS/SD encoders. Moduli set Delay [ns] Area [mm 2 ] 16 bits 24 bits 32 bits 40 bits 16 bits 24 bits 32 bits 40 bits S ,24 0,24 0, S S 3 1,34 1,34 1,34 1, S 4 /S 5 1,35 1,35 1,35 1, Five moduli Six moduli Table 4 Performance evaluation of RNS/SD decoders. Moduli set Delay [ns] Area [mm 2 ] 16 bits 24 bits 32 bits 40 bits 16 bits 24 bits 32 bits 40 bits S 1 3, ,65 6, S S 3 4,50 5, , S 4 /S Five moduli Six moduli 8, Table 5 Performance evaluation of 8-tap RNS/SD FIR filters. Moduli set Delay [ns] Area [mm 2 ] Area Delay 16 bits 24 bits 32 bits 40 bits 16 bits 24 bits 32 bits 40 bits 16 bits 24 bits 32 bits 40 bits S 1 4,17 4, S 2 3,86 4,11 4,82 4, S 3 4,16 4,66 4,81 4, S 4 /S Five moduli ,89 4, Six moduli 3,65 4,

14 A. Persson, L. Bengtsson 8.1 RNS/SD Encoders Table 3 shows VLSI implementation results for RNS/ SD encoders using the moduli sets from Table 2. For the parameterized moduli sets, the circuit delay is not affected by the value of the parameter n. The RNS/SD encoders for general moduli sets, on the other hand, consist of different adder-tree structures for different dynamic ranges. Thus, the circuit delay is not constant. The circuit area grows linearly with increased dynamic ranges for all encoders. 8.2 RNS/SD Decoders Table 4 shows VLSI implementation results for the proposed RNS/SD decoders. As seen in Table 4, the decoder for moduli set S 1 has the smallest area and, for the 16-bit dynamic range, also the shortest circuit delay. For the larger dynamic ranges, the decoder for moduli set S 2 has the shortest delay. The decoders for sets S 4 and S 5 has considerably longer delay and larger area, due to the constant multipliers in the second stage of the converters. the performance of RNS/SD arithmetic circuits depends on the choice of the moduli set (a set of pairwise prime numbers), the purpose of this work has been to compare RNS/SD number systems based on different sets. Four moduli-set-specific conversion techniques are proposed. A conversion technique for general moduli sets consisting of any number of coprime moduli has also been presented. Finite impulse response (FIR) filters have been used in order to evaluate the performance of RNS/SD processing using the proposed moduli sets. All designs have been implemented in a commercially available 0.13 μm CMOS process. The designs have been compared with respect to delay, area and area delay products. The implementation results show that the complexity of RNS/SD converters grows as the number of moduli is increased. However, if the designs are large enough, the increased complexity of the converters is overcome by area savings in RNS/SD processing units. For the case of FIR filters it is shown that generic moduli sets with five or six moduli results in designs with the best area delay products. 8.3 RNS/SD FIR Filters Implementation results for 8-tap FIR filters are presented in Table 5. When implementing the FIR filters, pipeline stages where added in the RNS/SD converters to maintain a clock cycle that is determined by the critical path of the filter taps. The RNS/SD forward conversions introduce an additional latency of one clock cycle. The reverse conversions introduce an additional latency of two clock cycles for filters using moduli sets S 1,...,S 3 and three clock cycles for filters with moduli sets S 4 and S 5. The reverse conversions for general moduli sets introduce a latency of three clock cycles. For the case of 8-tap FIR filters we see that generic moduli sets with five or six moduli results in designs with the best area delay products. Considering even longer filters, the impact of the forward and backward converters on the total circuit area will decrease. This will furthermore favor the longer moduli sets. 9 Conclusions This work has presented new forward and reverse converters for signed-digit residue number systems. Since References 1. Lindström, A., Nordseth, M., Bengtsson, L., & Omondi, A. (2004). Arithmetic circuits combining residue and signeddigit representations. In Lecture notes in computer science (LNCS) (Vol. 2823, pp ). Springer. 2. Szab, N. S., & Tanaka, R. I. (1967). Residue arithmetic and its applications to computer technology. McGraw-Hill (December). 3. Soderstrand, M., & Jenkins, W. (1986). Residue number system arithmetic: Modern applications in digital signal processing. IEEE Press. 4. Wang, W., Swamy, M., & Ahmad, M. (2003). Rns application in digital image processing. In Proceedings of the 3rd IEEE international workshop on system-on-chip for real-time applications (pp ) (July). 5. Avizienis, A. (1961). Signed-digit number representation for fast parallel arithmetic. IRE Transactions on Electronic Computers, EC-10, Parhami, B. (1988). Carry-free addition of recoded binary signed-digit numbers. IEEE Transactions on Computers, 37(11), (November). 7. Wei, S., & Shimizu, K. (2000). A novel residue arithmetic hardware algorithm using a signed-digit number representation. IEICE Transactions on Information and Systems, E83 D(12), (December). 8. Wei, S., & Shimizu, K. (2001). Fast residue arithmetic multipliers based on a signed-digit number system. In Proceedings of the 8th IEEE international conference on electronics, circuits and systems (Vol. 1, pp ) (September).

Forward and reverse converters and moduli set selection... 9. Wei, S., & Shimizu, K. (2002).

In Proceedings of the 9th IEEE international conference on electronics, circuits and systems (Vol. 2, pp. 591 594) (September). 10. Lindahl, A., & Bengtsson, L. (2005).

15 Forward and reverse converters and moduli set selection Wei, S., & Shimizu, K. (2002). Residue signed-digit arithmetic circuit with a complement of mudulus and the application to rsa encryption processor. In Proceedings of the 9th IEEE international conference on electronics, circuits and systems (Vol. 2, pp ) (September). 10. Lindahl, A., & Bengtsson, L. (2005). A low-power fir filter using combined residue and radix-2 signed-digit representation. In Proceedings of the 8th EUROMICRO conference on digital system design (DSD 05) (pp ). Porto, Portugal: IEEE Computer Society Press (August September). 11. Abdallah, M., & Skavantzos, A. (1995). A systematic approach for selecting practical moduli sets for residue number systems. In Proceedings of the 27th IEEE southeastern symposium on system theory (pp ) (March). 12. Vinnakota, B., & Rao, V. B. (1994). Fast conversion techniques for binary-residue number systems. IEEE transactions on circuits and systems I: Fundamental theory and applications, CAS-41(12), (December). 13. Wang, W., Swamy, M., Ahmad, M., & Wang, Y. (1999). The applications of the new Chinese remainder theorems for three moduli sets. In Proceedings of the 1999 IEEE Conadian conference on electrical and computer engineering (Vol. 1, pp ) (May). 14. Wang, Y., Song, X., Aboulhamid, M., & Shen, H. (2002). Adder based residue to binary number converters for (2 n 1, 2 n, 2 n + 1). IEEE Transactions on Signal Processing, 50(7), (July). 15. Wang, Y. (2000). Residue-to-binary converters based on new Chinese remainder theorems. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 47(3), (March). 16. Cao, B., Chang, C., & Srikanthan, T. (2003). An efficient reverse converter for the 4-moduli set {2 n 1, 2 n, 2 n + 1, 2 2n + 1} based on the new Chinese remainder theorem. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50(10), (October). 17. Skavantzos, A., & Saturates, T. (1999). Grouped-moduli residue number systems for fast signal processing. In Proceedings of the 1999 IEEE international symposium on circuits and systems (ISCAS 99) (Vol. 3, pp ) (May). 18. Wang, W., Swamy, M., & Ahmad, M. (2000). An area-time efficient residue-to-binary converter. In Proceedings of the 43rd IEEE midwest symposium on circuits and systems (pp ) (August). 19. Shallit, J. (2005). A primer on balanced binary representations. Retrieved October 2005, from ca/ shallit/papers/bbr.pdf. Andreas Persson obtained the M.Sc. degree from Chalmers University of Technology, Gothenburg, Sweden in He is now pursuing the Ph.D. degree at the Centre for Research on Embedded Systems (CERES) at Halmstad University, Sweden. Lars Bengtsson obtained the M.Sc. and Ph.D. degrees from Chalmers University of Technology, Gothenburg, Sweden in 1983 and 1997 respectively. After working in industry for some years as a HW and SW engineer he was recruited for a position as senior lecturer and later promoted to associate professor at Halmstad University, Sweden. He subsequently moved to Chalmers where he was appointed associate professor in year His research interest lies in the area of embedded and networked processors, active RFID, and digital VLSI circuits.

A High-Speed Realization of Chinese Remainder Theorem

Proceedings of the 2007 WSEAS Int. Conference on Circuits, Systems, Signal and Telecommunications, Gold Coast, Australia, January 17-19, 2007 97 A High-Speed Realization of Chinese Remainder Theorem Shuangching