Semi-Custom VLSI Design and Implementation of a New Efficient RNS Division Algorithm

Semi-Custom VLSI Design and Implementation of a New Efficient RNS Division Algoithm AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD 2 Elect. Eng. Dept., Pincess Sumaya Univesity, PO Box 438, Amman 94, Jodan 2 Elect. & Sys. Eng. Dept., Oakland Univesity, Rocheste, MI 48309, USA Email: aahiasat@ss.gov.jo In this pape we intoduce a new algoithm fo division in esidue numbe system, which can be applied to any moduli set. Simulation esults indicated that the algoithm is faste than the most competitive published wok. To futhe impove this speed, we customize this algoithm to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). The customization esults in eliminating memoy devices (ROMs), thus inceasing the speed of opeation. A semi-custom VLSI design fo this algoithm fo the moduli (2 k +, 2 k, 2 k ) has been implemented, fabicated and tested. Received August 3, 998; evised Apil 26, 999. INTRODUCTION The esidue numbe system (RNS) has the advantage of cay-fee aithmetic opeations. Thus, using esidue aithmetic would in pinciple incease the compute pocessing speed. In paticula, addition, subtaction and multiplication can be pefomed on each esidue digit concuently and independently. Howeve, thee ae dawbacks associated with RNS. These dawbacks include the difficulty of esidue opeations like division, sign and oveflow detection. Geneally speaking, all epoted algoithms on division in RNS [3, 4, 5, 6, 7, 8, 9, 0, ] have the disadvantage of lengthy aithmetic opeations, lage execution time and complex hadwae equiements. The complexity of these algoithms is due to mixed-adix convesion (MRC) and pefoming difficult esidue opeations. The moduli sets (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ) ae paticulaly impotant in applications which equie a high degee of pecision [2, 3, 4]. Some of the ecusive digital filtes equie a high degee of pecision in thei computations in ode to accuately contol the fequency chaacteistics and to eliminate the occuence of instabilities. Moduli sets like (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ) ae among the vey few systems that can deal with such citical situations [4]. The popeties of these sets become moe appaent in hadwae consideations because most moduli ae diminished o augumented powes of two. Residue addition fo diminished powes of two is the cayadd type and multiplication by a powe of two is equivalent to left otation. Similaly, esidue addition fo augumented Pat of this pape is based on []. Anothe pat of this pape is based on [2]. powes of two is the cay-subtact type. Theefoe, they can play an inceased ole in implementing an RNS aithmetic unit fo computes. The advances in VLSI technology have suggested novel appoaches to the implementation of aithmetic units ove finite ings. RNS suppots the main VLSI design popeties and featues like simple connections, concuency and modulaity. Independence of esidue digits eliminates complex inteconnection pattens among diffeent logic components. This independence leads to concuency whee an aithmetic opeation can be caied on all esidue digits concuently. The similaity in pocessing achitectue fo each modulus offes functional and layout modulaity. In this pape, we pesent a new division algoithm. The main idea is based on selecting an appoximate quotient that is guaanteed to poduce a non-negative emainde, unless the division pocedue is completed. The main featues of thisalgoithm,ascompaedwithothes[3,4,5,6,7,8,9,0, ], ae: no sign detemination, oveflow detection, scaling o MRC is needed. Moeove, no need fo base extension, auxiliay o edundant moduli. The algoithm speed is not dependent on the numbe of moduli, but on dynamic ange. Nevethless, this new algoithm is still based upon conveting the esidue epesentation of the dividend and the diviso to a weighted code in ode to deive some infomation egading the position of the most-significant non-zeo bit contained in a esidue numbe. The algoithm is then customized to seve the above mentioned moduli sets. This customization educes hadwae and time equiements associated with conveting the esidue epesentation to a weighted code. The customized stuctue has been designed and implemented using VLSI design tools. The layout has,

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 233 then, been fabicated and tested to veify the integity and functionality of the design. In ode to evaluate the pefomance of this new design, it has to be compaed with othe RNS division algoithms. Chen s algoithm [5] has a slightly bette pefomance compaed to anothe algoithm [4] in tems of the mean of esidue opeations needed fo each division poblem. Nevetheless, it equies MRC and esidue scaling. If lookup tables ae used, then MRC and scaling would equie (N ) and (2N ) memoy cycles espectively [2]. Chen s algoithm also equies a edundant modulus which epesents anothe dawback. Gambege [6] pesented an algoithm which does not use MRC. The numbe of iteations fo each division poblem is popotional to the magnitude of the diviso. The mean of the numbe of iteations is, thus, vey high. Moeove, the hadwae implementation suggested by Gambege is vey complicated and expensive due to the utilization of auxiliay RNS. Hung and Pahami [] intoduced two RNS division algoithms based on the appoximate-sign detection technique. The faste among the two [] equies much moe hadwae than the othe slowe one. Hung and Pahami [] indicated that intemediate to these algoithms ae a numbe of choices that offe speed/cost tadeoffs. Although both the faste algoithm of Hung and Pahami [] and the algoithm of Lu and Chiang [3] have the same time complexity, the latte one has a bette hadwae complexity. The most competitive wok, intoduced by Lu and Chiang [3], does not use MRC, howeve, it utilizes the idea of the factional epesentation of /M to detect the paity of a esidue numbe and hence to check if an oveflow has taken place. Lu and Chiang s algoithm equies 2 log 2 Q steps, whee Q is the quotient. Each step consists of seveal esidue additions and subtactions, one esidue multiplication, two memoy access cycles and one multi-opeand addition (in fact, in pats II and IV of Lu and Chiang s algoithm, moe than one multi-opeand addition might be needed). Realization II of the new algoithm equies log 2 Q steps whee each step consists of one esidue multiplication (Q i ), one esidue subtaction ( Q i ) that is pefomed in paallel with one esidue addition (Q = Q + Q i ), one multi-opeand addition and two memoy access cycles: one to get the factional epesentations of esidue digits, while the othe is to obtain Q i. NOTE. Following the liteatue, the pevalent method of measuing execution time fo esidue aithmetic algoithms [4, 5, 2, 3], the mean of the basic esidue aithmetic opeations needed by each algoithm, is computed. The basic esidue opeations ae: addition, subtaction and multiplication. The following notational convention has been adopted fo this pape: {m, m 2,...,m N }, moduli set of N paiwise elatively pime positive integes. M = N i= m i, dynamic ange. Fo any intege [0, M), esidue epesentation of is: RNS (x, x 2,...,x N ). x i = mi,i.e.x i = (mod m i ). mˆ i = M/m i. / mˆ i mi, multiplicative invese of mˆ i (i.e. / mˆ i mi mˆ i mi = ).., the ceiling value of (.); that is the next intege geate than o equal to (.).., the floo value of (.); that is the peceding intege less o equal to (.). Define a function h(i) such that: + log 2 I, if I is an intege > 0 h(i) = 0, if I = 0 log 2 I, if 0 < I <. 2. DIVISION ALGORITHM Assume that, and Q ae non-negative integes such that Q = /, 0, then the following steps intoduce the basic idea fo division in RNS:. Set quotient Q to zeo; Q =0. 2. Find the position of the most-significant non-zeo bit in the diviso,sayk, thatisk = h( ). 3. Find the position of the most-significant non-zeo bit in the dividend, say j,thatis j = h(). 4. If j > k, then: Q = Q + 2 j k, = 2 j k, Q = Q, =. GotoStep3. 5. If j = k, then: =, j = h( M ), so if j < j then Q = Q +. Othewise, Q is unchanged. In eithe case, end pocedue. 6. If j < k,thenq is unchanged. End pocedue. This basic algoithm can be used effectively with RNS division aithmetic. An impotant featue of this algoithm is the selection of the quotient to be 2 j k, hence the quantity (2 j k ) is guaanteed to be non-negative as long as >. It should be emphasized that and in the above algoithm ae binay epesentations of nonnegative integes and that the algoithm is still coect when the factional epesentation is adopted. The poofs of both cases, intege and factional epesentations, ae intoduced in the next two subsections 2.. Poof of coectness of the algoithm: intege epesentation Befoe poving the algoithm, the following lemma has to be intoduced. LEMMA. Fo any esidue integes, [0, M), fo which j = h(), k = h( ) and j = k 0 then j > j if and j j if <,whee j = h( M ). Poof. Since j = k,then and can be expessed as: = 2 j + a, and = 2 j + b, whee: 0 a, b 2 j.

234 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD Fo the case (i.e. a b): M = 0, then = a b, but since 0 a b 2 j, then j = h( ) = h(a b) < j. Fo the case < (i.e. a < b): M = M +, j = h(m + ) = h(m + a b). The minimum value of j happens when a = 0, b = (2 j ) and M = +. Upon substituting these values: j h(2 j + ) = j, o equivalently: j j. Based on the above lemma, the poof of the algoithm is as follows: Fo the case j > k, and since < 2 k+ and 2 j then / > 2 j k = Q i (i.e. Q i is the ith patial quotient). Hence the estimate of the quotient in each iteation is guaanteed to poduce a positive emainde. Assume thee ae v iteations which satisfy the condition j > k, then the total patial quotients esulting fom this case ae Q, whee:q = v i= Q i. Fo the case j = k (i.e. (v + )th iteation), two possibilities ae expected:. <. This case is detected accoding to Lemma by evaluating j = h( ). Hence, if j j then Q v+ = 0. The pocedue is then stopped. 2.. This case is detected accoding to Lemma by evaluating j = h( ). Hence, if j < j then Q v+ =. The pocedue is then stopped. Fo the case j < k, it is obvious that <, hence the pocedue has to be stopped. Theefoe, the quotient would be: Q = / = vi= Q i + Q v+,whee: Q v+ = {, if j v+ > j 0, othewise j v+ = h(), inthe(v + )th iteation (i.e. when j = k). 2.2. Poof of coectness of the algoithm: factional epesentation LEMMA 2. In RNS, fo any factional epesentations /M, /M whee, [0, M), j = h(/m), k = h(/m) and j = k then j if, and j = if <,whee j = h(( )/M). Poof. Since /M and /M ae of the same ode (i.e. j = k)thatis: 2 j /M < 2 j+ and similaly 2 j /M < 2 j+. Fo the case : M = 0, then: 0 ( )/M < 2 j. Hence, j = h(( )/M) < j,o equivalently; j < j. Since the highest value of j is, then: j 2. Note that fo the special case, ( )/M = 0, then by definition h(0) = 0. Consequently, if then j. Fo the case < : M = M ( ), then j = h((m + )/M), o j = h( ( )/M) but since <, then: 0 <( )/M < 2 j,o j h( 2 j ). Since, M, then the maximum value of j is. Theefoe, 0 > j h( 2 ).O: j =. The poof of the algoithm fo the case when the dividend and diviso ae factional quantities uses Lemma 2 and follows the same appoach given in the poof of Realization I. The poof leads to the esult that the quotient Q can be expessed as: Q = Q v+ = = v Q i + Q v+ i= {, if j 0, othewise. 3. REALIZATION OF THE ALGORITHM IN RNS 3.. Realization I This ealization is based on the intege epesentation outlined in the pevious section. It is quite useful fo small and medium dynamic anges whee all the bits of esidue digits can be applied to a single RAM in ode to evaluate h(i). The poposed stuctue fo Realization I, shown in Figue, can be descibed as follows: by applying to aram,k = h( ) is evaluated, whee k is expessed in bits, = log 2 (log 2 M). Similaly, by applying to RAM j = h() is also evaluated, whee j is also expessed in bits. The patial quotient Q i is computed by applying j and k to RAM 2. A esidue multiplie then multiplies the patial quotient with. The output of the multiplie, i.e. Q i, is subtacted fom to poduce a new emainde. The pocedue is epeated and the esidue adde accumulates patial quotients, until j < k. RAM accepts N n i bits, whee n i = log 2 m i addess lines (i.e. bits of esidue epesentation of o ), thus it has a size of (2 N n i ) bits, = log 2 (log 2 M). Howeve, RAM 2 accepts the bits of j and k (a total of 2 bits), thus its size would be (2 2 N n i ). In the case that j k is fist evaluated befoe being applied to RAM 2, then RAM 2 would have the size (2 + N n i ). This ealization equies log 2 Q iteations. Each iteation consists of two consecutive memoy cycles followed by two consecutive esidue opeations. The ealization is vey attactive fo many digital-signal pocessing applications which utilize small and medium dynamic anges. 3.2. Realization II This ealization is based on the factional epesentation outlined in the pevious section. It is quite useful fo lage dynamic anges whee bits of esidue digits cannot be applied to a single RAM in ode to evaluate h(i). Van Vu [5] developed a convesion technique based on the CRT. This technique uses factional epesentation of

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 235 RAM j k RAM 2 Residue Residue Multip. Subtac. Residue Adde Q FIGURE. Poposed implementation of the algoithm (Realization I). 2 3 N RAM RAM 2 E RAM 3 n Adde M c j o RAM d e k. RAM N Residue Residue Multip. Subtac. Residue Adde Q FIGURE 2. Poposed implementation of the algoithm (Realization II). weighted numbes. This technique is given by N M = m i= i x i ˆm p () i mi whee p is a non-negative intege. Hence, the value of /M can be obtained by evaluating the ight-hand side of (). This is basically done by letting each x i addess a table which stoes x i m i ˆm. i mi The output of these N tables can be added using a multiopeand adde. Any intege oveflow esulting fom this adde is disegaded since it epesents the intege pat of the summation esult. The factional value stoed in tables should be expessed using t bits whee t log 2 MN if M is odd and t log 2 MN othewise [5]. The poposed stuctue fo Realization I, shown in Figue, can be descibed as follows: apply the esidue digits of the diviso to the factionalepesentationcicuit to obtain /M. This equies a memoy cycle followed by a multi-opeand addition. /M is applied to a pioity encode to evaluate k = h(/m). Next, and using the same appoach, j = h(/m) is evaluated. The bits of j and k (a total of 2 bits), o thei diffeence, ae applied to a RAM to poduce the patial emainde Q i.thisq i is applied to a esidue multiplie to compute Q i. The output of the multiplie (Q i ) is subtacted fom using a esidue subtacto. The output of the esidue subtacto is applied again to the factional epesentation cicuit as long as j k. 4. EVALUATION The complexity of the poposed esidue-based divide is highly dependent on the complexity of individual components being used in the design (e.g. esidue multiplie, adde, etc.). Many esidue-based multiplies can be found in the liteatue [6, 7, 8, 9]. These ae diffeent in thei stuctues, aea and time complexities (i.e. gate numbe, silicon aea, time delay, etc.). The same statement can be made about othe components in this poposed divide [2].

236 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD TABLE. Simulation esults. MORO MOMA Moduli Lu and Lu and Set Ous Chiang s Ous Chiang s M 2.68 0.34 0 0 M2 2.67 0.36 3.04 5.496 M3 2.72 0 3.089 5.504 Fo example, the aea and time complexities of the pioity encode pesented in [20] ae given by O(n) and O(log n), espectively. The poposed design in Figue 2 consists of diffeent devices; namely RAMs, a multi-opeand binay adde, a pioity encode followed by a RAM, a esidue-based adde, subtacto and multiplie. Fo the fist N RAMs, the size of the ith RAM is (2 n i n), whee n i = log 2 m i. The multi-opeand adde accepts N opeands of n bits each. The least significant (LS) n output bits of this adde constitute /M. The encode accepts these LS n bits of the adde and poduces h(i), expessed in bits. The RAM following the encode outputs the esidue epesentation of the estimated quotient. This estimated quotient is, then, applied to diffeent esidue-based aithmetic components. Each of these components accepts N esidue digits fom each opeand, whee each esidue digit is expessed in n i bits. To compae the pefomances of this algoithm and Lu and Chiang s algoithm, compute pogams simulating both algoithms have been developed to calculate the mean of esidue opeations (MORO). The mean of multi-opeand additions (MOMA) has been calculated and compaed, whee it applies. Fo simulation puposes, thee moduli sets wee selected to seve diffeent dynamic anges. These sets ae: M = (7,, 3, 5), M2 = (, 3, 5, 9, 23, 29, 3) and M3 = (29, 3, 43, 47, 53, 55, 59, 6, 63). Simulation esults ae listed in Table. Fo moduli set M, the esults ae exact. All possible combinations of dividends and divisos wee simulated. Howeve, fo M2 and M3, a sample of 200 million andomly geneated numbes within the dynamic ange defined by each moduli set was simulated. The simulation of the new algoithm is based on Figue fo M, and on Figue 2 fo M2 and M3. Simulation of Lu and Chiang s algoithm is based on the flowchat given in [3]. Fo the moduli set M, Table indicates that the new algoithm is fou times faste than Lu and Chiang s algoithm. This conclusion applies to evey moduli set whee all the bits of esidue digits can be applied simultaneously to a single RAM. Fo othe moduli sets like M2 and M3 which have vey lage dynamic anges, the new algoithm is still fou times faste egading the numbe of basic esidue opeations. Moeove, the aveage numbe of multi-opeand additions needed is almost half that needed by the othe algoithm. 5. CUSTOMIZED DIVISION ALGORITHM In Realization II, detemining the position of the highest powe of two contained in any esidue numbe I, which we efeed to as h(i), is an impotant time-delay element in the opeation of the poposed esidue divide. In this section, we customize the same algoithm to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). This customization esults in eliminating the need of ROMs and thus educing the delay contibuted by evaluating h(i). 5.. Evaluating h(i) fo the moduli set (2 k, 2 k, 2 k ) Define m = 2 k, m 2 = 2 k, m 3 = 2 k. Then, ˆm = (2 k )(2 k ), ˆm 2 = 2 k (2 k ), ˆm 3 = 2 k (2 k ). The esidue epesentation of is (, 2, 3 ). The multiplicative inveses fo ˆm, ˆm 2 and ˆm 3 ae [2]: 2 k +, 2 k 3and 2 k 2, espectively. Substituting the coesponding values of ˆm i and thei multiplicative inveses in (): M = 2 k (2k + ) 2 k + 2 k (2k 3) 2 2 k + 2 k 2k 2 3 2 k p. (2) Since < M, then/m <. Hence ( M = FRAC 2 k (2k + ) 2 k + 2 k (2k 3) 2 2 k ) + 2 k 2k 2 3 2 k (3) whee FRAC(...) denotesthe factionalpat of the opeand. The cicula shift popety [3] states that modulo (2 p ) multiplication of an intege by 2 n, whee p and n ae positive integes, is equivalent to n-bits cicula left-shift 3-bits (e.g. 2 3 binay (27) 3 0= 0 decimal 30). Theefoe, to simplify the tems on the ight-hand side (RHS) of (3), we poceed as follows: Evaluate (/2 k ) (2 k + ) 2 k. Assuming that the binay fom of is given by: b (k ) b (k 2)...b b 0, then, (2 k + ) 2 k = 2 k binay + 2 k (k )zeos b (k ) b (k )...b b 0 00...000 + b (k ) b (k 2)...b 2 b b 0 2 k. Recalling that fo mod2 k, only the LS k bits ae significant, then (2 k + ) 2 k = b x b (k 2)...b b 0, whee b x = (b 0 OR b (k ) ). Now, let R epesent the binay fom of (/2 k ) (2 k + ) 2 k,thenr is obtained by multiplying the binay fom of /2 k by that of (2 k + ) 2 k.thatis, R = 0.b x b (k 2)...b b 0. (4)

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 237 Evaluate [/(2 k )] (2 k 3) 2 2 k. Assuming that the binay fom of 2 is given by: b 2(k ) b 2(k 2)...b 2 b 20, then (2 k 3) 2 2 k = 2 k 2 3 2 2 k. But since 2k 2 2 k = 2 2 k,then (2 k 3) 2 2 k = (2 2) 2 k. Based on the cicula left-shift popety: binay (2 2 ) 2 k (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k. O equivalently, (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k = k-ones(=2 k ) (...) (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k. Assuming 2 0, then (2 k 3) 2 2 k binay b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) whee b denotes the complement of the bit b. Howeve, if 2 = 0, then (2 k 3) 2 2 k = 0. On the othe hand, the tem /(2 k ) can be witten as 2 k /( 2 k ). Recall that any faction in the fom q/( q), whee q <, can be expanded in a powe seies fom as: q/( q) = i= i= qi. Theefoe: 2 k /( 2 k ) = 2 k + 2 2k + 2 3k + 2 4k +... Based on eo analysis intoduced in [5], then the MS (3k + ) bits ae the only significant bits in ou computations. Let R 2 epesent the binay fom of [/(2 k )] (2 k 3) 2 2 k,thenr 2 is obtained by consideing the MS (3k +) bits of multiplying /(2 k ) by (2 k 3) 2 2 k.thatis kbits R 2 = 0. b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) kbits 0 b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) kbits b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) b 2(k 2) (5) whee implies that tems ae concatenated. Evaluate [/(2 k )] 2 k 2 3 2 k. Assuming that the binay fom of 3 is given in (k ) bits by: b 3(k 2) b 3(k 3)...b 3 b 30, then based on binay the cicula left-shift popety: 2 k 2 3 2 k b 30 b 3(k 2) b 3(k 3)...b 3. On the othe hand, the tem /(2 k ) can be witten as 2 (k ) /( 2 (k ) ). Thus, it can be expanded in a powe seies fom as: 2 (k ) /( 2 (k ) ) = 2 (k ) +2 2(k ) +2 3(k ) +2 4(k ) +...Now,let R 3 be the binay fom of [/(2 k )] 2 k 2 3 2 k, then R 3 is obtained by consideing the MS (3k +) bits of multiplying /2 k by 2 k 2 3 2 k.thatis (k )bits (k )bits R 3 = 0. b 30 b 3(k 2)...b 32 b 3 b 30 b 3(k 2)...b 32 b 3 (k )bits 4bits b 30 b 3(k 2)...b 32 b 3 b 30 b 3(k 2) b 3(k 3) b 3(k 4). (6) Theefoe, (3) can be ewitten as M = FRAC(R + R 2 + R 3 ). (7) Thus h(/m) is nothing but the position of the MS non-zeo bit of /M. EAMPLE. Conside the moduli set {6, 5, 7}. To find h(/m) whee = (8, 2, 4) (i.e. = 32 and M = 680), then: = 8 binay 000, so R = 0.000 2 = 2 binay 00, so R 2 = 0.00 00 00 0 3 = 4 binay 00, so R 3 = 0.00 00 00 00 0. Theefoe, /M = FRAC(R + R 2 + R 3 ) = 0.000000 Since the undelined MS non-zeo bit is in the thid location, then h(/m) = 3. In ode to implement (7), one thee-opeand binay adde is needed. A cay-save adde (CSA) followed by a caypopagate adde (CPA) can ealize the addition of the thee opeands. 5.2. Evaluating h(i) fo the moduli set (2 k +, 2 k, 2 k ) The esidue decode intoduced by Sweidan and Hiasat [22] has the advantages of educed hadwae equiements and extemely wide fixed-point dynamic anges since its uppe bound is not limited by a memoy size. Moeove, it equies only a total of fou 2k-bit binay addes, which makes it vey attactive compaed to othe published decodes [23, 24]. In this pape, we popose a hadwae layout that can decode esidue digits of the moduli set (2 k +, 2 k, 2 k ) into binay equivalent. The new layout is an impovement of that pesented in [2]. In this new contibution, we ae educing the numbe of addes needed fo the decoding opeation fom fou 2k-bit binay addes into one 2k-bit thee-opeand binay adde. It has been poved in [2] that whee: /2 k = A + B + C 2 2k (8) A = (2 2k + 2 k ) 3 2 2k B = (2 2k 2 k ) 2 2 2k C = (2 2k + 2 k ) 2 2k. Assuming that, 2 and 3 have the following binay fomat: = b k b (k )...b b 0 2 = b 2(k ) b 2(k 2)...b 2 b 20 3 = b 3(k ) b 3(k 2)...b 3 b 30, then using cicula left-shift popety, A, B and C can be

238 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD R R 2 R 3 E n M c j Adde o RAM d e k Residue Residue Multip. Subtac. Residue Adde Q FIGURE 3. Poposed implementation of the division algoithm customized fo moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). expessed as [2]: A = b 30 b 3(k )...b 32 b 3 b 30 b 3(k )...b 32 b 3 B = b 2(k ) b 2(k 2)...b 2 b 20... }{{} k ones C = b x b (k )...b 2 b b x b (k )...b 2 b whee: b x = b 0 OR b k. By edefining R = A, R 2 = B, and R 3 = C, then (8) can be ewitten as /2 k = R + R 2 + R 3 2 2k. (9) Case I: Since R = A, then the binay epesentation is the same: R = b 30 b 3(k ) b 3(k 2)...b 32 b 3 b 30 b 3(k ) b 3(k 2)...b 32 b 3. (0) Case II: Since R 2 = B, then fo the case < 2 k, and using the 2 s complement notation, R 2 = B + ( s complement of ) +. Noting that the LS k bits of B ae all ones, then the LS k bits of the esult of the subtaction ae simply the s complement of andanoveflowofatthe(k + )th bit. Based on 2 s complement, this oveflow indicates that the esult of subtaction is positive, hence it can be disegaded. Howeve, when 2 k = 2 = 0, then R 2 = 2 2k 2 2k = 0. Theefoe, R 2 can be expessed in binay fomat as 0, if 2 k = 2 = 0 R 2 = b 2(k ) b 2(k 2)...b 2 b 20 b (k ) b (k 2)...b b 0, othewise. () Case III: = 2 k. In this case, the (k + )th bit of is, thus the values R 2 and B ae the same because in the computation of R 2 we used the LS (k ) bits of, which ae all zeos in this case. Theefoe, the fomat of R 2 is not changed. Howeve, to take cae of this non-zeo (k + )th bit of, it has to be subtacted fom R 3. Theefoe b x b (k )...b 2 b b x b (k )...b 2 b, R if 2 k 3 = b x b (k )...b 2 b b x b (k )...b 2 b, if = 2 k. (2) Equation (9) is simply accomplished by adding R, R 2 and R 3. The output should then be incemented by any output-cay. Nevetheless, a cay esulting fom this adde can be neglected as long as the output does not have the value (2 n ), whee 0 n 2k. This can be justified by the fact that h(i) = h(i + ) if I (2 n ). Howeve, if I = 2 n, then the output cay would be significant and the pioity encode poposed in implementing the esidue divide can take cae of this special case. A single cay-save adde can add these thee opeands. Few logic gates ae also needed to detect the (k + )th bit of and to select the pope fomat of R 3. Using the fomula = /2 k 2 k + 2, then the value of can be obtained by concatenation of the k bits of 2 to the 2k output bits of the thee-opeand adde. These 3k bits and the cay ae applied to the pioity encode. EAMPLE. Conside the moduli set {7, 6, 5}. Tofind h() whee = (,, 2) (i.e. = 827), then: = binay 00, so R 3 = 0 0 2 = binay 0, so R 2 = 000 000 3 = 2 binay 000, so R = 000 000. Theefoe, adde output = (R + R 2 + R 3) = 00 000, whee the oveflow has been neglected. Since = /2 k 2 k + 2, then concatenation of bits of 2 to that of the adde yields: h() = h(00 000 0) = 0 (i.e. 0th position). Figue 3 shows the new poposed hadwae ealization of an RNS divide fo the moduli sets (2 k, 2 k, 2 k )

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 239 (x, x 2, x 3 ) / 3 bit mux/ adde eset clk e g / 3 (y, y 2, y 3 ) / 3 bit e adde g / 3 PE PE / 4 / 4 multiplie m m e 2 / g 3 m ROM 3. m / e 3 m 2 / g 3 m 3 adde subtacto m m e 2 g m 3 Q m Q m2 Q m3 DIVC DBZ FIGURE 4. Block diagam of the implemented divide. and (2 k +, 2 k, 2 k ). The opeation of this divide is selfexplanatoy. The popagation delay in Figue 3, as compaed with that in Figue 2, has been educed by a memoy access cycle pe iteation. Recalling that the memoy access cycle is vey significant compaed with the delay of othe components and that division is an iteative pocedue, then this eduction will, eventually, be inceasingly significant as the numbe of iteations pe division poblem is inceased. This implies that the new poposed ealization is much faste fo these paticula moduli sets. Moeove, the eduction in hadwae equiements is anothe substantial impovement. 6. VLSI IMPLEMENTATION OF A RESIDUE-BASED ARITHMETIC DIVIDER A pipelined design fo a esidue-based aithmetic divide fo the moduli set (2 k +, 2 k, 2 k ) has been implemented, fabicated and tested. The detailed design of the implemented cicuit is shown in Figue 4. Data path sizes ae also shown. The implementation was accomplished using Octtools-5.2 with a standad cell MSU2.3 libay. Fo pototype puposes, k was selected to be fou. Thus, the total numbe of input pins is 3 fo each opeand. The clock, an / selecto and eset ae anothe thee inputs. Similaly, the output quotient is expessed in 3 bits. Division-completed (DIVC) is a onebit output that goes high to validate the output quotient and sets the flag that the division pocess is completed. Divisionby-zeo (DBZ) is anothe output bit which sets a flag if the diviso is zeo. The design has an integated cicuit aea of (.792.675) mm 2. The tiny padfame (40PC22 22) was used to accommodate this design. Test esults showed that the design can un at a clock speed of 5 MHz. The numbe of clock cycles equied fo each division poblem depends on both the dividend and the diviso. Howeve, the aveage numbe ove diffeent division poblems is eight clock cycles. 7. CONCLUSIONS This pape has pesented a new geneal division algoithm fo RNS, which is faste than othe peviously poposed algoithms. The algoithms wee then customized to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). An RNS divide would then equie a binay adde, a pioity encode, a ROM, a esidue adde, a esidue subtacto and a esidue multiplie only. These educed hadwae equiements and pocessing time qualify the new ealization to be vey pactical fo many computing applications and theefoe enable RNS to play an inceased ole in designing aithmetic logic units fo geneal pupose computes. The poposed customized hadwae has been implemented on silicon and test esults have been pesented. REFERENCES [] Hiasat, A. and Abdel-Aty-Zohdy, H. (997) Design and implementation of an RNS division algoithm. Poc. 3th Symp. Compute Aithmetic (Asiloma, CA), pp. 240 249. [2] Hiasat, A. and Abdel-Aty-Zohdy, H. (995) High-speed division algoithm fo esidue numbe system. Poc. 995 IEEE Intenational Symposium on Cicuits and Systems (ISCAS), vol. 3, pp. 996 999. [3] Lu, M. and Chiang, J. (992) A novel division algoithm fo esidue numbe system. IEEE Tans. Comput., 4, 026 032.

240 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD [4] Baneji, D., Cheung, T. and Ganesan, V. (98) A high-speed division method in esidue aithmetic. Poc. 5th IEEE Symp. Comput. Aithmetic, pp. 58 64. [5] Chen J., W. (990) A new esidue numbe division algoithm. Compute Math. Appl., 9, 3 29. [6] Gambege, D. (99) New appoach to intege division in esidue numbe system. Poc. 0th Symp. Comput. Aith., pp. 84 9. [7] Kie,., Cheney, P. and Tannenbaum, M. (962) Division and oveflow detection in esidue numbe systems. IRE Tans. Electon. Comput.,, 50 507. [8] Lin, L., Leiss, E. and Mcinnis, B. (984) Division and sign detection algoithm fo esidue numbe systems. Comput. Math. Appl., 0, 33 342. [9] Kinoshita, E., Kosako, H. and Kojima,. (973) Geneal division in the symmetic esidue numbe system. IEEE Tans. Computes, 22, 34 42. [0] Hitz, M. and Kaltofen, E. (995) Intege division in esidue numbe system. IEEE Tans. Computes, 44, 240 248. [] Hung, C. and Pahami, B. (994) An appoximate sign detection method fo esidue numbes and its applications to RNS division. Compute Math. Appl., 27, 23 35. [2] Sodestand, M., Jenkins, W., Jullien, G., Taylo, F. (eds) (986) Residue Numbe System Aithmetic: Moden Applications in Digital Signal Pocessing. IEEE Pess, New ok. [3] Szabo, N. and Tanaka, R. (967) Residue Aithmetic and Its Applications to Compute Technology. McGaw Hill, New ok. [4] Jenkins, W. (979) Recent advances in esidue numbe techniques fo ecusive digital filteing. IEEE Tans. Acoust. Speech and Signal Pocessing, 27, 9 3. [5] Van Vu, T. (985) Efficient implementations of chinese emainde theoem fo sign detection and esidue decoding. IEEE Tans Comput., 34, 646 65. [6] Hiasat, A. (996) Semi-custom VLSI design fo RNS multiplies using combinational logic appoach. ICECS 96, 2, 935 938. [7] Radhakishnan, D. and uan,. (92) Novel appoaches to the design of VLSI RNS multiplies. IEEE Tans. Cic. Sys- II: Analog and Digital Signal Pocessing, 39, 52 57. [8] Alia, G. and Matinelli, E. (99) A VLSI modulo m multiplie. IEEE Tans. Comp., 40, 873 878. [9] Elleithy, K. and Bayoumi, M. (995) A systolic achitectue fo modulo multiplication. IEEE Tans. Cic. Sys-II: Analog and Digital Signal Pocessing, 42, 725 729. [20] Wada, K., Hagihaa, K. and Tokua, N. (984) Aea-time optimal fast implementations of seveal functions in a VLSI model. IEEE Tans. Computes, 33, 435 440. [2] Hiasat, A. and Zohdy, H. (998) Residue to binay convete fo the moduli (2 k, 2 k, 2 k ). IEEE Tans. Cicuits and Systems Pat II, 45, 204 209. [22] Sweidan, A. and Hiasat, A. (988) New efficient memoyless, esidue to binay convete. IEEE Tans. Cicuits and Systems, 35, 44 444. [23] Benadson, B. (985) Fast memoyless, ove 64 bits, esidue to decimal convete. IEEE Tans. Cicuits and Systems, 32, 298 300. [24] Ibahim, K. and Saloum, S. (988) An efficient esidue to binay convete design. IEEE Tans. Cicuits and Systems, 35, 56 58.