Semi-Custom VLSI Design and Implementation of a New Efficient RNS Division Algorithm

Similar documents
A Deep Convolutional Neural Network Based on Nested Residue Number System

Truncated Squarers with Constant and Variable Correction

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

Chapter 3: Theory of Modular Arithmetic 38

HOW TO TEACH THE FUNDAMENTALS OF INFORMATION SCIENCE, CODING, DECODING AND NUMBER SYSTEMS?

ANA BERRIZBEITIA, LUIS A. MEDINA, ALEXANDER C. MOLL, VICTOR H. MOLL, AND LAINE NOBLE

3.1 Random variables

ASTR415: Problem Set #6

Method for Approximating Irrational Numbers

A STUDY OF HAMMING CODES AS ERROR CORRECTING CODES

Relating Branching Program Size and. Formula Size over the Full Binary Basis. FB Informatik, LS II, Univ. Dortmund, Dortmund, Germany

A Bijective Approach to the Permutational Power of a Priority Queue

C/CS/Phys C191 Shor s order (period) finding algorithm and factoring 11/12/14 Fall 2014 Lecture 22

On the integration of the equations of hydrodynamics

AQI: Advanced Quantum Information Lecture 2 (Module 4): Order finding and factoring algorithms February 20, 2013

Functions Defined on Fuzzy Real Numbers According to Zadeh s Extension

Quantum Fourier Transform

Solution to HW 3, Ma 1a Fall 2016

Probablistically Checkable Proofs

1. INTRODUCTION FAST ELLIPTIC CURVE CRYPTOGRAPHY USING OPTIMAL DOUBLE-BASE CHAINS

7.2. Coulomb s Law. The Electric Force

The Substring Search Problem

arxiv: v1 [math.nt] 12 May 2017

The Chromatic Villainy of Complete Multipartite Graphs

Determining solar characteristics using planetary data

QIP Course 10: Quantum Factorization Algorithm (Part 3)

4/18/2005. Statistical Learning Theory

Duality between Statical and Kinematical Engineering Systems

A Power Method for Computing Square Roots of Complex Matrices

Light Time Delay and Apparent Position

Chapter 3 Optical Systems with Annular Pupils

9.1 The multiplicative group of a finite field. Theorem 9.1. The multiplicative group F of a finite field is cyclic.

VLSI IMPLEMENTATION OF PARALLEL- SERIAL LMS ADAPTIVE FILTERS

Berkeley Math Circle AIME Preparation March 5, 2013

ITI Introduction to Computing II

When two numbers are written as the product of their prime factors, they are in factored form.

Physics 2A Chapter 10 - Moment of Inertia Fall 2018

A scaling-up methodology for co-rotating twin-screw extruders

New problems in universal algebraic geometry illustrated by boolean equations

Physics 2B Chapter 22 Notes - Magnetic Field Spring 2018

Math 301: The Erdős-Stone-Simonovitz Theorem and Extremal Numbers for Bipartite Graphs

Multiple Criteria Secretary Problem: A New Approach

ON INDEPENDENT SETS IN PURELY ATOMIC PROBABILITY SPACES WITH GEOMETRIC DISTRIBUTION. 1. Introduction. 1 r r. r k for every set E A, E \ {0},

Temporal-Difference Learning

Markscheme May 2017 Calculus Higher level Paper 3

A NEW VARIABLE STIFFNESS SPRING USING A PRESTRESSED MECHANISM

On a quantity that is analogous to potential and a theorem that relates to it

Compactly Supported Radial Basis Functions

FUSE Fusion Utility Sequence Estimator

ON THE TWO-BODY PROBLEM IN QUANTUM MECHANICS

APPLICATION OF MAC IN THE FREQUENCY DOMAIN

ON THE INVERSE SIGNED TOTAL DOMINATION NUMBER IN GRAPHS. D.A. Mojdeh and B. Samadi

Auchmuty High School Mathematics Department Advanced Higher Notes Teacher Version

On decompositions of complete multipartite graphs into the union of two even cycles

SMT 2013 Team Test Solutions February 2, 2013

PROBLEM SET #1 SOLUTIONS by Robert A. DiStasio Jr.

Chapter 5 Force and Motion

Fresnel Diffraction. monchromatic light source

PDF Created with deskpdf PDF Writer - Trial ::

A Markov Decision Approach for the Computation of Testability of RTL Constructs

Chem 453/544 Fall /08/03. Exam #1 Solutions

Swissmetro: design methods for ironless linear transformer

NOTE. Some New Bounds for Cover-Free Families

Hydroelastic Analysis of a 1900 TEU Container Ship Using Finite Element and Boundary Element Methods

Basic Bridge Circuits

arxiv: v1 [math.co] 4 May 2017

Encapsulation theory: the transformation equations of absolute information hiding.

Mathematisch-Naturwissenschaftliche Fakultät I Humboldt-Universität zu Berlin Institut für Physik Physikalisches Grundpraktikum.

arxiv: v2 [physics.data-an] 15 Jul 2015

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

Psychometric Methods: Theory into Practice Larry R. Price

Section 8.2 Polar Coordinates

Centripetal Force OBJECTIVE INTRODUCTION APPARATUS THEORY

The Millikan Experiment: Determining the Elementary Charge

0606 ADDITIONAL MATHEMATICS 0606/01 Paper 1, maximum raw mark 80

Analysis of high speed machining center spindle dynamic unit structure performance Yuan guowei

Pulse Neutron Neutron (PNN) tool logging for porosity Some theoretical aspects

Appendix B The Relativistic Transformation of Forces

Revision of Lecture Eight

Chapter 5 Force and Motion

F-IF Logistic Growth Model, Abstract Version

Using Laplace Transform to Evaluate Improper Integrals Chii-Huei Yu

2 Governing Equations

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

On the Quasi-inverse of a Non-square Matrix: An Infinite Solution

EQUI-PARTITIONING OF HIGHER-DIMENSIONAL HYPER-RECTANGULAR GRID GRAPHS

Introduction Common Divisors. Discrete Mathematics Andrei Bulatov

Divisibility. c = bf = (ae)f = a(ef) EXAMPLE: Since 7 56 and , the Theorem above tells us that

Secret Exponent Attacks on RSA-type Schemes with Moduli N = p r q

Three-dimensional Quantum Cellular Neural Network and Its Application to Image Processing *

Appraisal of Logistics Enterprise Competitiveness on the Basis of Fuzzy Analysis Algorithm

Surveillance Points in High Dimensional Spaces

Conjugate Gradient Methods. Michael Bader. Summer term 2012

Module 9: Electromagnetic Waves-I Lecture 9: Electromagnetic Waves-I

0606 ADDITIONAL MATHEMATICS

Inverse Square Law and Polarization

Information Retrieval (Relevance Feedback & Query Expansion)

Physics 521. Math Review SCIENTIFIC NOTATION SIGNIFICANT FIGURES. Rules for Significant Figures

STABILITY AND PARAMETER SENSITIVITY ANALYSES OF AN INDUCTION MOTOR

A New Design of Binary MDS Array Codes with Asymptotically Weak-Optimal Repair

Transcription:

Semi-Custom VLSI Design and Implementation of a New Efficient RNS Division Algoithm AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD 2 Elect. Eng. Dept., Pincess Sumaya Univesity, PO Box 438, Amman 94, Jodan 2 Elect. & Sys. Eng. Dept., Oakland Univesity, Rocheste, MI 48309, USA Email: aahiasat@ss.gov.jo In this pape we intoduce a new algoithm fo division in esidue numbe system, which can be applied to any moduli set. Simulation esults indicated that the algoithm is faste than the most competitive published wok. To futhe impove this speed, we customize this algoithm to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). The customization esults in eliminating memoy devices (ROMs), thus inceasing the speed of opeation. A semi-custom VLSI design fo this algoithm fo the moduli (2 k +, 2 k, 2 k ) has been implemented, fabicated and tested. Received August 3, 998; evised Apil 26, 999. INTRODUCTION The esidue numbe system (RNS) has the advantage of cay-fee aithmetic opeations. Thus, using esidue aithmetic would in pinciple incease the compute pocessing speed. In paticula, addition, subtaction and multiplication can be pefomed on each esidue digit concuently and independently. Howeve, thee ae dawbacks associated with RNS. These dawbacks include the difficulty of esidue opeations like division, sign and oveflow detection. Geneally speaking, all epoted algoithms on division in RNS [3, 4, 5, 6, 7, 8, 9, 0, ] have the disadvantage of lengthy aithmetic opeations, lage execution time and complex hadwae equiements. The complexity of these algoithms is due to mixed-adix convesion (MRC) and pefoming difficult esidue opeations. The moduli sets (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ) ae paticulaly impotant in applications which equie a high degee of pecision [2, 3, 4]. Some of the ecusive digital filtes equie a high degee of pecision in thei computations in ode to accuately contol the fequency chaacteistics and to eliminate the occuence of instabilities. Moduli sets like (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ) ae among the vey few systems that can deal with such citical situations [4]. The popeties of these sets become moe appaent in hadwae consideations because most moduli ae diminished o augumented powes of two. Residue addition fo diminished powes of two is the cayadd type and multiplication by a powe of two is equivalent to left otation. Similaly, esidue addition fo augumented Pat of this pape is based on []. Anothe pat of this pape is based on [2]. powes of two is the cay-subtact type. Theefoe, they can play an inceased ole in implementing an RNS aithmetic unit fo computes. The advances in VLSI technology have suggested novel appoaches to the implementation of aithmetic units ove finite ings. RNS suppots the main VLSI design popeties and featues like simple connections, concuency and modulaity. Independence of esidue digits eliminates complex inteconnection pattens among diffeent logic components. This independence leads to concuency whee an aithmetic opeation can be caied on all esidue digits concuently. The similaity in pocessing achitectue fo each modulus offes functional and layout modulaity. In this pape, we pesent a new division algoithm. The main idea is based on selecting an appoximate quotient that is guaanteed to poduce a non-negative emainde, unless the division pocedue is completed. The main featues of thisalgoithm,ascompaedwithothes[3,4,5,6,7,8,9,0, ], ae: no sign detemination, oveflow detection, scaling o MRC is needed. Moeove, no need fo base extension, auxiliay o edundant moduli. The algoithm speed is not dependent on the numbe of moduli, but on dynamic ange. Nevethless, this new algoithm is still based upon conveting the esidue epesentation of the dividend and the diviso to a weighted code in ode to deive some infomation egading the position of the most-significant non-zeo bit contained in a esidue numbe. The algoithm is then customized to seve the above mentioned moduli sets. This customization educes hadwae and time equiements associated with conveting the esidue epesentation to a weighted code. The customized stuctue has been designed and implemented using VLSI design tools. The layout has,

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 233 then, been fabicated and tested to veify the integity and functionality of the design. In ode to evaluate the pefomance of this new design, it has to be compaed with othe RNS division algoithms. Chen s algoithm [5] has a slightly bette pefomance compaed to anothe algoithm [4] in tems of the mean of esidue opeations needed fo each division poblem. Nevetheless, it equies MRC and esidue scaling. If lookup tables ae used, then MRC and scaling would equie (N ) and (2N ) memoy cycles espectively [2]. Chen s algoithm also equies a edundant modulus which epesents anothe dawback. Gambege [6] pesented an algoithm which does not use MRC. The numbe of iteations fo each division poblem is popotional to the magnitude of the diviso. The mean of the numbe of iteations is, thus, vey high. Moeove, the hadwae implementation suggested by Gambege is vey complicated and expensive due to the utilization of auxiliay RNS. Hung and Pahami [] intoduced two RNS division algoithms based on the appoximate-sign detection technique. The faste among the two [] equies much moe hadwae than the othe slowe one. Hung and Pahami [] indicated that intemediate to these algoithms ae a numbe of choices that offe speed/cost tadeoffs. Although both the faste algoithm of Hung and Pahami [] and the algoithm of Lu and Chiang [3] have the same time complexity, the latte one has a bette hadwae complexity. The most competitive wok, intoduced by Lu and Chiang [3], does not use MRC, howeve, it utilizes the idea of the factional epesentation of /M to detect the paity of a esidue numbe and hence to check if an oveflow has taken place. Lu and Chiang s algoithm equies 2 log 2 Q steps, whee Q is the quotient. Each step consists of seveal esidue additions and subtactions, one esidue multiplication, two memoy access cycles and one multi-opeand addition (in fact, in pats II and IV of Lu and Chiang s algoithm, moe than one multi-opeand addition might be needed). Realization II of the new algoithm equies log 2 Q steps whee each step consists of one esidue multiplication (Q i ), one esidue subtaction ( Q i ) that is pefomed in paallel with one esidue addition (Q = Q + Q i ), one multi-opeand addition and two memoy access cycles: one to get the factional epesentations of esidue digits, while the othe is to obtain Q i. NOTE. Following the liteatue, the pevalent method of measuing execution time fo esidue aithmetic algoithms [4, 5, 2, 3], the mean of the basic esidue aithmetic opeations needed by each algoithm, is computed. The basic esidue opeations ae: addition, subtaction and multiplication. The following notational convention has been adopted fo this pape: {m, m 2,...,m N }, moduli set of N paiwise elatively pime positive integes. M = N i= m i, dynamic ange. Fo any intege [0, M), esidue epesentation of is: RNS (x, x 2,...,x N ). x i = mi,i.e.x i = (mod m i ). mˆ i = M/m i. / mˆ i mi, multiplicative invese of mˆ i (i.e. / mˆ i mi mˆ i mi = ).., the ceiling value of (.); that is the next intege geate than o equal to (.).., the floo value of (.); that is the peceding intege less o equal to (.). Define a function h(i) such that: + log 2 I, if I is an intege > 0 h(i) = 0, if I = 0 log 2 I, if 0 < I <. 2. DIVISION ALGORITHM Assume that, and Q ae non-negative integes such that Q = /, 0, then the following steps intoduce the basic idea fo division in RNS:. Set quotient Q to zeo; Q =0. 2. Find the position of the most-significant non-zeo bit in the diviso,sayk, thatisk = h( ). 3. Find the position of the most-significant non-zeo bit in the dividend, say j,thatis j = h(). 4. If j > k, then: Q = Q + 2 j k, = 2 j k, Q = Q, =. GotoStep3. 5. If j = k, then: =, j = h( M ), so if j < j then Q = Q +. Othewise, Q is unchanged. In eithe case, end pocedue. 6. If j < k,thenq is unchanged. End pocedue. This basic algoithm can be used effectively with RNS division aithmetic. An impotant featue of this algoithm is the selection of the quotient to be 2 j k, hence the quantity (2 j k ) is guaanteed to be non-negative as long as >. It should be emphasized that and in the above algoithm ae binay epesentations of nonnegative integes and that the algoithm is still coect when the factional epesentation is adopted. The poofs of both cases, intege and factional epesentations, ae intoduced in the next two subsections 2.. Poof of coectness of the algoithm: intege epesentation Befoe poving the algoithm, the following lemma has to be intoduced. LEMMA. Fo any esidue integes, [0, M), fo which j = h(), k = h( ) and j = k 0 then j > j if and j j if <,whee j = h( M ). Poof. Since j = k,then and can be expessed as: = 2 j + a, and = 2 j + b, whee: 0 a, b 2 j.

234 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD Fo the case (i.e. a b): M = 0, then = a b, but since 0 a b 2 j, then j = h( ) = h(a b) < j. Fo the case < (i.e. a < b): M = M +, j = h(m + ) = h(m + a b). The minimum value of j happens when a = 0, b = (2 j ) and M = +. Upon substituting these values: j h(2 j + ) = j, o equivalently: j j. Based on the above lemma, the poof of the algoithm is as follows: Fo the case j > k, and since < 2 k+ and 2 j then / > 2 j k = Q i (i.e. Q i is the ith patial quotient). Hence the estimate of the quotient in each iteation is guaanteed to poduce a positive emainde. Assume thee ae v iteations which satisfy the condition j > k, then the total patial quotients esulting fom this case ae Q, whee:q = v i= Q i. Fo the case j = k (i.e. (v + )th iteation), two possibilities ae expected:. <. This case is detected accoding to Lemma by evaluating j = h( ). Hence, if j j then Q v+ = 0. The pocedue is then stopped. 2.. This case is detected accoding to Lemma by evaluating j = h( ). Hence, if j < j then Q v+ =. The pocedue is then stopped. Fo the case j < k, it is obvious that <, hence the pocedue has to be stopped. Theefoe, the quotient would be: Q = / = vi= Q i + Q v+,whee: Q v+ = {, if j v+ > j 0, othewise j v+ = h(), inthe(v + )th iteation (i.e. when j = k). 2.2. Poof of coectness of the algoithm: factional epesentation LEMMA 2. In RNS, fo any factional epesentations /M, /M whee, [0, M), j = h(/m), k = h(/m) and j = k then j if, and j = if <,whee j = h(( )/M). Poof. Since /M and /M ae of the same ode (i.e. j = k)thatis: 2 j /M < 2 j+ and similaly 2 j /M < 2 j+. Fo the case : M = 0, then: 0 ( )/M < 2 j. Hence, j = h(( )/M) < j,o equivalently; j < j. Since the highest value of j is, then: j 2. Note that fo the special case, ( )/M = 0, then by definition h(0) = 0. Consequently, if then j. Fo the case < : M = M ( ), then j = h((m + )/M), o j = h( ( )/M) but since <, then: 0 <( )/M < 2 j,o j h( 2 j ). Since, M, then the maximum value of j is. Theefoe, 0 > j h( 2 ).O: j =. The poof of the algoithm fo the case when the dividend and diviso ae factional quantities uses Lemma 2 and follows the same appoach given in the poof of Realization I. The poof leads to the esult that the quotient Q can be expessed as: Q = Q v+ = = v Q i + Q v+ i= {, if j 0, othewise. 3. REALIZATION OF THE ALGORITHM IN RNS 3.. Realization I This ealization is based on the intege epesentation outlined in the pevious section. It is quite useful fo small and medium dynamic anges whee all the bits of esidue digits can be applied to a single RAM in ode to evaluate h(i). The poposed stuctue fo Realization I, shown in Figue, can be descibed as follows: by applying to aram,k = h( ) is evaluated, whee k is expessed in bits, = log 2 (log 2 M). Similaly, by applying to RAM j = h() is also evaluated, whee j is also expessed in bits. The patial quotient Q i is computed by applying j and k to RAM 2. A esidue multiplie then multiplies the patial quotient with. The output of the multiplie, i.e. Q i, is subtacted fom to poduce a new emainde. The pocedue is epeated and the esidue adde accumulates patial quotients, until j < k. RAM accepts N n i bits, whee n i = log 2 m i addess lines (i.e. bits of esidue epesentation of o ), thus it has a size of (2 N n i ) bits, = log 2 (log 2 M). Howeve, RAM 2 accepts the bits of j and k (a total of 2 bits), thus its size would be (2 2 N n i ). In the case that j k is fist evaluated befoe being applied to RAM 2, then RAM 2 would have the size (2 + N n i ). This ealization equies log 2 Q iteations. Each iteation consists of two consecutive memoy cycles followed by two consecutive esidue opeations. The ealization is vey attactive fo many digital-signal pocessing applications which utilize small and medium dynamic anges. 3.2. Realization II This ealization is based on the factional epesentation outlined in the pevious section. It is quite useful fo lage dynamic anges whee bits of esidue digits cannot be applied to a single RAM in ode to evaluate h(i). Van Vu [5] developed a convesion technique based on the CRT. This technique uses factional epesentation of

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 235 RAM j k RAM 2 Residue Residue Multip. Subtac. Residue Adde Q FIGURE. Poposed implementation of the algoithm (Realization I). 2 3 N RAM RAM 2 E RAM 3 n Adde M c j o RAM d e k. RAM N Residue Residue Multip. Subtac. Residue Adde Q FIGURE 2. Poposed implementation of the algoithm (Realization II). weighted numbes. This technique is given by N M = m i= i x i ˆm p () i mi whee p is a non-negative intege. Hence, the value of /M can be obtained by evaluating the ight-hand side of (). This is basically done by letting each x i addess a table which stoes x i m i ˆm. i mi The output of these N tables can be added using a multiopeand adde. Any intege oveflow esulting fom this adde is disegaded since it epesents the intege pat of the summation esult. The factional value stoed in tables should be expessed using t bits whee t log 2 MN if M is odd and t log 2 MN othewise [5]. The poposed stuctue fo Realization I, shown in Figue, can be descibed as follows: apply the esidue digits of the diviso to the factionalepesentationcicuit to obtain /M. This equies a memoy cycle followed by a multi-opeand addition. /M is applied to a pioity encode to evaluate k = h(/m). Next, and using the same appoach, j = h(/m) is evaluated. The bits of j and k (a total of 2 bits), o thei diffeence, ae applied to a RAM to poduce the patial emainde Q i.thisq i is applied to a esidue multiplie to compute Q i. The output of the multiplie (Q i ) is subtacted fom using a esidue subtacto. The output of the esidue subtacto is applied again to the factional epesentation cicuit as long as j k. 4. EVALUATION The complexity of the poposed esidue-based divide is highly dependent on the complexity of individual components being used in the design (e.g. esidue multiplie, adde, etc.). Many esidue-based multiplies can be found in the liteatue [6, 7, 8, 9]. These ae diffeent in thei stuctues, aea and time complexities (i.e. gate numbe, silicon aea, time delay, etc.). The same statement can be made about othe components in this poposed divide [2].

236 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD TABLE. Simulation esults. MORO MOMA Moduli Lu and Lu and Set Ous Chiang s Ous Chiang s M 2.68 0.34 0 0 M2 2.67 0.36 3.04 5.496 M3 2.72 0 3.089 5.504 Fo example, the aea and time complexities of the pioity encode pesented in [20] ae given by O(n) and O(log n), espectively. The poposed design in Figue 2 consists of diffeent devices; namely RAMs, a multi-opeand binay adde, a pioity encode followed by a RAM, a esidue-based adde, subtacto and multiplie. Fo the fist N RAMs, the size of the ith RAM is (2 n i n), whee n i = log 2 m i. The multi-opeand adde accepts N opeands of n bits each. The least significant (LS) n output bits of this adde constitute /M. The encode accepts these LS n bits of the adde and poduces h(i), expessed in bits. The RAM following the encode outputs the esidue epesentation of the estimated quotient. This estimated quotient is, then, applied to diffeent esidue-based aithmetic components. Each of these components accepts N esidue digits fom each opeand, whee each esidue digit is expessed in n i bits. To compae the pefomances of this algoithm and Lu and Chiang s algoithm, compute pogams simulating both algoithms have been developed to calculate the mean of esidue opeations (MORO). The mean of multi-opeand additions (MOMA) has been calculated and compaed, whee it applies. Fo simulation puposes, thee moduli sets wee selected to seve diffeent dynamic anges. These sets ae: M = (7,, 3, 5), M2 = (, 3, 5, 9, 23, 29, 3) and M3 = (29, 3, 43, 47, 53, 55, 59, 6, 63). Simulation esults ae listed in Table. Fo moduli set M, the esults ae exact. All possible combinations of dividends and divisos wee simulated. Howeve, fo M2 and M3, a sample of 200 million andomly geneated numbes within the dynamic ange defined by each moduli set was simulated. The simulation of the new algoithm is based on Figue fo M, and on Figue 2 fo M2 and M3. Simulation of Lu and Chiang s algoithm is based on the flowchat given in [3]. Fo the moduli set M, Table indicates that the new algoithm is fou times faste than Lu and Chiang s algoithm. This conclusion applies to evey moduli set whee all the bits of esidue digits can be applied simultaneously to a single RAM. Fo othe moduli sets like M2 and M3 which have vey lage dynamic anges, the new algoithm is still fou times faste egading the numbe of basic esidue opeations. Moeove, the aveage numbe of multi-opeand additions needed is almost half that needed by the othe algoithm. 5. CUSTOMIZED DIVISION ALGORITHM In Realization II, detemining the position of the highest powe of two contained in any esidue numbe I, which we efeed to as h(i), is an impotant time-delay element in the opeation of the poposed esidue divide. In this section, we customize the same algoithm to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). This customization esults in eliminating the need of ROMs and thus educing the delay contibuted by evaluating h(i). 5.. Evaluating h(i) fo the moduli set (2 k, 2 k, 2 k ) Define m = 2 k, m 2 = 2 k, m 3 = 2 k. Then, ˆm = (2 k )(2 k ), ˆm 2 = 2 k (2 k ), ˆm 3 = 2 k (2 k ). The esidue epesentation of is (, 2, 3 ). The multiplicative inveses fo ˆm, ˆm 2 and ˆm 3 ae [2]: 2 k +, 2 k 3and 2 k 2, espectively. Substituting the coesponding values of ˆm i and thei multiplicative inveses in (): M = 2 k (2k + ) 2 k + 2 k (2k 3) 2 2 k + 2 k 2k 2 3 2 k p. (2) Since < M, then/m <. Hence ( M = FRAC 2 k (2k + ) 2 k + 2 k (2k 3) 2 2 k ) + 2 k 2k 2 3 2 k (3) whee FRAC(...) denotesthe factionalpat of the opeand. The cicula shift popety [3] states that modulo (2 p ) multiplication of an intege by 2 n, whee p and n ae positive integes, is equivalent to n-bits cicula left-shift 3-bits (e.g. 2 3 binay (27) 3 0= 0 decimal 30). Theefoe, to simplify the tems on the ight-hand side (RHS) of (3), we poceed as follows: Evaluate (/2 k ) (2 k + ) 2 k. Assuming that the binay fom of is given by: b (k ) b (k 2)...b b 0, then, (2 k + ) 2 k = 2 k binay + 2 k (k )zeos b (k ) b (k )...b b 0 00...000 + b (k ) b (k 2)...b 2 b b 0 2 k. Recalling that fo mod2 k, only the LS k bits ae significant, then (2 k + ) 2 k = b x b (k 2)...b b 0, whee b x = (b 0 OR b (k ) ). Now, let R epesent the binay fom of (/2 k ) (2 k + ) 2 k,thenr is obtained by multiplying the binay fom of /2 k by that of (2 k + ) 2 k.thatis, R = 0.b x b (k 2)...b b 0. (4)

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 237 Evaluate [/(2 k )] (2 k 3) 2 2 k. Assuming that the binay fom of 2 is given by: b 2(k ) b 2(k 2)...b 2 b 20, then (2 k 3) 2 2 k = 2 k 2 3 2 2 k. But since 2k 2 2 k = 2 2 k,then (2 k 3) 2 2 k = (2 2) 2 k. Based on the cicula left-shift popety: binay (2 2 ) 2 k (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k. O equivalently, (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k = k-ones(=2 k ) (...) (b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) ) 2 k. Assuming 2 0, then (2 k 3) 2 2 k binay b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) whee b denotes the complement of the bit b. Howeve, if 2 = 0, then (2 k 3) 2 2 k = 0. On the othe hand, the tem /(2 k ) can be witten as 2 k /( 2 k ). Recall that any faction in the fom q/( q), whee q <, can be expanded in a powe seies fom as: q/( q) = i= i= qi. Theefoe: 2 k /( 2 k ) = 2 k + 2 2k + 2 3k + 2 4k +... Based on eo analysis intoduced in [5], then the MS (3k + ) bits ae the only significant bits in ou computations. Let R 2 epesent the binay fom of [/(2 k )] (2 k 3) 2 2 k,thenr 2 is obtained by consideing the MS (3k +) bits of multiplying /(2 k ) by (2 k 3) 2 2 k.thatis kbits R 2 = 0. b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) kbits 0 b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) kbits b 2(k 2) b 2(k 3)...b 2 b 20 b 2(k ) b 2(k 2) (5) whee implies that tems ae concatenated. Evaluate [/(2 k )] 2 k 2 3 2 k. Assuming that the binay fom of 3 is given in (k ) bits by: b 3(k 2) b 3(k 3)...b 3 b 30, then based on binay the cicula left-shift popety: 2 k 2 3 2 k b 30 b 3(k 2) b 3(k 3)...b 3. On the othe hand, the tem /(2 k ) can be witten as 2 (k ) /( 2 (k ) ). Thus, it can be expanded in a powe seies fom as: 2 (k ) /( 2 (k ) ) = 2 (k ) +2 2(k ) +2 3(k ) +2 4(k ) +...Now,let R 3 be the binay fom of [/(2 k )] 2 k 2 3 2 k, then R 3 is obtained by consideing the MS (3k +) bits of multiplying /2 k by 2 k 2 3 2 k.thatis (k )bits (k )bits R 3 = 0. b 30 b 3(k 2)...b 32 b 3 b 30 b 3(k 2)...b 32 b 3 (k )bits 4bits b 30 b 3(k 2)...b 32 b 3 b 30 b 3(k 2) b 3(k 3) b 3(k 4). (6) Theefoe, (3) can be ewitten as M = FRAC(R + R 2 + R 3 ). (7) Thus h(/m) is nothing but the position of the MS non-zeo bit of /M. EAMPLE. Conside the moduli set {6, 5, 7}. To find h(/m) whee = (8, 2, 4) (i.e. = 32 and M = 680), then: = 8 binay 000, so R = 0.000 2 = 2 binay 00, so R 2 = 0.00 00 00 0 3 = 4 binay 00, so R 3 = 0.00 00 00 00 0. Theefoe, /M = FRAC(R + R 2 + R 3 ) = 0.000000 Since the undelined MS non-zeo bit is in the thid location, then h(/m) = 3. In ode to implement (7), one thee-opeand binay adde is needed. A cay-save adde (CSA) followed by a caypopagate adde (CPA) can ealize the addition of the thee opeands. 5.2. Evaluating h(i) fo the moduli set (2 k +, 2 k, 2 k ) The esidue decode intoduced by Sweidan and Hiasat [22] has the advantages of educed hadwae equiements and extemely wide fixed-point dynamic anges since its uppe bound is not limited by a memoy size. Moeove, it equies only a total of fou 2k-bit binay addes, which makes it vey attactive compaed to othe published decodes [23, 24]. In this pape, we popose a hadwae layout that can decode esidue digits of the moduli set (2 k +, 2 k, 2 k ) into binay equivalent. The new layout is an impovement of that pesented in [2]. In this new contibution, we ae educing the numbe of addes needed fo the decoding opeation fom fou 2k-bit binay addes into one 2k-bit thee-opeand binay adde. It has been poved in [2] that whee: /2 k = A + B + C 2 2k (8) A = (2 2k + 2 k ) 3 2 2k B = (2 2k 2 k ) 2 2 2k C = (2 2k + 2 k ) 2 2k. Assuming that, 2 and 3 have the following binay fomat: = b k b (k )...b b 0 2 = b 2(k ) b 2(k 2)...b 2 b 20 3 = b 3(k ) b 3(k 2)...b 3 b 30, then using cicula left-shift popety, A, B and C can be

238 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD R R 2 R 3 E n M c j Adde o RAM d e k Residue Residue Multip. Subtac. Residue Adde Q FIGURE 3. Poposed implementation of the division algoithm customized fo moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). expessed as [2]: A = b 30 b 3(k )...b 32 b 3 b 30 b 3(k )...b 32 b 3 B = b 2(k ) b 2(k 2)...b 2 b 20... }{{} k ones C = b x b (k )...b 2 b b x b (k )...b 2 b whee: b x = b 0 OR b k. By edefining R = A, R 2 = B, and R 3 = C, then (8) can be ewitten as /2 k = R + R 2 + R 3 2 2k. (9) Case I: Since R = A, then the binay epesentation is the same: R = b 30 b 3(k ) b 3(k 2)...b 32 b 3 b 30 b 3(k ) b 3(k 2)...b 32 b 3. (0) Case II: Since R 2 = B, then fo the case < 2 k, and using the 2 s complement notation, R 2 = B + ( s complement of ) +. Noting that the LS k bits of B ae all ones, then the LS k bits of the esult of the subtaction ae simply the s complement of andanoveflowofatthe(k + )th bit. Based on 2 s complement, this oveflow indicates that the esult of subtaction is positive, hence it can be disegaded. Howeve, when 2 k = 2 = 0, then R 2 = 2 2k 2 2k = 0. Theefoe, R 2 can be expessed in binay fomat as 0, if 2 k = 2 = 0 R 2 = b 2(k ) b 2(k 2)...b 2 b 20 b (k ) b (k 2)...b b 0, othewise. () Case III: = 2 k. In this case, the (k + )th bit of is, thus the values R 2 and B ae the same because in the computation of R 2 we used the LS (k ) bits of, which ae all zeos in this case. Theefoe, the fomat of R 2 is not changed. Howeve, to take cae of this non-zeo (k + )th bit of, it has to be subtacted fom R 3. Theefoe b x b (k )...b 2 b b x b (k )...b 2 b, R if 2 k 3 = b x b (k )...b 2 b b x b (k )...b 2 b, if = 2 k. (2) Equation (9) is simply accomplished by adding R, R 2 and R 3. The output should then be incemented by any output-cay. Nevetheless, a cay esulting fom this adde can be neglected as long as the output does not have the value (2 n ), whee 0 n 2k. This can be justified by the fact that h(i) = h(i + ) if I (2 n ). Howeve, if I = 2 n, then the output cay would be significant and the pioity encode poposed in implementing the esidue divide can take cae of this special case. A single cay-save adde can add these thee opeands. Few logic gates ae also needed to detect the (k + )th bit of and to select the pope fomat of R 3. Using the fomula = /2 k 2 k + 2, then the value of can be obtained by concatenation of the k bits of 2 to the 2k output bits of the thee-opeand adde. These 3k bits and the cay ae applied to the pioity encode. EAMPLE. Conside the moduli set {7, 6, 5}. Tofind h() whee = (,, 2) (i.e. = 827), then: = binay 00, so R 3 = 0 0 2 = binay 0, so R 2 = 000 000 3 = 2 binay 000, so R = 000 000. Theefoe, adde output = (R + R 2 + R 3) = 00 000, whee the oveflow has been neglected. Since = /2 k 2 k + 2, then concatenation of bits of 2 to that of the adde yields: h() = h(00 000 0) = 0 (i.e. 0th position). Figue 3 shows the new poposed hadwae ealization of an RNS divide fo the moduli sets (2 k, 2 k, 2 k )

IMPLEMENTATION OF A NEW EFFICIENT RNS DIVISION ALGORITHM 239 (x, x 2, x 3 ) / 3 bit mux/ adde eset clk e g / 3 (y, y 2, y 3 ) / 3 bit e adde g / 3 PE PE / 4 / 4 multiplie m m e 2 / g 3 m ROM 3. m / e 3 m 2 / g 3 m 3 adde subtacto m m e 2 g m 3 Q m Q m2 Q m3 DIVC DBZ FIGURE 4. Block diagam of the implemented divide. and (2 k +, 2 k, 2 k ). The opeation of this divide is selfexplanatoy. The popagation delay in Figue 3, as compaed with that in Figue 2, has been educed by a memoy access cycle pe iteation. Recalling that the memoy access cycle is vey significant compaed with the delay of othe components and that division is an iteative pocedue, then this eduction will, eventually, be inceasingly significant as the numbe of iteations pe division poblem is inceased. This implies that the new poposed ealization is much faste fo these paticula moduli sets. Moeove, the eduction in hadwae equiements is anothe substantial impovement. 6. VLSI IMPLEMENTATION OF A RESIDUE-BASED ARITHMETIC DIVIDER A pipelined design fo a esidue-based aithmetic divide fo the moduli set (2 k +, 2 k, 2 k ) has been implemented, fabicated and tested. The detailed design of the implemented cicuit is shown in Figue 4. Data path sizes ae also shown. The implementation was accomplished using Octtools-5.2 with a standad cell MSU2.3 libay. Fo pototype puposes, k was selected to be fou. Thus, the total numbe of input pins is 3 fo each opeand. The clock, an / selecto and eset ae anothe thee inputs. Similaly, the output quotient is expessed in 3 bits. Division-completed (DIVC) is a onebit output that goes high to validate the output quotient and sets the flag that the division pocess is completed. Divisionby-zeo (DBZ) is anothe output bit which sets a flag if the diviso is zeo. The design has an integated cicuit aea of (.792.675) mm 2. The tiny padfame (40PC22 22) was used to accommodate this design. Test esults showed that the design can un at a clock speed of 5 MHz. The numbe of clock cycles equied fo each division poblem depends on both the dividend and the diviso. Howeve, the aveage numbe ove diffeent division poblems is eight clock cycles. 7. CONCLUSIONS This pape has pesented a new geneal division algoithm fo RNS, which is faste than othe peviously poposed algoithms. The algoithms wee then customized to seve two specific moduli sets: (2 k, 2 k, 2 k ) and (2 k +, 2 k, 2 k ). An RNS divide would then equie a binay adde, a pioity encode, a ROM, a esidue adde, a esidue subtacto and a esidue multiplie only. These educed hadwae equiements and pocessing time qualify the new ealization to be vey pactical fo many computing applications and theefoe enable RNS to play an inceased ole in designing aithmetic logic units fo geneal pupose computes. The poposed customized hadwae has been implemented on silicon and test esults have been pesented. REFERENCES [] Hiasat, A. and Abdel-Aty-Zohdy, H. (997) Design and implementation of an RNS division algoithm. Poc. 3th Symp. Compute Aithmetic (Asiloma, CA), pp. 240 249. [2] Hiasat, A. and Abdel-Aty-Zohdy, H. (995) High-speed division algoithm fo esidue numbe system. Poc. 995 IEEE Intenational Symposium on Cicuits and Systems (ISCAS), vol. 3, pp. 996 999. [3] Lu, M. and Chiang, J. (992) A novel division algoithm fo esidue numbe system. IEEE Tans. Comput., 4, 026 032.

240 AHMAD A. HIASAT AND HODA ABDEL-AT-ZOHD [4] Baneji, D., Cheung, T. and Ganesan, V. (98) A high-speed division method in esidue aithmetic. Poc. 5th IEEE Symp. Comput. Aithmetic, pp. 58 64. [5] Chen J., W. (990) A new esidue numbe division algoithm. Compute Math. Appl., 9, 3 29. [6] Gambege, D. (99) New appoach to intege division in esidue numbe system. Poc. 0th Symp. Comput. Aith., pp. 84 9. [7] Kie,., Cheney, P. and Tannenbaum, M. (962) Division and oveflow detection in esidue numbe systems. IRE Tans. Electon. Comput.,, 50 507. [8] Lin, L., Leiss, E. and Mcinnis, B. (984) Division and sign detection algoithm fo esidue numbe systems. Comput. Math. Appl., 0, 33 342. [9] Kinoshita, E., Kosako, H. and Kojima,. (973) Geneal division in the symmetic esidue numbe system. IEEE Tans. Computes, 22, 34 42. [0] Hitz, M. and Kaltofen, E. (995) Intege division in esidue numbe system. IEEE Tans. Computes, 44, 240 248. [] Hung, C. and Pahami, B. (994) An appoximate sign detection method fo esidue numbes and its applications to RNS division. Compute Math. Appl., 27, 23 35. [2] Sodestand, M., Jenkins, W., Jullien, G., Taylo, F. (eds) (986) Residue Numbe System Aithmetic: Moden Applications in Digital Signal Pocessing. IEEE Pess, New ok. [3] Szabo, N. and Tanaka, R. (967) Residue Aithmetic and Its Applications to Compute Technology. McGaw Hill, New ok. [4] Jenkins, W. (979) Recent advances in esidue numbe techniques fo ecusive digital filteing. IEEE Tans. Acoust. Speech and Signal Pocessing, 27, 9 3. [5] Van Vu, T. (985) Efficient implementations of chinese emainde theoem fo sign detection and esidue decoding. IEEE Tans Comput., 34, 646 65. [6] Hiasat, A. (996) Semi-custom VLSI design fo RNS multiplies using combinational logic appoach. ICECS 96, 2, 935 938. [7] Radhakishnan, D. and uan,. (92) Novel appoaches to the design of VLSI RNS multiplies. IEEE Tans. Cic. Sys- II: Analog and Digital Signal Pocessing, 39, 52 57. [8] Alia, G. and Matinelli, E. (99) A VLSI modulo m multiplie. IEEE Tans. Comp., 40, 873 878. [9] Elleithy, K. and Bayoumi, M. (995) A systolic achitectue fo modulo multiplication. IEEE Tans. Cic. Sys-II: Analog and Digital Signal Pocessing, 42, 725 729. [20] Wada, K., Hagihaa, K. and Tokua, N. (984) Aea-time optimal fast implementations of seveal functions in a VLSI model. IEEE Tans. Computes, 33, 435 440. [2] Hiasat, A. and Zohdy, H. (998) Residue to binay convete fo the moduli (2 k, 2 k, 2 k ). IEEE Tans. Cicuits and Systems Pat II, 45, 204 209. [22] Sweidan, A. and Hiasat, A. (988) New efficient memoyless, esidue to binay convete. IEEE Tans. Cicuits and Systems, 35, 44 444. [23] Benadson, B. (985) Fast memoyless, ove 64 bits, esidue to decimal convete. IEEE Tans. Cicuits and Systems, 32, 298 300. [24] Ibahim, K. and Saloum, S. (988) An efficient esidue to binay convete design. IEEE Tans. Cicuits and Systems, 35, 56 58.