COMPARISON OF FPGA IMPLEMENTATION OF THE MOD M REDUCTION

Lati America Applied Research 37:93-97 (2007) COMPARISON OF FPGA IMPLEMENTATION OF THE MOD M REDUCTION J-P. DESCHAMPS ad G. SUTTER Escola Tècica Superior d Egiyeria, Uiversitat Rovira i Virgili, Tarragoa, Spai, jeapierre.deschamps@urv.et; http://www.etse.urv.es Escuela Politécica Superior, Uiversidad Autóoma de Madrid, Madrid, Spai, gustavo.sutter@ii.uam.es; http://www.ii.uam.es. Abstract Several algorithms for computig x mod m are preseted, amog others the reductio mod B k -a, the pre-computatio of B i.k mod m, a geeralized versio of the Barrett algorithm ad a modified versio of the same Barrett algorithm. The four metioed algorithms, as well as the classical iteger o-restorig divisio algorithm, have bee sythesized ad implemeted withi xc3s4000 compoets. Keywords Arithmetic i FPGA, Galois Field, Cryptography, modular operatio. I. INTRODUCTION Arithmetic operatios over the fiite rig Z m = {0, 1,..., m-1} are used as computatio primitives for executig umerous cryptographic algorithms, especially those related with the use of public keys (asymmetric cryptography). Classical examples are cipherig / decipherig, autheticatio ad digital sigature protocols based o RSA-type or elliptic curve algorithms. Oe of the basic operatios is the modulo m reductio. Combied with operatios over the set Z of itegers (sum, subtractio, product, ad so o) it allows to perform the same operatios over Z m. A straightforward solutio cosists of usig a iteger divisio algorithm. Nevertheless, more efficiet algorithms have bee proposed (Blake et al, 2002; Hakerso et al, 2004). I this paper several algorithms are described, amely the reductio mod B k -a, the pre-computatio of B i.k mod m, a geeralized versio of the Barrett algorithm ad a modified versio of the same Barrett algorithm. The four metioed algorithms, as well as the classical iteger o-restorig divisio algorithm, have bee sythesized ad implemeted withi xc3s4000 compoets. II. ALGORITHM I this sectio the followig problem is studied: give two aturals x ad m, compute z = x mod m. A. Iteger divisio A straightforward method cosists of performig the iteger divisio of x by m, that is, x = q.m + z, z < m. For that purpose, ay divisio algorithm ca be used, for example the o-restorig divisio algorithm (Deschamps et al, 2006). Algorithm 1 No-restorig reductio y := m*(2**(-k)); rems(1) := x - y; for i i 1.. -k loop if rems(i) < 0 the rems(i+1) := 2*rems(i)+ y; rems(i+1) := 2*rems (i) - y; ed if; if rems(-k+1) < 0 the z := rems(-k+1)/(2**(-k)) + m; z := rems(-k+1)/(2**(-k)); ed if; The core of the algorithm is a (-k)-step iteratio. If a ripple-carry k-bit adder-subtractor is used, the computatio time is about time(,k) (-k).k.t FA, (1) where T FA is the delay of a full-adder. B. Reductio mod B k -a Assume that B k-1 m < B k, where B is a atural umber 2. The m = B k a where 1 a B k B k-1. Compute the followig quotiets q i ad remaiders r i : x = q 0.B k + r 0, q 0.a = q 1.B k + r 1, q 1.a = q 2.B k + r 2, (2)... q s-2.a = q s-1.b k + r s-1. Multiply the secod equatio of (2) by (B k /a), the third oe by (B k /a) 2,..., the last oe by (B k /a) s-1, ad sum up the s equatios; the result is x = r 0 + r 1.(B k /a) + r 2.(B k /a) 2 +... + r s-1.(b k /a) s-1 + q s-1.b k.(b k /a) s-1. (3) As a < B k, that is, B k /a > 1, there exists a value of s such that x < B k.(b k /a) s-1, (4) ad thus q s-1 = 0. Let s be the value of s such that q s-1 = 0. Notice that if r s-1 = 0 the the last equatio 93

Lati America Applied Research 37:93-97 (2007) of (2), with q s-1 = 0, is q s-2.a = 0, that is, q s-2 = 0, so that s is ot the value of s such that q s-1 = 0. Thus x = r 0 + r 1.(B k /a) + r 2.(B k /a) 2 +... + r s-1.(b k /a) s-1, with r s-1 > 0. (5) By summig up the s equatios of system (2), with q s-1 = 0, the followig relatio is obtaied: x = q 0.(B k -a) + q 1.(B k -a) +... + q s-2.(b k -a) + r 0 + r 1 +... + r s-1. (6) Defie r = r 0 + r 1 +... + r s-1. (7) Accordig to (6) ad (7), x r mod m, with m = B k -a. (8) Comparig (5) with (7) it is obvious that if s > 1, that is, if x B k, the r < x. If r is still greater tha or equal to B k, the same method ca be used i order to get r x mod m, with r < r. After a fiite umber of iteratios, a umber r is obtaied such that r x mod m ad r < B k, so that z = r q.m where 0 q B-1. I particular, if B = 2 the z is either r or r -m. To summarize, the mod m reductio algorithm, with m = B k -a, is the followig: Algorithm 2 mod m reductio algorithm, with m = B k -a r := x mod b**k; q := x/b**k; loop loop r := r + (q*a mod b**k); q := q*a/b**k; if q = 0 the exit; ed if; q := r/b**k; r := r mod b**k; if q = 0 the exit; ed if; while r >= m loop r := r-m; If B is the base (or a power of the base) of the chose umeratio system, the the divisio by B k ad the mod B k reductio are trivial operatios. The oly o-trivial operatios are multiplicatio by a, sums (remaider accumulatio) ad subtractios (fial reductio). The umber of executios of the iteral loop body ca be estimated as follows: a sufficiet coditio for q s-1 beig equal to 0 is (4), which is equivalet to s > (log x log a) / (k.log B - log a). Thus s = (log x log a) / (k.log B - log a). I particular, if x = B -1, that is, the greatest - digit B-ary umber, the s = ( log B a) / (k log B a), (9) ad, assumig that log B a is much smaller tha k ad, s /k. (10) As regards the reductio rate of the algorithm, that is, the relatio betwee a iitial value x ad the obtaied value r after a first executio of the iteral loop, otice that r is smaller tha s.b k, so that the umber d(r) of B- ary digits of r satisfies the coditio d(r) k + log B s, (11) where s is approximately equal to (10). Thus d(r max ) log B + k log B k < log B + k. I order to defie the size of the variable r = r 0 + r 1 +... + r s-1, the followig values are previously calculated (see (9) ad (11)): s = ( log 2 a) / (k log 2 a), t = log 2 s, so that r ca be represeted as a (k+t)-bit umber. The core of the algorithm is a (/k)-step iteratio. Each step icludes the multiplicatio of a (-k)-bit umber q by a k-bit umber a, ad the sum of a (k+t)-bit umber r ad a k-bit umber. The computatio time of the multiplier depeds o the particular value of a. Nevertheless, i order to get a estimatio of the computatio time, it will be assumed that a parallel multiplier is used. Its computatio time is about ((-k)+2.k-2).t FA (+k).t FA (Deschamps et al, 2006). The step duratio is approximately equal to (+k).t FA + (k+t).t FA. If +2.k >> t the the computatio time is approximately equal to time(,k) (/k).( +2.k).T FA. (12) C. Pre-computatio of B i.k mod m Assume agai that B k-1 m < B k, ad that x is represeted i base B k, i.e. x = x s-1.b (s-1).k + x s-2.b (s-2).k +... + x 1.B k + x 0, where x s-1 > 0. (13) The followig values must have bee previously computed: b 0 = 1, b 1 = B k mod m, b 2 = B 2.k mod m,..., b s-1 = B (s-1).k mod m. The x x s-1.b s-1 + x s-2.b s-2 +... + x 1.b 1 + x 0.b 0 mod m, ad the problem is reduced to the computatio of r mod m where r = x s-1.b s-1 + x s-2.b s-2 +... + x 1.b 1 + x 0.b 0. (14) Observe that b i = (B i.k mod m) < m < B k B i.k, i > 0. Comparig (14) with (13), it is obvious that if s > 1, that is, if x B k, the r < x. If r is still greater tha or equal to B k, the same method ca be used i order to get r x mod m with r < r. After a fiite umber of iteratios, a umber r is obtaied such that r x mod m ad r < B k, so that z = r q.m where 0 q B-1. I particular, if B = 2 the z is either r or r -m. To summarize, the mod m reductio algorithm, with pre-computatio of B i.k mod m, is the followig (it is assumed that the ats b i = B i.k mod m have bee previously calculated): 94

J-P. DESCHAMPS, G. SUTTER Algorithm 3 mod m reductio, with pre-computatio of B i.k mod m mai: loop --represet x as a s-digit umber: vector_x(0) := x mod base**k; q := x/base**k; for i i 1.. s-1 loop vector_x(i) := q mod base**k; q := q/base**k; --ed of computatio detectio: oe_digit := true; for i i 1.. s-1 loop if vector_x(i) /= 0 the oe_digit := false; exit; ed if; --mai computatio if oe_digit the exit mai; x := vector_x(0); iteral: for i i 1.. s-1 loop x := x + vector_x(i)*b(i); ed loop iteral; ed if; ed loop mai; r := vector_x(0); while r >= m loop r := r-m; The mod B k reductio ad the iteger divisio by B k are trivial operatios. The oly o-trivial operatios are products of base-b k digits (vector_x(i)*b(i)) ad sums, as well as the ed of computatio detectio. Let be the umber of B-ary digits of x. Accordig to (13), x max = B s.k - 1, so that = s.k ad the umber s of executios of the iteral loop body is s = /k. (15) As regards the reductio rate of the algorithm, otice that r is smaller tha s.b 2.k, so that the umber d(r) of B- ary digits of r satisfies the coditio d(r) 2.k + log B s, where s is equal to (15). Thus d(r max ) log B + 2.k log B k < log B + 2.k. (16) The core of the algorithm is a (/k)-step iteratio. Each of them icludes the product of two k-bit umbers x i ad b i, ad the sum of two (log 2 + 2.k)-bit umbers. The total computig time is approximately equal to time(,k) (/k).(log 2 + 5.k). (17) D. Barrett reductio algorithms A geeralized versio of the Barrett algorithm (Blake et al, 2002; Hakerso et al, 2004) is preseted. D.1 -digit to (k+t)-digit reductio Assume that m belogs to the rage B k-1 < m < B k where B is the base (or a power of the base) of the chose umeratio system (if m is a power of B the computatio of x mod m is trivial). The value of z = x mod m is the remaider of the iteger divisio of x by m, that is, x = q.m + z, z < m. The Barrett algorithm starts with the computatio of a approximatio q of q = x/m such that q-a q q. (18) Compute r = x q.m. (19) Takig ito accout that z = x q.m, the, accordig to (18), z r z + a.m. Let t be the iteger such that B t a+1. (20) The r z + a.m < (a+1).m < B k+t. Thus 0 z r < B k+t, so that r = r mod B k+t = (x - q.m) mod B k+t. (21) Furthermore, accordig to (19) r mod m = x mod m = z. (22) The followig algorithm, icludig a fuctio approximatio which geerates a approximatio q of x/m - see relatio (18) -, computes a (k+t)-digit umber r equivalet to x mod m: Algorithm 4 -digit to (k+t)-digit reductio q := approximatio(x, m); r := ((x mod B k+t ) (q*m mod B k+t )) mod B k+t ; If a = 2 ad B 3, the coditio (20) is B t 3 ad is satisfied if t = 1. Thus x - q.m ca be computed mod B k+1. This case correspods to the classical Barrett algorithm. D.2 A first approximatio of q Let x ad m be expressed i base B: x = x -1.B -1 + x -2.B -2 +... + x 0.B 0, m = m k-1.b k-1 + m k-2.b k-2 +... + m 0.B 0, where m k-1 > 0. The approximatio q of q = x/m is q = x/b k-1. B /m / B -k+1. It ca be demostrated (Hakerso et al, 2004) that q q + 2, that is a = 2. Accordig to (20) the value of t must be chose i such a way that B t 3. Thus if B = 2, the t = 2 (the computatio is performed mod B k+2 ), if B > 2 (classical Barrett algorithm), the t = 1 (the computatio is performed mod B k+1 ). 95

Lati America Applied Research 37:93-97 (2007) To summarize, the followig algorithm computes z = x mod p. The at c = B /m (23) must have bee previously calculated. Algorithm 5 Geeralized Barrett reductio y := x/b**(k-1); w := y*c; q := (w/b**(-k+1)) mod B**(k+t); r := ((x mod B**(k+t)) ((q*m) mod B**(k+t))) mod B**(k+t); while r >= m loop r := r-m; The divisio by B k-1 or B -k+1 ad the mod B k+t reductio are trivial operatios. The oly o-trivial operatios are the multiplicatio by m ad the subtractios. Commet I the classical Barrett algorithm (Blake et al, 2002; Hakerso et al, 2004), is assumed to be equal to 2.k, so that c = B 2.k /m, q = x/b k-1. B 2.k /m / B k+1. Assumig (best case approximatio) that the first value of r is already smaller tha m, the computatio time is the sum of the delays of a (-k+1)-bit by (-k+1)-bit multiplier (computatio of w), a (k+t)-bit by k-bit multiplier (computatio of q.m) ad a (k+t+1)-bit subtractor. It is approximately equal to ((3.(-k+1)-2) + (k+t+2.k-2) + (k+t+1)).t FA. If 2.t << 3.+k, the time(,k) (3.+k).T FA. (24) A drawback of the Barrett algorithm is the high of the multipliers. The of a -bit by m-bit multiplier is proportioal to.m (Deschamps et al, 2006). Thus, the total of both multipliers is proportioal to (k+1) 2 + (k+t).k (-k) 2 + k 2 whose value (for k smaller tha ) is 2 /2 (whe k = /2). D.3 A secod approximatio of q I order to reduce the computatio complexity (basically the computatio of w), a worse approximatio of q ca be computed. First observe that c = B /m is a at most (-k+1)-digit umber. Thus w = y.c = c 0.B 0.y + c 1.B 1.y +...+ c -k.b -k.y, q = y.c / B -k+1 = c 0.B -+k-1.y + c 1.B -+k.y +... + c -k.b -1.y. (25) Defie q = c 0. B -+k-1.y + c 1. B -+k.y +...+ c -k. B -1.y, that is q = c 0.v 0 + c 1. v 1 +... + c -k. v -k, with v i = y/b -k-i+1, i = 0, 1,..., -k. (26) Obviously q q. Furthermore q q + c 0 + c 1 +...+ c -m = q + weight(c), where weight(c) is the sum of all digits of c. Thus q weight(c) q q ad q - 2 weight(c) q q, that is, q is a approximatio (18) of q such that a = 2 + weight(c). Algorithm 6 Modified Barrett reductio y := x/b**(k-1); for i i 0.. -k loop v(i) := (y/b**(-k-i+1)) mod B**(k+t); q := c(0)*v(0) + c(1)*v(1)) mod B**(k+t); for i i 2.. -k loop q := (q + c(i)*v(i)) mod B**(k+t); r := ((x mod B**(k+t)) ((q*m) mod B**(k+t))) mod B**(k+t); while r >= m loop r := r-m; The divisio by B k-1 or B -k-i+1 ad the mod B k+t reductio are trivial operatios. The oly o-trivial operatios are multiplicatios by B-ary digits c i (a trivial operatio if B=2), multiplicatio by m, additios ad subtractios. The computatio is divided ito two parts. First, a (-k)-step iteratio computes q. The correspodig time is approximately (-k).(k+t).t FA (k).k.t FA. Assumig agai (best case approximatio) that the first value of r is smaller tha m, the secod part cosists of a (k+t)-bit by k-bit product (q.m) ad a (k+t)- bit subtractio, that is, a delay equal to ((3.k+t-2) + (k+t)).t FA 4.k.T FA. Thus, the total time is about time(,k) (-k+4).k.t FA (-k).k.t FA. (27) E. Summary The mai results are summarized i table 1. The approximate computatio time, expressed i full-adder delays, is give for every reductio method. I particular, the values obtaied whe = 2.k are computed: they correspod to the case where x is the result of multiplyig two elemets of Z m, that is, two k-bit umbers. Table 1. Computatio time, expressed i full-adder delays, for reducig a -bit umber modulo a k-bit umber algorithm time(,k) time(2.k,k) o-restorig divisio (-k).k k 2 mod 2 k -a (/k).(+2.k) 8.k pre-comput of 2 i.k mod m (/k).(log 2 +5.k) 10.k Barrett 3.+k 7.k modified Barrett (-k).k k 2 As log as the computatio time is cosidered, ad assumig that the approximatios are reasoably good, the Barrett algorithm is the best choice. Nevertheless, as quoted above, its is O( 2 ) ad could be prohibitively high for great values of (see ext sectio). III. FPGA IMPLEMENTATIONS Reductio circuits, with = 2.k = 16, 64, 256 ad 1024, have bee sythesized usig ISE6.3i (Xilix, 2006). The results for a xc3s4000-5 device are give i tables 2 to 6. The is expressed i umber of slices. Apart from the logic slices, both Barrett algorithms eed a lot of 18- by-18-bit multipliers. The xc3s4000-5 device cotais 96 such dividers, a isufficiet umber for implemet- 96

J-P. DESCHAMPS, G. SUTTER ig Barrett algorithms for great values of. This fact is idicated by the symbol withi the colum. Reductio circuits for = 64 ad m = 239, so that k = 8, have also bee sythesized (table 7). Table 2. No-restorig divisio: ad computatio time ( = 2.k) 16 49 7 60 64 133 9 300 256 430 14 1,800 1024 1619 36 19,000 Table 3. Reductio mod 2 k -a: ad computatio time ( = 2.k) 16 25 6 25 64 72 8 35 256 240 13 55 1024 918 35 140 Table 4. Pre-computatio of 2 i.k mod m: ad computatio time ( = 2.k) 16 42 6 50 64 144 9 75 256 536 20 160 1024 2061 62 500 Table 5. Barrett algorithm: ad computatio time ( = 2.k) 16 31 8 25 64 130 10 30 256 - - 1024 - - Table 7. Cost ad computatio time ( = 64 ad m = 239) algorithm mi. period (s) time (s) o-restorig divisio 118 14 850 mod 2 k -a 101 14 300 pre-comput. of 2 i.k mod m 116 20 600 Barrett 546 13 50 modified Barrett 215 19 1,600 IV. COMMENTS AND CONCLUSIONS Accordig to both the theoretical aalysis (table 1) ad the practical sythesis results (tables 2 to 7), the fastest circuits are obtaied with the Barrett algorithm. Nevertheless, the correspodig s are excessive for great values of. The secod best solutio, as regards the computatio time, is the reductio mod 2 k -a. Actually, these coclusios are valid as log as geeric reductio circuits are cosidered. For specific values of ad m, the pre-computatio optio could be a iterestig alterative (chapter 8 of Deschamps et al, 2006). For small values of, the best optio is a block of ROM storig the 2 pre-computed values of x mod m. I the case where the reductio is part of a algorithm icludig a lot of multiplicatios, for example a expoetiatio algorithm, a alterative solutio is the Motgomery product (Motgomery, 1985). It has ot bee studied i this paper dedicated to reductio circuits, but is oe of the mai topics of aother (ot yet published) work o fiite rig ad field operatios. REFERENCES Blake, I.V., G. Seroussi ad N. Smart, Elliptic Curves i Cryptography. Cambridge Uiversity Press (2002) Hakerso, D., A.J. Meezes ad S. Vastoe, Guide to Elliptic Curve Cryptography, Spriger (2004) Deschamps, J.-P., G.A. Bioul, ad G.D. Sutter, Sythesis of Arithmetic Circuits, Wiley (2006) Motgomery, P., Modular Multiplicatio without Trial Divisio, Mathematics of Computatio, 44, 519-521 (1985) Xilix Ic, http://www.xilix.com (2006) Table 6. Modified Barrett algorithm: ad computatio time ( = 2.k) 16 62 9 80 64 373 17 650 256 4,245 25 3,300 1024 - - Received: April 14, 2006. Accepted: September 8, 2006. Recommeded by Special Issue Editor Hilda Larrodo. 97