MODULAR multiplication with large integers is the main

Size: px

Start display at page:

Download "MODULAR multiplication with large integers is the main"

Wilfrid Daniel
6 years ago
Views:

1 1658 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 A General Digit-Serial Architecture for Montgomery Modular Multiplication Serdar Süer Erdem, Tuğrul Yanık, and Anıl Çelebi Abstract The Montgomery algorithm is a fast modular multiplication method frequently used in cryptographic applications. This paper investigates the digit-serial implementations of the Montgomery algorithm for large integers. A detailed analysis is given and a tight upper bound is presented for the intermediate results obtained during the digit-serial computation. Based on this analysis, an efficient digit-serial Montgomery modular multiplier architecture using carry save adders is proposed and its complexity is presented. In this architecture, pipelined carry select adders are used to perform two final tasks: adding carry save vectors representing the modular product and subtracting the modulus from this addition, if further reduction is needed. The proposed architecture can be designed for any digit size δ and modulus θ. This paper also presents logic formulas for the bits of the precomputation θ 1 mod 2 δ used in the Montgomery algorithm for δ 8. Finally, evaluation of the proposed architecture on Virtex 7 FPGAs is presented. Index Terms Carry-save addition, carry-select addition, Montgomery modular multiplication, RSA cryptosystem. I. INTRODUCTION MODULAR multiplication with large integers is the main computation in many public key cryptosystems such as RSA [1] and elliptic curve cryptography (ECC) [2], [3]. Conventional modular multiplication is an expensive operation because it requires division. The Montgomery algorithm [4] is an efficient modular multiplication technique replacing costly division with multiplication and bit shift operations. This algorithm uses a precomputation based on the modulus to achieve this improvement. Montgomery modular multiplication with a modulus θ can be implemented in two different ways. 1) The whole multiplier is multiplied by the multiplicand. Then, the resulting product is reduced by using the precomputation ψ = θ 1 mod 2 n, where n is an integer not less than the multiplier bit length. 2) Multiplication and reduction steps are interleaved. The multiplier is multiplied by the multiplicand, δ bits at a time. After each multiplication, the partial product is reduced and accumulated. In reduction, the precomputation ψ = θ 1 mod 2 δ is used. Manuscript received June 7, 2016; revised October 2, 2016 and December 5, 2016; accepted January 5, Date of publication February 2, 2017; date of current version April 24, S. S. Erdem is with the Department of Electronics Engineering, Gebze Technical University, Gebze, Turkey ( serdem@gtu.edu.tr). T. Yanık is with the Department of Computer Engineering, Celal Bayar University, Muradiye, Turkey ( tugrul.yanik@cbu.edu.tr). A. Çelebi is with the Department of Electronics and Communication Engineering, Kocaeli University, İzmit, Turkey ( anilcelebi@kocaeli.edu.tr). Digital Object Identifier /TVLSI The hardware architectures of Montgomery modular multiplication processing the multiplier bits, δ = 1 bit at a time, are abundant in the literature [5] [14]. In RSA and ECC applications, the modulus θ is an odd integer. Thus, the precomputation ψ = θ 1 mod 2 δ = 1alwayswhenδ = 1. The advantages of these architectures are obvious. They do not need to compute and store the precomputation. Also, their area complexities are low because they process the multiplier bit by bit. However, the hardware architectures with δ = 1havea major drawback. They need a large number of clock cycles to perform a single modular multiplication since they process the multiplier one bit a time. When δ = 1, at least n/δ = n clocks are needed to perform a modular multiplication, where n is the bit length of the multiplier. However, the bit length of the multiplier n is very large in many applications. For example, n = 1024 or n = 2048 for RSA typically. The hardware architectures with δ>1are also proposed in the literature to obtain speed improvement at the expense of area [12], [15] [18]. These are digit-serial implementations and called radix-2 δ or high radix Montgomery multiplication frequently. A radix-4 architecture using lookup tables is proposed in [12] and a radix-8 architecture using booth encoding is proposed in [15]. Unfortunately, it is very difficult to extend these architectures to support larger δ. The architecture in [16] uses a special modulus θ so that the precomputation ψ =±1mod2 δ. The architecture in [17] uses the precomputation ψ = θ 1 mod 2 δ for a general modulus θ and digit size δ. However, this architecture computes not n bit δ bit partial products but δ bit δ bit partial products in each clock cycle. Thus, it needs not n/δ but more than (n/δ) 2 clock cycles to perform a modular multiplication. Also, the Montgomery multiplier in [18] converts one of the operands into a sparse integer representation by a method based on canonic signed digit recoding. Though the resulting sparse representation of the operand enables fast multiplication, the computation time becomes input dependent. Thus, it is very vulnerable to side-channel attacks. This paper presents an efficient digit-serial hardware architecture for Montgomery algorithm. As many previous works, carry-save adders are used to accumulate partial products to avoid carry propagation delay. The proposed architecture does not put any restrictions on the modulus θ and digit size δ unlike [12], [15], and [16]. Also, it computes a modular multiplication in n/δ + c clock cycles for a small constant c. Because its computation time is fixed, it is less vulnerable IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 ERDEM et al.: GENERAL DIGIT-SERIAL ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION 1659 to simple side-channel attacks, compared with the multipliers with input dependent computation times in [13], [14], and [18]. In this paper, a thorough mathematical analysis of the Montgomery algorithm is made and the exact bit lengths of the carry-save vectors representing the intermediate variables are determined. Using this analysis, a detailed bit-level description of the proposed multiplier and a complete evaluation of its complexities are presented. A final subtraction with modulus θ can be needed at the end of the Montgomery algorithm. Thus, fast adders must be used in the multiplier to perform addition and subtraction with carry-save vectors. We show how to use carry-select adders (CSAs) to handle this final subtraction without increasing the critical path delay. Also, we present the logic formulae for the bits of ψ = θ 1 mod 2 8 = f (θ 0,θ 1,...,θ 7 ) in terms of the modulus bits θ i. With these formulas, the precomputation ψ can easily be computed for the designs with digit size δ 8. In practical applications, this digit size is sufficient and a larger digit size increases the area complexity considerably. The higher bits of the precomputation ψ = θ 1 mod 2 δ can be calculated by the modular inversion algorithms in [19] and [20]. This paper is organized as follows. Section II introduces the Montgomery algorithm. Section III discusses the digit-serial computation of the Montgomery algorithm. Also, it presents a tight upperbound for intermediate results and shows the fast computation of the required precomputation. Section IV proposes a fast digit-serial architecture for Montgomery multiplication. Section V presents the complexity analysis and FPGA implementation results. Section VI discusses conclusions and future work. II. MONTGOMERY MULTIPLICATION Montgomery multiplication of the integers a and b is defined for some integer δ as follows: r = ab2 δ mod θ. This computation can be performed without division using the precomputation θ 1 mod 2 δ (Note that θ must be odd). A. Montgomery Domain In Montgomery multiplication, a, b, andr are actually the residual representations of some integers ã, b, and r a = ã2 δ mod θ, b = b2 δ mod θ, r = r2 δ mod θ. This special representation is called Montgomery domain. Note that Montgomery multiplication satisfies r = ab2 δ mod θ r2 δ mod θ = (ã2 δ mod θ)( b2 δ mod θ)2 δ mod θ r = ã b mod θ. As seen, the Montgomery product r = ab2 δ mod θ is actually the Montgomery domain computation of the modular multiplication r = ã b mod θ. The Montgomery product r = ab2 δ mod θ can be computed faster than r = ã b mod θ, when the precomputation θ 1 mod 2 δ is used. Of course, converting the integers ã, b, and r into a, b, andr in Montgomery domain and converting them back have a cost. Nevertheless, this cost is affordable if a large number of modular multiplications are performed in Montgomery domain. B. Montgomery Reduction The Montgomery multiplication r = ab2 δ mod θ can be split into the multiplication step u = ab and the reduction step r = u2 δ mod θ. The Montgomery modular reduction r = u2 δ mod θ can be estimated without division as follows: r = u2 δ mod θ r = (u + θq)/2 δ (1) where q = uψ mod 2 δ is a parameter depending on the precomputation ψ = θ 1 mod 2 δ. The rationale behind this modular reduction is as follows. Because θ is odd, gcd(θ, 2 δ ) = 1 and there exist two integers (2 δ mod θ) and (θ 1 mod 2 δ ) such that 2 δ (2 δ mod θ)+ θ(θ 1 mod 2 δ ) = 1. The precomputation ψ = θ 1 mod 2 δ. Thus 2 δ mod θ = (1 + θψ)/2 δ and the Montgomery modular reduction r = u2 δ mod θ (u + uθψ)/2 δ where q = uψ mod 2 δ. (u + θ(uψ mod 2 δ ))/2 δ (u + θq)/2 δ C. Calculation of Montgomery Product Using the reduction in (1), the Montgomery modular multiplication r = ab2 δ mod θ can be estimated without division as follows: r = ab2 δ mod θ r = (ab + θq)/2 δ where q = abψ mod 2 δ is a parameter depending on the precomputation ψ = θ 1 mod 2 δ. When a < a sup and b < 2 δ, Montgomery modular multiplication for integer δ is r = ab2 δ ab + θq mod θ r = 2 δ < a sup + θ (2) where q = abψ mod 2 δ and ψ = θ 1 mod 2 δ. Note that a final subtraction of the modulus θ may be needed to reduce r below the upper bound a sup as follows: where r = ab2 δ mod θ r = r εθ < a sup (3) ε = { 1, if r θ 0, if r <θ.

3 1660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 In practice, the operands a and b and the modulus θ are represented by n log 2 θ bits. Also, the upper bound a sup must not be less than the modulus θ. Thatis a, b,θ <2 n, a < a sup [θ,2 n ]. Then, Montgomery multiplication for integer n is r = ab2 n ab + θq mod θ r = 2 n < a sup + θ (4) where q = abψ mod 2 n and ψ = θ 1 mod 2 n. Note that a final subtraction of the modulus θ may be needed to reduce r below the upper bound a sup as follows: r = ab2 n mod θ r = r εθ < a sup (5) where { 1, if r a sup [θ,2 n ] ε = 0, if r < a sup [θ,2 n ]. Now, there are two practical cases to consider here. 1) a sup = θ<2 n. 2) θ<a sup = 2 n. For the first case, the inequality in (4) becomes r = ab2 n mod θ r = (ab + θq)/2 n < 2θ where q = abψ mod 2 n and ψ = θ 1 mod 2 n. Also, after the final reduction in (5), the result r is equal to the Montgomery product r = ab2 n mod θ = r = { r θ, if r θ r, if r <θ. The second case is called incomplete arithmetic [21]. For the second case, the inequality in (4) becomes r = ab2 n mod θ r = (ab + θq)/2 n < 2 n + θ where q = abψ mod 2 n and ψ = θ 1 mod 2 n. Also, after the final reduction in (5), the result r is equivalent to the Montgomery product but may be larger than it { r = ab2 n mod θ r r θ, if r 2 n = r, if r < 2 n. Note that the final result r < 2 n and r < θ does not necessarily hold. Thus, r is only an n-bit value equivalent to the Montgomery product in the second case. The purpose of the incomplete arithmetic is to keep the operands and the results in the range [0, 2 n ). III. INTERLEAVED MODULAR MULTIPLICATION The following algorithm implements the Montgomery multiplication r = ab2 n mod θ given by (4) and (5). 1) v = ab for b = (b n 1,...,b 0 ) 2. 2) u = ( v + θq ) /2 n for q = vψ mod 2 n. 3) if u a sup then r = u θ else r = u. Here, ψ = θ 1 mod 2 n and a < a sup [θ,2 n ]. The operands are multiplied in Step 1) and the resulting product is reduced in Step 2). These multiplication and reduction steps can be interleaved as follows: 1) u = 0. 2) For i = 0ton/δ 1, a) v = u + aβ for β = (b iδ+δ 1,...,b iδ ) 2. b) u = ( v + θq ) /2 δ for q = vψ mod 2 δ. 3) If u a sup,thenr = u θ; elser = u. Here, ψ = θ 1 mod 2 δ and a < a sup [θ,2 n ].Inthis algorithm, the operand a is multiplied by the δ bits of the operand b in Step 2a) and the resulting product is reduced in Step 2b). A. Fast Calculation of ψ = θ 1 mod 2 δ for δ 8 The precomputation ψ must be calculated from the modulus θ and loaded into a register before modular multiplication. For this purpose, we present the boolean functions that give the bits from 0 to 7 of ψ in terms of the bits of the modulus θ as follows: ψ 0 = 1, ψ 3 = θ 1 θ 2 θ 3 ψ 1 = θ 1, ψ 4 = (θ 1 θ 3 )(θ 2 θ 3 ) θ 4 ψ 2 = θ 2, ψ 5 = (θ 1 θ 3 )(θ 2 θ 3 )(θ 3 θ 4 ) θ 1 θ 4 θ 5 ψ 6 = (θ 1 θ 5 )(θ 2 θ 5 )(θ 3 θ 5 )(θ 4 θ 5 ) (θ 1 θ 4 )(θ 2 θ 4 )(θ 3 θ 5 ) (θ 1 θ 5 )(θ 2 θ 4 ) θ 6, ψ 7 = (θ 1 θ 6 )(θ 2 θ 6 )(θ 3 θ 6 )(θ 4 θ 6 )(θ 5 θ 6 ) (θ 1 θ 2 )(θ 2 θ 4 )(θ 4 θ 6 )(θ 5 θ 6 ) (θ 1 θ 6 )(θ 2 θ 6 )(θ 3 θ 6 ) (θ 1 θ 2 )(θ 4 θ 6 ) θ 7. (6) These boolean functions can be implemented in hardware easily and used in digit-serial modular multipliers with digit size δ 8. As an example, let θ mod 2 8 = 185. That is θ mod 2 8 = (θ 7... θ 1 θ 0 ) 2 = ( ) 2 = 185 and the proposed boolean functions yield ψ = (ψ 7... ψ 1 ψ 0 ) 2 = ( ) 2 = 119. Note that the calculated ψ is really equal to θ 1 mod 2 δ for any δ 8 as follows: θψ = mod 2 δ = 1mod2 δ. These formulas are found by partly induction and partly trial and error. Their correctness is checked by mathematical software. B. Intermediate Results In each iteration of the interleaved modular multiplication, the variable u is recalculated. The upper bound of u must be known to decide the number of bits needed to represent it in hardware implementations. The following theorem gives the value of u in each iteration and an upper bound for these values. To the best of our knowledge, the current literature does not demonstrate such a theorem and its proof.

4 ERDEM et al.: GENERAL DIGIT-SERIAL ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION 1661 Theorem 1: In the iteration i {0, 1,...,n/δ 1} of the interleaved Montgomery modular multiplication u = ab + θ(ab( θ 1 ) mod 2 (i+1)δ ) 2 (i+1)δ < a sup + θ for the operands a < a sup and b < 2 n where B = b mod 2 (i+1)δ = (b (i+1)δ 1,...,b 0 ) 2. Proof: Let the Montgomery product of a and B with respect to the integer (i + 1)δ be denoted as follows: r (i+1)δ = ab2 (i+1)δ mod θ. The Montgomery product with respect to the integer δ and its estimate is given by (2). The multiplier b < 2 δ in (2) and the multiplier B < 2 (i+1)δ here. Thus, we make the substitutions B b, 2 (i+1)δ 2 δ in (2) and obtain the estimate of r (i+1)δ as follows: r (i+1)δ r (i+1)δ = (ab + θq)/2 (i+1)δ < a sup + θ where q = ab( θ 1 ) mod 2 (i+1)δ. Note that u given in the theorem is nothing else than the estimate of the Montgomery product of a and B with respect to the integer (i + 1)δ r (i+1)δ r (i+1)δ = ab2 (i+1)δ mod θ. As a result, the theorem actually claims that u = r (i+1)δ < a sup + θ. Now, we must prove that u = r (i+1)δ. According to the theorem, the zeroth iteration of the interleaved Montgomery algorithm computes u = r (i+1)δ i=0 = r δ = ab + θ(ab( θ 1 ) mod 2 δ ) 2 δ where B = (b δ 1,...,b 0 ) 2 < 2 δ. This is true because after Steps 2a) and 2b) in the zeroth iteration of the algorithm, u = (aβ + θq)/2 δ where q = aβ( θ 1 ) mod 2 δ, β = (b δ 1,...,b 0 ) 2. Now, the proof can be completed by induction if we show that when the ith iteration computes u = r (i+1)δ,the(i + 1)th iteration computes u = r (i+2)δ in the interleaved Montgomery algorithm. Also, note that β = (b (i+1)δ+δ 1,...,b (i+1)δ ) 2 δ 1 = b (i+1)δ+ j 2 j = 1 (i+1)δ+δ 1 2 (i+1)δ b j 2 j j=0 j=(i+1)δ in the (i + 1)th iteration. Let k = (i + 1)δ. Then, we must show that if some iteration computes u = r k, the next iteration computes u = r k+δ where β = 1 k+δ 1 2 k b j 2 j. j=k Assume that some iteration computes u = r k = a k 1 j=0 b j2 j + θ [ a k 1 j=0 b j2 j ( θ 1 ) mod 2 k] 2 k. As seen from the interleaved Montgomery algorithm, each iteration performs u + aβ + θq u = 2 δ = u + aβ + θ((u + aβ)ψ mod 2δ ) 2 δ. Thus, the next iteration must compute u = r = r k + aβ + θ[( r k + aβ)( θ 1 ) mod 2 δ ] 2 δ. Let us multiply both numerator and denominator by 2 k r = 2k ( r k + aβ) + θ[2 k ( r k + aβ)( θ 1 ) mod 2 k+δ ] 2 k+δ. Note that 2 k ( r k + aβ) = P + θq where k+δ 1 P = a b j 2 j, j=0 k 1 Q = a b j 2 j ( θ 1 ) mod 2 k. j=0 Then, the computation in the next iteration becomes r = P + θq + θ[p( θ 1 ) Q mod 2 k+δ ] 2 k+δ. Note that P( θ 1 ) mod 2 k+δ Q as seen in the following: k+δ 1 P( θ 1 ) mod 2 k+δ = a b j 2 j ( θ 1 ) mod 2 k+δ j=0 k 1 Q = a b j 2 j ( θ 1 ) mod 2 k. j=0 Thus, we can further simplify the computation in the next iteration as follows: r = P + θ[p( θ 1 ) mod 2 k+δ ]/2 k+δ. Then, r = r k+δ as follows: r = a k+δ 1 j=0 b j 2 j + θ [ a k+δ 1 j=0 b j 2 j ( θ 1 ) mod 2 k+δ] 2 k+δ. Now, the proof by induction is complete. IV. PROPOSED MODULAR MULTIPLIER Algorithm 1 gives the bit-level interleaved Montgomery modular multiplication. The operands and the modulus θ in this algorithm are n bit. That is a < a sup = 2 n, b < 2 n, θ < 2 n. Then, it follows from Theorem 1 that the intermediate variable u in Algorithm 1 satisfies: u < a sup + θ = 2 n + θ and is thus an (n + 1)-bit variable. Also, note that a < 2 n, β < 2 δ, aβ <2 n+δ. Then, as seen from Algorithm 1 v = aβ + u < 2 n+δ + 2 n+1 (7) and v is an (n + δ + 1)-bit variable.

5 1662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 Algorithm 1 Digit-Serial Montgomery Modular Multiplication for a δ-bit Digit Size Algorithm 2 Digit-Serial Montgomery Modular Multiplication With Carry-Save Addition for an δ-bit Digit Size Algorithm 2 is the same as Algorithm 1, except that it keeps the variables u and v in carry-save form. Fig. 1 illustrates Algorithm 2. The following key observations can be made from Fig. 1. 1) The first accumulator takes u in carry-save form and computes v = aβ + u in Step 2a) of Algorithm 2 in carry-save form. 2) v can be represented by an (n + δ 1)-bit save vector and an (n +δ)-bit carry vector because v<2 n+δ +2 n+1 as seen from (7). However, the least significant δ bits of carry vector and save vector are added together to use in the calculation of q = ψv mod 2 δ. 3) The second accumulator takes the carry-save vectors representing v and computes θq + v in Step 2b) of Algorithm 2, whose least significant δ bits are always zero. Then, these zero bits are removed to obtain u = (θq + v)/2 δ. A. Area Complexity The area complexity of the multiplier is as follows: 2n + δ flip flops 2nδ + δ 2 /2 3δ/2 AND gates 2nδ + δ 2 /2 5δ/2 + 1 full adders 3δ 2 half adders. (8) The above complexity does not include the area needed to store the multiplicands and modulus. Also, the complexity of computing ψ = θ 1 mod 2 δ is neglected because this cost is relatively small and very implementation dependent. The area complexity of the proposed multiplier is the sum of the following area requirements. 1) Storing the carry and sums (c (u) j, s (u) j ) requires 2n + 1 flip-flops. Also, storing ψ requires δ flip-flops as seen from Fig. 1. However, one flip-flop can be saved since ψ 0 = 1always. 2) The first accumulator requires nδ AND gates and nδ adders to accumulate the partial products aβ. However, δ 2 of the adders are half adders as can be understood from Fig. 2. 3) The calculation of q i requires δ(δ+1)/2 AND gates and (δ 2)(δ 1)/2 full adders and δ 1 half adders as can be understood from Fig. 3. However, δ AND gates can be saved because ψ 0 = 1always. 4) The second accumulator requires nδ AND gates to compute θq i and (n 1)(δ 2) + 2(n + δ 1) adders to add them over the outputs (c (v) j, s (v) j ) of the first accumulator, as can be understood from Fig. 3. However, δ AND gates can be saved because θ 0 = 1 always. Also, δ + 2ofthe adders are half adders. Moreover, one half adder can be saved since the sum s (v) 0 + θ 0 q 0 is redundant. Note that because ψ 0 = θ 0 = 1always s (v) 0 + θ 0 q 0 = s (v) 0 + θ 0 s (v) 0 ψ 0 = 2s (v) 0. B. Time Complexity The critical path delay of the multiplier is 3T AND + T HA + (δ + 4)T FA for δ>2 3T AND + T HA + 4T FA for δ = 2 (9)

6 ERDEM et al.: GENERAL DIGIT-SERIAL ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION 1663 Fig. 1. Digit-serial Montgomery modular multiplication circuit. Fig. 2. Accumulation of the partial products aβ in Step 2a) for the digit size δ = 6 bits and the operand size n = 10 bits. where T AND is the AND gate delay, T FA is the full adder delay, and T HA is the half adder delay. The critical path delay is the sum of the following delays as can be understood from Figs. 2 and 3: 1) the delay T AND + (i + 1)T FA to compute s (v) i in the first accumulator for i <δ; 2) the delay T AND + T FA + T HA to compute q i from s (v) i for 2 i <δ;

7 1664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 Fig. 3. Computation of q and accumulation of θq in Step 2b) for the digit size δ = 6 bits and the operand size n = 10 bits. 3) the delay T AND + (δ i + 2)T FA to compute θq i accumulate them in the second accumulator for 2 i <δ. C. Final Reduction In Step 3 of Algorithm 2, the n+1 bit addition t = (c (u) [n] c (u) [n 1] c (u) [n 2]... c (u) [0]) 2 +(0 s (u) [n 1] s (u) [n 2]... s (u) [0]) 2 (10) is performed. The output r = t, if the most significant bit of the addition t[n] =0. Otherwise r = (t[n] t[n 1] t[n 2]... t[1] t[0]) 2 (0 θ[n 1] θ[n 2]... θ[1] θ[0]) 2. (11) The data are kept in carry-save form in the previous steps of the proposed algorithm to avoid large carry propagation delays. Thanks to this strategy, the critical path delay for the implementation of the previous steps is linear with the digit size δ but not the operand size n as seen in (9). This is desirable since n 1 usually in practical applications. Unfortunately, the final addition and subtraction in Step 3 of Algorithm 2 require n + 1 bit carry and borrow propagations. Thus, these operations must be performed by a fast adder and even in more than one clock cycle. Otherwise, the critical path delay will be O(n) instead of O(δ). CSA is a fast and simple adder. Thus, it is a good choice to implement the final addition and subtraction. Fig. 4 shows an example Verilog implementation of Step 3 of Algorithm 2 with CSA. In this example, the operands of a 1024-bit sum are divided into smaller 168- and 172-bit blocks. These smaller operand blocks are added with two different ripple carry adders, one with carry in 0 and the other one with carry in 1. One of these two additions is always correct, and thus the correct 1024-bit sum is actually obtained. However, the correct input carries must be determined and the correct bits must be selected using multiplexers. Though the additions are almost doubled and the multiplexers are used, 168- and 172-bit additions in the example are performed 1024/168 and 1024/172 times faster than the 1024-bit addition. Thus, the addition with CSA is faster than the usual addition. Moreover, a 1024-bit addition can be pipelined and performed in more than one clock cycle to reduce the critical path. The additions in Fig. 4 are performed in two clock cycles. There are two 1024-bit additions in Fig. 4. The first one is the sum of the carry-save vectors in (10) and the second one is the subtraction with the modulus θ in (11) t[1024:0] = C[1024:0] + S[1023:0] tt[1023:0] = t[1023:0] + θ[1023:0] where θ[1023:0] is the negated modulus. In Fig. 4, enableadd signal must be high for three clock cycles. The correct values of the addition and subtraction results are obtained during these three clock cycles as follows. 1) t[511:0] and the carry mmc1 in the first clock cycle. 2) t[1024:512], tt[511:0], and the borrow mmb1 in the second clock cycle. 3) tt[1023:512] in the third clock cycle. As seen, the subtraction result tt[1023:0] lags one clock cycle since it depends on the addition result t[1024:0]. In the fourth clock cycles, the carry out of the addition result t[1024] is checked. If it is set, the output r[1023:0] is the subtraction result. If not, the output r[1023:0] is the addition result. The area complexity of Step 3 of Algorithm 2 for the operand size n is roughly as follows: 4nA FA + 2nA REG + 3nA MUX2 (12)

8 ERDEM et al.: GENERAL DIGIT-SERIAL ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION 1665 Fig. 4. Example Verilog implementation of Step 3 of Algorithm 2 for 1024-bit operands with CSA using a 2-stage pipeline. Here, C[1023:0] and S[1023:0] are carry-save vectors. θ[1023:0] is negated modulus. TABLE I AREA COMPLEXITY OF DIFFERENT IMPLEMENTATIONS where A FA, A REG,andA MUX2 are, respectively, full adder, flip-flop, and multiplexer area. Nearly 4n full adders are needed because the addition and the subtraction in Step 3 of Algorithm 2 both require n full adders, and also carry-save adder strategy almost doubles the adders. At least 2n flip-flops are needed because the addition and subtraction results must both be stored. Almost n multiplexers are needed for addition with carry-save adders. Almost n multiplexers are needed for subtraction with carry-save adders. Exactly n multiplexers are needed to select one of the addition result and the subtraction result. The critical path delays of the carry-save adders in Step 3 of Algorithm 2 must be reduced by pipelining so that these delays do not exceed the critical path delay of Montgomery multiplication performed in carry-save form. Then, the clock cycles required for Step 3 is P + 2 (13) where P is the number of pipeline stages used in carry-save adders. P cycles are needed to perform the final addition and subtraction in Step 3 with carry-save adders. The (P + 1)th cycle is needed to finish the final subtraction, which depends on the final addition result. The (P + 2)th cycle is needed to select one of the addition and subtraction results. V. PERFORMANCE COMPARISONS Table I gives the area complexities of some Montgomery multipliers with digit size δ = 1 bit and the proposed multiplier for digit sizes δ = 2, 4, 8 bits. As seen, the proposed multiplier has a larger area complexity because its digit size δ > 1. The area complexity for the proposed multiplier is the complexities in (8) and (12) plus the register a and the register θ shown in Fig. 1. Note that the multipliers with digit size δ = 1 bit do not need fast adders. This is because they always keep their operands and results in carry-save form. These multipliers avoid the final subtraction in the last step of Montgomery multiplication using the clever scheme in [22]. In this scheme, the operands are bounded by two times the modulus. That is a, b < 2θ. And, the Montgomery multiplication for the integer n + 2 r = (ab + θq)/2 n+2 ab2 (n+2) mod θ is computed instead of (ab + θq)/2 n ab2 n mod θ. It can be shown that r < 2θ in this scheme. Thus, the operands and the result are all bounded by 2θ for the multiplier designs with digit size δ = 1 bit in Table I. Consequently, these multipliers can perform successive modular multiplications by keeping their inputs and outputs in carry-save form. Table II compares the time complexity of the proposed multiplier with the complexities of the Montgomery multipliers with digit size δ = 1 bit. As seen, though the proposed multiplier has a larger critical path delay, it requires fewer clock cycles to finish its computation as the digit size δ gets larger. Thus, the total computation time of the proposed multiplier is better for large δ. The time complexity for the proposed multiplier is obtained from (9) and (13).

9 1666 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 TABLE II TIME COMPLEXITY OF DIFFERENT IMPLEMENTATIONS TABLE III AREA AND TIME PERFORMANCES OF FPGA IMPLEMENTATIONS The required cycles for the proposed multiplier n/δ + P δ + 2 n/δ. This is because 512 n 2048, while the number of pipeline stages of the fast adders P δ is a small number in practice. Thus, the total computation time of the proposed multiplier can be approximated by (δ + 4) n δ T FA + n δ (3T AND + T HA ). As a result, the total computation time of the proposed multiplier is less than those of the other works in Table II for large digit size δ. FPGA implementation results are presented in Table III. We have written Verilog codes for the proposed digit serial multiplier and the Montgomery multipliers proposed in [8] and [10]. The Verilog codes are synthesized and implemented in Xilinx Vivado tool for Virtex-7, device xc7vx330t, package ffg1157 with speed grade-3. Table III gives the timing results from the implementation process for the default Vivado optimization strategy. The Vivado tool also gives the required numbers of slices, LUTs, and flip-flops for the implementation. It gives these numbers for the total implementation and individual modules (fast adder unit and Montgomery multiplier) separately. The complexities given in Table III are for the total implementations. As seen, the shortest clock period belongs to the multiplier proposed in [10]. However, the smallest area time product belongs to the proposed multiplier with digit size δ = 2 bits. As δ gets larger, the total computation time decreases, but the area time product gets larger too. Nevertheless, the proposed multiplier with digit size δ = 4 bits still has a smaller area time product than the multipliers proposed in [8] and [10]. Also, the proposed design with digit size δ = 2 requires fewer FPGA slices than the designs in [8] and [10]. This is interesting because though the proposed design needs fewer flip-flops, the designs in [8] and [10] have a substantial advantage over the proposed design in terms of combinational circuits as seen from Table I. There are two explanations for this. First, FPGA slices have dedicated carry chains, and thus the fast additions in the proposed design can be

10 ERDEM et al.: GENERAL DIGIT-SERIAL ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION 1667 performed more efficiently. Second, the designs in [8] and [10] keep their inputs and output in carry save form unlike the proposed design. Because very large numbers are multiplied, representing all the data with two vectors complicates the FPGA design. Especially, input output units of these designs require a large number of slices. One of the important improvements of the proposed design is the total computation time. The proposed multiplier has a smaller computation time than [8] and [10] for all digit sizes. The proposed multiplier with digit size δ = 4 bits has at least two times smaller computation time than [8] and [10]. However, the gain in the computation time is declined for the digit size δ = 8. Tables I III give a very good idea for the performance of the proposed digit serial multiplier. There are many other high radix Montgomery multipliers in the literature. However, some of them are designed for a certain digit size [12], [15] and some are designed for certain types of modulus [16]. Also, the performances of these multipliers are given usually for Virtex 2 devices, which are considerably old technology. The work in [18] gives Virtex 5 performance results. However, this work is not a complete design. It neither gives the details of the fast adder implementation required for the final subtraction nor mentions the complexity of this implementation. The multiplier in [18] tries to convert one of the operands into a sparse integer representation by a method based on canonic signed digit recoding. After the conversion, signed digits are multiplied by the second operand and zero digits are skipped. Our multiplier can also be optimized using similar techniques. However, the required number of clock cycles for computation would then vary greatly according to the operands. The computation would be shorter for operands with many consecutive ones and zeros but would be longer for other types of operands. This variable computation time can lead to side-channel attacks in cryptographic applications such as RSA and ECC besides many other problems. VI. CONCLUSION This paper presents a detailed analysis of Montgomery algorithm and its digit serial computation. The precomputation ψ = θ 1 mod 2 δ is given in (6) for digit size δ 8. The computed intermediate results and their tight upperbound are given in Theorem 1. Using our analysis, we have developed a general digit serial architecture for the Montgomery algorithm, which can be implemented for any modulus θ and any digit size δ. To decrease the critical path of the multiplier, the accumulated product is kept in carry-save form. However, an addition and a subtraction are needed to obtain the final result. We have shown in detail how these addition and subtraction are performed with CSAs. CSAs are very suitable for FPGA implementations because FPGAs have fast carry chains to perform large additions. The CSA strategy just divides the huge additions required by applications into smaller additions, which can be performed by the carry chains in FPGAs in a desired time. The total computation time of the proposed digit serial multiplier is much less than the classical bit serial multipliers. However, this improvement comes at the expense of area. The area requirement increases rapidly with digit size δ. The FPGA implementation results show that the minimum area time product is achieved for digit size δ = 2. The proposed architecture with digit size δ = 2 has a considerably better area time product than the classical bit serial multiplication. Also, the proposed architecture with digit size δ = 4 has a comparable area time product with the classical bit serial multiplication. As a future research, the scheme in [22] can be used to avoid the final subtraction in the last step of the Montgomery algorithm and a new digit serial modular multiplier can be developed. In this way, applications can perform successive modular multiplications by keeping the operands and the results in carry save form all the time. The advantage of such an architecture is that it will not need any fast adders. Its disadvantage is that each operand and each computed result is represented with a carry vector and a save vector. Thus, the cost of storing and manipulating the data will increase. REFERENCES [1] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Commun. ACM, vol. 21, no. 2, pp , Feb [2] N. Koblitz, Elliptic curve cryptosystems, Math. Comput., vol. 48, no. 177, pp , [3] V. S. Miller, Use of elliptic curves in cryptography, in Advances in Cryptology CRYPTO (Lecture Notes in Computer Science), vol. 218, H. C. Williams, Ed. New York, NY, USA: Springer-Verlag, 1986, pp [4] P. L. Montgomery, Modular multiplication without trial division, Math. Comput., vol. 44, no. 170, pp , Apr [5] C.-C. Yang, T.-S. Chang, and C.-W. Jen, A new RSA cryptosystem hardware design based on Montgomery s algorithm, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 45, no. 7, pp , Jul [6] A. F. Tenca and Ç. K. Koç, A scalable architecture for Montgomery multiplication, in Cryptographic Hardware and Embedded Systems (Lecture Notes in Computer Science), Ç. K. Koç and C. Paar, Eds. London, U.K.: Springer-Verlag, 1999, pp [7] A. F. Tenca and Ç. K. Koç, A scalable architecture for modular multiplication based on Montgomery s algorithm, IEEE Trans. Comput., vol. 52, no. 9, pp , Sep [8] C. McIvor, M. McLoone, and J. V. McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, IEE Proc.- Comput. Digit. Techn., vol. 151, no. 6, pp , Nov [9] D. M. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu, An improved unified scalable radix-2 Montgomery multiplier, in Proc. 17th IEEE Symp. Comput. Arithmetic, Jun. 2005, pp [10] M. D. Shieh, J. H. Chen, H. H. Wu, and W. C. Lin, A new modular exponentiation architecture for efficient design of RSA cryptosystem, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp , Sep [11] M. D. Shieh and W.-C. Lin, Word-based Montgomery modular multiplication algorithm for low-latency scalable architectures, IEEE Trans. Comput., vol. 59, no. 8, pp , Aug [12] M. Huang, K. Gaj, and T. El-Ghazawi, New hardware architectures for Montgomery modular multiplication algorithm, IEEE Trans. Comput., vol. 60, no. 7, pp , Jul [13] S.-R. Kuang, J.-P. Wang, K.-C. Chang, and H.-W. Hsu, Energy-efficient high-throughput Montgomery modular multipliers for RSA cryptosystems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 11, pp , Nov [14] S.-R. Kuang, K.-Y. Wu, and R.-Y. Lu, Low-cost high-performance VLSI architecture for Montgomery modular multiplication, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 2, pp , Feb [15] A. F. Tenca, G. Todorov, and Ç. K. Koç, High-radix design of a scalable modular multiplier, in Proc. 3rd Int. Workshop Cryptogr. Hardw. Embedded Syst. (CHES), 2001, pp

1668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 [16] M. Knezevic, F. Vercauteren, and I.

Aoki, and A. Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp. 1136 1146, Jul. 2011. [18] A.

23, no. 9, pp. 1710 1719, Sep. 2015. [19] S. R. Dussé and B. S. Kaliski, A cryptographic library for the Motorola DSP56000, in Advances in Cryptology EUROCRYPT, vol. 473, I. B. Damgard, Ed.

11 1668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 5, MAY 2017 [16] M. Knezevic, F. Vercauteren, and I. Verbauwhede, Faster interleaved modular multiplication based on Barrett and Montgomery reduction methods, IEEE Trans. Comput., vol. 59, no. 12, pp , Dec [17] A. Miyamoto, N. Homma, T. Aoki, and A. Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp , Jul [18] A. Rezai and P. Keshavarzi, High-throughput modular multiplication and exponentiation algorithms using multibit-scan multibit-shift technique, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9, pp , Sep [19] S. R. Dussé and B. S. Kaliski, A cryptographic library for the Motorola DSP56000, in Advances in Cryptology EUROCRYPT, vol. 473, I. B. Damgard, Ed. New York, NY, USA: Springer-Verlag, 1990, pp [20] O. Arazi and H. Qi, On calculating multiplicative inverses modulo 2 m, IEEE Trans. Comput., vol. 57, no. 10, pp , Oct [21] T. Yanık, E. Savaş, and Ç. K. Koç, Incomplete reduction in modular arithmetic, IEE Proc.-Comput. Digit. Techn., vol. 149, no. 2, pp , Mar [22] C. D. Walter, Montgomery exponentiation needs no final subtractions, Electron. Lett., vol. 35, no. 21, pp , Oct Tuğrul Yanık received the B.S. degree in computer engineering from Agean University, Izmir, Turkey, in 1996, the M.S. degree in computer science and engineering from the Oregon Graduate Institute of Science and Technology, Beaverton, OR, USA, in 1999, and the Ph.D. degree in electrical and computer engineering from Oregon State University, Corvallis, OR, USA, in He was an Assistant Professor with the Department of Computer Engineering, Celal Bayar University, Manisa, Turkey. His current research interests include cryptography and network security, computer arithmetic, and architecture. Serdar Süer Erdem received the B.S. degree in electrical and electronics engineering from Boğaziçi University, Istanbul, Turkey, in 1992, the M.S. degree in electrical and computer engineering from Pennsylvania State University, State College, PA, USA, in 1996, and the Ph.D. degree in electrical and computer engineering from Oregon State University, Corvallis, OR, USA, in He was a Research and Development Software Engineer with a number of technology companies. In 2004, he joined the Faculty of Electronics Engineering, Gebze Institute of Technology, Gebze, Turkey. His current research interests include cryptography, embedded systems security, computer arithmetic, finite fields, and network security. Anıl Çelebi received the B.S., M.S., and Ph.D. degrees in electronics and communication engineering from Kocaeli University, Kocaeli, Turkey, in 2002, 2005, and 2008, respectively. Since 2002, he has been with the Department of Electronics and Telecommunications Engineering, Kocaeli University, where he is currently an Assistant Professor. His current research interests include VLSI design and implementation for analog/mixed signal systems, image processing, and video coding systems.

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm

An Optimized Hardware Architecture of Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, and Tarek El-Ghazawi 1 1 The George Washington University, Washington, DC 20052,