FPGA Implementation of Pairings using Residue Number System and Lazy Reduction

Size: px

Start display at page:

Download "FPGA Implementation of Pairings using Residue Number System and Lazy Reduction"

Britney Dixon
6 years ago
Views:

1 FPGA Implementation of Pairings using Residue Number System and Lazy Reduction Ray C.C. Cheung 1, Sylvain Duquesne 2, Junfeng Fan 4, Nicolas Guillermin 2,3, Ingrid Verbauhede 4, and Gavin Xiaoxu Yao 1 1 Department of Electronic Engineering City University of Hong Kong, Hong Kong SAR, r.cheung@cityu.edu.hk, gavin.yao@student.cityu.edu.hk 2 IRMAR, UMR CNRS 6625, Université Rennes 1 Campus de Beaulieu, Rennes cedex, France 3 DGA.IS, La Roche Marguerite Bruz, France sylvain.duquesne@univ-rennes1.fr, nicolas.guillermin@m4x.org 4 Katholieke Universiteit Leuven, COSIC & IBBT Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium {junfeng.fan, ingrid.verbauhede}@esat.kuleuven.be Abstract. Recently, a lot of progress has been made in the implementation of pairings in both hardare and softare. In this paper, e present to FPGA-based high speed pairing designs using the Residue Number System and lazy reduction. We sho that by combining RNS, hich is naturally suitable for parallel architectures, and lazy reduction, hich performs one reduction for multiple multiplications, the speed of pairing computation in hardare can be largely increased. The results sho that both designs achieve higher speed than previous designs. The fastest version computes an optimal ate pairing at 126-bit security level in ms, hich is 2 times faster than all previous hardare implementations at the same security level. Keyords: Optimal Pairing, Residue Number System, Lazy Reduction, FPGA 1 Introduction and Motivation Bilinear pairings on elliptic curves have been introduced in cryptography in the middle of 90 s for cryptanalysis [18, 33]. In 2000, Joux introduced the first constructive use of pairings ith a tripartite key exchange protocol [25]. In the last decade many pairing-based schemes such as identity-based encryption [10], This ork as supported in part by the European Commission s ECRYPT II NoE (ICT ), by the Belgian State s IAP program P6/26 BCRYPT, by the K.U. Leuven-BOF (OT/06/40), by the Research Council K.U. Leuven: GOA TENSE (GOA/11/007), by French ANR projects no. 07-BLAN-0248 ALGOL and 09- BLAN CHIC, and by City University of Hong Kong Start-up Grant

2 identity-based signatures [12] and short signatures [11] have been proposed and studied. Compared ith other popular public key cryptosystems, e.g. Elliptic Curve Cryptography (ECC) [30, 34] and RSA [41], pairing computation is much more complicated. For this reason, efficient implementation of cryptographic pairings has received increasing interests [3, 8, 16, 20, 23, 27, 36]. The computation of a pairing can be broken don into modular operations in the underlying fields. For example, one optimal ate pairing [44] defined on a 256-bit Barreto-Naehrig (BN) curve [7] requires around 10 4 modular multiplications [3]. Thus, having an efficient modular multiplier is the key step to a high performance pairing processor. In this ork, e are interested in hardare implementation of pairings over large characteristic fields. In this case one possible optimization is to use lazy reduction. Lazy reduction in pairing computation as introduced by Scott [42] and then generalized by Aranha et al. in [3]. In short, it performs one reduction for expressions like A i B j, here A i, B j F p. Aranha et al. have shon that lazy reduction can significantly speed up optimal ate pairings in softare [3]. As suggested by Duquesne [14], lazy reduction can be combined ith the Residue Number System (RNS) to further reduce the complexity. Besides, the RNS distributes computation over a group of small integers, and is naturally suitable for parallel implementations [22, 29]. In this paper, e propose to FPGA-based pairing processors that use RNS representation and lazy reduction. The first design, referred as Design I, is based on a general (but enhanced) RNS data-path. Design I is scalable in terms of security level and flexible in terms of target devices. On an Altera Stratix III FPGA, Design I computes a pairing at 126-bit security level in 1.07 ms. The second design, referred as Design II, has an optimized architecture for pairings over 254-bit curves. We use a set of parameters that leads to a reduced complexity and an optimized datapath that benefits from the reduction. Design II computes a pairing at 126-bit security level in ms on a Xilinx Virtex-6 FPGA. To the best of our knoledge, these are the first hardare designs of pairing using RNS, and the designs outperform all previous hardare implementations at a similar security level [16, 19, 27]. The rest of the paper is organised as follos. Section 2 and Section 3 provide the background on optimal ate pairing and RNS, respectively. In Section 4 and Section 5 e describe the architecture of Design I and Design II. Section 6 describes the control of the data-path and the high level scheduling. We give a detailed analysis of the performance in Section 7. Finally, Section 8 concludes the paper. 2 Optimal Ate Pairings 2.1 Pairings on Barreto-Naehrig Curves A bilinear pairing is a non-degenerate map from G 1 G 2 to G T hich is linear in both components. Popular pairings such as Tate pairing [6], ate pairing [24], R-ate pairing [32], optimal pairing [44] choose G 1 and G 2 to be specific cyclic subgroups of E(F p k), and G T to be a subgroup of F p k.

3 Let F p be a finite field and let E be an elliptic curve defined over F p. Let l be a large prime dividing #E(F p ) and k the embedding degree ith respect to l, namely, the smallest positive integer k such that l p k 1. Small embedding degrees can be easily obtained using supersingular curves. Hoever, it is too small (k 2) if large characteristic base fields are used. Thus, e use ordinary curves ith prescribed embedding degrees constructed via the complex multiplication method as surveyed in [17]. We focus on the most popular one to date, namely the Barreto-Naehrig curves [7]. The reason for their popularity is that they are ell-suited for 128-bit security level and they have degree 6 tist. Let u Z such that p = 36u u u 2 + 6u + 1 and l = 36u u u 2 + 6u + 1 are prime. A BN curve is an elliptic curve defined over F p by E : y 2 = x 3 + b, here b 0 such that #E = l, and it has 12 as an embedding degree. In this paper, e mainly focus on the discussion of optimal ate pairing [36] because it is the most efficient to date for BN curves. Let r = 6u + 2, an optimal ate pairing on BN curves is defined as follos [2, 36]: a opt : E(F p 12) Ker(π p p) E(F p )[l] F p 12/ ( F p 12 ) l (Q, P ) ( ) p 12 1 l f (r,q) (P ) g (rq,πp(q))(p ) g (rq+πp(q), πp 2(Q)) (P ) here π p is the Frobenius map on the curve (π p (x, y) = (x p, y p )), and g (Q1,Q 2) is the line through Q 1 and Q Pairing Computation and Parameter Selection Pairing computation The computation consists of to main functions, f (r,q) and f p12 1 l. The function f (r,q) has the folloing divisor: div(f (r,q) ) = r(q) (rq) (r 1)(O). It is normally computed using a double-and-add method (also knon as Miller s loop [35]). Concerning f p12 1 l, also knon as the final exponentiation, Koblitz and Menezes sho in [31] that it can be split in to steps due to the integer factorization p 12 1 l = ( p 6 1 ) ( p ) ( p 4 p 2 ) + 1. l The first step is poering to p 6 1 and to p This is easily obtained via cheap Frobenius computations and an inversion. The second step is poering to hich is called the hard part of the final exponentiation. p 4 p 2 +1 l

4 Parameter selection The selection of parameters has an essential impact on the security and the performance of a pairing computation. It is explained in [39] ho to generate BN curves ith nice properties. For this ork e choose to curves both ith b = 2. The first one, BN 126, is defined by u = ( ) and has already been used in [2, 39]. It ensures only 126 bits of security, but is ell suited to registers on a general-purpose CPU hen lazy reduction is used. Hoever, FPGA architectures have no such constraints, since multipliers of larger size can alays be constructed ith DSP slices. Hence, e also consider a curve BN 128 defined by u = ( ) hich ensures 128 bits of security. Finally, e propose BN 192, the BN curve defined by u = ( ). Optimal pairing defined on this curve provides 192 bits of security [38]. In the three cases, the extension fields are defined as follos: F p 2 = F p [i]/(i 2 + 1) F p 6 = F p 2[β]/(β 3 (1 + i)) F p 12 = F p 6[Γ ]/(Γ 2 β) = F p 2[γ]/(γ 6 (1 + i)) This toer of extensions has many advantages, the most important being an efficient multiplication algorithm for the canonical polynomial base. Note that BN curves alays have degree 6 tists. This means E is isomorphic over F p 12 to a curve E defined by y 2 = x 3 + b ζ, here ζ is neither a square nor a cube in F p 2. In our case e take ζ = 1 + i so that it defines both the sextic extension of F p 2 and the tist. Then e can define tisted versions of pairings on E (F p 2) E(F p )[l]. In other ords, the coordinates of Q can be ritten as (x Q ζ 1 3, y Q ζ 1 2 ) here x Q and y Q are in F p 2. For u selected as a negative integer, the optimal ate pairing for BN curves is computed by Algorithm 1 [2] here dbl, add and hard-part are given in appendix. 3 Residue Number System A Residue Number System (RNS) represents a large integer using a set of smaller integers. Let B = {b 1, b 2,..., b n } be a set of pairise co-prime integers, and M B = n i=1 b i. For any integer X, 0 X < M B, there is a unique RNS representation on B: {X} B = {x 1, x 2,..., x n }, here x i = X bi, 1 i n. Throughout the paper e use a b to denote a mod b. Given {X} B, one can recover X using the Chinese Remainder Theorem (CRT): n X = x i B 1 i B i here B i = bi MB M B. (1) b i i=1 The set B is also knon as a base, and each element b i, 1 i n, is called an RNS modulus or an RNS channel. RNS representation admits efficient parallel computations. Consider to integers X, Y and their RNS representations {X} B = {x 1, x 2,..., x n } and {Y } B = {y 1, y 2,..., y n }, then e have { X Y MB } B = { x 1 y 1 b1,..., x n y n bn }, {+,,, /}. (2)

5 Algorithm 1 Optimal ate pairing on BN curves for u < 0 Require: P E(F p)[l], Q = (x Qγ 2, y Qγ 3 ) E(F p 12) Ker(π p p) ith x Q and y Q F p 2, r = 6u + 2 = s 1 i=0 ri2i, here u < 0. Ensure: a opt(q, P ) F p 12 1: T = (X T γ 2, Y T γ 3, Z T ) (x Qγ 2, y Qγ 3, 1), f 1 2: for i = s 2 donto 0 do 3: T, g dbl(t, P ), f f 2 g 4: if r i = 1 then 5: T, g add(t, Q, P ), f f g 6: end if 7: end for 8: T T, f f p6 (f p6 is equivalent to f 1 as noticed in [3]) 9: Q 1 π p(q), Q 2 π p(q 1) 10: T, g add(t, Q 1, P ), f f g 11: T, g add(t, Q 2, P ), f f g ( ) p : f f p6 1 13: f hard-part(f, u ) 14: return f Note that the division is available only if Y is co-prime ith M B. For all these operations, computations beteen x i and y i have no dependency on other channels, hich makes RNS naturally suitable for parallel implementations. The computation of X bi is called a channel reduction. To accelerate this operation, pseudo-mersenne numbers of the form b i = 2 ε i, here ε i < 2 /2, are typically selected as RNS moduli. Hence, the computation of X bi is performed using 2 times of X X/2 ε i + (X mod 2 ), and a correction step in the end to bring the result back to the range [0, b i ). 3.1 RNS Montgomery Reduction Using RNS representation ensures efficient computation in Z/M B Z. Unfortunately, it can t be applied directly in F p since M B is not prime. One ay to utilize RNS for field multiplication is to combine RNS and Montgomery reduction [22, 29]. This is shon in Algorithm 2. RNS Montgomery reduction requires to bases, B and C, ith M C co-prime to M B. The reason of including C is that division by M B is not possible in B. Note that the size of M B and M C, compared to p, determine the upper bound of input X. Guillermin found that if X < αp 2, M B > αp and M C > 2p, then Algorithm 2 has output S < 2p [22, Proposition 1]. This is an important principle for base selection. 3.2 Base Extension The operation to transform the representation in one RNS base to another base is called Base Extension (BE). To compute {X} C = {x 1, x 2,..., x n} from {X} B =

6 Algorithm 2 RNS Montgomery reduction [5] Require: RNS bases B and C ith M B > αp, M C > 2p, p coprime ith M B M C, {X} B and {X} C being the RNS representations of X < αp 2. Precomputed: { p 1 MB } B, { M 1 B M } C C and {p} C. Ensure: {S} B, {S} C such that S p = XM 1 B p and S < 2p 1: {Q} B {X} B { p 1 } B 2: {Q} B Base Extension {Q} C 3: {S} C ( {X} C + {Q} C {p} C ) {M 1 B 4: {S} B Base Extension {S} C } C {x 1, x 2,..., x n }, one can use the Posch-Posch method [40]. Given {X} B, for (1), there must exist an integer λ < n such that: n X = x i B 1 n n i B i = ξ i B i = ξ i B i λ M B (3) bi MB MB i=1 i i=1 here ξ i = x i B 1, 1 i n. In the Posch-Posch method, λ can be bi calculated by the folloing equation: n λ = i=1 ξ i B i n = M B i=1 In [29], ξ i /b i is further approximated by ξ i /2 as b i is chosen as a pseudo- Mersenne number near 2. Once λ is obtained, {X} C = {x 1,..., x n} can be computed as follos: n n x j = ξ i B i λ M B = ξ i B i cj λ M B cj. (5) cj cj i=1 B i cj and M B cj, 1 i, j n, can be precomputed once B and C are fixed. The folloing algorithm describes the computation of X ck. i=1 ξ i b i i=1 (4) 4 Design I: A Scalable Architecture In this section e propose an enhancement of the Cox-Roer architecture first proposed in [29] such that it is suitable for pairing computation. 4.1 Cox-Roer Architecture The Cox-Roer architecture as first proposed by Kaamura et al. in [29]. It as first implemented in a VLSI design of an RSA cryptosystem [37]. It as later enhanced by Guillermin [22] to support all arithmetic operations in F p and fast Elliptic Curve Scalar Multiplications (ECSM).

7 Algorithm 3 Base extension algorithm for k-th element of X [29] Require: X bi for i {1,.., n} Ensure: X ck Precomputed: B 1 i bi, B i ck, 1 i n, M B ck 1: ξ i x i B 1 i bi for i {1,.., n} 2: ψ 0, z 0 3: for i = 0 to (n 1) do 4: ψ ψ + ξ i 5: ρ ψ/2 //ρ {0, 1} 6: ψ ψ 2 7: z z + (ξ i B i ck ) ρ M B ck 8: end for 9: return z Fig. 1 shos our Cox-Roer implementation. The top-level structure is similar to the one proposed by [29]. It is consists of n similar Roer units performing in parallel operations in one RNS channel of B or C. The Cox unit is only used during the base extension (Algorithm 3) to generate the ρ. The Roer unit normally consists of a multiplier and channel reduction logic, and serves as the orkhorse of both multiplication (operation in (2)) and reduction (step 1, 3 in Algorithm 2 and step 1, 7 in Algorithm 3). The Cox-Roer is driven by a microcoded sequencer, and the sequencer can be easily reprogrammed to provide support for different algorithms. While the basic structure of our design resembles the Cox-Roer architecture in Guillermin s ECC processor [22], our architecture has an optimized pipeline structure and a more aggressive memory organization specifically designed for pairing support. 4.2 Cox-Roer Parametrization for Pairing First e need to parametrize the Cox-Roer to support the pairings on the three curves defined in Section 2.2. Popular Altera and Xilinx FPGAs have embedded multipliers or even multipliers (Virtex-5 and higher). Moreover, a full multiplier can be built by efficiently combining several such small multipliers. It is thus natural to select =18 or =36. For the implementation of BN 126 or BN 128, =36 gives the best trade-off. Indeed the gain in frequency brought by smaller multipliers and adders does not compensate the necessary cycle surplus of reductions. The parameter n can then be set to 8 for BN 126 and BN 128, and 19 for BN 192. Because of our chosen arithmetic (see Section 6), the value α defined in Section 3.1 reaches 198 for all the curves. The orst case is reached during the schoolbook multiplication in F p 12. One can verify that, in order to have M B > αp, =33 is enough to support BN 126 and =34 for both BN 126 and BN 128. For BN 192, is set to 35. In this architecture, e used the same method to select bases as in [22], ith pseudo-mersennes of the form t i = 2 ε i, t i B C and ε i positive.

8 An adaptation of Guillermin s architecture is necessary to provide pairing support. As the number of local variables and precomputations is much larger for pairings than that of ECSM (in fact, the Miller s loop has a built-in ECSM), e use a single triple port RAM of 256 ords instead of the ROM and a group of 16 registers to store precomputed values and temporary results. This is enough to support all curves listed in Section 2.2. Fig. 1. Design I: architecture of the Cox-Roer and its pipeline structure command sequencer in main bus out RAM 2 ti q Cox Ro1 Ro2 Ron main MUX q + 1 RAM q + 1 coxi main bus sequencer Triple port RAM 2q + 1 main bus ALU RAM inv main MUX register modular adder MUX modular subtracter unsigned multiplier 0 or 1 bit shifter

9 4.3 Pipeline Architecture Our goal is to keep the maximal frequency already available in [22]. Additive hardare must be carefully introduced, to keep the critical path under control. On the other hand, e can easily raise the pipeline depth ithout a lot of cycle loss. Indeed, pairing computation has more parallelizable operations than classical elliptic curve scalar multiplication over F p. The architecture in [37] and [22] uses 2-stage and 5-stage pipelines, respectively. For the implementation of pairings, e found that a pipeline of up to 10 stages can still be efficiently filled during the hole pairing computation except for F p inversion. Based on Guillermin s architecture, to accumulators are included in the pipeline to efficiently support pairing computation. Fig. 1 shos the adapted pipeline architecture. The first 4 stages perform a b ti, here t i B C. We also introduce shift and MUX units in the first 4 stages to compute 2 a b ti, hich are utilized to accelerate squarings in the extension fields. Three multipliers (, (q+1) and (q+1) q) are used, here q= log 2 ε i. In the 6 th stage e implemented to independent accumulators. They are preceded by a subtracter (hich computes a b ti ) and a MUX in the 5 th stage. The output of the first 4 stages can then be independently added, subtracted or ignored by both accumulators. These to accumulators, together ith the use of toer extensions, save a lot of cycles during the pairing computation. See Section 6 for details. On this architecture, an RNS multiplication costs 2 cycles and the results can be accumulated immediately. An RNS reduction costs 2n+3 cycles as previously described in [22]. 5 Design II: Hardare/Algorithm Co-optimization In this section, e propose an optimized design that achieves an even higher throughput. The improvement comes mainly from to tricks: a set of good bases that admits a less-expensive base extension, and a fine-tuned pipeline structure that allos a higher frequency. 5.1 Base Selection Revisited The core observation here is that the complexity of base extension can be reduced if the moduli in the to bases are close to each other. The base extension is the most computational expensive operation in the RNS Montgomery algorithm. It requires n 2 times of multiplications. Indeed, (5) can be ritten as a matrix multiplication belo. x 1. x n B 1 c1 B n c B 1 cn B n cn ξ 1. ξ n M B c1 γ. M B cn (6)

10 Note that the elements in the matrix, B i cj, 1 i, j n, are constants and are generated as follos: B i cj = n k=1,k i b k = cj n k=1,k i (b k c j ) (7) cj Define B i,j := n k=1,k i (b k c j ). When b k and c j are close to each other, the difference b k c j is small. In practice, B i,j could be much smaller than c j if n is relatively small. Note that using B i cj or B i,j makes no difference in the final results due to the channel reduction on the products. Furthermore, Bi,j for 1 i, j n ill be predictably divisible by 2 n 2 if c j is odd. Consider the to bases B = {b 1, b 2,, b n } and C = {c 1, c 2,, c n }. Since there is at most one even number in B C, (b i c j ) is divided by 2 unless b i or c j is even. In practice, Bi,j can have more than n 2 zero bits in the least significant bits (LSBs). To shrink the operand size of the matrix multiplications, e can use truncate the least significant zeros of Bi,j, and restore the correct results by a simple left-shift. We denote B i,j the B i,j after truncation. Considering the size of p and the Cox-Roer architecture, e again select n = 8 and thus b i (and c j ) close to The selection of moduli also takes into account the size of B i,j for 1 i, j n. The folloing 16 moduli ( = 33) ere selected as the bases. B = {2 1, 2 9, 2 + 3, , 2 + 5, 2 + 9, 2 31, }, C = {2, 2 + 1, 2 3, , 2 13, 2 21, 2 25, 2 33}. After applying the truncation of zero bits, e manage to reduce the bitlength of all B i,j (actually, B i,j ) from standard 34 to 25. While this complexity reduction seems negligible, it admits notable savings in hardare design. A detailed analysis of the base selection is given in Appendix A A Fine-tuned Roer for Pairing Computation Our refinement focuses solely on the design of Roers. Fig. 2(c) shos the overvie of a Roer. It consists of a dual mode multiplier, 3 accumulators, a channel reduction module, a channel adder, a triple port RAM for multiplier inputs and a RAM for adder inputs. Dual mode multiplier The multiplier in the Roer is built to support the bases selected in Section 5.1. There are to types of multiplication executed in an RNS multiplication: multiplication (step 1, 3 of Algorithm 2) and multiplication (step 1, 7 of Algorithm 3). The DSP slices in recent Xilinx FPGAs are made of a signed bit multiplier, a 17-bit left shifter, and an accumulator. The dual mode multiplier is built ith four DSP slices, hich supports either unsigned multiplication or to signed multiplications in parallel. Fig. 2(a) illustrates the structure of the dual mode multiplier.

11 Fig. 2. The fine-tuned Roer architecture B[33:17] or B' i,j B[33:0] or B' i,j B' i,j+4 A[33:0] or ξ j B[16:0] or B' i,j+4 A[33:0] ξ j ξ j+4 A[33:0] or ξ j+4 ρ {0, 1, 2} B' i,j ξ j + B' i,j+4 ξ j+4 { M B ci, M C bi } AB[69:0] <<bias cj Acc0 Acc1 Acc2 +/- +/- <<17 {1, 2, 3, 6} AB[69:0] B' i,j ξ j + B' i,j+4 ξ j+4 (a) Dual mode multiplier [73:0] (b) Accumulators RAM0 Dual Mode Multiplier DSP DSP DSP DSP ξ j ξ j+4 [73:0] [73:33] [32:0] ε i 25x35 multiplier ith 2-stage pipeline realized by 2 DSP48E1s 35x25 35x25 [48:33] Adder logic ε i [32:0] Constant multiplier realized by adders RAM1 Accumulators Acc0 Acc1 Acc2 ρ t i +/- [33:0] Adder Subtracter Channel Adder Channel Reduction ξ i [33:0] Adder/Subtracter (c) Overvie of Roer i (d) Channel reduction For pairing computation, multiplication by small constants (2, 3, 6) is idely employed. Therefore, a constant multiplier is also included in the pipeline. Because to multiplications are executed simultaneously in the base extension, the number of cycles to perform the Cox-Roer algorithm (Algorithm 3) is reduced from n to n/2. This is a significant speedup of the base extension. Other optimizations The architecture of channel reduction is shon in Fig. 2(d). As the Hamming eights of all the ε i are either equal to or less than 3 in nonadjacent form, multiplication by ε i can be realized by 4 adders instead of multipliers. In order to maintain a high operating frequency, e use a three-stage pipeline to realize the channel reduction. To achieve the maximum usage of the multiplier, a separate adder together ith a dedicated RAM is included. Because of the lazy reduction inside the Roer, the rite port of RAM0 is not alays occupied by the multiplier. Therefore, the addition and the multiplication can run in parallel. While most of the

12 additions in the pairing computation are performed by the accumulators, the separate adder is useful in point additions and doublings. While this architecture requires less cycles for reduction than Design I, it is less scalable. Indeed, in order to support larger p, e need to choose larger n. The bitlength of B i,j increases quickly hen n goes up, and to multiplications in parallel becomes impossible on the current Roer. 6 Scheduling the Pairing Algorithm In this section e present the implementation choices for every step of the pairing computation. 6.1 Arithmetic in F p 2: Back to the Schoolbook Method Karatsuba and derived interpolation methods have been used intensively in field operations in the literature to save the expensive multiplication [3, 8, 23, 42]. Karatsuba uses 3 instead of 4 multiplications at the cost of 3 extra additions. In a normal positional number system, this method saves computation poer, as multiplications are much more expensive than additions. Hoever, in RNS, the complexities of a multiplication and an addition are the same. Hence, the schoolbook method involves less operations (counting both additions and multiplications) and is preferred. On both architectures, a full F p 2 multiplication finishes in 8 cycles, and a squaring in only 6 cycles. 6.2 Arithmetic in F p 12: Interpolation ith Parsimony Let X = {x 1,, x 12 }, Y = {y 1,, y 12 } and Z = X Y F p 12. To accelerate F p 12 arithmetic, e aim to execute only once x i y j, for all {i, j} {1,, 12} 2. As i 2 = 1 and γ 6 =(1 + i), the result of x i y j is either added to or subtracted from at most to components of Z. This is the reason for the presence of to independent accumulators. Moreover, the result of the multiplication can be multiplied by 2 on the fly before it is accumulated, hich speeds up the squaring in F p 12. We estimate if the interpolation techniques at higher levels may save cycles or not. We exclude the reduction step, therefore the cycle count is equivalent on the flexible as ell as the optimized design. We only consider Karatsuba in F p 12 over F p 6 hich is the best case for these techniques (interpolation at other levels F p 12/F p 4, F p 6/F p 2 or F p 4/F p 2 ould give orse results). Four different types of multiplications and squaring are needed during pairing, each one requiring a different algorithm: Squaring during the Miller loop: the operand of this squaring has no specific structure. We propose to use schoolbook squaring. Thanks to the multiplication by 2 included in the pipeline of both Design I and Design II, a squaring costs 156 cycles. The interpolation on F p 12/F p 6 leads to the folloing

13 formula: (X H + Γ X L ) 2 = ( (X H + X L )(X H + Γ 2 X L ) (1 + Γ 2 )X H X L ) + Γ (2X H X L ) (ith X H, X L F p 6). We found no method to go belo 156 cycles on both designs. Multiplication by the line in the Miller loop: half the values of B are equal to 0, therefore the schoolbook multiplication costs 144 cycles. Interpolation techniques do not manage to take the advantage of the half size operand. Squaring during final exponentiation: during the hard part of FE, the operands to be squared are in G Φ6 (F p 2). We use the formulae given in [21] (Algorithm A.1). On Design I, e need 84 cycles to implement it. This approach is much more efficient than interpolation techniques. 5 Multiplication during final exponentiation: Without counting pipeline bubbles, Karatsuba requires 278 and 254 cycles on Design I and Design II, respectively, hile the schoolbook method requires 288 on both. Note that there are only 20 such multiplications in the hole pairing for BN 126, and the savings are really limited. Because of the moderate gain and the complication brought to the sequencer, e decided not to implement it. 6.3 F p Inversion In the final exponentiation of pairing, an F p 12 inversion is required. It can be done ith only one inversion, 35 reductions and additional multiplications/additions in F p [14, 23], but the remaining inversion in F p is very expensive. Since comparison in RNS is difficult, inversion through exponentiation (X 1 X p 2 mod p) is used. For this operation, e use a simple square and multiply algorithm ith least significant bit first. It is more memory consuming, but it allos to perform multiplications in parallel. On a pipelined datapath, LSB-first exponentiation is more efficient than the MSB-first method. In total, an inversion needs log(p) 1 squarings and many multiplications, but the cost of multiplications is hidden in the pipeline. 6.4 Higher Level Scheduling Based on the operation over F p and its extension fields, e are ready to implement the hole pairing computation. The folloing to points are important to efficiently utilize the datapath and to have limited memory usage: control of the number of local variables to limit the size of RAMs; control of the dependency beteen operations, to avoid pipeline bubbles. For the first point, the step hich requires the most live local variables is in the hard-part of the final exponentiation. We use the algorithm described by Scott et al. [43], hich is to date the fastest ay to compute the hard-part (Algorithm A.1). To keep its full poer hile limiting the number of local variables 5 More recently, Karabina introduced a compressed form for elements in F p 12 hich require less operations to be squared [3, 28]. Unfortunately, this method involves extra inversions so that it is not suitable for our designs.

14 (ith a classical register allocation technique) e slightly rearranged their original formulae. For the second point, excluding the F p inversion here idle states cannot be avoided, the most constrained part is the F p 2 arithmetic in the Miller loop. Therefore e rearranged the projective coordinates addition and doubling formulae to emphasize the inherent parallelism. Formulas can be found in Algorithm 4 and 5 in appendix. On both architectures, e managed to eliminate almost all pipeline bubbles in the Miller loop. Idle states on the multiplier remain less than %1 of the time on both architectures in the pairing computation, excluding the F p inversion. 7 Implementation Results and Analysis 7.1 Area The prototype of the proposed pairing coprocessors ere implemented on commercial FPGAs. Table 1 gives the logic utilization of both designs. Design I We synthesized the first design for n = 8 on three different FPGAs : EP2C35: A lo cost (less than $100) 65 nm node Altera FPGA. It is designed for industrial series production; EP2S30: A 65 nm node high end series of Altera. It is picked for the sake of comparison ith the Xilinx Virtex-4 used in [15]. EP3SE50: A 40 nm node high end FPGA. Note that it is the smallest FPGA of the series. BN 192 (n = 19) is also implemented on this device. The area consumption is given in ALM, the equivalence of the Xilinx Virtex-4 slice, and in LE for the Cyclone, hich can be considered as a simple 4 4 LUT ith carry chain and registers. We refer the reader to [1] for details. We let the synthesizer decide ho to implement multiplications (ith speed constraints). Note that this choice lead to the maximal use of DSP blocks. We used also embedded RAMs: the RAM of each Roer is built the M4k or M9k (depending on the FPGA). We gave no instructions to the design suite for the implementation of the sequencer (containing 20 kb microcode), therefore they ere placed in the big available memories. The same design provides support for the to curves defined in 2.2, ith the only difference being the content of the RAM blocks (precomputed values and the sequencer). Design II Design II is implemented on a Xilinx Virtex-6 XC6VLX240T-2 FPGA, hich embeds DSP slices. As there are 8 Roers and each Roer contains 4 DSP slices, the total number of DSPs is 32. Data RAMs used in each Roer are implemented ith distributed memory blocks. The block RAMs (BRAMs) serve as the microcode sequencer. Thanks to the fine-tuned pipeline, the coprocessor can operate at 250MHz.

15 Table 1. Logic Utilization n Device Freq. Multipliers Logic Data Sequencer Elements Memory Cyclone II 91 MHz bit Mult LE 32 M4k 35 M4k 8 Stratix II 165 MHz 72 DSP18el 4227 ALMs 32 M4k 1 M512 Design I Stratix III 165 MHz 72 DSP18el 4233 ALMs 16 M9k 1 M M9k 19 Stratix III 131 MHz 171 DSP18el 9910 ALMs 38 M9k 1 M M9k Design II 8 Virtex MHz 32 DSP48E1s 7032 Slices Kb BRAMs 7.2 Performance Table 2 gives number of cycles used by the sub-functions used in optimal ate pairing on both Design I and Design II. Due to the carefully selected bases and dual mode multiplier, Design II achieved a loer cycle count. The improvement mainly comes from the speedup of RNS reduction: 19 cycles on Design I compared to 12 cycles on Design II hen n = Comparison and Discussion Table 3 lists the performance of softare and hardare implementations reported in recent literature. Both Design I and II achieve a better performance than previous hardare implementations [16, 19, 27]. Due to the use of different platforms, a fair comparison is difficult. Nevertheless, compared ith the design of [16] hich also uses Virtex-6, Design II achieves a speedup of factor 2. The softare implementations achieve very high performance. Although e did not break the softare record, the speed of our design is already close to that of softare. The speedup comes mainly from three improvements. First, RNS multiplication has loer complexity than traditional integer multiplications. For example, an 256-bit multiplication on RNS involves multiplications, hile a 256- bit integer multiplication requires multiplications using Karatsuba s method. Second, the use of lazy reduction reduces the cost of multiplications in extension fields. Third, RNS representation is very parallelization-friendly. Indeed, it only involves relatively small numbers (no long carry propagation) and alays uses data in the local memory (no inter-core communication overhead). The performance of both Design I and II has demonstrated the efficiency of RNS in pairing implementations, and confirms ith actual implementations on FPGA the analysis of [14]. Table 2. Cycle count in one optimal pairing Curve Mul./Red. 2T and T+Q and f 2 f g Miller s Final Total in F p g (T,T ) (P ) g (T,Q) (P ) Loop Exp. BN ,530 89, ,111 2 / Design I BN ,480 94, ,502 BN / , , ,849 Design II BN / ,116 81, ,111

16 Table 3. Performance comparison of softare and hardare implementations of pairings Design Pairing Security Platform Algorithm Area Freq. Cycle Delay [bit] [MHz] [ 10 3 ] [ms] Altera FPGA LE (Cyclone II) 35 mult. 126 Altera FPGA RNS 4233 A Design I optimal ate (Stratix III) Montgomery 72 DSPs 192 Altera FPGA 9910 A (Stratix III) 171 DSPs Design II optimal ate 126 Xilinx FPGA RNS 7032 slices (Virtex-6) Montgomery 32 DSPs [16] ate Xilinx FPGA Hybrid 4014 slices optimal ate (Virtex-6) Montgomery 42 DSPs Tate 1, [19] ate 128 Xilinx FPGA Blakley 52k Slices 50 1, optimal ate (Virtex-4) Tate 11, [27] ate 128 ASIC Montgomery 97 kgates 338 7, optimal ate (130 nm) 5, [15] Tate over Xilinx FPGA 4755 Slices 128 F (Virtex-4) - 7 BRAMs [2] optimal Eta Xilinx over F Virtex-4 Slices [23] ate - 15, bit Core2 Montgomery 2400 optimal ate 10, [20] ate bit Core2 Montgomery , [36] optimal ate 128 Core2 Quad Hybrid Mult , [8] optimal ate 126 Core i7 Montgomery , [3] optimal ate 126 Phenom II Montgomery , [4] η T over F Xeon , [9] η T over F Core i , [2] opt. Eta F Core i , Estimated by the authors. 8 Conclusions In this paper, e demonstrated that RNS together ith lazy reduction is a really competitive approach for pairing computation in hardare. Thanks to the use of RNS, both datapath and memory can be nicely parallelized. The results sho that both designs e proposed achieve higher speed than all previous designs in hardare. The fastest version computes an optimal ate pairing at 126-bit security level in ms, hich is 2 times faster than all previous hardare implementations. Moreover, e also reported the first hardare pairing implementation at 192-bit security level. For future ork, e ould like to further optimize the pipeline structure to achieve higher speed, particularly for 192-bit optimal pairing. We ould like implement using RNS a pairing processor that provides 256-bit security. For these levels, BN curves may become less competitive, and have to be compared to other approaches such like KSS curves [26]. Also, as shon in Section 5, carefully selected RNS bases can help to reduce the complexity of RNS base extension. Hoever, it is not clear ho much e can benefit from it hen the

17 number of channels is relatively large, and this is definitely part of the design exploration for the implementation of pairings at higher security levels. References 1. Altera eb site D. Aranha, J.-L. Beuchat, J. Detrey, and N. Estibals. Optimal eta pairing on supersingular genus-2 binary hyperelliptic curves. Cryptology eprint Archive, Report 2010/559, D. Aranha, K. Karabina, P. Longa, C. H. Gebotys, and J. López. Faster explicit formulas for computing pairings over ordinary curves. In Advances in Cryptology EUROCRYPT 2011, volume 6632 of LNCS, pages Springer, Heidelberg, D. Aranha, J. López, and D. Hankerson. High-speed parallel softare implementation of the η T pairing. In Topics in Cryptology - CT-RSA 2010, volume 5985 of LNCS, pages Springer, Heidelberg, J.-C. Bajard, L.-S. Didier, and P. Kornerup. An RNS Montgomery modular multiplication algorithm. Computers, IEEE Transactions on, 47(7): , P. Barreto, H. Kim, B. Lynn, and M. Scott. Efficient algorithms for pairing-based cryptosystems. In Advances in Cryptology CRYPTO 2002, volume 2442 of LNCS, pages Springer, Heidelberg, P. Barreto and M. Naehrig. Pairing-friendly elliptic curves of prime order. In Selected areas in cryptography SAC 2005, volume 3897 of LNCS, pages Springer, J.-L. Beuchat, J. González-Díaz, S. Mitsunari, E. Okamoto, F. Rodríguez- Henríquez, and T. Teruya. High-speed softare implementation of the optimal ate pairing over Barreto-Naehrig curves. In Pairing-Based Cryptography - Pairing 2010, volume 6487 of LNCS, pages Springer, Heidelberg, J.-L. Beuchat, E. López-Trejo, L. Martínez-Ramos, S. Mitsunari, and F. Rodríguez- Henríquez. Multi-core implementation of the Tate pairing over supersingular elliptic curves. In Cryptology and Netork Security, volume 5888 of LNCS, pages Springer, Heidelberg, D. Boneh and M. Franklin. Identity-based encryption from the Weil pairing. In Advances in Cryptology CRYPTO 2001, volume 2139 of LNCS, pages Springer, Heidelberg, D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. Journal of Cryptology, 17(4): , J. C. Cha and J. H. Cheon. An identity-based signature from gap Diffie-Hellman groups. In International Workshop on Theory and Practice in Public Key Cryptography PKC 2003, volume 2567, pages 18 30, Springer, Heidelberg, C. Costello, T. Lange, and M. Naehrig. Faster pairing computations on curves ith high-degree tists. In Public Key Cryptography PKC 2010, volume 6056 of LNCS, pages Springer, Heidelberg, S. Duquesne. RNS arithmetic in F p k and application to fast pairing computation. Cryptology eprint Archive, Report 2010/555, to appear in Journal of Mathematical Cryptology. 15. N. Estibals. Compact hardare for computing the Tate pairing over 128-bitsecurity supersingular curves. In Pairing-Based Cryptography - Pairing 2010, volume 6487 of LNCS, pages Springer, Heidelberg, 2010.

18 16. J. Fan, F. Vercauteren, and I. Verbauhede. Efficient hardare implementation of F p-arithmetic for pairing-friendly curves. Computers, IEEE Transactions on, PP(99):1, D. Freeman, M. Scott, and E. Teske. A taxonomy of pairing-friendly elliptic curves. Journal of Cryptology, 23: , G. Frey and H.G. Rück. A remark concerning m-divisibility and the discrete logarithm in the divisor class group of curves. Mathematics of computation, 62(206): , S. Ghosh, D. Mukhopadhyay, and D. Roychodhury. High speed flexible pairing cryptoprocessor on FPGA platform. In Pairing-Based Cryptography - Pairing 2010, volume 6487 of LNCS, pages Springer, Heidelberg, P. Grabher, J. Großschädl, and D. Page. On softare parallel implementation of cryptographic pairings. In Selected Areas in Cryptography, volume 5381 of LNCS, pages Springer, Heidelberg, R. Granger and M. Scott. Faster squaring in the cyclotomic subgroup of sixth degree extensions. Public Key Cryptography PKC 2010, 6056: , N. Guillermin. A High Speed Coprocessor for Elliptic Curve Scalar Multiplications over F p. In Cryptographic Hardare and Embedded Systems-CHES 2010, volume 6225 of LNCS, pages 48 64, Springer, Heidelberg, D. Hankerson, A. Menezes, and M. Scott. Softare Implementation of Pairings, volume 2 of Cryptology and Information Security Series, pages IOS Press, M. Joye and G. Neven edition, F. Hess, N.P. Smart, and F. Vercauteren. The Eta pairing revisited. Information Theory, IEEE Transactions on, 52(10): , A. Joux. A one round protocol for tripartite Diffie-Hellman. Journal of Cryptology, 17: , E. J. Kachisa, E. F. Schaefer, and M. Scott. Constructing brezing-eng pairingfriendly elliptic curves using elements in the cyclotomic field. In Pairing-Based Cryptography - Pairing 2008, volume 5209 of LNCS, pages , Springer, Heidelberg, D. Kammler, D. Zhang, P. Schabe, H. Scharaechter, M. Langenberg, D. Auras, G. Ascheid, and R. Mathar. Designing an ASIP for cryptographic pairings over Barreto-Naehrig curves. In Cryptographic Hardare and Embedded Systems-CHES 2009, volume 5747 of LNCS, pages Springer, Heidelberg, K. Karabina. Squaring in cyclotomic subgroups. Cryptology eprint Archive, Report 2010/542, S. Kaamura, M. Koike, F. Sano, and A. Shimbo. Cox-roer architecture for fast parallel montgomery multiplication. In Advances in Cryptology EUROCRYPT 2000, volume 1807 of LNCS, pages Springer, Heidelberg, N. Koblitz. Elliptic Curve Cryptosystem. Math. Comp., 48: , N. Koblitz and A. Menezes. Pairing-based cryptography at high security levels. Cryptography and coding, 3796:13 36, E. Lee, H.-S. Lee, and C.-M. Park. Efficient and generalized pairing computation on abelian varieties. Information Theory, IEEE Transactions on, 55(4): , A.J. Menezes, T. Okamoto, and S.A. Vanstone. Reducing elliptic curve logarithms to logarithms in a finite field. Information Theory, IEEE Transactions on, 39(5): , V. Miller. Uses of Elliptic Curves in Cryptography. In Advances in Cryptology: Proceedings of CRYPTO 85, volume 218 of LNCS, pages Springer, Heidelberg, 1985.

19 35. V. S. Miller. The Weil pairing, and its efficient calculation. Journal of Cryptology, 17: , /s M. Naehrig, R. Niederhagen, and P. Schabe. Ne softare speed records for cryptographic pairings. In Progress in Cryptology LATINCRYPT 2010, volume 6212 of LNCS, pages Springer, Heidelberg, H. Nozaki, M. Motoyama, A. Shimbo, and S. Kaamura. Implementation of RSA algorithm based on RNS montgomery multiplication. In Cryptographic Hardare and Embedded Systems-CHES 2001, volume 2162 of LNCS, pages , Springer, Heidelberg, National Institute of Standard and technology. Key management, management.html. 39. G.C.C.F. Pereira, M.A. Simplício, M. Naehrig, and P.S.L.M. Barreto. A family of implementation-friendly BN elliptic curves. Journal of Systems and Softare, K. Posch and R. Posch. Base extension using a convolution sum in residue number systems. Computing, 50:93 104, R. L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21(2): , M. Scott. Implementing cryptographic pairings. In Pairing-Based Cryptography - Pairing 2007, volume 4575 of LNCS, pages Springer, Heidelberg, M. Scott, N. Benger, M. Charlemagne, L. Dominguez Perez, and E. Kachisa. On the final exponentiation for calculating pairings on ordinary elliptic curves. Pairing- Based Cryptography Pairing 2009, volume 5671 of LNCS, pages 78 88, Springer, Heidelberg, F. Vercauteren. Optimal pairings. Information Theory, IEEE Transactions on, 56(1): , A Appendix A.1 Sub-routines for Optimal Ate Pairing Algorithm 4 dbl, doubling step Require: T = (X T γ 2, Y T γ 3, Z T ) E(F p 12) ith X T, Y T and Z T F p 2, P = (x P, y P ) E(F p). Ensure: The point 2T and the evaluation in P of the equation of the tangent line in T to the curve up to multiplicative factors in F p 2. 1: B Y 2 T, C 3Z 2 T, D 2X T Y T 2: F B + 3iC, G B 3iC, H 3C, t 3 B + ic, A X 2 T, E 2Y T Z Y 3: X 2T DF, Y 2T G 2 + 4HC, Z 2T 4BE, t 0 Ey P, t 1 3Ax P 4: return (X 2T γ 2, Y 2T γ 3, Z 2T ), t 0 + t 1γ + t 3γ 3 These algorithms are specific to the BN curve y 2 = x and the toer of extensions given in Subsection 2.1. Jacobian coordinates are usually used for pairing computations [8, 23, 36] but projective coordinates are more interesting

20 in our case [3, 13]. Using the degree 6 tist on the curve, the point Q can be ritten as (x Q γ 2, y Q γ 3 ) ith x Q and y Q F p 2. The doubling step of the Miller loop consists of to stages: the doubling of a temporary projective point T = (X T γ 2, Y T γ 3, Z T ) ith X T, Y T and Z T F p 2 and the evaluation in P of the tangent line in T to the curve. This is given in Algorithm 4 here the classical formulae are rearranged in a ay to highlight the reductions (every temporary result needs a reduction except F, G, H and t 3 ), and the inherent parallelism in the local variables (each line can be implemented in random order). This is important to avoid idle states in Cox-Roer. The total cost of this step is 4 multiplications, 5 squarings, 8 reductions in F p 2 and 2 modular multiplications of an element of F p 2 by an element of F p. Note that multiplications like 2X T Y T can be transformed into squaring of (X T + Y T ) at the cost of some extra additions [3, 13], thus it is not interesting for our design. In the same ay, the cost of the addition step given by Algorithm 5 is 11 multiplications, 2 squarings, 11 reductions in F p 2 and 2 modular multiplications of an element of F p 2 by an element of F p. Algorithm 5 add, addition step Require: T = (X T γ 2, Y T γ 3, Z T ) E(F p 12) ith X T, Y T and Z T F p 2, Q = (x Qγ 2, y Qγ 3 ) E(F p 12), P = (x P, y P ) E(F p). Ensure: The point T + Q and the evaluation in P of the equation of the line passing through T and Q up to multiplicative factors in F p 2. 1: E x QZ T X T, F y QZ T Y T 2: E 2 E 2, F 2 F 2 3: A F 2Z T 2X T E 2 EE 2, B X T E 2, E 3 EE 2 4: X T +Q AE, Z T +Q Z T E 3, t 3 F x Q Ey Q 5: Y T +Q F (B A) y QE 3, t 0 Ey P, t 1 F x P 6: return (X T +Qγ 2, Y T +Qγ 3, Z T +Q), t 0 + t 1γ + t 3γ 3 Algorithm 6 hard-part, hard part of the final exponentiation according [43] Require: f F p 12 of order p 4 p 2 + 1, x = u. Ensure: f (p4 p 2 +1)/l ith p and l as in 2.1. {Computation of the y i} ( ( 1: y 0 f p f p2 f p3, y 1 f x, y 3 y1 x, y 5 y3 x, y 4 y p 5, y6 y4y5 = f x3 f x3) p) (f x2) p) 2: y 5 y p 3, y2 y 1 5 (=, y4 y1y2 f x / ( ( 3: y 2 y p 5 = f x2) p 2), y 5 y 1 3 ( = 1/f x2), y 3 y p 1 ((mx ) p ), y 1 f 1 {Multi-addition chain for computing y 0.y 2 1.y 6 2.y 12 3.y 18 4.y 30 5.y 36 6 } 4: t 0 y 2 6, t 0 t 0y 4, t 0 t 0y 5, t 1 y 3y 5, t 1 t 1t 0, t 0 t 0y 2, t 1 t 2 1 5: t 1 t 1t 0, t 1 t 2 1, t 0 t 1y 1, t 1 t 1y 0, t 0 t 2 0, t 0 t 0t 1 6: return t 0

21 Algorithm 7 Squaring during final exponentiation (hard-part) [21] Require: A = 5 i=0 aiγi F p 12 ith a i F p 2, 0 i 5 Ensure: A 2 A 0 3a (1 + i)a 2 3 2a 0, A 1 6(1 + i)a 2a 5 + 2a 1 A 2 3a (1 + i)a 2 4 2a 2, A 3 6a 0a 3 + 2a 3 A 4 3a (1 + i)a 2 5 2a 4, A 5 6a 1a 4 + 2a 5 return 5 i=0 Aiγi A.2 RNS Parameter Selection Since p should be around 254 bits, e set n to be 8 and the moduli are chosen near We consider for i, j [1,n] the folloing issues: (1) bitlength of B i cj and C i bj ; (2) Hamming eight of b i and c j. A simple (bounded) exhaustive search program returns the bases shon in Section 5. We denote the bitlength of all B i,j in a length matrix, L B, and the bitlength matrix of all Ci,j as L C. For the selected bases, e have the folloing L B and L C L B = , L C = After truncating the least significant zeros, e get the folloing bitlength, denoted as L B and L C. Note that e applied the same truncation parameter for all the numbers in the same ro. The number of zeros truncated is chosen as min{z(b 1,j ), z(b 2,j ),, z(b n,j )} here z(b i,j ) gives the number of zeros at the LSBs of B i,j L B = , L C = No all of B i,j and C i,j are less than 25 bits, and they fit in one operand of an FPGA DSP slice, hile the standard 34-bit operands do not. In fact, in the implementation e do not truncate more zeros as far as all elements in that ro fit in 25 bits.

A FPGA pairing implementation using the Residue Number System. Sylvain Duquesne, Nicolas Guillermin

A FPGA pairing implementation using the Residue Number System Sylvain Duquesne, Nicolas Guillermin IRMAR, UMR CNRS 6625 Université Rennes 1 Campus de Beaulieu 35042 Rennes cedex, France sylvain.duquesne@univ-rennes1.fr,