output H = 2*H+P H=2*(H-P) - PDF Free Download

Ecient Algorithms for Multiplication on Elliptic Curves by Volker Muller TI-9/97 22. April 997 Institut fur theoretische Informatik

Ecient Algorithms for Multiplication on Elliptic Curves Volker Muller Technische Hochschule Darmstadt Fachbereich Informatik Alexanderstr. 64283 Darmstadt Germany Email: vmueller@cdc.informatik.th-darmstadt.de 28th April 997 Abstract We describe new fast algorithms for multiplying points on elliptic curves over nite elds of characteristic greater three. In contrary to the standard binary algorithm, these algorithms use representations of the multiplier with negative coecients. Timings of the new algorithms show that they are up to 25% faster than the standard binary multiplication algorithm. This running time improvement is especially important for using elliptic curve cryptosystems on smart cards. Key words: elliptic curve cryptosystem, multiplication. Introduction The growing importance of public key cryptography in the last decade induced the search for optimal algorithms for fast exponentiation in various groups. Fast exponentiation is the main bottleneck for improving the speed of several cryptosystems as RSA and ElGamal. In recent years, elliptic curve public key cryptosystems are becoming more and more popular. These cryptosystems are variants of the ElGamal scheme, but they use the group of points on an elliptic curve over a nite eld (for a description of such systems, see [3], [6] or [2]). Here, multiplication of a point with a large integer is the most time consuming operation of the encryption and decryption procedure. In this paper, we describe four new algorithms for this key operation, which use special properties of elliptic curves. These new algorithms lead to a running time improvement of up to 25%. Moreover, the algorithms are memory ecient, such that they can also be used for elliptic curve cryptosystem implementations on smart cards.

We start with a short introduction to elliptic curves over nite elds of characteristic greater three. It should be mentioned that the techniques of this paper can also be used for elliptic curves over elds of characteristic two. Let p > 3 be a prime, and let IF q be the nite eld with q = p n elements. An elliptic curve E over IF q can be dened by an equation of the form y 2 = x 3 + a4 x + a6 ; () where a4; a6 2 IF q and 4a 3 4 + 27a 2 6 6=. The set E(IF q) of points on E over IF q is given by the set of solutions in IF 2 q to () together with a \point at innity" O. This set E(IF q ) forms a nite abelian (additive) group. There exist simple algebraic formulas for adding two arbitrary points in E(IF q ) (see []). For the speed of elliptic curve cryptosystems, the number of elementary eld operations for point addition is important. Here we are just interested in \quadratic" eld operations, i.e. we do not care about operations which can be done in linear time. One important observation is then the fact that negating a point is \for free", since for any nonzero point P = (x; y) 2 E(IF q ) the negative point is given as?p = (x;?y). If we count the quadratic eld operations of the other basic point operations, we get the following results: Doubling a point takes one multiplication, two squarings and one inversion, adding two dierent points can be done with one multiplication, one squaring and one inversion in IF q. In practice, the inversion is by far the most time consuming part of these operations (in the computer algebra library LiDIA, one inversion of a random element in a 55 bit prime eld takes about the same time as 25 multiplications in this eld). Let in the remainder of this paper m 2 IN >, and let P 2 E(IF q ) be a non zero point on some given elliptic curve E. In the following sections, we describe several new algorithms for computing the multiple m P 2 E(IF q ). These algorithms are designed to especially take care of the special properties of elliptic curves. 2 Left-to-Right Addition Chains The usual method for computing m P is a variant of the binary exponentiation. It is easy to see that the running time of this algorithm depends on the bit length of m and on the number of ones in the binary decomposition of m. Morain and Olivos [7] developed an extension of this binary algorithm which uses a decomposition of the form m = kx i= m i 2 i ; m i 2 f; ;?g : (2) Moreover, the number of non zero coecients in this representation is smaller than the number of ones in the binary decomposition of m. Since for elliptic curves negating a point is for free, their algorithm is therefore faster than the usual binary method (see Section 5, where we list some timings). The algorithm works as follows: it reads the bits of the binary decomposition of m \from the right to the left" (i.e. from the low order bit to the high order bit). For each bit, the algorithm reacts according to the actual state of a given 2

nite automaton, changes the state and multiplies on the y. In [7], two suitable nite automatons are given. We generalize the idea of [7] to describe an algorithm which uses a decomposition like (2), but reads the bits of m in the opposite direction, i.e. bits are handled \from the left to the right" (the high order bit to the low order bit). Again, the multiplication algorithm is \given" by a nite automaton. 2. The Basic Version The basic idea of [7] is the observation that blocks of 's in the binary decomposition of m can be substituted by \equivalent" bit blocks, which have fewer non zero entries. For example, the computation of 5 P with the binary method takes 3 doublings and 3 point additions, but using the equality 5 P = 6 P? P it can also be done in 4 doublings and addition. In this example, we have substituted a bit block ( )2 by the \equivalent" block (?)2. In general, we substitute a block ( a )2; a 2; in the binary decomposition of m by the block ( a??)2. A multiplication algorithm which uses this idea can be described by a nite automaton. The states of this automaton \store" the current situation: state : The algorithm has read a -bit. state : This state indicates that the algorithm is inside a block of 's. state : The previous bit was a -bit, but the current bit is a. We do not know whether the current -bit starts a block of 's or not. Therefore we have to use \lazy evaluation" and wait for the next bit. If the next bit is, then we have an isolated and we go back to state, otherwise we are in a block of 's and we switch to state. The following Figure describes the actions of nite automaton A in a graph. The current bit and the corresponding operation are written at the edges of this graph. Note that the algorithm induced by Automaton A needs two doublings and one addition, when it is in state and reads a -bit. For this situation, it might be advantageous to precompute (and store) 2 P and use the equation 2 (2 H + P ) = 4 H + 2 P. This transformation is especially useful with the observation which we will describe in Theorem. Note further that the correctness of this algorithm follows directly from construction. 2.2 The Improved Version We can improve the algorithm induced by Figure even more with the following observation already used in [7]: If there is an isolated between two blocks of 's, then we can use the substitution (? )2 = (?)2 to do the transformation ( a b )?! ( a?? b??)2?! ( a? b??)2 : 3

output H = 2*H+P H=4*(H+P) output H H=2* (2*H+P) output H=H-P H=O H=2*H H=2*(H-P) H=2*H Figure : Finite Automaton, Version A We change Automaton A appropriately to take care of this equation by introducing a new state. If we are leaving a block of 's (i.e. we are in state and we read a -bit), we have to delay the computation until we know the bit following the -bit. Therefore we go to state and read the next bit. After this bit input, the algorithm can decide whether the -bit really is an isolated bit between two blocks of 's or not and react correctly. We describe the corresponding nite Automaton B in Figure 2. Note that the correctness of Automaton B follows directly from the correctness of Automaton A and the construction. Moreover, the remarks made to Automaton A remain true: it might be advantageous to replace the operations H = 2 (2 H P ) by a precomputation and the corresponding operations H = 4 H 2 P. It should be observed that Automaton B does not always induce a method with fewer additions as the standard binary method. If we choose for example m = 26 = ( )2, then the algorithm induced by Automaton B needs one doubling more than the standard method. Nevertheless the new algorithm is in practice very often better than the standard method, as we will see in Section 5. 3 Using a 4-adic Decomposition of the Multiplier The ordinary binary algorithm uses a 2-adic decomposition of the multiplier m. In this section, we describe a \left-to-right" multiplication algorithm which uses a 4-adic decom- 4

output H = 2*H+P H=4*(H+P) output H H=2* (2*H+P) output H=H-P H=O H=2*H H= 4*(H-P) H=2* (2*H-P) H=2*H output H=2*(H-P) Figure 2: Finite Automaton, Version B position of m. Let the 4-adic representation of m be given as m = sx i= n i 4 i ; n i < 4; n s 6= : (3) A multiplication algorithm based on (3) can process the coecients n i either in ascending or in descending order. Note that the processing direction is of great importance, since only for descending order n s ; n s?; : : : the algorithm can use a precomputed table of points. This claim follows directly from the equation m P = 4 : : : 4 4 n s P + n s? P : : : + n P + n P : It is easy to see from this equality that only additions of points r P for r < 4 are necessary, and these points can be precomputed and stored in a table. 3. Computing 4 H Another interesting point is the computation of 4 H for various points H 2 E(IF q ). This operation obviously is a key operation in a 4-adic multiplication algorithm. The naive algorithm would double H twice. Such an algorithm would need two inversions, two multiplications and four squarings in the given eld. In this section, we describe an alternative algorithm which only needs one inversion (but more multiplications and squarings). 5

We use the theory of division polynomials as explained in [, page 45]. Using these polynomials, we can express multiplication of a \formal point" by a pair of rational functions. Computing 4 H for some given non zero point H 2 E(IF q ) then means evaluating these rational functions. First we dene some division polynomials which we will need in the alternative algorithm: 2(x; y) = 2 y ; 3(x; y) = 3 x 2 + 6 a4 x + 2 a6 x? a 2 4 ;!2(x; y) = 2x 2 + a4 x + 4a6 x? a 2 4 4(x; y) = 2(x; y)!2(x; y) : x? 8a4a6 x? 2a 3 4? 6a2 6 ; Note that the coecients of these polynomials only depend on the used elliptic curve E. Therefore they can be precomputed and stored; we neglect the cost of computing the polynomial coecients. Thus we can assume that the evaluation of all these polynomials at a given point H can be done with 7 multiplications and squaring in IF q. A useful observation is the fact that 4 H = O if and only if 4(H) =. Therefore the polynomials!2(x; y) and 2(x; y) should be evaluated at rst. Apart from these values, we need two other values: (H) = 3(H) 3 and (H) = 2(H) 4!2(H). All these values can be computed with 3 squarings and 2 multiplications. Then we can compute 4(x; y) = (x; y)? (x; y) 3(x; y) ;?!4(x; y) = 2? 2 (x; y) + 3 (x; y)?! 2(x; 2 y) (x; y)? (x; y) 2 : It should be mentioned that multiplication with 2? can be performed in linear time if we use the fact that for r 2 IF p we have 2? r = (r=2), if r is even, and 2? r = (r + p)=2, if r is odd (here, p is the characteristic of IF q ). The connection to the original problem is described in [, Prop..7.8, page 47]. We can deduce that 4 H = x(h)? 4(H) 4(H) 2 ;!4(H) 4(H) 3 : Therefore we can nd 4 H by inverting 4(H) and using the values 4(H) and!4(h). If we count the cost for all these operations, we get the following theorem.. Theorem There exists an algorithm which computes 4H in at most 4 multiplications, 7 squarings and one inversion in IF q for any point H 2 E(IF q ). In the introduction of this paper, we mentioned that in a lot of nite eld implementations one inversion has about the same cost as approximately 25 multiplications. If we then compare the naive algorithm with the algorithm described in this section, we get the following result: the naive algorithm needs about 52 multiplications and 4 squarings, the new algorithm needs 39 multiplications and 7 squarings. Since in a clever implementation one squaring is up to twice as fast as a multiplication, we expect the new algorithm to be approximately 2% faster than the naive algorithm. A comparison of practical timings can be found in Section 5. 6

3.2 The 4-adic Multiplication Algorithm Using (3) and the previously mentioned facts, we immediately can describe the following 4-adic multiplication algorithm. 2. Algorithm (Multiplication of Points Using 4-adic Decompositions) Input: m 2 IN and P 2 E(IF q ). Output: m P. () compute and store T i = i P for all i 3. s (2) compute the representation m = i= n i 4 i ; (3) set H = T ns. n i < 4; n s 6=. (4) for (i = s? downto ) do (5) set H = 4 H. (6) if (n i > ) then (7) set H = H + T ni. (8) od (9) return (H) We can assume that for a random multiplier m about half the bits in the binary decomposition of m are -bits. Therefore we expect that the standard binary multiplication algorithm needs about log2(m) point doublings and 2 log 2(m) point additions. The length of a 4-adic decomposition (3) of m is only half the binary length of m. Since multiplication with 4 can be done faster than two doublings, we expect the \doubling part" of Algorithm 2 to be faster than the corresponding part of the binary method. Unfortunately, Algorithm 2 needs one additional point addition for each non zero coecient n i in (3). If we assume that for a random integer m the coecients n i behave like random elements in ZZ=4ZZ, then about a fourth of these coecients should be zero. Therefore we expect that 3 8 log 2(m) point additions are necessary in step (7) of Algorithm 2. We will describe the practical behavior of this algorithm in Section 5. The next section combines the two main ideas of this paper to reduce the number of non zero coecients in (3). 4 Addition Chains and 4-adic Decompositions We have already mentioned that the expected number of non zero coecients in (3) is about one fourth of all coecients. Fortunately, we can combine the ideas of the 4-adic multiplication Algorithm 2 with the ideas of -addition chains introduced in Section 2. Again we substitute blocks of 's in the binary decomposition by \equivalent" blocks. 7

In the 4-adic situation, there is the small diculty that not every substitution gives an improvement. For example, we know that 23 = ( 3)4 = ( )2 = (?)2 = ( 2?)4, but this substitution does not improve the number of nonzero coecients in the 4-adic expansion of 23. This fact aggravates the description of a nite automaton which denes the improved 4-adic multiplication algorithm. The idea for the reduction is however simple: Assume that the actual situation is (x 3 y)4, and we currently read the 3-coecient. If we use the equality 3 = 4?, we get the equivalent block ((x + ) (y? 4))4. Testing all possible values for x; y, we nd the situations where a substitution will increase the number of zero coecients. These cases are stored in a nite automaton. We do not describe the automaton with a graph, but we give a list of states and corresponding actions. The construction of this nite automaton is straightforward (the states store the last read coecient, as in Section 2) except that input coecients 3 are handled more careful. If the automaton reads a 3-coecient directly after a -coecient, then it goes to state delay to wait for the next coecient. If we are in a block of 3-coecients of length at least two, then this block can be exchanged, otherwise a substitution does not pay. The nite automaton starts in state n, if n = n s < 3, and in state delay if n s = 3. Moreover, we initialize H = O. Then it reads the coecients of the 4-adic decomposition (3) of m in descending order, starting with n s?. Let n 3 be the last coecient which the algorithm has read. The following list describes the actions which the automaton should perform in a given state with input n. state : If n 2, then goto state n, and set H = 4 H. If n = 3, then goto state delay. state : If n 2, then set H = 4 H + P, otherwise set H = 4 H + 2 P. Goto state n. state 2: If n 2, then set H = 4 H + 2 P, otherwise set H = 4 H + 3 P. Goto state n. state 3: If n 3, then set H = 4 H, otherwise set H = 4 H? P. If n = 2, then goto state -2, if n =, then goto state -3, otherwise goto state n. state -2: If n 2, then set H = 4 H? 2 P, otherwise set H = 4 H? P. Goto state n. state -3: If n < 3, then set H = 4 H? 3 P, otherwise set H = H? 2 P. Goto state n. state delay: If n, then set H = 6 H + 3 P, else set H = 4 (4 H + P ). If n = 2, then goto state -2, else goto state n. If the automaton has read all coecients of (3) and is currently in state n, the algorithm should output the result H = 4 H + n P for n =,, 2, -2, -3, 8

H = 4 H? P for n = 3, H = 6 H + 3 P for n = delay. The correctness of this procedure follows directly by construction. Obviously, all point additions in this procedure should preferably be done as fast as possible. Therefore the points r P for 2 r 3 should be precomputed and stored in a table. Note again that we can derive the points?2p and?3p for free. Moreover, multiplication with 4 should be done with the new algorithm introduced in Theorem. In the following section, we give running times for all algorithms described in this paper. We will see that the ideas of this section improve the speed of the 4-adic multiplication algorithm again by about 5%. 5 Timings of the Algorithms First we count the number of elementary eld operations that have to be performed. We restrict our attention to the algorithms using 4-adic expansions, since the variants of the Morain/Olivos algorithm can be analyzed exactly as in [7]. We assume that m behaves like a random integer. Then we can expect that about half the bits of the binary expansion of m are zero. Therefore the binary algorithm has to do log 2 (m) doublings and 2 log 2(m) additions. Moreover, we expect that about one fourth of the coecients in a 4-adic expansion like (3) are zero. Since the length of such an expansion is only half the binary length of m, Algorithm 2 needs 2 log 2(m) multiplications with four and 3 8 log 2(m) additions. The expected number of additions in the improved 4-adic algorithm is slightly smaller, while the length of the expansion remains the same. If we look at blocks (x 3 y)4 and count the \good" values for x; y, then we nd exactly 9 such possibilities. Therefore the probability that a 3-coecient can be exchanged to without \deletion" of another -coecient is 6 9. Therefore we expect the number of additions to be 28 39 log 2(m). Using the result of Theorem, we compute the expected number of elementary eld operations as follows. Operation Binary Method 4-adic Algorithm Improved 4-adic Algorithm Multiplication 3 log2(m) 3 4 log2(m) + 4 487 64 log 2(m) + 4 Squaring 5 2 log 2 (m) 3 8 log 2 (m) + 3 487 28 log 2(m) + 3 Inversion 3 2 log 2 (m) 7 8 log 2 (m) + 2 3 28 log 2(m) + 2 If we compare the expected number of operations of the binary method with the 4-adic algorithm, then the 4-adic version should be superior if one inversion takes longer than multiplications. If we assume that squarings take the same time as a multiplication, and that one inversion takes about 25 multiplications, then we expect the 4-adic algorithm to need about 78% of the time of the binary algorithm, the improved 4-adic algorithm should even be a bit faster and take 74% of that time. 9

After these theoretical observations, we list practical timings. We implemented the standard binary method for multiplication, the second improved algorithm of Morain/Olivos [7] and all algorithms described in this paper. The basis for these implementation is the computer algebra library LiDIA (see [4]). All tests were done on a sparc4 machine. In the rst table, we compare the naive algorithm for computing 4 H with the new idea of Section 3.. We list the average time (in milliseconds) of such an operation for the smallest prime eld with the given bit length. This average time was computed by multiplying random points on random elliptic curves over IF p with 4. Bit length of p Double twice New Method Rate.68.53 79% 5.99.8 8% 2.24.99 8% 25.4.2 8% 3.83.5 82% This table shows that the running time improvements which we expected in Section 3. can almost be achieved in practice. The following table lists timings for the standard binary multiplication algorithm, the second improved Algorithm of Morain/Olivos [7], the new Algorithm Version A (Figure ), the new Algorithm Version B (Figure 2), the 4-adic Algorithm 2 and the Improved 4-adic Algorithm. We chose ve random elliptic curves over the smallest prime eld IF p, where p has the given bit length. For each curve, we multiply a random point P with 2 random integers m < p. The table lists the average time (in milliseconds) of one such multiplication and the relative time compared to the standard binary method. log 2 (p) Std Morain/Oli. Version A Version B 4-adic Impr. 4-adic 75 57 (9%) 54 (89%) 48 (85%) 48 (85%) 4 (8%) 5 387 35 (9%) 343 (88%) 325 (84%) 323 (84%) 37 (79%) 2 662 594 (9%) 578 (87%) 55 (83%) 545 (82%) 57 (78%) 25 4 933 (9%) 97 (87%) 862 (83%) 853 (82%) 88 (78%) 3 65 447 (9%) 393 (86%) 37 (82%) 288 (8%) 29 (75%) 35 2247 25 (9%) 952 (87%) 847 (82%) 798 (8%) 7 (76%) 4 353 2739 (9%) 2655 (87%) 2525 (83%) 252 (82%) 2364 (77%) These timings show that the new algorithms really lead to a signicant running time improvement. The description of the new multiplication algorithms is very simple such that these algorithms can also be used on smart cards. Depending on the memory capacity of the card, one can achieve either an improvement of up to 8% with Algorithm Version

B (no additional memory necessary), or 25% with the Improved 4-adic Algorithm (only about 4 log2(p) bit additional memory required). Since speed is an important requirement for smart card applications, the new algorithms is of great importance for smart card implementations. Finally, it should be remarked that obviously the ideas of this paper can also be used for elliptic curves dened over nite elds of characteristic two. References [] I. Connell: Elliptic Curve Handbook, Draft July 995, available on ftp://math.mcgill.ca/pub/ech/. [2] IEEE P363 Working Draft: Public Key Cryptography, Draft, August 6 996, available on ftp://stdsbbs.ieee.org/pub/p363/. [3] N. Koblitz: Elliptic Curve Cryptosystems, Mathematics of Computation, 48, 987, 23 { 29. [4] LiDIA { A Library for Computational Number Theory, available on http://www.informatik.th-darmstadt.de/ti/ [5] A. Menezes: Elliptic Curve Public Key Cryptosystems, Kluwer Academic Publishers, 993. [6] V.S. Miller: Use of Elliptic Curves in Cryptography, Advances in Cryptology - CRYPTO 85, Lecture Notes in Computer Science No. 28, 986, 47 { 426. [7] F. Morain and J. Olivos: Speeding up the Computations on an Elliptic Curve using Addition-Subtraction Chains, in F. Morain, Courbes Elliptiques et Tests de Primalite, Doctoral Thesis, Universite Lyon I, 99.