Elliptic Curve Method for Integer Factorization on Parallel Architectures

EDIC RESEARCH PROPOSAL 1 Elliptic Curve Method for Integer Factorization on Parallel Architectures Andrea Miele I&C, EPFL Abstract The elliptic curve method (ECM) for integer factorization is an algorithm that uses the algebraic structure of the set of points of an elliptic curve for factoring integers. The running time of ECM depends on the size of the smallest prime divisor of the number to be factored. One of its main applications is the co-factorization step in the number field sieve algorithm that is used for assessing the security of the RSA cryptosystem. The principal goal emphasized in this proposal is the efficient implementation of ECM on highly parallel low-cost devices, like graphics cards. This requires theoretical and practical study of parallel algorithms for elliptic curve and finite field arithmetic. Index Terms ECM, finite field arithmetic, elliptic curves, Edwards curves, integer factorization. I. INTRODUCTION Implementation and study of algorithms for integer factorization is crucial for the security assessment of several public-key cryptosystems. The Number Field Sieve (NFS) [1] is the best known method for factoring integers with large prime factors (such as RSA moduli) which directly impacts the security of the RSA. The Elliptic Curve Method (ECM) [2] for integer factorization is expected to yield better performance than NFS only if the composite integer n to be factored has some small size prime divisors (compared to the size Proposal submitted to committee December 8th, 2011; Candidacy exam date: December 15th, 2011; Candidacy exam committee: Emre Telatar, Arjen Lenstra, Amin Shokrollahi. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (R. Urbanke) (signature) of n). However, ECM plays a relevant role in the NFS cofactorization step in which many small composite integers (100 200 bits) need to be factored. This task can be offloaded on low-cost highly parallel devices like graphics cards. ECM has also two applications for large integers which can be accelerated on such devices. One is the factorization of numbers whose size is out of reach for NFS. This application is of interest only in the context of recreational mathematics. The second one is the factorization of RSA multiprime moduli. In this variant of the RSA, the modulus is built up from r > 2 primes of about the same size which allows to speed up the decryption step when using the Chinese Remainder Theorem. The problem of implementing ECM efficiently on low-cost highly parallel devices is relevant not only in the context of integer factorization. Several cryptological applications other than ECM are based on the implementation of finite field arithmetic and elliptic curve arithmetic, e.g., Elliptic Curve Cryptography (ECC) based protocols. Latest graphics processing units (GPUs) are an interesting platform for the implementation of ECM and the underlying arithmetic. In the last years they have evolved from simple parallel graphics pipelines to many-core architectures with full hardware/software support for general purpose computations. This has led to the popular general-purpose computing on graphics processing units (GPGPU) concept. GPUs are suitable for applications which involve many independent parallel computations on different chunks of data, with little or no synchronization needed between such computations. The papers described in this proposal cover the essential background related to ECM and its implementation. The classic Factoring integers with elliptic curves [2] by Hendrik Lenstra from 1987 introduced ECM. All the facts necessary to explain why and when it works are described along with two variants of the factoring algorithm and a conjecture on its expected running time. The second paper, Speeding the Pollard and Elliptic Curve Methods of Factorization [3], describes several improvements applicable to ECM and other factoring methods that must be taken into consideration in view of implementing these algorithms efficiently. The last one, Twisted Edwards Curves Revisited [4], presents the fastest known algorithms for performing group operations on elliptic curves that can speed up several cryptological applications including ECM [5]. In section II detailed descriptions of the papers will be given followed by the research proposal in section III. EDIC-ru/05.05.2009

EDIC RESEARCH PROPOSAL 2 Notation II. SURVEY OF THE SELECTED PAPERS The symbol log without explicit subscript for the base will denote the natural logarithm throughout the paper. A. Factoring integers with elliptic curves In this paper, Hendrik Lenstra proposes the elliptic curve method (ECM) for factoring positive integers, that is obtained from Pollard s (p 1)-method by replacing the multiplicative group of residues modulo p (Z/pZ) with the group of points on a random elliptic curve modulo p. 1) Elliptic curves over finite fields: Let K be a field, the author focuses on the case that K = F p for some prime number p > 3. A pair (a, b) K 2 for which 4a 3 + 27b 2 0 defines an elliptic curve over K corresponding to the short Weierstrass equation y 2 = x 3 + ax + b. (1) The elliptic curve defined by (a, b) is denoted by E a,b, or by E. The set of points E(K) of E a,b over K is defined by E(K) = {(x : y : z) P 2 (K) : y 2 z = x 3 + axz 2 + bz 3 }. P 2 (K) denotes the projective plane over K, i.e., the set of equivalence classes of triples (x, y, z) K 3, (x, y, z) (0, 0, 0); two triples (x, y, z) and (x, y, z ) are equivalent if there exists c K such that cx = x, cy = y and cz = z. The equivalence class containing (x, y, z) is denoted by (x : y : z). Given an elliptic curve E over K, the point (0 : 1 : 0) E(K) is the zero point of the curve; it is denoted by O and it is the only point with z = 0. All the other points of E are of the form (x : y : 1), where x, y K satisfy Eq. (1). The set E(K) has the structure of an abelian group with the group law defined as follows (additive notation): Identity element: O+P = P +O = P for all P E(K). Given P = (x 1 : y 1 : 1) O and Q = (x 2 : y 2 : 1) O, then P + Q = O if and only if x 1 = x 2 and y 1 = y 2 ; thus (x : y : z) = (x : y : z). Otherwise, given λ K such that λ = (y 1 y 2 )/(x 1 x 2 ) if P Q and λ = (3x 2 1 + a)/(2y 1 ) if P = Q. Then P + Q = R, where R = (x 3 : y 3 : 1) with x 3 = λ 2 x 1 x 2 and y 3 = λx 3 y 1 + λx 1. 2) Elliptic curves modulo a composite n: Consider the set of all triples (x, y, z) (Z/nZ) 3 for which gcd(x, y, z, n) = 1. The group of units (Z/nZ) acts on this set by u(x, y, z) = (ux, uy, uz). The orbits under this action (the set of elements that a given triple can be transformed to) are the points of the projective plane over Z/nZ. The orbit of (x, y, z) is denoted by (x : y : z), and the set of all orbits by P 2 (Z/nZ). Given a, b Z/nZ let E = E a,b be the curve defined over Z/nZ by the equation y 2 = x 3 + ax + b. The set of points E(Z/nZ) of E over Z/nZ is defined by E(Z/nZ) = {(x : y : z) P 2 (Z/nZ) : y 2 z = x 3 +axz 2 +bz 3 }. If 6(4a 3 + 27b 2 ) (Z/nZ) then E is defined as an elliptic curve over Z/nZ and the set E(Z/nZ) has a natural abelian group law. The author avoids using the group structure mentioned above and defines pseudo-addition on a subset of E(Z/nZ). This operation can fail in some cases (that occur when one attempts to compute the multiplicative inverse of an element u Z/nZ that is not a unit and so gcd(u, n) > 1) and such a failure can lead to finding a non-trivial divisor of n. Let O denote the point (0 : 1 : 0) of P 2 (Z/nZ), and let the subset V n of P 2 (Z/nZ) consist of the finite points together with O: V n = {(x : y : 1) : x, y (Z/nZ)} {O}. For P V n and a prime p dividing n, P p denotes the point in P 2 (F p ) that is obtained reducing the coordinates of P modulo p. Notice that P p = O p P = O. Given n Z >1, a Z/nZ and P, Q V n the author designs an algorithm that either computes a non-trivial divisor d of n, or determines a point R V n with the following property: if p is any prime divisor of n for which there exists b F p such that 6(4a 3 + 27b 2 ) 0 for a = a(mod p), P p E a,b (F p ) Q p E a,b (F p ), then R p = P p + Q p in the group E a,b (F p ). The algorithm attempts to compute first (x 1 x 2 ) 1 (mod n) (see group law formulae in paragraph II-A1) using the Euclidean algorithm, which outputs d = gcd(x 1 x 2, n). If 1 < d < n the addition fails and a non-trivial factor of n is found. If d = 1 the algorithm determines a point R with the above property. If d = n it attempts to compute (y 1 + y 2 ) 1 (mod n) (notice that in this case y 1 = y 2 and P = Q) and the value e = gcd(y 1 + y 2, n) is used exactly as the value d except that if e = n the output is R = O (i.e., P = Q in V n ). If the algorithm determines a point R, it will be denoted by P + Q and the partial binary operation on V n will be called addition. If the ordinary Euclidean algorithm is used, O((log n) 2 ) bit operations are performed. Using a sequence of pseudo-additions an algorithm that computes the following can be devised. Given k Z >0, n Z >1, a Z/nZ and P V n, it either calculates a nontrivial divisor d of n, or determines a point R V n with R p = k P p in the group E a,b (F p ), for suitable b and p as for the pseudo-addition. If the algorithm determines such a point R, it will be denoted by kp and the partial operation defined in this way multiplication. The number of additions performed by the algorithm depends on which addition chain is used for computing kp and whether kp is defined or not. An addition chain for n Z >0 is a sequence of positive integer values v 0 = 1, v 1,..., v m = n where for each 0 < j m, v j = v h + v l for some 0 h, l < j. If k = k 1 k 2 for some k 1, k 2 Z >0, kp can be computed as kp = k 1 (k 2 P ). So if k is such that k = r e(r), where r ranges over a finite set of positive integers and each e(r) is a positive integer, kp can be computed performing e(r) multiplications by r for each r. 3) Introduction: Pollard s (p 1)-method aims to find a non-trivial divisor of a given positive integer n using Fermat s little theorem. The idea of the algorithm is to pick a random residue modulo n, say c, and to compute its k-th power modulo

EDIC RESEARCH PROPOSAL 3 n. The value of k is chosen as the product of small prime powers less than a bound B (e.g., k = lcm(1, 2,..., B)). One hopes that for some prime factor p of n, the number p 1 will divide k. The algorithm computes c k mod n and d = gcd((c k mod n) 1, n). If for some prime factor p of n, k is divisible by p 1, d will be a non-trivial factor of n by Fermat s little theorem unless all prime factors of n are found simultaneously, i.e., d = n. If for some prime factor p of n, p 1 is the product of primes less than B, i.e., it is B smooth, the algorithm is likely to succeed. Whereas if for each prime p dividing n the number p 1 has a large prime factor, then Pollard s (p 1)-method would need a large bound B (i.e., a large running time) to have a reasonable chance of success. ECM uses the group of points on a random elliptic curve modulo p instead of (Z/pZ). First fix k = lcm(1, 2,..., B) as for Pollard s (p 1)-method and select a random elliptic curve E defined over Z/nZ (as in paragraph II-A4 or using a suitable parametrization) and a point P on E with coordinates in Z/nZ, where n is the number to factor. Next, compute the multiple k P of P using the group law of the curve. In practice one can use the pseudo-addition algorithm described in paragraph II-A2. If for some prime divisor p of n, k P and the zero point O of the curve become the same modulo p (but not modulo n) the algorithm succeeds. This corresponds to the failure of an inversion while computing the pseudo-addition. One can modify the pseudo-addition to work with projective coordinates with O = (0 : 1 : 0) and avoid inversions. In this case one must explicitly check for the above condition, that is now equivalent to p dividing the z (or x) coordinate of the result, calculating the greatest common divisor of such z (or x) with n. ECM has the same properties as Pollard s (p 1)-method with the order p 1 of (Z/pZ) replaced by the order of the group E(Z/pZ) of points of E with coordinates in Z/pZ. Hasse s theorem (1934) [2] states that the order of E(Z/pZ) is of the form p + 1 t p, where t p is an integer that depends on E and p for which t p 2 p. If there exists a prime factor p of n such that the number p + 1 t p is B smooth (and so k is a multiple thereof), then ECM is likely to find a non-trivial divisor of n. The author proves that if an elliptic curve over F p, where p > 3 is prime, is chosen at random, then its order is approximately 1 uniformly distributed in the interval (p + 1 2 p, p + 1 + 2 p). It follows that, if the algorithm fails, it can be run again selecting a different elliptic curve. This will likely yield a new t p value and so, the number p + 1 t p, will have a new chance to be B smooth. It will be shown that, under certain assumptions and with a suitable choice of parameters (see paragraph II-A7 for the details), given a positive integer g, ECM finds a non-trivial divisor of the number n in within time gk(p)m(n) with probability at least 1 e g, where the function K : R >0 R >0 is such that K(x) = e (2+o(1)) log x log log x for x, p is the least prime factor of n and M(n) is an upper bound for the time required by a single addition on an elliptic curve modulo n. The worst case occurs if n = pq with p, q primes 1 This is in fact proved for the interval (p + 1 p, p + 1 + p) only. n and the time becomes gm(n)e (1+o(1)) log n log log n. Several other algorithms have expected running time given by the latter expression but independent of the size of the prime factors of n. For example, the expected running time of the Quadratic Sieve (QS) [6] is the same as ECM in the worst case. However, ECM is expected to be faster in presence of small prime factors. 4) ECM with one curve: Let n, v, w Z >1 and a, x, y Z/nZ be given. For each integer r 2, denote by e(r) the largest integer m such that r m v + 2 v + 1, and put w k = r e(r). r=2 Given P = (x : y : 1) V n, attempt to compute kp using the pseudo-addition method described in paragraph II-A2. If it fails then a non-trivial divisor d of n is found. If it succeeds in computing kp the algorithm terminates with no factors found. 5) ECM trying several curves: Given n, v, w, h Z >1, generate a, x, y Z/nZ at random, and apply algorithm (II-A4) to n, v, w, a, x, y. If a non-trivial divisor d of n is found, halt. Otherwise repeat the above procedure unless it has been already applied h times. The choice of a, x, y determines the elliptic curve used. Algorithm (II-A4); the value v may be thought of as an upper bound for the divisor d that is hoped to be found, though the algorithm can determine a divisor d larger than v. The parameter w determines the execution time and the probability of success. The larger w, the larger the execution time and the probability of success. Algorithm (II-A5); w is the execution time of the algorithm on a single curve and h is the number of curves that will be tried. In this case the probability of success is a function of w and h. 6) When does the algorithm succeed?: The author proves a sufficient condition for the success of the algorithm: Proposition 1. Let n, v, w Z >1 and a, x, y Z/nZ be as in algorithm (II-A4), put b = y 2 x 3 ax Z/nZ and P = (x : y : 1) V n (see paragraph II-A2). Let p and q be prime divisors of n satisfying the following conditions. 1) p v 2) 6(4a 3 + 27b 2 ) 0 for a = a(mod p), b = b(mod p); 3) each prime divisor r of #E a,b satisfies r w; 4) 6(4â 3 + 27ˆb 2 ) 0 for â = a(mod q), ˆb = b(mod q); 5) #Eâ,ˆb is not divisible by the largest prime number dividing the order of P p (see paragraph II-A2). Then algorithm II-A4 finds a non-trivial divisor of n. 7) Efficiency: Assume that the addition chain used for computing k P uses the binary representation of k. Then O(log k) pseudo-additions are performed. Let M(n) be an upper bound for the time, measured in bit operations, required to perform one pseudo-addition (see paragraph (II-A2)). Then algorithm (II-A4) requires time O(w(log v)m(n)), since k is such that log k = O(w log v). Algorithm (II-A5) requires time at most h times as large, i.e., O(hw(log v)m(n)) (neglecting the time required by the random number generator used). Using proposition (1) and an estimate of the number of elliptic

EDIC RESEARCH PROPOSAL 4 curves over F p whose order is not divisible by a given prime l the author proves the following. 1) Let n, v, w Z >1 be such that n has at least two distinct prime divisors > 3, and such that the smallest prime factor p of n for which p > 3 satisfies p v. Put u = #{s Z : s (p + 1) < p, and each prime dividing s is w}. then the triple (a, x, y) results in the success of the algorithm with probability that is not much less than the probability u/(2[ p] + 1) that a random integer in the interval (p + 1 p, p + 1 + p) has all its prime factors w. 2) (Corollary) Let w Z >1 be such that the number u 3 and let f(w) = u/(2[ p] + 1) be the above probability. Assume that in algorithm (II-A5) each triple (a, x, y) is generated uniformly at random and successive triples are generated independently. There exists an effectively computable constant c > 1 such that for any h Z >1 the success probability of algorithm (II-A5) on input n, v, w, h is at least 1 c hf(w)/ log v. The author observes that choosing h (log v)/f(w) provides a reasonable chance of success. If h (log v)/f(w), algorithm (II-A5) requires time O((log v) 2 (w/f(w))m(n)). Then to minimize the running time it suffices to minimize w/f(w). The optimal value of w is determined as follows. Define L(x) = e log x log log x, for a real number x > e. Given α R >0, the probability that a random positive integer s x has all its prime factors L(x) α is L(x) 1 2α +o(1) for p. This is stated in a theorem of Canfield, Erdös and Pomerance. The author conjectures that this result is valid if s is a random integer in the interval (x + 1 x, x + 1 + x). Putting x = p this implies that f(l(p) α ) = L(p) 1 (2α) +o(1) for p, for any fixed positive α and f(w) as in the corollary above. If w = L(p) α then w/f(w) = L(p) 1 (2α) +α+o(1) for p, the optimal choice of w being: w = L(p) 1 2 +o(1), w/f(w) = L(p) 2+o(1), for p. The choice of w depends on p, the least prime factor > 3 of n, which is not known beforehand. In practice p is replaced by v in the above formula for w and algorithm (II-A5) is performed for a reasonable increasing sequence of values for v. Using these facts (notice that the factor (log v) 2 in the execution time above is L(p) o(1) ) the author provides the following conjectural running time estimate for ECM. Conjecture 1. There is a function K : R >0 R >0 with K(x) = e (2+o(1)) log x log log x for x such that the following assertion holds. Let n Z >1 be an integer that is not a prime power and that is not divisible by 2 or 3, and let g be any positive integer. Then algorithm (II-A5), when performed with suitable values for v, w, h, can be used to find a non-trivial divisor of n with probability at least 1 e g, within time gk(p)m(n), where p denotes the least prime factor of n and where M(n) denotes an upper bound for the time required by the pseudoaddition algorithm defined in paragraph (II-A2), measured in bit operations. ECM can be repeated until it leads to the complete factorization of n with expected time at most L(n) 1+o(1) = e (1+o(1)) log n log log n for n. The worst case occurs if the second largest prime factor of n is not much smaller than n and so n is built up from some small primes and two large primes of the same size. 8) Conclusions: If the second largest prime factor of n is much smaller than n, ECM is asymptotically faster than several other algorithms whose conjectured expected execution time is L(n) 1+o(1) but it is independent of the size of prime factors of n. However, in practice, these algorithms may result faster in the worst case, due to the different constants hidden in the asymptotics. ECM can be used to recognize numbers that are built up from prime factors smaller than a given bound. This problem must be solved in several factoring algorithms. B. Speeding the Pollard and Elliptic Curve Methods of Factorization In this paper the author presents some techniques to speed up several algorithms for integer factorization. 1) Introduction: Four factoring algorithms are considered in this paper: ECM, Pollard s (p 1)-method, Pollard s Rho method and Williams (p+1)-method. However, in the context of this research proposal, the techniques to speed up ECM are the most relevant and the following description will be focused on them. In some cases, such techniques can be adapted to the other algorithms. All the aforementioned algorithms involve some computations modulo the composite number to be factored n, which is assumed to have a prime factor p. At the end of each step of these algorithms one must compute the gcd of a partial result with n, hoping that this will be a non trivial divisor thereof. It is possible to avoid taking a gcd at each step by replacing it with a multiplication modulo n and computing a gcd only at the end of the last step. This accomplished by applying the following observation, p gcd(xy mod n, n) p gcd(x, n) or p gcd(y, n). It follows that if k steps are performed and at end of each step a gcd of the result x k mod n and n must be computed, it is possible to accumulate the results by multiplying them together. Then, after the last step, the gcd of the final product and n is computed, i.e., d = gcd(x 1 x 2,..., x k mod n, n). In this way k gcd s are replaced by k 1 multiplications modulo n and one gcd with n. It can happen that d = n (i.e. all the prime factors of n have been found) in which case one

EDIC RESEARCH PROPOSAL 5 must backtrack to check whether all the factors were found at once in a single step, or different divisors were found at different steps. In the latter case the algorithm is successful. The main technique that will be studied in the following is the stage two or continuation of ECM. 2) ECM stage two: The version of ECM presented in II-A4 will be referred to as stage one of the algorithm. It can be summarized as follows. To factor a composite n, select a random elliptic curve E modulo n, a point P = (x, y) on it and then compute Q = kp where k > 0 is an integer divisible by all prime powers less than a positive integer bound B 1. If p is a prime factor of n, stage one succeeds when k is divisible by the order of P on the curve E modulo p (but not by the order of P on the curve E modulo all the other prime factors of n), in which case Q = kp = O on E modulo p and a non trivial divisor is found through a gcd computation. If the stage one fails, the point Q on E modulo n is output. The number of curve operations required to compute Q is O(log k) = O(B 1 ). In case of failure, one can increase the bound B 1 and run ECM again or simply abandon it. Assume now that sq = O on E modulo p for some prime factor p of n (but not for all of them), where s is a prime between B 1 and a larger value B 2. In other words, one assumes that the order of Q modulo p is s (i.e., the order of P modulo p is B 1 -smooth except for the prime s). In this case, one can run again stage one increasing the bound B 1 to the value of B 2 to have a good chance of success. The number of elliptic curve operations will be O(B 2 ). A better alternative is to run the stage two or continuation of the algorithm that is tailored for cases in which the order of Q (P ) is of the above form. The idea is to attempt to find the prime s such that sq = O on E modulo p in a smart way. One wants to increase the chance of success of each run of the algorithm on a given curve at a small additional cost (e.g., comparable with the cost of the stage one just executed). This will result in the reduction of the overall expected running time. The standard continuation entails testing each prime s between B 1 and B 2 one after the other. This can be done in a naive way, by simply computing sq for each s, but this would have a cost comparable to running again stage one with B 1 = B 2. A smarter approach arises from the observation that if s j denotes the j th prime then the difference s j+1 s j is known to be small. The idea is to pre-compute the points (s j+1 s j )Q for all the differences of consecutive primes belonging to the interval (B 1, B 2 ) and store them in a table. Then one can use the table to compute s j+1 Q as (s j+1 s j )Q + s j Q for j > (π(b 1 )+1). This will require π(b 2 ) π(b 1 ) elliptic curve operations. If the largest difference between two consecutive primes in the interval (B 1, B 2 ) is D than the table will have at most D/2 entries that can be computed in O(D) elliptic curve operations. The number of elliptic curve operations needed to compute the first point s π(b1)+1q is O(log s π(b1)+1). Finally the number of elliptic curve operations required to compute each multiple of Q for each prime in (B 1, B 2 ) using the pre-computed differences is π(b 2 ) π(b 1 ). The overall cost of this continuation is roughly π(b 2 ) π(b 1 ) elliptic curve operations plus π(b 2 ) π(b 1 ) modular gcd s/multiplications. This is not a significant improvement over running again stage one with B 1 increased to B 2. 3) Baby-step giant-step approach: The performance of stage two can be improved by using a memory-time trade-off technique to look for the prime s. The idea is to represent each prime in (B 1, B 2 ) in a sort of radix w representation, where w is an integer such that w B 2. Let v 1 = B 1 /w and v 2 = B 2 /w. Assume that affine coordinates are used. For each v such that v 1 v v 2 and u such that 0 u < w compute vwq = (x vwq, y vwq ) and uq = (x uq, y uq ). Then compute h = (x vwq x uq ) mod n (2) v u for each u and v such that s = vw + u for some prime s in (B 1, B 2 ), in π(b 2 ) π(b 1 ) modular multiplications. Finally check whether gcd(h, n) gives a non trivial divisor of n. The number of elliptic curve operations is now reduced from π(b 2 ) π(b 2 ) to O( B 2 ). Memory requirements have changed from D/2 to O( B 2 ). The cost is further reduced by storing points uq for u such that gcd(u, w) = 1, thus dropping some points for which u does not correspond to any prime. Moreover, points vwq need not to be stored and can be computed as needed if the primes are processed in ascending order. More memory space can be saved reducing the value of w. Performance can be further improved if two primes are tested at once. In order to do so, one must look for pairs (v, u) such that every prime in the interval (B 1, B 2 ) is represented as vw ± u for some pair (v, u). Now consider the polynomial g(m) = m 2 and observe that given two primes represented by the pair (v, u), s 1 = vw + u and s 2 = vw u, vw ± u g(vw) g(u) = (vw) 2 u 2. The idea is to store points g(vw)q and g(u)q corresponding to the found pairs in tables and then recover them through table look-ups to compute gcd(x g(vw)q x g(u)q, n). To keep the tables small, values of v and u should be restricted. A possible choice is { u umax, v 1 = B 1 /w v B 2 /w, where u max w/2 is selected in advance. Building the tables will require O(v 2 v 1 ) + O(u max ) elliptic curve operations. The number of gcd s/modular multiplications performed to look for a non trivial divisor is then proportional to the number of pairs (v, u) required to represent all primes in (B 1, B 2 ) and so their number should be reduced as much as possible. One idea for devising an algorithm that finds such pairs is based on the observation that given two primes s 1 = vw + u and s 2 = vw u their sum s 1 + s 2 is a multiple of 2w and vice versa if s 1 +s 2 is a multiple of 2w then s 1 = vw +u and s 2 = vw u for some u and v. The idea is to maintain a queue Q q where q ranges over the residues modulo 2w with q w. For each prime s to be paired, compute q = s mod 2w and a such that 2aw + q = s. Then store a into the queue Q q unless there is a (corresponding to the prime 2a w q) in Q q such that u = w(a a ) + q is less then u max. If this is the case then two primes have been paired. After all the primes

EDIC RESEARCH PROPOSAL 6 are processed as described, some elements corresponding to unpaired primes can be present is some queues, in which case they are paired with a composite. 4) FFT continuation: Another possible approach is the Fast Fourier Transform (FFT) continuation that splits the interval (B 1, B 2 ) in smaller intervals of length w and pre-compute several multiples of the point Q as above. The double product in (2) is now viewed as a polynomial h(x), whose roots are the x coordinates of the points uq, evaluated at a sequence of values (the x coordinates of the points vwq). For each 0 u < w with gcd(u, w) = 1 and v 1 v v 2 where v 1 = B 1 /w and v 2 = B 2 /w, compute the points uq = (x uq, y uq ) and vwq = (x vwq, y vwq ). Then compute the coefficients of the polynomial h(x) = u (x x uq ) mod n as follows. 1) write h(x) recursively as the product of two monic polynomials of degree as close as possible and store each polynomial in a binary tree. If φ(w) is a power of 2, the tree has log 2 φ(w) levels. The i-th level (the root corresponds to i = log 2 φ(w) and the leaves to i = 0) φ(w) 2 log 2 φ(w) i. has at most φ(w) 2 polynomials of degree i 2) These polynomials are pairwise multiplied together from the leaves up to the root (that is h(x)), using fast algorithms for polynomial multiplication that require O(d log d) operations for two degree d polynomials. The cost is then O(φ(w)(log φ(w)) 2 ) operations modulo n, where φ(w) is the number of positive integers less than w and co-prime with w. The value φ(w) is the degree of h(x), since it has as many roots as the number of different u values. Next evaluate h v = h(x vwq ) for each v 1 v v 2 and compute h = v h v. Finally check whether gcd(h, n) gives a non trivial divisor of n. A polynomial of degree d can be evaluated at d successive terms of a geometric progression in d log d steps and so the above evaluation can be accomplished in O(φ(w) log φ(w)) steps (if φ(w) (B 2 /w B 1 /w)). Montgomery suggests choosing w B 2 performance. 5) Montgomery curves: The equation for good asymptotic By 2 = x 3 + Ax 2 + x, (3) defines a Montgomery curve. Montgomery curves provide faster arithmetic than Weierstrass curves in contexts in which the y coordinate of points can be dropped. This is equivalent to identify points up to their sign and despite that, it is still possible to compute scalar multiplication. In ECM this can be exploited because, as seen so far, the only computation involved on elliptic curves is the scalar multiplication. There is no need for determining the sign of a point at any time and the value of the x coordinate is what one is only interested in. Given two points on a Montgomery curve, P 1 = (x 1, y 1 ) and P 2 = (x 2, y 2 ), and their difference P 4 = P 1 P 2 = (x 4, y 4 ), it is possible to derive efficient formulae for computing the x coordinate of their sum P 3 = P 1 +P 2 = (x 3, y 3 ), that do not involve y coordinates. This is done by manipulating the product x 3 x 4 using (3) and introducing projective coordinates, i.e., x = X X1 Z. Given the ratios Z 1 and X2 Z 2 for distinct points P 1 and P 2 the ratio of their sum X3 Z 3 is given by: X 3 = 4Z 4 (X 1 X 2 Z 1 Z 2 ) 2, Z 3 = 4X 4 (X 1 Z 2 Z 1 X 2 ) 2. These formulae can be computed using 2 squarings and 4 multiplications by caching some intermediate values. Given X1 Z 1 for P 1 the ratio X3 Z 3 of P 3 = 2P 1 is given by: X 3 = (X 2 1 Z 2 1) 2, Z 3 = (4X 1 Z 1 )[(X 1 Z 1 ) 2 + ((A + 2)/4)(4X 1 Z 1 )], These formulae can be computed using 2 squarings and 3 multiplications by caching some intermediate values. Since the above addition formulae require the difference of two points, the scalar multiplication (Q = kp for a positive integer k) is performed using a special case of addition chains (see end of paragraph II-A2 for the definition of addition chain) called Lucas chains [7]. Suyama s parametrization for Montgomery curves allows to select a curve (and fix a point on it) whose order is divisible by 12. This is desirable when looking for curves whose order is divisible by small prime powers as in ECM (because it is already divisible by 3 and 2 2 ). 6) Conclusions: This paper describes several techniques to improve ECM. Above all, the continuation or stage two of ECM (executed upon the failure of stage one, i.e., the original algorithm), that reduces the expected running time of the algorithm. Montgomery curves provide fast arithmetic for ECM, but Twisted Edwards curves (presented in the next section) are asymptotically faster. C. Twisted Edwards Curves Revisited In this paper [4], the authors present fast algorithms for computing group operations on Twisted Edwards Curves which lead to the fastest elliptic curve scalar multiplication that can speed up both ECC and cryptanalytic applications (e.g, ECM). The following notation is used to analyze the algorithms: M: field multiplication, S: field squaring, I: field inversion, D: multiplication by a curve constant. 1) Introduction: Recently Edwards curves have gained attention in the context of cryptology because of their fast arithmetic. Edwards introduced a normal form for elliptic curves along with the addition law; such curves are defined by x 2 + y 2 = c 2 + c 2 x 2 y 2 [8]. Bernstein and Lange introduced a more general version of these curves defined by x 2 + y 2 = c 2 (1 + dx 2 y 2 ) or x 2 + y 2 = 1 + dx 2 y 2 along with the first algorithm for computing group operations in projective coordinates (e.g. the point addition requires 10M+1S+1D) [9]. These curves are today known as Edwards curves. Bernstein and Lange also introduced inverted Edwards coordinates resulting in point addition with cost 9M+1S+1D [10]. Finally Bernstein and other authors introduced a generalization of Edwards curves, i.e., twisted Edwards curves [11]. The authors of [4] present the fastest group arithmetic for twisted Edwards curves obtained by using an additional

EDIC RESEARCH PROPOSAL 7 coordinate, i.e., the extended twisted Edwards coordinates system. They design a fast algorithm for scalar multiplication by mixing this system with the standard one. 2) Twisted Edwards curves: The following terms characterize the group law (additive notation) on elliptic curves: unified: point addition formulae that remain valid when the two input points are identical. complete: point addition formulae defined for all inputs. mixed: point addition formulae that add an affine point to a point in a given projective representation. Let K be a field of odd characteristic, Edwards curves are defined by x 2 + y 2 = c 2 (1 + dx 2 y 2 ) where c, d K with cd(1 dc 4 ) 0. Such form is a special case of more general twisted Edwards curves form defined by E E,a,d : ax 2 + y 2 = 1 + dx 2 y 2 where a, d K with ad(a d) 0 (Edwards curves represent the special case where a can be rescaled to 1). Group operations formulae for this curves can be found in [11]. The inversion I is usually more expensive than M. It is then convenient to use projective coordinates to avoid it. (ax 2 + Y 2 )Z 2 = Z 4 + dx 2 Y 2. (4) Eq. (4) defines the projective closure of the curve ax 2 + y 2 = 1 + dx 2 y 2. The identity element is (0 : 1 : 1) and the negative of (X : Y : Z) is ( X : Y : Z). For all λ 0 K, (X : Y : Z) = (λx : λy : λz). This system is denoted by E. 3) Extended Twisted Edwards Coordinates: A new coordinate t = xy is introduced to represent a point (x, y) on ax 2 + y 2 = 1 + dx 2 y 2 in extended affine coordinates (x, y, t). The map (x, y, t) (x : y : t : 1) allows to pass to projective coordinates. For all nonzero λ K, (X : Y : T : Z) = (λx : λy : λt : λz) that satisfies Eq. (4) and corresponds to the extended affine point (X/Z, Y/Z, T/Z) with Z 0. The auxiliary coordinate T has the property T = XY/Z. This system is called extended twisted Edwards coordinates and is denoted by E e. The identity element is (0 : 1 : 0 : 1). The negative of (X : Y : T : Z) is ( X : Y : T : Z). Given (X, Y, Z) in E, passing to E e can be performed in 3M+1S by computing (XZ, Y Z, XY, Z 2 ) whereas given (X : Y : T : Z) in E e passing to E is cost-free by dropping T. Unified Addition in E e. Given (X 1 : Y 1 : T 1 : Z 1 ) and (X 2 : Y 2 : T 2 : Z 2 ) with Z 1 0 and Z 2 0, then (X 1 : Y 1 : T 1 : Z 1 ) + (X 2 : Y 2 : T 2 : Z 2 ) = (X 3 : Y 3 : T 3 : Z 3 ) where X 3 = (X 1 Y 2 + Y 1 X 2 )(Z 1 Z 2 dt 1 T 2 ), Y 3 = (Y 1 Y 2 ax 1 X 2 )(Z 1 Z 2 + dt 1 T 2 ), T 3 = (Y 1 Y 2 ax 1 X 2 )(X 1 Y 2 + Y 1 X 2 ), Z 3 = (Z 1 Z 2 dt 1 T 2 )(Z 1 Z 2 + dt 1 T 2 ). These unified formulae are complete if d is not a square in K and a is a square in K, and they can be computed with a 9M+2D algorithm by caching some intermediate results. An 8M+2D mixed addition algorithm can be derived by setting Z 2 = 1, i.e., adding (X 1 : Y 1 : T 1 : Z 1 ) and an extended affine point (x 2, y 2, x 2 y 2 ) which can be written as (x 2 : y 2 : x 2 y 2 : 1). If E e is used, an 8M+1D point addition algorithm can be devised if a = 1 transforming the curve in a more (5) convenient form. The operations can be reduced to 7M+1D by setting Z 2 = 1 and using a mixed addition algorithm. Dedicated Addition in E e. In this case formulae are similar to (5) but they are independent of the curve constant d. The operations can be performed with a 9M+1D algorithm and a mixed addition algorithm can be derived setting Z 2 = 1. The case a = 1 allows to derive an 8M algorithm, that can be reduced further to 7M setting Z 2 = 1. Dedicated Doubling in E e. The authors provide doubling formulae which are independent of the curve constant d. The operations can be performed with a 4M+4S+1D algorithm which can be improved by mixing E e with E. This formulae do not require the T 1 coordinate of the point to be doubled. Notice that these formulae are slower than 3M+4S+1D ones in E [11]. 4) Applications: The authors focus on the implementation of scalar multiplication on parallel architectures. In particular they present a detailed comparison between scalar multiplication in extended twisted Edwards coordinates using unified addition only and the Montgomery ladder using Montgomery curves. Both of them provide theoretical Simple Power Analysis (SPA) protection, since addition and doubling are performed using the same sequence of field operations. Extended twisted Edwards curves result faster in parallel environments, i.e., when 2 or 4 processors are used (up to 66.7% on 4 processors). In the context of ECM this comparison is not relevant since SPA is not needed. However, the authors propose a fast algorithm for scalar multiplication dedicated formulae for addition and doubling which are faster than the ones for unified addition. This algorithm mixes twisted Edwards coordinates E with extended twisted Edwards coordinates E e and uses a windowing technique. It turns out to be the fastest scalar multiplication algorithm for elliptic curves. Fast Scalar Multiplication. Scalar multiplication on twisted Edwards curves involves point doublings and can be sped up by mixing E e and E replacing slower doublings in E e with faster doublings in E and using the fact that no consecutive additions are performed: 1) If a point doubling is followed by another point doubling, use E 2E. 2) If a point doubling is followed by a point addition, use a) E e 2E for the doubling step and then, b) E E e + E e for the point addition step. E 2E is performed using 3M+4S+1D formulae in [11]. The operation E e 2E can be performed as follows: Instead of passing from E to E e in 3M+1S as described in paragraph (II-C3), dedicated doubling formulae in E e are used since they do not require the input T 1 and so they can be used for E e 2E. E E e + E e is based on dedicated addition formulae in E e. The computation of T 3 can be avoided. This compensates the extra field multiplication necessary to compute T 3 in E e 2E. The authors show the cost estimates, in terms of M performed, for 256-bit fast scalar multiplication under different S/M and D/M scenarios. Twisted Edwards curves with a = 1 and mixed coordinates result always faster than Edwards curves, inverted Edwards curves and Montgomery curves.

EDIC RESEARCH PROPOSAL 8 Fast Scalar Multiplication in parallel. Mixing E with E e in the scalar multiplication algorithm does not seem to provide sources of parallelism that can be exploited. However, the authors show that using 4 processors, the doubling operation E e 2E e can be performed with a 1M+ 1S algorithm and that the addition operation E e E e + E e can be performed with a 2M algorithm (a 2M+ 2S algorithm and a 4M algorithm respectively using 2 processors). This suggests using E e only when working in parallel settings. 5) Conclusions: This paper introduces a new representation E e for twisted Edwards curves and describes group operations. A fast scalar multiplication algorithm using dedicated formulae is presented, which is designed by mixing E e and E. It results 4% 18% faster than the algorithms in literature and can be further sped up by a factor of 3.54 using 4 processors in parallel. This algorithm can be used to accelerate ECM. III. RESEARCH PROPOSAL This research proposal addresses the problem of implementing ECM efficiently on parallel architectures, which requires the study of parallel algorithms for elliptic curve arithmetic and finite field arithmetic. The efficiency of finite field arithmetic depends mainly on the modular multiplication operation. The first research goal is then the study of the implementation of algorithms for modular multiplication on GPUs (e.g., comparison between schoolbook and Karatsuba multiplication and implementation of FFT multiplication using floating point arithmetic). Edwards curves provide the fastest elliptic curve arithmetic. Studying their efficiency on parallel architectures is relevant for ECM and all the applications using elliptic curves (see [5] for an example). The second goal is then the comparison between Edwards curve arithmetic and Montgomery curve arithmetic on GPUs. This implies the comparison between the sliding window algorithm and Montgomery s PRAC algorithm [7] for scalar multiplication. Building on these insights, the third goal is the efficient implementation of ECM for factoring numbers up to roughly 200 bits which can be used effectively as a sub-routine for co-factorization in the NFS. This fits within the RSA moduli factorization project at LACAL. The fourth goal is the optimization of an ongoing work at LACAL on high-throughput implementation of ECM for factoring larger numbers on GPUs. This is applicable in the context of the ECM record pursuit. Another goal of practical relevance is the optimization of high-throughput implementation on GPUs of the RSA developed at LACAL. From a more theoretical perspective there are several challenges that this proposal aims to take on. One is the optimization of Edwards curves arithmetic, with the focus on reducing the memory requirements and the number of additions performed. The second one is the research of efficient curves for ECM (see [12]). The third one is the implementation of the stage two of ECM on parallel architectures which is hurdled by the memory requirements of the variants known in literature. Another interesting problem that can be explored is the study and the implementation of higher genus curve arithmetic. One interesting application would be implementing the hyperelliptic curves method for factorization (HECM) introduced in [13]. In this work the author presents an implementation of HECM on central processing units (CPUs) derived from GMP-ECM. This implementation is faster then GMP-ECM for large numbers and can be improved by optimizing the squaring operation. Tackling the implementation of higher genus curve arithmetic on parallel architectures is relevant also for other problems, e.g., the elliptic curve discrete logarithm problem. Although the choice of GPUs as the main implementation platform seems to be reasonable, following the evolution of different architectures like multi-core CPUs and field programmable gate arrays (FPGAs) must not be neglected. The debate on which one is the most convenient for parallel applications is quite hectic and so far sees no clear winner (see [14] for a recent ECM implementation on FPGAs). The integration of CPUs with graphics processors (e.g., AMD Fusion family), the availability of high level synthesis tools for FPGAs and the constant improvement of GPGPU architectures stir things up even more. REFERENCES [1] A. K. Lenstra and J. Hendrik W. Lenstra, Eds., The development of the number field sieve, ser. Lecture Notes in Mathematics. Berlin: Springer-Verlag, 1993, vol. 1554. [2] H. W. Lenstra, Factoring integers with elliptic curves, The Annals of Mathematics, vol. 126, no. 3, pp. 649 673, Nov. 1987. [3] P. L. Montgomery, Speeding the Pollard and Elliptic Curve Methods of Factorization, Mathematics of Computation, vol. 48, no. 177, pp. 243 264, 1987. [4] H. Hisil, K. K.-H. Wong, G. Carter, and E. Dawson, Twisted Edwards Curves Revisited, in Proceedings of the 14th International Conference on the Theory and Application of Cryptology and Information Security: Advances in Cryptology, ser. ASIACRYPT 08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 326 343. [5] D. J. Bernstein, T.-R. Chen, C.-M. Cheng, T. Lange, and B.-Y. Yang, ECM on Graphics Cards, in Proceedings of the 28th Annual International Conference on Advances in Cryptology: the Theory and Applications of Cryptographic Techniques, ser. EUROCRYPT 09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 483 501. [6] C. Pomerance, The Quadratic Sieve Factoring Algorithm. in EURO- CRYPT 84, 1984, pp. 169 182. [7] P. L. Montgomery, Evaluating recurrences of form X m+n = f(x m, X n, X m n ) via Lucas chains, 1992, URL: ftp://ftp.cwi.nl/pub/pmontgom/lucas.ps.gz. [8] H. M. Edwards, A Normal Form for Elliptic Curves, Bulletin of the American Mathematical Society, vol. 44, no. 3, pp. 393 422, July 2007. [9] D. J. Bernstein and T. Lange, Faster addition and doubling on elliptic curves, in Proceedings of the Advances in Crypotology 13th international conference on Theory and application of cryptology and information security, ser. ASIACRYPT 07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 29 50. [10], Inverted edwards coordinates, in Proceedings of the 17th international conference on Applied algebra, algebraic algorithms and error-correcting codes, ser. AAECC 07. Berlin, Heidelberg: Springer- Verlag, 2007, pp. 20 27. [11] D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters, Twisted Edwards curves, in Proceedings of the Cryptology in Africa 1st international conference on Progress in cryptology, ser. AFRICACRYPT 08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 389 405. [12] D. J. Bernstein, P. Birkner, and T. Lange, Starfish on strike, in Proceedings of the 1st international conf. on Progress in cryptology: cryptology and information security in Latin America, ser. LATIN- CRYPT 10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 61 80. [13] R. Cosset, Factorization with genus 2 curves, Mathematics of Computation / Mathematics of Computation of the American Mathematical Society, vol. 79, pp. 1191 1208., 2010. [14] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, R. Bachimanchi, and M. Rogawski, Area-time efficient implementation of the elliptic curve method of factoring in reconfigurable hardware for application in the number field sieve, IEEE Trans. Comput., vol. 59, pp. 1264 1280, September 2010.