Faster polynomial multiplication over nite elds

Size: px

Start display at page:

Download "Faster polynomial multiplication over nite elds"

Augustine Lawrence
6 years ago
Views:

1 Faster polynomial multiplication over nite elds David Harvey School of Mathematics and Statistics University of New South Wales Sydney NSW 2052 Australia Joris van der Hoeven a, Grégoire Lecerf b Laboratoire d'informatique UMR 76 CNRS École polytechnique 928 Palaiseau Cedex France a. vdhoeven@lix.polytechnique.fr b. lecerf@lix.polytechnique.fr July, 204 Let p be a prime, and let M p (n) denote the bit complexity of multiplying two polynomials in F p [X] of degree less than n. For n large compared to p, we establish the bound M p (n) = O(n log n 8 log n log p), where log is the iterated logarithm. This is the rst known Fürertype complexity bound for F p [X], and improves on the previously best known bound M p (n) = O(n log n log log n log p). Keywords: Polynomial multiplication, nite eld, algorithm, complexity bound, FFT A.C.M. subject classification: G..0 Computer-arithmetic, F.2. Computations in nite elds A.M.S. subject classification: 68W30, 68Q7, 68W40. Introduction Given a ring R, a basic problem in complexity theory is to nd an upper bound for the cost of multiplying two polynomials in R[X] of degree less than n. Several complexity models may be considered. In algebraic complexity models, such as the straight-line program model [9, Chapter 4], we count the number of ring operations in R, denoted M R (n). In this model, the best currently known bound M R (n) = O(n log n log log n) was obtained by Cantor and Kaltofen [0]. More precisely, their algorithm performs O(n log n log log n) additions and subtractions and O(n log n) multiplications in R. Their bound may be viewed as an algebraic analogue of the Schönhage Strassen result for integer multiplication [32] and generalises previous work by Schönhage [3]. These algorithms all rely on suitable incarnations of the fast Fourier transform (FFT) [2]. For details we refer the reader to [7, Chapter 8]. In this paper we are mainly interested in the case that R is the nite eld F p = Z/ p Z for some prime p. The standard bit complexity model based on deterministic multitape Turing machines [26] is more realistic in this setting, as it takes into account the dependence on p. We write M p (n) for the bit complexity of multiplying two polynomials in F p [X] of degree less than n. Two basic approaches are known for obtaining good asymptotic bounds for M p (n). If n is not too large compared to p, then one may use Kronecker substitution, which converts the problem to integer multiplication by packing the coecients into large integers. Let I(n) denote the bit complexity of n-bit integer multiplication. Throughout the paper, we make the customary assumption that I(n)/n is increasing, and for convenience we dene I(x) := I(dxe) for x 2 R, x > 0. According to our recent sharpening [8] of Fürer's bound [5], we may take where log x denotes the iterated logarithm for x 2 R, i.e., I(n) = O(n log n 8 log n ); (.) log x := min fk 2 N: log k x 6 g; (.2) log k := log k log:

2 2 Faster polynomial multiplication over finite fields If log n = O(log p), then Kronecker substitution yields M p (n) = O(I(n log p)) (see section 2.6), so we have M p (n) = O n log p log (n log p) 8 log (n log p) : However, Kronecker substitution leads to inferior bounds when n is large compared to p, due to coecient growth in the lifted product in Z[X]. In this situation the best existing method is the algebraic SchönhageStrassen algorithm (i.e., the CantorKaltofen algorithm). In the Turing model this yields the bound M p (n) = O(n log n log log n log p + n log n I(log p)): The rst term dominates for large enough n, say for loglog p 8 log p =O(loglog n), and we get simply M p (n) = O(n log n log log n log p): (.3) Since Fürer's breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in F p [X]. Namely, it has remained an open question whether the factor log log n appearing in (.3) can be improved to K log n for some K >. It seems that Fürer's construction does not work in F p [X]... Our contributions The main result of this paper is a new algorithm that completely closes the gap mentioned above. Theorem.. The bound holds uniformly for all n > 2 and all primes p. M p (n) = O n log p log(n log p) 8 log (nlog p) The bound follows immediately via Kronecker substitution if log n = O(log p). The bulk of the paper concerns the reverse case where n is large compared to p. For example, if log log p = O(log n), the bound becomes simply M p (n) = O(n log n 8 log n log p): As promised, this replaces the factor log log n in (.3) by 8 log n. The basic idea of the new algorithm is as follows. We rst construct a special extension F p of F p, where is exponentially smaller than n, and for which p has many small divisors. These divisors are themselves exponentially smaller than n, yet they are so numerous that their product is comparable to n. We now convert the given multiplication problem in F p [X] to multiplication in F p [Y ], by cutting up the polynomials into small pieces, and we then multiply in F p [Y ] by using FFTs over F p. Applying the CooleyTukey method to the small divisors of p, we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein's method to convert each short transform to a convolution over F p, and then use Kronecker substitution (from bivariate to univariate polynomials) to convert this to multiplication in F p [X]. These latter multiplications are then handled recursively. We continue the recursion until n is comparable to p, at which point we switch to using Kronecker substitution (from polynomials to integers). In many respects this approach is very similar to the algorithm for integer multiplication introduced in [8], but there are various additional technical issues to address. We recommend [8] as a gentle introduction to the main ideas. We also prove the following conditional result. Theorem.2. Assume Conjectures 8., 8.3 and 8.5. Then the bound M p (n) = O n log p log(n log p) 4 log (nlog p) holds uniformly for all n > 2 and all primes p.

3 David Harvey, Joris van der Hoeven, Grégoire Lecerf 3 Of these conjectures, Conjecture 8.3 seems to be in reach for specialists in analytic number theory; Conjecture 8.5 is at least as hard as Artin's conjecture on primitive roots, but may be tractable under the generalised Riemann hypothesis (GRH); and Conjecture 8. is almost as strong as the LenstraPomeranceWagsta conjecture on the distribution of Mersenne primes, and seems to be inaccessible at present..2. Outline of the paper The paper is structured as follows. In section 2, we start with a survey of relevant basic techniques: discrete Fourier transforms (DFTs) and FFTs, Bluestein's chirp transform, and Kronecker substitution. Most of this material is repeated from [8, Section 2] for the convenience of the reader; however, sections 2.5 and 2.6 dier substantially. In section 3, we recall basic complexity results for arithmetic in nite elds. In particular, we consider the construction of irreducible polynomials in F p [X], algorithms for nding roots of unity in F p, and the cost of arithmetic in F p and F p [Y ]. In section 4, we show how to construct special extensions of F p whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3]. Section 5 gives complexity bounds for functions that satisfy recurrence inequalities involving postcompositions with logarithmically slow functions. The prototype of such an inequality is T (n) 6 K T (log n) + L, where K and L are constants. The denitions and theorems are duplicated from [8, Section 5]; for the proofs, see [8]. To minimise the constant K in the bound M p (n) = O(n log n K log n log p) (for n large relative to p), we need one more tool: in section 6, we present a polynomial analogue of the CrandallFagin convolution algorithm [3]. This allows us to convert a cyclic convolution over F p of length n, where n is arbitrary, into a cyclic convolution over F p of somewhat smaller length N, where N is prescribed and where 2 n/n. (We can still obtain K = 6 without using this CrandallFagin analogue, but we do not know how to reach K = 8 without it.) Section 7 is devoted to the proof of Theorem.. Section 8 gives the proof of Theorem.2, and discusses the three number-theoretic hypotheses of that theorem. The last section oers some nal notes and suggested directions for generalisation. We rst quickly dispense with the bit complexity of multiplication in F q [X] for prime powers q, and in (Z/m Z)[X] for arbitrary integers m >. We then sketch some algebraic complexity bounds for polynomial multiplication over F p -algebras and (Z / m Z)-algebras, especially in the straightline program model. Our techniques also give rise to new strategies for polynomial evaluationinterpolation over F q. This may for instance be applied to the ecient multiplication of polynomial matrices over F q. Although we have not implemented any of our algorithms yet, we conclude by a few remarks on their expected eciency in practice. Notations. We use Hardy's notations f g for f = o(g), and f g for f = O(g) and g = O(f). The soft-oh notation f(n) = O ~ (g(n)) means that f(n) = g(n) (log(3 + g(n))) O() (see [7, Chapter 25, Section 7] for details). The symbol R > denotes the set of non-negative real numbers, and N denotes f0; ; 2; :::g. For a ring R and n 2 N, we write R[X] n := fp 2 R[X]: deg P < ng. We will write lg n := dlog n/log 2e. In expressions like log log p or lg lg p, we tacitly assume that the function is adjusted so as to take positive values for small primes such as p = 2. Acknowledgments. We would like to thank Karim Belabas for directing us to the paper [2], and Igor Shparlinski for answering some analytic number theory questions. 2. Survey of classical tools This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [3, 9, 7, 30]. 2.. Arrays and sorting In the Turing model, we have available a xed number of linear tapes. An n n d array M i;:::;i d of b-bit elements is stored as a linear array of n n d b bits. We generally assume that the elements are ordered lexicographically by (i ; :::; i d ), though this is just an implementation detail.

4 4 Faster polynomial multiplication over finite fields What is signicant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2-dimensional) by rows or by columns. In the Turing model, we may transpose an n n 2 matrix of b-bit elements in time O(b n n 2 lg min (n ; n 2 )), using the algorithm of [7, Appendix]. Briey, the idea is to split the matrix into two halves along the short dimension, and transpose each half recursively. We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that X is a totally ordered set, whose elements are represented by bit strings of length b, and suppose that we can compare elements of X in time O(b). Then an array of n elements of X may be sorted in time O(b n lg n) using merge sort [22], which can be implemented eciently on a Turing machine Discrete Fourier transforms Let R be a commutative ring with identity and let n >. An element! 2 R is said to be a principal n-th root of unity if! n = and n X (! i ) k = 0 (2.) k=0 for all i 2 f;:::; n g. In this case, we dene the discrete Fourier transform (or DFT) of an n-tuple a = (a 0 ; :::; a n ) 2 R n with respect to! to be DFT! (a) = a^= (a^0; :::; a^n ) 2 R n where a^i := a 0 + a! i + + a n! (n )i : That is, a^i is the evaluation of the polynomial A(X) := a 0 + a X + + a n X n at! i. If! is a principal n-th root of unity, then so is its inverse! =! n, and we have DFT! (DFT! (a)) = n a: Indeed, writing b := DFT! (DFT! (a)), the relation (2.) implies that n b i = X j=0 a^j! ji = X j=0 n n X k=0 where i;k = if i = k and i;k = 0 otherwise. n a k! j(k i) = X n X n a k! j(k i) = X a k (n i;k ) = n a i ; k=0 j=0 k=0 Remark 2.. In all of the new algorithms introduced in this paper, we actually work over a eld, whose characteristic does not divide n. In this setting, the concept of principal root of unity coincides with the more familiar primitive root of unity The CooleyTukey FFT Let! be a principal n-th root of unity and let n = n n 2 where < n < n. Then! n is a principal n 2 -th root of unity and! n2 is a principal n -th root of unity. Moreover, for any i 2 f0; :::; n g and i 2 2 f0; :::; n 2 g, we have a^i n 2 +i 2 = X k =0 X n n 2 k 2=0 n = X k =0 a k2 n +k! (k2n+k)(in2+i2) n 2! ki2 X k 2 =0 a k2n +k (! n ) k2i2!(! n2 ) ki : (2.2) If A and A 2 are algorithms for computing DFTs of length n and n 2, we may use (2.2) to construct an algorithm A A 2 for computing DFTs of length n as follows. For each k 2 f0; :::; n g, the sum inside the brackets corresponds to the i 2 -th coecient of a DFT of the n 2 -tuple (a 0n+k ; :::; a (n2 )n +k ) 2 R n2 with respect to! n. Evaluating these inner DFTs requires n calls to A 2. Next, we multiply by the twiddle factors! ki2, at a cost of n operations in R. (Actually, fewer than n multiplications are required, as some of the twiddle factors are equal to. This optimisation, while important in practice, has no asymptotic eect on the algorithms discussed in this paper.) Finally, for each i 2 2 f0; :::; n 2 g, the outer sum corresponds to the i -th coecient of a DFT of an n -tuple in R n with respect to! n2. These outer DFTs require n 2 calls to A.

5 David Harvey, Joris van der Hoeven, Grégoire Lecerf 5 Denoting by F R (n) the number of ring operations needed to compute a DFT of length n, and assuming that we have available a precomputed table of twiddle factors, we obtain F R (n n 2 ) 6 n F R (n 2 ) + n 2 F R (n ) + n: For a factorisation n = n n d, this yields recursively d F R (n) 6 X i= n n i F R (n i ) + (d ) n: (2.3) The corresponding algorithm is denoted A A d. The operation is neither commutative nor associative; the above expression will always be taken to mean (((A A 2 ) A 3 ) ) A d. The above discussion requires several modications in the Turing model. Assume that elements of R are represented by b bits. First, for A A 2, we must add a rearrangement cost of O(b n lg min (n ; n 2 )) to eciently access the rows and columns for the recursive subtransforms (see section 2.). For the general case A A d, the total rearrangement cost is bounded by O( P i b n lg n i) = O(b n lg n). Second, we will sometimes use non-algebraic algorithms to compute the subtransforms, so it may not make sense to express their cost in terms of F R. The relation (2.3) therefore becomes d F(n) 6 X i= n n i F(n i ) + (d ) n m R + O(b n lg n); (2.4) where F(n) is the (Turing) cost of a transform of length n over R, and where m R is the cost of a single multiplication in R. Finally, we point out that A A 2 requires access to a table of twiddle factors! ii2, ordered lexicographically by (i ; i 2 ), for 0 6 i < n, 0 6 i 2 < n 2. Assuming that we are given as input a precomputed table of the form ;!; :::;! n, we must show how to extract the required twiddle factor table in the correct order. We rst construct a list of triples (i ; i 2 ; i i 2 ), ordered by (i ; i 2 ), in time O(n lg n); then sort by i i 2 in time O(n lg 2 n) (see section 2.); then merge with the given root table to obtain a table (i ; i 2 ;! ii2 ), ordered by i i 2, in time O(n (b + lg n)); and nally sort by (i ; i 2 ) in time O(n lg n (b + lg n)). The total cost of the extraction is thus O(n lg n (b + lg n)). The corresponding cost for A A d is determined as follows. Assuming that the table ;!; :::;! n is given as input, we rst extract the subtables of (n n i )-th roots of unity for i = d ; :::; 2 in time O((n n d + + n n 2 ) (b + lg n)) = O(n (b + lg n)). Extracting the twiddle factor table for the decomposition (n n i ) n i then costs O(n n i lg n (b + lg n)); the total over all i is again O(n lg n (b + lg n)). Remark 2.2. An alternative approach is to compute the twiddle factors directly in the correct order. When working over C, as in [8, Section 3], this requires a slight increase in the working precision. Similar comments apply to the root tables used in Bluestein's algorithm in section Fast Fourier multiplication Let! be a principal n-th root of unity in R and assume that n is invertible in R. Consider two polynomials A = a a n X n and B = b b n X n in R[X]. Let C = c c n X n be the polynomial dened by c := n DFT! (DFT!(a) DFT! (b)); where the product of the DFTs is taken pointwise. By construction, we have c^ = a^ b^, which means that C(! i ) = A(! i ) B(! i ) for all i 2 f0; :::; n g. The product S = s s n X n of A and B modulo X n also satises S(! i ) = A(! i ) B(! i ) for all i. Consequently, s^ = a^ b^, s = DFT! (s^)/n = c, whence C = S. For polynomials A; B 2 R[X] with deg A < n and deg B < n, we thus obtain an algorithm for the computation of A B modulo X n using at most 3 F R (n) + O(n) operations in R. Modular products of this type are also called cyclic convolutions. If deg (A B) < n, then we may recover the product A B from its reduction modulo X n. This multiplication method is called FFT multiplication.

6 6 Faster polynomial multiplication over finite fields If one of the arguments (say B) is xed and we want to compute many products A B (or cyclic convolutions) for dierent A, then we may precompute DFT! (b), after which each new product A B can be computed using only 2 F R (n) + O(n) operations in R Bluestein's chirp transform We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs of arbitrary length, not necessarily a power of two to polynomial multiplication [5], as follows. Let! be a principal n-th root of unity and consider an n-tuple a 2 R n. First consider the case that n is odd. Let f i :=! (i2 i)/2 ; f i 0 :=! (i2 +i)/2 ; g i :=! ( i2 i)/2 : Note that (i 2 i)/2 and (i 2 + i)/2 are integers, and that g i+n =! ( (i+n)2 (i+n))/2 =! ( i2 i)/2! n( i (n+)/2) = g i : Let F := f 0 a f n a n X n, G:= g g n X n and C := c 0 + +c n X n F G modulo X n. Then n f i 0 c i = X j=0 n f i 0 (f j a j ) g i j = X j=0 n! (i2 +i)/2! (j2 j)/2! ( (i j)2 (i j))/2 a j = X j=0! ij a j = a^i: In other words, the computation of a DFT of odd length n reduces to a cyclic convolution product of the same length n, together with O(n) additional operations in R. Notice that the polynomial G is xed and independent of a in this product. Now suppose that n is even. In this case we require that 2 be invertible in R, and that! n/2 =. Let := ( ) n/2, and put Then and f i :=! i2 ; f i 0 :=! i2 +i ; g i :=! i2 +! i2 i : g n i+ =! i2! i2 i ; g i+n = g i ; 2 g i + g n 2 i+ =! i2 ; 2 g i g n 2 i+ =! i2 i : 2 As above, let F := f 0 a f n a n X n, G := g g n X n and C := c c n X n F G modulo X n. Then for i 2 0; :::; n we have 2 and similarly 2 f i 0 2 f i c i + c n i+ = n 2 2 f X i c i c i+ n 2 j=0 n = X j=0 f j a j g i j + g i j+ n 2 n! i2! j2! (i j)2 a j = X j=0! 2ij a j = a^2i ; n = X n! i2 +i! j2! (i j)2 (i j) a j = X! (2i+) j a j = a^2i+ : j=0 j=0 Again, the DFT reduces to a convolution of length n, together with O(n) additional operations in R. The only complication in the Turing model is the cost of extracting the f i, f i 0 and g i in the correct order. For example, consider the f i in the case that n is odd. Given as input a precomputed table ;!;! 2 ; :::;! n, we may extract the f i in time O(n lg n (b + lg n)) by applying the strategy from section 2.3 to the pairs (i; 2 (i2 i) mod n) for 0 6 i < n. The other sequences are handled similarly. For the g i in the case that n is even, we also need to perform O(n) additions in R; assuming that additions in R have cost O(b), this is already covered by the above bound.

7 David Harvey, Joris van der Hoeven, Grégoire Lecerf 7 Remark 2.3. Bluestein's original formulation used the weights f i :=! i2 /2, g i :=! i2 /2. This has two drawbacks in our setting. First, it requires the presence of 2 n-th roots of unity, which may not exist in R. Second, if n is odd, it leads to negacyclic convolutions, rather than cyclic convolutions. The variants used here avoid both of these problems. Remark 2.4. It is also possible to give variants of the new multiplication algorithms in which Bluestein's transform is replaced by a dierent method for converting DFTs to convolutions, such as Rader's algorithm [29] Kronecker substitution Multiplication in F p [X] may be reduced to multiplication in Z using the classical technique of Kronecker substitution [7, Corollary 8.27]. More precisely, let n > 0 and suppose that we are given two polynomials A; B 2 F p [X] of degree less than n. Let A ~ ; B ~ 2 Z[X] be lifts of these polynomials, with coecients A i and B i satisfying 0 6 A i ; B i < p. Then for the product C ~ = A ~ B ~ we have 0 6 C i < n p 2, and the coecients of C ~ may be read o the integer product C ~ (2 N ) = A ~ (2 N ) B ~ (2 N ) where N := 2 lg p + lg n. We deduce the coecients of C := A B by dividing each C i by p. This shows that M p (n) = O(I(n (lg p + lg n)) + n I(lg p)). Using the assumption that I(n)/n is increasing, we obtain M p (n) = O(I(n (lg p + lg n))): If we also assume that lg n = O(lg p), i.e., that the degree is not too large, then this becomes simply M p (n) = O(I(n lg p)): (2.5) A second type of Kronecker substitution reduces bivariate polynomial multiplication to the univariate case. Indeed, let n > and >, and suppose that A; B 2 F p [X ; Z], where deg X A; deg X B < n and deg Z A; deg Z B <. We may then recover C := A B from the univariate product C(Y 2 ; Y ) = A(Y 2 ; Y ) B(Y 2 ; Y ) in F p [Y ]. Note that A(Y 2 ; Y ) and B(Y 2 ; Y ) have degree less than 2 n, so the cost of the bivariate product is M p (2 n ) + O(n lg p). The same method works for computing cyclic convolutions: to multiply A; B 2 F p [X ; Z] / (X n ), the same substitution leads to a product in F p [Y ] / (Y 2n ). The cost is thus M p 0 (2 n ) + O(n lg p), where M p 0 (d) denotes the cost of a multiplication in F p [X]/(X d ). 3. Arithmetic in finite fields Let p be a prime and let >. In this section we review basic results concerning arithmetic in F p and F p [Y ]. We assume throughout that F p is represented as F p [Z]/P for some irreducible monic polynomial P 2F p [Z] of degree. Thus an element of F p is represented uniquely by its residue modulo P, i.e., by a polynomial F 2 F p [Z] of degree less than. Lemma 3.. Let p be a prime and let >. We may compute a monic irreducible polynomial P 2 F p [Z] of degree in time O ~ ( 4 p /2 ). Proof. See [33, Theorem 3.2]. The above complexity bound is very pessimistic in practice. Better complexity bounds are known if we allow randomised algorithms, or unproved hypotheses. For instance, assuming GRH, the bound reduces to ( log p) O() []. We now consider the cost of arithmetic in F p, assuming that P is given. Addition and subtraction in F p have cost O( lg p). For multiplication we will always use the SchönhageStrassen algorithm. Denote by S p () the cost of multiplying polynomials in F p [Z] by this method, i.e., using the polynomial variant [3] for the polynomials themselves, followed by the integer version [32] to handle the coecient multiplications. Then we have S p () = O( lg lg lg lg p lg lg p lg lg lg p):

8 8 Faster polynomial multiplication over finite fields Of course, we could use the new multiplication algorithm recursively for these products, but it turns out that SchönhageStrassen is fast enough, and leads to a simpler complexity analysis in section 7. Let D p () denote the cost of dividing a polynomial in F p [Z] 2 by P, returning the quotient and remainder. Using Newton's method [7, Chapter 9] we have D p () = O(S p ()). Thus elements of F p may be multiplied in time O(S p ()). Let N > be a divisor of p. To compute a DFT over F p of length N, we must rst have access to a primitive N-th root of unity in F p. In general it is very dicult to nd a primitive root of F p, as it requires knowledge of the factorisation of p. However, to nd a primitive N-th root, Lemma 3.3 below shows that it is enough to know the factorisation of N. The construction relies on the following existence result. Lemma 3.2. Let ` be a positive integer such that p` > c r 4 (log r + ) 4 2, where r is the number of distinct prime divisors of p, and where c > 0 is a certain absolute constant. Then there exists a monic irreducible polynomial 2 F p [Z] of degree ` such that modulo P is a primitive root of unity for F p = F p [Z]/P. Proof. This is [34, Theorem.]. Lemma 3.3. Assume that N divides p and that the factorisation of N is given. Then we may compute a primitive N-th root of unity in F p [Z]/P in time O ~ ( 9 p). Proof. Testing whether a given 2 F p is a primitive N-th root of unity reduces to checking that N = and N /s =/ for every prime divisor s of N. According to Lemma 3.2, it suces to apply this test to := ( mod P ) (p )/N for 2 F p [Z] running over all monic polynomials of degree ` := dlog(c r 4 (log r + ) 4 2 )/log pe, where r = O( lg p) is as in the lemma. The number of candidates is at most p` 6 c r 4 (log r + ) 4 2 p = O ~ ( 6 p), and each candidate can be tested in time O ~ (r lg N ( lg p)) = O ~ ( 3 lg 3 p) using binary exponentiation. Of course, we can do much better if randomised algorithms are allowed, since (p )/N is reasonably likely to be a primitive N-th root for randomly selected. Finally, we consider polynomial multiplication over F p. Let n >, and let A; B 2 F p [X] n. Let A ~ ; B ~ 2 F p [X ; Z] be their natural lifts, i.e., of degree less than with respect to Z. The bivariate product C ~ := A ~ B ~ may be computed using Kronecker substitution (section 2.6) in time M p (2 n )+O(n lg p). Writing C ~ = P2n 2 i=0 C ~ i(z) X i, to recover A B we must divide each C ~ i by P. Denoting by M p (n) the complexity of the original multiplication problem, we obtain M p (n) = M p (2 n ) + O(n D p ()) = M p (2 n ) + O(n S p ()): As in section 2.6, the same method may be used for cyclic convolutions. If M 0 p (n) denotes the cost of multiplication in F p [X]/(X n ), then we get M 0 p (n) = M 0 p (2 n ) + O(n S p ()): Remark 3.4. In the other direction, we can also reduce multiplication in F p [X] to multiplication in F p [Y ] by splitting the inputs into chunks of size b / 2c. This leads to the bound M p (n) 6 M p (dn/b/2ce)+o(n lg p) for n>. A variant of this procedure is developed in detail in section Preparing for DFTs of large smooth orders The aim of this section is to prove the following theorem, which allows us to construct small extensions F p of F p containing many roots of unity of low order. Recall that a positive integer is said to be y-smooth if all of its prime divisors are less than or equal to y. Theorem 4.. There exist computable absolute constants c 3 > c 2 > 0 and n 0 2 N with the following properties. Let p be a prime and let n > n 0. Then there exists an integer in the interval (lg n) c2lg lg lg n < < (lg n) c3lg lg lg n ;

9 David Harvey, Joris van der Hoeven, Grégoire Lecerf 9 and a ( + )-smooth integer M > n, such that M j p. Moreover, given p and n, we may compute and the prime factorisation of M in time O((lg n) lg lg n ). The proof depends on a series of preparatory lemmas. For > 2, dene Y H := q: q prime; q j For example, H 36 = 9990 = and H 37 = 2. Note that H is always squarefree and ( + )-smooth. We are interested in nding for which H is unusually large; that is, for which has many divisors d such that d + happens to be prime. Most of the heavy lifting is done by the following remarkable result. Lemma 4.2. Dene 0 (k) := min p 2 N: H > k. There exist computable absolute constants c 5 > c 4 > 0 such that for any integer k > 00 we have (log k) c4log log log k < 0 (k) < (log k) c5log log log k : p Proof. This is part of [2, Theorem 3]. (The threshold k could of course be replaced by any xed power of k. It is stated this way in [2] because that paper is concerned with primality testing.) The link between H and F p is as follows. Lemma 4.3. Let p be a prime and let > 2. Then there exists a ( + )-smooth integer M > H /( + ) such that M j p. Proof. We take M := H / p if p divides H, and otherwise M := H. In the former case we must have p 6 +, so in both cases M > H /( + ). To see that M j p, consider any prime divisor q =/ p of H. Then q j, so q j p by Fermat's little theorem. Remark 4.4. The integer M constructed in Lemma 4.3 only takes into account the structure of H, and ignores p itself. In practice, p will often have small factors other than those in H, possibly including repeated factors (which are never detected by H ). For example, H 6 =237, but 9 6 = In an implementation, one should always incorporate these highly valuable accidental factors into M. We will ignore them in our theoretical discussion. Next we give a simple sieving algorithm that tabulates approximations of log H for all up to a prescribed bound. Lemma 4.5. Let m > 2. In time O(m 2 ), we may compute integers `; :::; `m with j` log H j 6 ; ( 6 6 m): (4.) Proof. Initialise a table of ` with ` := 0 for = ; :::; m. Since log H 6 P q6+ log q = O(), it suces to set aside O(lg m) bits for each `. For each integer q = 2; :::; m, perform the following steps. First test whether q is prime, discarding it if not. Using trial division, the cost is q /2+o(), so m 3/2+o() = O(m 2 ) overall. Now assume that q is found to be prime. Using a fast algorithm for computing logarithms [8], compute an integer L q such that jl q 2 r log qj 6, where r := + lg m, in time O (lg m) +o(). Scan through the table, replacing ` by ` + L q for those divisible by q, i.e., every (q )-th entry, in time O(m lg m). The total cost for each prime q is O(m lg m), and there are O(m/lg m) primes, so the overall cost is O(m 2 ). At the end, we divide each ` by 2 r, yielding the required approximations satisfying (4.). Proof of Theorem 4.. We are given p and n as input. Applying Lemma 4.2 with k := n 3, we nd that for large enough n there exists some with (log (n 3 )) c4log log log (n3) < < (log (n 3 )) c5log log log (n3 )

10 0 Faster polynomial multiplication over finite fields and such that H > n 3/2. Choose any positive c 2 < c 4 log 2 and any c 3 > c 5 log 2. Since we have log log(n 3 ) log log log(n 3 ) (log 2) log lg n lg lg lg n; (lg n) c2lg lg lg n < (log(n 3 )) c4log log log(n3) < < (log(n 3 )) c5log log log(n3) c3lg lg lg n < (lg n) for large n. Using Lemma 4.5, we may nd one such in time O(m 2 ), where m := (lg n) c3lg lg lg n. Let M be as in Lemma 4.3. Then M divides p, and M > H + > n3/2 m + > n for large n. We may compute the prime factorisation of M by simply enumerating the primes q 6 +, q =/ p, and checking whether q divides for each q. This can be done in time O( 2 ) = O(m 2 ). The main multiplication algorithm depends on a reduction of a long DFT to many short DFTs. It is essential to have some control over the long and short transform lengths. The following result packages together the prime divisors of the abovementioned M, to obtain a long transform length N near a given target L, and short transform lengths N i near a given target S. Theorem 4.6. Let p, n,, M be as in Theorem 4.. Let L and S be positive integers such that < S < L < M. Then there exist ( + )-smooth integers N ; :::; N d with the following properties: a) N := N N d divides M (and hence divides p ). b) L 6 N 6 ( + ) L. c) S 6 N i 6 S 3 for all i. Given, S, L, and the prime factorisation of M, we may compute such N ; :::; N d (and their factorisations) in time O ~ ( 3 ). Proof. Let M = N N s be the prime decomposition of M. Taking d minimal with N N d > L, we ensure that (a) and (b) are satised. At this stage we have N i S for all i. As long as the tuple (N ; :::; N d ) contains an entry N i with N i < S, we pick the two smallest entries N i and N j and replace them by a single entry N i N j. Clearly, this does not alter the product N of all entries, so (a) and (b) continue to hold. Furthermore, as long as there exist two entries N i and N j with N i < S and N j < S, new entries N i N j will always be smaller than S 2. Only at the very last step of the loop, the second smallest entry N j might be larger than S. In that case, the product of the entry N i with N i < S with N j is still bounded by S S 2 = S 3. This shows that condition (c) is satised at the end of the loop. Determining d requires at most s = O() multiplications of integers less than M. There are at most s iterations of the replacement loop. Each iteration must scan through at most s integers of bit size O(log M) = O() and perform one multiplication on such integers. 5. Logarithmically slow recurrence inequalities Let : (x 0 ;)! R be a smooth increasing function, for some x 0 2 R. We say that : (x 0 ; )! R > is an iterator of if is increasing and if (x) = ((x)) + (5.) for all suciently large x. For instance, the standard iterated logarithm log dened in (.2) is an iterator of log. An analogous iterator may be dened for any smooth increasing function : (x 0 ; )! R for which there exists some > x 0 such that (x) 6 x for all x >. Indeed, in that case, (x) := min fk 2 N: k (x) 6 g

11 David Harvey, Joris van der Hoeven, Grégoire Lecerf is well-dened and satises (5.) for all x >. It will sometimes be convenient to increase x 0 so that (x) 6 x is satised on the whole domain of. We say that is logarithmically slow if there exists an ` 2 N such that (log ` exp `)(x) = log x + O() (5.2) for x!. For example, the functions log(2 x), 2 log x, (logx) 2 and (log x) log log x are logarithmically slow, with ` = 0; ; 2; 3 respectively. Lemma 5.. Let : (x 0 ; )! R be a logarithmically slow function. Then there exists > x 0 such that (x) 6 x for all x >. Consequently all logarithmically slow functions admit iterators. In this paper, the main role played by logarithmically slow functions is to measure size reduction in multiplication algorithms. In other words, multiplication of objects of size n will be reduced to multiplication of objects of size n 0, where n 0 6 (n) for some logarithmically slow function (x). The following result asserts that, from the point of view of iterators, such functions are more or less interchangeable with log x. Lemma 5.2. For any iterator of a logarithmically slow function, we have (x) = log x + O(): The next result is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions. Proposition 5.3. Let K >, B > 0 and ` 2 N. Let x 0 > exp `(), and let : (x 0 ; )! R be a logarithmically slow function such that (x) 6 x for all x > x 0. Then there exists a positive constant C (depending on x 0,, K, B and `) with the following property. Let > x 0 and L > 0. Let S R, and let T : S! R > be any function satisfying the following recurrence. First, T (y) 6 L for all y 2 S, y 6. Second, for all y 2 S, y >, there exist y ; :::; y d 2 S with y i 6 (y), and weights ; :::; d > 0 with P i i =, such that T (y) 6 K + B log ` y X d i T (y i ) + L: i= Then we have T (y) 6 C L K log y log for all y 2 S, y >. 6. The CrandallFagin algorithm Consider the problem of computing the product of two n-bit integers modulo m:= 2 n. The naive approach is to compute their ordinary 2 n-bit product and then reduce modulo m. The reduction cost is negligible because of the special form of m. If n is divisible by a high power of two, one can save a factor of two by using the fact that FFTs naturally compute cyclic convolutions. An ingenious algorithm of Crandall and Fagin [3] allows for the gain of this precious factor of two for arbitrary n. Their algorithm is routinely used in the extreme case where n is prime, in the large-scale GIMPS search for Mersenne primes (see A variant of the CrandallFagin algorithm was a key ingredient of our algorithm for integer multiplication that conjecturally achieves the bound I(n) = O(n log n 4 log n ) [8, Section 9]. In this section we present yet another variant, for computing products in F p [X]/(X n ). 6.. The CrandallFagin reduction for polynomials Let n, N and be positive integers with 6 N 6 n and > 2 dn / N e. Our aim is to reduce multiplication in F p [X] / (X n ) to multiplication in F p [Y ] / (Y N ). In the applications in subsequent sections, N will be a divisor of p with many small factors, so that we can multiply eciently in F p [Y ]/(Y N ) using FFTs over F p.

12 2 Faster polynomial multiplication over finite fields Suppose that F p is represented as F p [Z]/P, for some irreducible P 2 F p [Z] of degree. The reduction relies on the existence of an element 2 F p such that N = Z. (This is the analogue of the real N-th root of 2 appearing in the usual CrandallFagin algorithm, which was originally formulated over C.) It is easy to see that such may not always exist; for example, there is no cube root of Z in F 6 = F 2 [Z]/(Z 4 + Z + ). Nevertheless, the next result shows that if is large enough, then we can always nd some modulus P for which a suitable exists. Proposition 6.. Suppose that p /2 > N. Then we may compute an irreducible polynomial P 2 F p [Z] of degree, and an element 2 F p = F p [Z]/P such that N = Z, in time O ~ ( 9 p). Proof. We rst observe that if 2 F p is a primitive root, then N cannot lie in a proper subeld of F p. (This property is independent of P.) Indeed, if N 2 F p` for some proper divisor ` of, then every N-th power in F p lies in F p`. This contradicts the fact that the number of N-th powers in F p is at least + (p )/N > p /N > p /2 > p`. Now we give an algorithm for computing P and. We start by using Lemma 3. to compute an irreducible P ~ 2 F p [U] of degree in time O ~ 4 p /2, and we temporarily represent F p as F p [U]/P ~. Using Lemma 3.2, we may construct a list of O ~ ( 6 p) candidates for primitive roots of F p (see the proof of Lemma 3.3). For each candidate, we compute N and its powers ; N ; :::; ( )N. If these are linearly dependent over F p then N belongs to a proper subeld (and so, as shown above, is not actually a primitive root). Thus we must eventually encounter some for which they are linearly independent. We take P to be the minimal polynomial of this N, which may be computed by nding a linear relation among ; N ; :::; N. Let ': F p [Z]/P! F p [U]/P ~ be the isomorphism that sends Z to N. The matrix of ' with respect to the standard bases ; Z; :::; Z and ; U ; :::; U is given by the coecients of ; N ; :::; ( )N. The inverse of this matrix yields the matrix of ' : F p [U]/P ~! F p [Z]/P. We then set = ' (), so that N = ' ( N ) = Z. For each candidate, the cost of computing the necessary powers of is O ~ (( + lg N) lg p) = O ~ ( 2 lg 2 p), and the various linear algebra steps require time O ~ ( 3 lg p) using classical matrix arithmetic. In the remainder of this section, we x some P and as in Proposition 6., and assume that F p is represented as F p [Z]/P. Suppose that we wish to compute the product of u; v 2 F p [X]/(X n ). The presentation here closely follows that of [8, Section 9.2]. We decompose u and v as where N u = X i=0 N u i X ei ; v = X i=0 e i := dn i/n e; u i ; v i 2 F p [X] ei+ e i : Notice that e i+ e i takes only two possible values: bn/n c or dn/n e. For 0 6 i < N, let v i X ei ; (6.) c i := N e i n i; (6.2) so that 0 6 c i < N. For any 0 6 i ; i 2 < N, dene i;i 2 2 Z as follows. Choose 2 f0; g so that i := i + i 2 N lies in the interval 0 6 i < N, and put From (6.2), we have i ;i 2 := e i + e i2 e i n: c i + c i2 c i = N (e i + e i2 e i ) n (i + i 2 i) = N i;i 2 : Since the left hand side lies in the interval ( N ; 2 N), this shows that i ;i 2 2 f0; g. Now, since e i + e i2 = e i + i ;i 2 (mod n), we have u v = X i =0 N N X i 2 =0 N X u i v i2 X ei +ei 2 = w i X ei (mod X n ); i=0

13 David Harvey, Joris van der Hoeven, Grégoire Lecerf 3 where w i := X i +i 2=i (mod N) X i ;i 2 ui v i2 : Since u i 2 F p [X] dn/n e and similarly for v i2, we have w i 2 F p [X] 2dn/N e. Note that we may recover u v from w 0 ; :::; w N in time O(n lg p), by a standard overlap-add procedure (provided that N = O(n/lg n)). Now, regarding u i and v i as elements of F p by sending X to Z, dene polynomials U ; V 2 F p [Y ]/(Y N ) by U i := ci u i and V i := ci v i for 06i<N, and let W = W 0 + +W N Y N := U V be their (cyclic) product. Then X w~ i := ci W i = ci U i V i2 = X ci +ci 2 ci u i v i2 = X Z i ;i 2 ui v i2 i +i 2 =i (mod N) coincides with the reinterpretation of w i as an element of F p. Moreover, we may recover w i unambiguously from w~ i, as > 2 dn / N e and w i 2 F p [X] 2dn/N e. Altogether, this shows how to reduce multiplication in F p [X]/(X n ) to multiplication in F p [Y ]/(Y N ). Remark 6.2. The pair (e i+ ; c i+ ) can be computed from (e i ; c i ) in O(lg n) bit operations, so we may compute the sequences e 0 ; :::; e N and c 0 ; :::; c N in time O(N lg n). Moreover, since c i+ c i takes on only two possible values, we may compute the sequence c0 ; :::; cn using O(N) multiplications in F p. 7. The main algorithm Consider the problem of computing t > products u v; :::; u t v with u ; :::; u t ; v 2 F p [X]/(X n ), i.e., t products with one xed operand. Denote the cost of this operation by C p;t (n). Our algorithm for this problem will perform t + forward DFTs and t inverse DFTs, so it is convenient to introduce the normalisation C C p (n) := sup p;t (n) t> 2 t + : This is well-dened since clearly C p;t (n) 6 t C p; (n). Roughly speaking, C p (n) may be thought of as the notional cost of a single DFT. The problem of multiplying two polynomials in F p [X] of degree less than k may be reduced to the above problem by using zero-padding, i.e., taking n := 2 k and t :=. Since C p; (n) 6 3 C p (n), we obtain M p (k) 6 3 C(2 k) + O(k lg p). Thus it suces to obtain a good bound for C p (n). The next theorem gives the core of the new algorithm for the case that n is large relative to p. Theorem 7.. There exist x 0 > 2 and a logarithmically slow function : (x 0 ; )! R with the following property. For all integers n > x 0, there exist integers n ; :::; n d 6 (n), and weights ; :::; d > 0 with P i i =, such that X d C p (n) n lg p lg (n lg p) 6 uniformly for n > max (x 0 ; p 2 ). 8 + O lg lg n i= C p (n i ) i + O(); (7.) n i lg p lg(n i lg p) Proof. We wish to compute t > products u v; :::; u t v with u ; :::; u t ; v 2 F p [X]/(X n ), for some suciently large n. We assume throughout that p 2 < n. Choose parameters. Using Theorem 4., we obtain integers and M with (lg n) c2lg lg lg n < < (lg n) c3lg lg lg n ; and so that M >n, M j p, and M is ( +)-smooth. We choose long and short target transform lengths L := dn/ 3 e and S := (lg lg n)2. For large enough n we then have < S < L < n 6 M, so we may apply Theorem 4.6. This yields ( + )-smooth integers N ; :::; N d, with known factorisations, such that N := N N d divides p and lies in the range L 6 N 6 ( + ) L, and such that S 6 N i 6 S 3 for all i. Finally we set := d2 dn/n e/e, so that N also divides p, and we put n i := 2 N i. All of these parameters may be computed in time O((lg n) lg lg n ) = O(n).

14 4 Faster polynomial multiplication over finite fields For large n, we observe that the following estimates hold. First, we have n 3 6 N 6 2 n 2 : Since log lg lg n lg lg lg n, it follows that log 2 N = + O lg lg n lg n; and also that log lg lg n lg lg lg n. We have N = 2 n + O(N ) = 2 n + O(n/), so N = 2 + O n: lg n To estimate d, note that N > S d and log S (lg lg n) 3 lg lg lg n, so lg n d = O (lg lg n) 3 : lg lg lg n Since p 6 n we have lg (n lg p) = + O lg n; (7.2) lg lg n and as n i > N i > S > (lg n) lg lg n, we similarly obtain lg (n i lg p) = + O lg n i : (7.3) lg lg n CrandallFagin reduction. Let us check that the hypotheses of section 6. are satised, to enable the reduction to multiplication in F p [Y ]/(Y N ). We certainly have 6N 6 n and >2 dn/n e. For Proposition 6., observe that > 2 lg n lg N, so p /2 > 2 /2 > N for large n. Thus we obtain an irreducible P 2 F p [Z] of degree, and an element 2 F p = F p [Z]/P such that N = Z, in time O ~ ( 9 p) = O ~ ( 27 n /2 ) = O(n). Each multiplication in F p has cost O(S p ()) (see section 3). Computing the sequences e i and c i costs O(N lg n) = O(n lg n), and computing the sequence ci costs O(N S p ()). The initial splitting and nal overlap-add phases require time O(t n lg p), and the multiplications by ci and ci cost O(t N S p ()). Long transforms. The factorisation of N is known, and N divides p, so by Lemma 3.3 we may compute a primitive N-th root of unity! 2 F p in time O ~ ( 9 p) = O(n). We will multiply in F p [Y ]/(Y N ) by using DFTs with respect to!. The table of roots ;!; :::;! N may be computed in time O(N S p ()). In a moment, we will describe an algorithm A i for computing a short DFT of length N i with respect to! i :=! N /Ni ; we then use the algorithm A := A A d for the main transform of length N (see section 2.3). The corresponding twiddle factor tables may be extracted in time O(N lg N ( lg p + lg N)) = O(n lg n lg p). Let D denote the complexity of A, and for t 0 > let D i;t 0 denote the cost of performing t 0 independent DFTs of length N i using A i. Then by (2.4) we have d D 6 X i= The last term is simply O(n lg n lg p). D i;n /Ni + O(d N S p ()) + O(( lg p) N lg N): Bluestein conversion. We now begin constructing A i, assuming that t 0 > independent transforms are sought. We rst use Bluestein's algorithm (section 2.5) to convert each DFT of length N i to a multiplication in F p [X]/(X Ni ). We must check that 2 is invertible in F p if N i is even; indeed, if N i is even, then so is p, so p =/ 2. The Bluestein conversion contributes O(t 0 N i S p ()) to the cost of A i. We must also compute a suitable table of roots, once at the top level. We rst extract the table ;! i ; :::;! i N i from the top level table in time O(N lg p) = O(n lg p), and then sort them into the correct order (and perform any necessary additions) in time O(N i lg N i ( lg p + lg N)) = O(S 3 lg N i lg p) = O(lg N i (n lg p)). Over all i the cost is O(lg N (n lg p)) = O(n lg n lg p).

15 David Harvey, Joris van der Hoeven, Grégoire Lecerf 5 Kronecker substitution. We nally convert each multiplication in F p [X]/(X Ni ) to a multiplication in F p [X]/(X ni ) using Kronecker substitution (see section 2.6). The latter multiplications have cost C p;t 0(n i ), since one argument is xed. After the multiplications, we must also perform t 0 N i divisions by P to recover the results in F p [X]/(X Ni ), at cost O(t 0 N i S p ()). Consolidating the estimates for the Bluestein conversion and Kronecker substitution, we have D i;t 0 6 C p;t 0(n i ) + O(t 0 N i S p ()): 6 (2 t 0 + ) C p (n i ) + O(t 0 N i S p ()): For t 0 = N /N i > N /S 3 lg lg n, this becomes D i;n /Ni 6 (2 + O(/lg lg n)) (N /N i ) C p (n i ) + O(N S p ()): Conclusion. Summing over i yields X d D O lg lg n Since and this becomes i= lg n d N S p () = O (lg lg n) 3 lg lg lg n D 6 = O n lg n lg p = O(n lg n lg p) N N i C p (n i ) + O(d N S p ()) + O(n lg n lg p): (lg lg lg n)2 lg lg n N = 2 N n N i 2 N i = 4 + O ; lg n n i 8 + O lg lg n X d i= N lg lg lg lg p lg lg p lg lg lg p n n i C p (n i ) + O(n lg n lg p): Let i := log N i /log N, so that P i i =. Since log lg n i = log 2 N i + O(log ) = + O log 2 N i = + O log 2 N i log S lg lg n and lg n = ( + O(/lg lg n)) log 2 N, we get X d D O lg lg n i= i n lg n n i lg n i C p (n i ) + O(n lg n lg p): To compute the desired t products, we must execute t + forward transforms, and t inverse transforms. Each product also requires O(N) pointwise multiplications in F p and O(N) multiplications by /N. These have cost O(N S p ()), which is absorbed by the O(n lg n lg p) term. Thus we obtain C p;t (n) 6 (2 t + ) D + O(t n lg n lg p): Dividing by (2 t + ) n lg n lg p, taking suprema over t >, and using (7.2) and (7.3), yields the bound (7.). Finally, since n i = O(S 3 3 ) = O (lg n) 3c3lg lg lg n((lg lg n)2 +) = O (lg n) (lg lg n)3 ; we have log n i = O((log log n) 4 ) and hence log log log n i 6 log log log log n + C for some constant C and all large n. We may then take (x) := exp 3 (log 4 x + C). Now we may prove the main theorem announced in the introduction.

16 6 Faster polynomial multiplication over finite fields Proof of Theorem.. For n > 2 dene T p (n) := C p (n) n lg p lg(n lg p) : It suces to prove that T p (n) = O(8 log (nlg p) ), uniformly in n and p. Let x 0 and (x) be as in Theorem 7.. Increasing x 0 if necessary, we may assume that (x) 6 x for x > x 0, and that x 0 > exp (exp ()). Let p := max(x 0 ; p 2 ) for each p. We will consider the regions n6 p and n > p separately. First consider the case n 6 p. There are only nitely many primes p < (x 0 ) /2, so we may assume that p 2 >x 0 and that n6 p = p 2. In this region we use Kronecker substitution: by (2.5) and (.) we have and since M p (n) = O(I(n lg p)) = O n lg p lg (n lg p) 8 log (nlg p) ; C C p (n) = sup p;t (n) t> 2 t + 6 sup t C p; (n) t> 2 t + = O(C p;(n)) = O(M p (n)); we get T p (n) = O(8 log (n lg p) ) uniformly for n 6 p. In fact, this even shows that T p (n) = O(8 log p ) uniformly for n 6 p. Now consider the case n > p. Here we invoke Theorem 7. to obtain absolute constants B; L > 0 such that for every n > p, there exist n ; :::; n d 6 (n) and ; :::; d such that T p (n) B log log n X d i T p (n i ) + L: Set L p := max (L; max 26n6p T p (n)). Applying Proposition 5.3 with ` := 2, K := 8, S := f2; 3; :::g, we nd that T p (n) = O(L p 8 log n log p ) uniformly for n > p. Since log p = log p + O() and L p = O(8 log p ), we conclude that T p (n) = O(8 log n ) = O(8 log (nlg p) ) uniformly for n > p. 8. Conjecturally faster multiplication Recall that in [8] we established the bound I(n)=O(n lg n K log n ) with K =8 for the complexity of integer multiplication. We also proved that this can be improved to K = 4, assuming the following slight weakening of the LenstraPomeranceWagsta conjecture on the distribution of Mersenne primes [28, 35] (see [8, Section 9] for further discussion). Conjecture 8.. Let m (x) be the number of Mersenne primes less than x. Then there exist constants 0 < a < b such that for all x > 3, i= a log log x < m (x) < b log log x: The source of the (conjectural) speedup is as follows. The K = 8 algorithm of [8, Section 6] computes DFTs over C, and so we encounter the short product obstruction. Namely, to compute the product of two real numbers with p signicant bits using FFT algorithms, we are forced to compute a full product of two p-bit integers, and then truncate the result to p bits. To achieve K = 4, we replaced the coecient ring C by the nite eld F p [i], where p = 2 q is a Mersenne prime. Multiplication modulo 2 q is a cyclic product rather than a short product, and this saves a factor of two at each recursion level. The aim of this section is to outline a credible strategy for achieving the same improvement, from K = 8 to K = 4, in the context of multiplication in F p [X], at least under certain plausible number-theoretic hypotheses. Consider the problem of multiplying polynomials in F p [X] of degree less than n. In the Kronecker substitution region, i.e., for lg n = O(lg p), we can of course achieve K = 4 if we assume Conjecture 8.. Now consider the region where n is much larger than p. In the algorithm of section 7, we reduced the multiplication problem to computing products in F p [Y ]/(Y Ni ), where F p = F p [Z]/P for some monic irreducible P 2 F p [Z] of degree, and where N i is a short transform length. These multiplications were in turn converted to multiplications in F p [X] / (X 2Ni ) via Kronecker substitution. It is exactly this factor of two in zero-padding that we wish to avoid.

Even faster integer multiplication

Even faster integer multiplication David Harvey School of Mathematics and Statistics University of New South Wales Sydney NSW 2052 Australia Email: d.harvey@unsw.edu.au Joris van der Hoeven a, Grégoire