The Middle Product Algorithm I

Size: px

Start display at page:

Download "The Middle Product Algorithm I"

Laurel Kelley
5 years ago
Views:

1 AAECC 4, (004) Digital Object Identifier (DOI) 0.007/s The Middle Product Algorithm I Speeding up the Division and Square Root of Power Series Guillaume Hanrot, Michel Quercia, Paul Zimmermann INRIA-Lorraine/LORIA, 65, rue du jardin botanique, 5460 Villers-lès-Nancy Cedex, France ( {hanrot, zimmerma}@loria.fr) 3 rue de Montchapet, 000 Dijon, France ( michel.quercia@prepas.org) Received: July 0, 000; revised version: November 4, 003 Published online: February 3, 004 Springer-Verlag 004 Abstract. We present new algorithms for the inverse, division, and square root of power series. The key trick is a new algorithm MiddleProduct or, for short, MP computing the n middle coefficients of a (n ) n full product in the same number of multiplications as a full n n product. This improves previous work of Brent, Mulders, Karp and Markstein, Burnikel and Ziegler. These results apply both to series and polynomials. Keywords: Middle product, Inversion, Division, Square root, Newton s method. Introduction One of the major tools for performing arithmetic or computing special functions is Newton s iteration. Indeed its quadratic rate of convergence (so that the number of correct digits or terms doubles at each step) and its self-correcting character make it especially suitable for this kind of computations. In full generality, Newton s rule for computing a root of a differentiable function f(x)can be written as x k+ = x k f(x k )/f (x k ). As a rule of thumb, the term x k should be seen as the main term, whereas f(x k )/f (x k ) should be seen as the correcting term. In particular, not all the digits of f(x k )/f (x k ) are significant, since the low weight ones will be corrected in the next iteration. Furthermore, since f(x k ) is supposed to tend to zero quickly, one can expect important cancellations in its evaluation. The present paper shows how to evaluate only the meaningful part of f(x k ), i.e., the middle digits, in some situations where the evaluation of f is mainly based on a multiplication. As a consequence, new algorithms for inversion, division and square root are derived. Another application of the middle product is the efficient computation of some transposed multiplications [].

2 46 G. Hanrot et al. We will make use of the following notations: M(n) denotes the complexity of the underlying multiplication algorithm on two polynomials of degree n; K(n) denotes the complexity of Karatsuba s algorithm, given by K() =, K(n) = K( n/ ) + K( n/ ); FFT (n) denotes the complexity of the Schönhage-Strassen multiplication algorithm [0]. The paper is organized as follows: describes the basic trick in the two most usual nontrivial multiplication models, namely Karatsuba and FFT. and 3 show how this trick can be used to compute the square and the inverse of a power series; 4 and 5 study the problem of computing the quotient and square root of power series. In 6, an implementation of these algorithms is presented, which confirms the complexity results in practice. We conclude the paper by a table summarizing the previous and new worstcase complexities under the Karatsuba and FFT models, for the middle product, inverse, quotient, square and square root operations, following work of Brent [], Mulders [9], Burnikel and Ziegler []. Finally, an appendix gives a formal description of how to transform any multiplication algorithm of a certain type into a middle product algorithm with the same complexity; this is linked to the so-called transposition principle. Newton s Inverse Iteration and the Basic Trick Newton s iteration for computing /A where A is a number, a polynomial, or a series follows the recurrence: x k+ = x k + x k ( Ax k ). () Suppose we are looking for an approximation of /A to precision n. In the last step of Newton s iteration, x k is then accurate to precision n/, and we have to compute Ax k to precision n, where the n/ most significant coefficients or bits of Ax k vanish with. To get an approximation x k+ to precision n of /A, we multiply x k by the n/ most significant coefficients of Ax k. The cost of this last step was 3M(n/) so far: M(n/) for the product Ax k which splits into two products of n/ coefficients, and M(n/) for the product of x k by Ax k. The total cost of Newton s iteration to compute /A to precision n is It may seem more natural to express the complexities of operations on power series in term of multiplications on power series (short products) instead of multiplications on polynomials (full products). For example, if M (n) denotes the complexity of a short product, Brent gives in [, Table 7.] upper bounds of 3M (n), 4M (n), and 5.5M (n) for the inverse, quotient, and square root respectively. However all known subquadratic algorithms for short products are based on full products, and in the FFT case, it is not even known whether a short product can be computed faster than a full product! This explains our choice of the full product as basic operation. By most significant, we mean the low order terms in the case of power series, and the leftmost digits in the usual notation for floating-point numbers.

3 The Middle Product Algorithm I 47 thus 3FFT (n) for FFT multiplication and 3 K(n) for Karatsuba multiplication [6]. As we know in advance that the upper n/ bits of Ax k will vanish with, a natural question is the following: is there a faster way to compute directly the n/ required bits from position n/ton of the product Ax k? We give in this section a positive answer under the Karatsuba and FFT models, and show this result applies to many multiplication algorithms. Note that other papers already investigated the computation of only part (usually the most significant bits) of the product of power series or floating-point numbers. See e.g. [8], [9].. Karatsuba Model for n a Power of We first describe our trick and the corresponding algorithm in the simple case when n is a power of, and when the underlying multiplication algorithm is the Karatsuba method. Basic trick. For the sake of clarity, we suppose A is a Taylor series in the variable t (at t = 0). Assume we have A = a 0 + a t + a t and x = x 0 + x t. We have Ax = a 0 x 0 + (a 0 x + a x 0 )t + (a x + a x 0 )t + a x t 3 and we want to compute (a 0 x + a x 0 )t + (a x + a x 0 )t. The algorithm works as follows (see Fig. ): Algorithm MiddleProduct. Input: x = x 0 + x t,a = a 0 + a t + a t Output: h = a 0 x + a x 0 and l = a x + a x 0. α (a 0 + a )x. β a (x x 0 ) 3. γ (a + a )x 0 4. h α β 5. l γ + β. Now we can use this idea recursively (note that here and in the sequel we use MP as a shortcut for MiddleProduct). This yields the following algorithm, described in the case when n is a power of. See. for the general case. Algorithm MP([x 0,...,x n ], [a 0,...,a n ]). 0. If n =, return [a 0 x 0 ]. p n/.. α MP([x p,...,x p ], [a 0 + a p,...,a p + a 3p ]) 3. β MP([x p x 0,...,x p x p ], [a p,...,a 3p ])

4 48 G. Hanrot et al. A 0 A A A 3 X 0 u v X u v u + v = α = MP(X, [A 0,A ] + [A,A ]) v u = β = MP(X X 0, [A,A ]) u + v = γ = MP(X 0, [A,A ] + [A,A 3 ]) Fig.. MiddleProduct recursion. 4. γ MP([x 0,...,x p ], [a p + a p,...,a 3p + a 4p ]) 5. Return [α β,γ + β]. Theorem When n is a power of two, Algorithm MP returns [c n,...,c n ] where c k = i+j=k a ix j, using exactly K(n) = n log 3 ring multiplications.. Karatsuba Model: The Odd Case In this section, we generalize the trick described above to deal with the case when n is not a power of. If R is the coefficient ring of our power series, and y := [y 0,...,y n ] R n, z := [z 0,...,z n ] R n, we define MP(y, z) = [ i+j=k y iz j ] n k n. Theorem Karatsuba s algorithm can be adapted so as to yield an algorithm computing MP(y, z) in K(n) multiplications. Proof. Consider the following algorithm: Algorithm MP-Karatsuba([y 0,...,y n ], [z 0,...,z n ]). 0. If n =, return [y 0 z 0 ].. n 0 n/, n n/.. α MP-Karatsuba([y n0,...,y n ], [z 0 +z n,...,z n +z 3n ]) 3. If n is even 3. β MP-Karatsuba([y n y 0,...,y n y n0 ], [z n,...,z 3n ]) else 3. β MP-Karatsuba([y n0,y n y 0,...,y n y n0 ], [z n,...,z 3n ])

5 The Middle Product Algorithm I γ MP-Karatsuba([y 0,...,y n0 ], [z n +z n,...,z n +n 0 + z n ]) 5. Return [α 0 β 0,...,α n β n,γ 0 + β 0,...,γ n0 + β n0 ]. It is easily checked that the number MP (n) of multiplications on an input of size n satisfies MP() =, MP (n) = MP( n/ ) + MP( n/ ), hence MP (n) = K(n). A straightforward calculation or using the general formulation developed in the appendix then allows one to prove that this algorithm indeed computes MP(y, z)..3 The FFT Model Theorem 3 Schönhage-Strassen s FFT algorithm can be adapted so as to yield an algorithm computing MP(y,z)inFFT (n) multiplications. Proof. We give the algorithm and refer to the general framework described in the appendix for a detailed proof. As for the complexity, the algorithm amounts to three FFTs of size n, which is exactly the same as what is done in a single FFT multiplication. Algorithm MP-FFT([y 0,...,y n ], [z 0,...,z n ]). 0. ω primitive (n)-th root of unity. α FFT(ω, [y n,...,y 0, 0,...,0]) }{{} n. β FFT(ω, [z 0,...,z n, 0]) 3. γ FFT(ω, [α 0 β 0,...,α n β n ]) 4. Return [γ 0 /n,...,γ n /n]. FFT(a,v) computes [ n i=0 ami v i ] 0 m n, where a is a root of unity of order n. At step 3, γ = [c n,...,c n,c n,c n + c 0,...,c 3n + c n ], where c k = i+j=k y iz j..4 General Situation The two algorithms described in that section can be generalized: to any multiplication algorithm of a certain quite general kind, we can associate a middle product algorithm of the same complexity. (See [5, Def. 8] for the definition of a bilinear algorithm.) Theorem 4 From any bilinear (n, n) multiplication algorithm, one can derive an algorithm computing a (n, n ) middle product with exactly the same number of ring multiplications.

6 40 G. Hanrot et al. Since the proof of that theorem is quite technical, and not required in the rest of the paper, we postpone it in an appendix. In the following sections, we consider an algorithm MP(y, z) taking y with n terms, z with n terms, and returning their middle product, with the same complexity K(n) or FFT (n) depending on the computation model than multiplication. We then use this algorithm to build new algorithms for basic operations on power series or polynomials: square, inverse, quotient, square root. Sometimes to simplify the algorithms we allow z to have more terms; in such a case we consider only the first n terms from z, and we write MP(y, z, n) to avoid confusion. We also write mul(a, b) for the full product of two series or polynomials of n terms (giving n terms), mul high(a, b) for the (short) product of two series of n terms, giving the n most significant terms, of order 0 to n, and mul low(a, b) for the short product giving the n least significant terms, of order n to n. Squaring Power Series In this section, we show how to use the MP algorithm to compute the square of a power series, or equivalently the upper (or lower) half part from the square of a polynomial. Theorem 5 Using the MP algorithm, one can compute the square of a power series of order n in R(n) K(n)+ ring multiplications under the Karatsuba model. Proof. We use the following algorithm. Let A = n i=0 a it i = A 0 + t p A,we have A = A 0 + A 0A t p mod t n, plus an extra term ap t p when n is odd. At step3wehaveα = A mod t n p, and at step 4 (where the second argument of MP should read [a ] for n = 3) β gives the coefficients of degree n p to n ofa 0 + A 0A t p, thus α + t n p β, plus ap t p when n is odd, gives A mod t n. Algorithm MP-square([a 0,...,a n ]).. If n =, return [a0 ].. p n/ 3. α MP-square([a 0,...,a n p ]) 4. β MP([a 0,...,a p ], [a n p+,...,a p, a p,...,a n ]) 5. If n mod = then β p β p + ap 6. Return [α 0,...,α n p,β 0,...,β p ]. The number R(n) of ring multiplications satisfies the recurrence R(n) = R( n/ )+M( n/ )+(n mod ), with R() =. In the Karatsuba model, assume that R(k) K(k)+ for k<n.ifn = k,wehaver(n) = R(k)+K(k) 3K(k)+ K(n)+. If n = k +, then R(n) = R(k + ) + K(k) +

7 The Middle Product Algorithm I 4 K(k+)+K(k)+3 K(k+)+K(k)+ K(n)+ K(k + ). 3, using the fact that K(k) + 3 Power Series Inversion In this section, we explain how the MP algorithm can be applied, when plugged into Newton s iteration, to compute inverses. Theorem 6 Using the MP algorithm, one can inverse a power series of order n in I (n) = K(n) ring multiplications under Karatsuba s model, and I (n) FFT (n) using FFT. Remark. The analysis of all the FFT variants from now on is biased by the fact that exact FFT complexity (i.e., exact formulas for FFT (n)) highly depends on the implementation. We shall thus content ourselves with reasonable estimates, assuming mainly that FFT (n) FFT (n) and FFT (n + ) FFT (n). Proof. We describe the algorithm, which simply rests on Newton s iteration, with the usual product replaced by an MP call. Algorithm MP-inv([a 0,...,a n ]). 0. If n =, return [/a 0 ].. p n/.. α MP-inv([a 0,...,a n p ]) 3. β MP([α 0,...,α n p ], [a,...,a n, 0],n p) 4. γ mul high([α 0,...,α p ], [β 0,...,β p ]) 5. Return [α 0,...,α n p, γ 0,..., γ p ]. The algorithm performs just one division. The number I (n) of ring multiplications is given by I() = 0, I (n) = I( n/ )+M( n/ )+M( n/ ). Under Karatsuba s framework, we have I (n) = K(n), since I (n) + satisfies the same recurrence as K(n). In the FFT case, one gets I (n) FFT (n). Remark. The mul high call at step 4 of MP-inv has a better complexity than K(n) for most values of n if Mulders algorithm is used, which results in a better overall complexity for MP-inv. See [9] for more details. 3 It can be shown that K(n + ) K(n) = z(n)+ where z(n) is the number of zeroes in the binary expansion of n.

8 4 G. Hanrot et al. 4 Power Series Division 4. Newton s Method From our inversion algorithm, we can mechanically deduce a division algorithm as follows: Theorem 7 The quotient of two order n power series Mulders short division can be computed in 4 3 K(n) under the Karatsuba model, and 5 FFT (n) under the FFT model. As a consequence, the division with remainder of a polynomial of degree n by a polynomial of degree n can be performed in 7 FFT (n). Proof. We use Karp and Markstein s trick [7] to incorporate the dividend in the last Newton s iteration. The corresponding algorithm is as follows: Algorithm MP-div-KM([b 0,...,b n ], [a 0,...,a n ]). 0. If n =, return [b 0 /a 0 ].. p n/.. α MP-inv([a 0,...,a n p ]); 3. β mul high(α, [b 0,...,b n p ]); 4. γ MP([β 0,...,β n p ], [a,...,a n, 0],n p); 5. δ mul([α 0,...,α p ], [b n p γ 0,...,b n γ p ]); 6. Return [β 0,...,β n p,δ 0,...,δ p ]. The total cost is equivalent to that of one inversion plus one short multiplication of size n/, i.e., I (n)+m(n/),or 4 3 K(n) with Karatsuba, and 5 FFT (n) using the FFT. Remark. In the above algorithm we consider that the mul high call at step 3 costs M(n p) multiplications. Actually it may be less expensive when one uses Mulders short product under the Karatsuba model. Nevertheless, the worst-case complexity 4 K(n) already improves on the previous best-known 3 algorithm of average complexity.397k(n) from Mulders [9, Table 6]. However, see also Direct Division Using MP Algorithm MP may also be used directly i.e., without Newton s iteration to compute the quotient of polynomials or power series. This new algorithm is described and analyzed in this section. Assume that we are dividing B by A, and split B as (B 0,B ), and A as (A 0,A ). Let the quotient be Q = (Q 0,Q ). The algorithm is illustrated by Figure.

9 The Middle Product Algorithm I 43 A 0 A Q 0 Q B r = 0 MP(Q 0,A) B r Fig.. Algorithm MP-divide. Namely, the upper part of the quotient Q 0 is obtained by dividing B 0 by A 0. The lower part is then obtained by dividing B minus the middle terms of the product AQ 0 hence MP(Q 0,A) bya 0. This gives more formally the following algorithm: Algorithm MP-divide ([b 0,...,b n ], [a 0,...,a n ]). 0. If n =, return [b 0 /a 0 ].. p n/.. α MP-divide([b 0,...,b n p ], [a 0,...,a n p ]). 3. r MP(α, [a,...,a n, 0],n p). 4. β MP-divide([b n p r 0,...,b n r p ], [a 0,...,a p ]). 5. Return [α, β]. Theorem 8 Algorithm MP-divide with inputs of n terms correctly computes the first n terms of their quotient, using D(n) = K(n) n ring multiplications under Karatsuba model. Proof. The correctness follows easily from the explanations of the picture above. Let D(n) be the number of multiplications in the base field, D d (n) be the number of divisions. Then one has D d (n) = D d ( n/ ) + D d ( n/ ), and D d () =, hence D d (n) = n. As for D(n),wehaveD(n) = D( n/ ) + D( n/ ) + M( n/ ), D() = 0. Under Karatsuba model, we get D(n) = K(n) n by a simple induction, since the recurrence relations are closely related to those that one obtains for Karatsuba method. In the case of the FFT, we get for n large enough the reasonable estimate D(n) FFT (n) log (n). As a consequence, Newton s method should be preferred in that case. 4.3 Division With Remainder Consider two series A and B of n terms. Since Algorithm MP-divide computes their quotient Q in K(n) operations, and the remainder can be ob-

10 44 G. Hanrot et al. tained from B AQ in K(n) additional operations, this yields a division with remainder in K(n) operations, which equals the best known complexity in the Karatsuba model []. (Considering series at infinity instead of zero, all results transpose to polynomials.) We show in this section how to compute the remainder, i.e., the low part from AQ,in 3 K(n) operations, which gives a division with remainder in 5 3 K(n) operations. In the following algorithm we assume that the high part S of AQ is known (which is the case in division, since this is just the high part of the dividend B). Algorithm ShortRem([a 0,...,a n ], [q 0,...,q n ], [s 0,...,s n ]). 0. p n/.. T mul low([a p,...,a n ], [q p,...,q n ]).. If n is even.. U mul([a 0 + a p,...,a p + a n ], [q 0 + q p,...,q p + q n ]);. Return [u 0 +u p t 0 s 0 s p,...,u p +u p t p s p s n,u p s p s n,t 0,...,t p ] else.3 U mul([a 0 +a p,...,a p +a n,a p ], [q 0 +q p,...,q p + q n,q p ]);.4 Return [u p t 0 s p,u 0 +u p t s 0 s p,...,u p 4 +u p 4 t p 3 s p 4 s p 4,u p 3 +u p 3 s p 3 s p 3,u p +u p s p s n,t 0,...,t p 3 ]. Theorem 9 Algorithm ShortRem correctly computes the low half of AQ, and uses K(n) operations under the Karatsuba model, when A and Q have 3 n terms. Proof. The idea is that the remainder comes from the computation AQ, which Karatsuba decomposes as three multiplications A 0 Q 0, A Q and (A 0 +A )(Q 0 + Q ). But in the present case, the upper half is already known, and we can recover A 0 Q 0 from it and the two other products. More precisely, put A = A 0 + t p A, Q = Q 0 + t p Q, and write A Q = T 0 + t n p T, A 0 Q 0 = Z 0 + t p Z, (A 0 + A )(Q 0 + Q ) = U 0 + t n p U, S = S 0 + t n p S. The low part of the remainder is simply the low part of A Q, i.e., T.By definition, S is the high part of A 0 Q 0 + t p (A 0 Q + A Q 0 ) + t p A Q. () If n is even, S 0 = Z 0 and S = Z + U 0 T 0 Z 0. This implies that Z = S + T 0 U 0 + S 0. The high part of the remainder is the high part of the low part of (), i.e., the low part of (A 0 Q + A Q 0 ) plus the high part of A Q, namely T 0 + U T Z = U 0 + U T S 0 S.

11 The Middle Product Algorithm I 45 If now n is odd, we have S 0 = Z 0, S = Z + t(u 0 T 0 Z 0 ), hence Z = S + ts 0 + tt 0 tu 0, and the high part of the remainder is U T Z + tt 0 = tu 0 + U T ts 0 S, and the result follows. The computation of T is a n n product, which costs K( n/ ); that of U = (A 0 + A ) (Q 0 + Q ) is a full n n product, which costs K( n/ ). Thus ShortRem costs K( n/ ) + K( n/ ) 3 K(n). Remark. Since we only need the sum of U 0 or tu 0 and U, we may replace the full product (A 0 + A ) (Q 0 + Q ) by a product mod t p, which would directly give the sum U 0 + U or tu 0 + U.Ifp is even or, better, divisible by a large power of, the evaluation of such a product can be improved by simple FFT-like techniques, using the fact that (x, y) (x + y)/ + t k (x y)/is an isomorphism from R[t]/(t k ) R[t]/(t k + ) onto R[t]/(t p ), when p = k, if the characteristic of the ring R is odd. Remark. When using FFT multiplication, Algorithm ShortRem is of little interest, since it replaces a multiplication of size n by two multiplications of size n/. However, the fact that the product (A 0 +A )(Q 0 +Q ) is needed only modulo t p (whereas FFT might compute it modulo t p ) can probably be used to speed up things. 5 Square Root of Power Series 5. Newton s Method From our division algorithm, we can derive a square root algorithm by the classical Newton s iteration x k+ = x k + A x k x k. Theorem 0 The square root of a power series of order n can be computed in S(n) K(n) ring multiplications under the Karatsuba model, S(n) FFT (n) under the FFT model. 7 Proof. The corresponding algorithm is as follows: Algorithm MP-sqrt([a 0,...,a n ]). 0. If n =, return [ a 0 ].. p n/.. α MP-sqrt([a 0,...,a n p ]); 3. β mul([α,...,α n p ], [α,...,α n p ]); 4. γ MP-divide([a n p β n p,...,a n β n 3 ], α); 5. Return [α 0,...,α n p,γ 0,...,γ p ].

12 46 G. Hanrot et al. where at step 4 it is assumed that β n p = 0ifn =, β n 3 = 0ifnis even, and α is truncated to p terms. The number S(n) of ring multiplications satisfies S() = 0, S(n) = S( n/ ) + D( n/ ) + M( n/ ). We get S(n) K(n) n under the Karatsuba model and S(n) 7 FFT (n) under the FFT model. Corollary The square root (resp. square root with remainder) of a power series can be computed in 3 4 K(n) (resp. 5 K(n)) multiplications under the 4 Karatsuba model, and 3FFT (n) (resp. 4FFT (n)) multiplications under the FFT model. Proof. If one replaces the full product from step 3 by a short square giving the n p high order terms of α (see ), one gets S(n) 3 K(n) under the 4 Karatsuba model. In practice, one just needs to replace the call to mul by a call to MP-square. For the FFT model, instead of dividing A xk by x k at step 4 of MP-sqrt, we multiply by y k where y k = y k + y k ( x k y k ) is an approximation of / A, computed together with x k. For the last step, this replaces one division of size n/ by one multiplication of size n/ the last y k is not needed and for the preceding steps this replaces one division of size n/ j by three multiplications of size n/ j for j, whence a total gain of M(n/) for D(n) 5 M(n). To get the complexities for the square root with remainder, just add K(n) under the Karatsuba model (using Algorithm MP-square) and FFT (n) under the FFT model. 5. A Variant We can also give a slightly more general algorithm.assume that A is given to precision n, and we have an approximation A = S +t p R to precision p. Write S = S + δt p, with δ of order n p. One has A = S + Rt p δt p (S δt p ). To get the right S, we just need that Rt p δt p (S δt p ) be 0 to precision n, hence δ = R/(S + δt p ) to precision n p. This identity is meaningful as soon as p>0, since the quotient is performed online : the term of degree k of δ only depends on the terms of δ of degree k p. However for our recursive algorithm to work, all the first half bits of (S + δt p ) are needed to compute the first half of δ. Hence we can give an in place formulation of MP-divide finding δ as soon as p >(n p)/, i.e., p > n/3. Algorithm MP-sqrt corresponds to the case p = n/, hence we can ignore the term δt p, and simply compute R/(S) using MP-divide. Precisely, MP-div-inplace(a,q,l,n) computes the quotient of [a 0,...,a n ]by[q 0,...,q n ] and puts the result in [q l,...,q l+n ]. Algorithm MP-div-inplace ([a 0,...,a n ], [q 0,...,q l+n ],l,n). 0. If n =, q l a 0 /q 0. Return.

13 The Middle Product Algorithm I 47. p n/.. MP-div-inplace(a,q,l,n p); 3. γ MP([q l,...,q l+n p ], [q,...,q n, 0]); 4. MP-div-inplace([a n p γ 0,...,a n γ p ],q,l+n p, p). Algorithm MP-sqrt-generic([a 0,...,a n ],p). 0. If n = return [ a 0 ].. α MP-sqrt-generic([a 0,...,a p ]);. β MP-square(α); 3. MP-div-inplace([a p β p,...,a p β 0,a p,...,a n ], α, p +,n p); 4. Return [α 0,...,α p, α 0,...,α n p ]. One can then try to optimize the choice of the splitting point p.a theoretical analysis in the Karatsuba model, taking p = βn, givesβ opt 0.4, with a cost 0.734K(n). Experimental observations yield the following rule: for n between k and k + k, choose p = k, except for n = k + where one should choose p = k + ; for n between k + k and k+, choose p = n k. Using those simple rules, we seem to get the optimal number of multiplications up to a 5% loss; furthermore those rules seem to be asymptotically optimal. The asymptotic behaviour for S(n)/K(n) is then lim sup S(n)/K(n) = 3/4(eg. for n = k where we use in fact the first algorithm) and lim inf S(n)/K(n) seems to be 7/8 (roughly for n up to 6 ), and obtained for n = k + k. On average we get 5 6 S(k) k= 5 K(k) Implementation Results Since the algorithms we propose offer only a gain on the constant factor, it seemed to us that a realistic (by opposition with complexity estimates) implementation was required. We describe it shortly in the present section, and give experimental results on several architectures. We choose to implement the algorithms on power series with coefficients in Z/pZ, with p the largest prime such that p fits into a machine word (p = 655 on 3-bit machines, p = on 64-bit machines). For small sizes of arguments, quadratic algorithms are used. The optimal threshold for each algorithm is determined by a tuning program. Experiments have been done for series from degree 6 to 000. We discuss the results algorithm by algorithm. 6. Middle Product The middle product algorithm has been compared with Karatsuba s algorithm. The thresholds with the quadratic version are very similar on all architectures,

14 48 G. Hanrot et al. though slightly smaller for the middle product. In practice, we recover the expected result MP (n) K(n). We sometimes get (ev6 and k7) a little better, roughly 0.9K(n)..4 MP K(n).4 MP K(n) Middle product on a k7-550mhz Middle product on a ev6-500mhz.4 MP K(n).4 MP K(n) Middle product on a sparcv9-96 MHz Middle product on an R0K-94 MHz 6. Short Square We have compared the MP-square algorithm, of theoretical complexity K(n) (see ), 4 with the following algorithm: Algorithm Kara-square([a 0,...,a n ]). 0. If n = return [a 0 ].. p n/ ; 4 A referee proposed the following Mulders-like trick to improve MP-square: take p = ( β)n instead of p = n/, which leads to the recurrence R(n) = R(βn) + K(( β)n) + R((β )n) under the Karatsuba model, with an optimal β 0.560, and a theoretical average cost of about 0.48K(n). The same idea also applies to Kara-square with p = βn, β opt 0.58, and a cost of about 0.485K(n).

15 The Middle Product Algorithm I 49. α mul([a 0,...,a n p ], [a 0,...,a n p ]); 3. β mul high([a 0,...,a n p ], [a n p,...,a n ]); 4. Return [α 0,...,α n p,α n p + β 0,...,α n + β p ]. The theoretical complexity of this algorithm is M( n ), i.e., K(n) with 3 Karatsuba multiplication. However, a square is usually less expensive to compute than a product, and costs K(n) under reasonable assumptions. With the 3 average gain due to Mulders algorithm, we get an algorithm of practical complexity about 0.5K(n). On the following diagrams, the upper curve is the time for execution of a full square using Karatsuba, the middle one is the time for computing a short square using Kara-square and the lower one the time for computing a short square using MP-square.. square MP-Square Kara-Square K(n)/. square MP-Square Kara-Square K(n)/ Short square on a k7-550mhz Short square on a ev6-500mhz. square MP-Square Kara-Square K(n)/. square MP-Square Kara-Square K(n)/ Short square on a sparcv9-96 MHz Short square on an R0K-94 MHz

16 430 G. Hanrot et al. 6.3 Division We have compared our MP-divide algorithm with Burnikel-Ziegler division [], in which we have used short products to compute the remainder associated to the high part of the quotient. We get the following diagrams (lower curve is MP-divide):.8 MP-Div Kara-Div K(n).8 MP-Div Kara-Div K(n) Division on a k7-550mhz Division on a ev6-500mhz.8 MP-Div Kara-Div K(n).8 MP-Div Kara-Div K(n) Division on a sparcv9-96 MHz Division on an R0K-94 MHz 6.4 Square Root We have compared our MP-Sqrt algorithm (lower curve) with Karatsuba square root [3], in which we have used the Kara-square algorithm. We get the following diagrams:

17 The Middle Product Algorithm I 43. Sqrt MP-Sqrt 3/4 K(n). Sqrt MP-Sqrt 3/4 K(n) Square root on a k7-550mhz Square root on a ev6-500mhz. Sqrt MP-Sqrt 3/4 K(n). Sqrt MP-Sqrt 3/4 K(n) Square root on a sparcv9-96 MHz Square root on an R0K-94 MHz 6.5 Synthesis The complexity results obtained in this paper, which might be subject to caution since they deal with constants, are reliable. Our experiments confirm that we really obtain those constants in practice. We even get slightly better since the middle product seems to perform slightly better than Karatsuba on a variety of machines. 7 Conclusion This paper presents new algorithms for basic operations on polynomials or power series. Those algorithms are based on a new algorithm, called middle product, that computes the n middle coefficients of a (n ) n product as efficiently as a full n n product. Table gives a detailed analysis of the

18 43 G. Hanrot et al. Table. Previous and new best-known complexities for several operations on power series and polynomials, under the Karatsuba and FFT models, within O(n) memory usage. Square Karatsuba model FFT model previous this paper previous this paper K(n) K(n) [Th. 5] M(n) M(n) Middle product K(n) K(n) [Th. ] M(n) M(n) [Th. 3] Inverse Quotient... with remainder K(n) [] Square root K(n) [3]... with remainder 3 K(n) [] K(n) [Th. 6] 3M(n) M(n) [Th. 6] 3 K(n) [] K(n) [Th. 8] 7 M(n) 5 M(n) [Th. 7] 5 K(n) [Th. 9] 9 M(n) 7 M(n) [Th. 7] 3 3 K(n) [Co. ] 7 M(n) 3M(n) [Co. ] 4 3 K(n) [3] 5 K(n) [Co. ] 9 M(n) 4M(n) [Co. ] 4 number of coefficient operations used by those algorithms, compared to previously known algorithms. 5 An implementation in Maple of these algorithms is available at the URL zimmerma/papers/mp.mpl. It is possible to obtain floating point versions of the middle product algorithm using only O() extra memory with respect to the polynomial version at each recursion step. This, and the applications to multiprecision floating point arithmetic, is a work in progress and will be presented in a forthcoming paper [3]. In [], Victor Shoup asks if a matrix/vector product of the following form can be reduced to a single multiplication of polynomials of degree less than n: v 0 v v n v v v n. v n v n v n b 0 b. b n. This is exactly MP([b n,...,b 0 ], [v 0,...,v n ]), whence the middle product algorithm partially answers to that open question: it does not reduce the matrix/vector product to a single n n product, but to the same number of ring multiplications, provided the product algorithm uses linear forms. This is a particular instance of the transposition principle problem [4, Open Problem 6] in the case of Toeplitz matrices: if M is the (n ) n matrix with M i,j = x i j for 0 i j <n, and M i,j = 0 otherwise, y is the vector (a 0,...,a n ) T, and z is the vector (a n,...,a 0 ) T, then M y gives the full product of [x 0,...,x n ] 5 For the sake of simplicity, we consider that a full square costs as much as a full product, and we don t reuse Fourier transforms when a given variable is used in several multiplications.

19 The Middle Product Algorithm I 433 and [a 0,...,a n ], whereas M T z gives MP([x 0,...,x n ], [a 0,...,a n ]), in reversed order. The algorithms of Table need only a O(n) memory space, i.e., proportional to the input size. If one enables a larger memory usage, then other algorithms exist. In particular, the relaxed DAC-multiplication proposed by van der Hoeven in [] can be adapted to yield a division with remainder in exactly K(n) scalar operations, using O(nlog n) memory. However the fast variant from [, 4.3.], which works within O(n) memory, requires more than K(n) operations asymptotically. Acknowledgements. The authors wish to thank one of the anonymous referees for his clever comments that greatly helped to improve the presentation of the results, and Éric Schost who pointed out the link with the transposition principle. References. Brent, R.P.: The complexity of multiple-precision arithmetic. In: The Complexity of Computational Problem Solving, R.S. Anderssen and R.P. Brent, (eds.), University of Queensland Press, 976, pp Burnikel, C., Ziegler, J.: Fast recursive division. Research Report MPI-I-98--0, MPI Saarbrücken, Hanrot, G., Quercia, M., Zimmermann, P.: The Middle Product Algorithm, II. Floating point arithmetic. In preparation 4. Kaltofen, E.: Challenges of symbolic computation: My favorite open problems. J. Symbolic Comput. 9(6), (000) 5. Kaminski, M., Kirpatrick, D.G., Bshouty, N.H.: Addition requirements for matrix and transposed matrix products. J. Algorithms, 9, (988) 6. Karatsuba, A.A., Ofman, Y.P.: Multiplication of multiplace numbers by automata. Dokl. Akad. Nauk SSSR 45(), (96) 7. Karp, A.H., Markstein, P.: High-precision division and square root. ACM Trans. Math. Softw. 3(4), (997) 8. Krandick, W., Johnson, J.R.: Efficient multiprecision floating point multiplication with exact rounding. RISC-Linz Report Series 93-76, RISC-Linz, Johannes Kepler University, Mulders, T.: On short multiplications and divisions. AAECC (), (000) 0. Schönhage, A., Strassen, V.: Schnelle Multiplikation großer Zahlen. Computing 7, 8 9 (97). Shoup, V.: Efficient computation of minimal polynomials in algebraic extension of finite fields. In: Proceedings of the 999 International Symposium on Symbolic and Algebraic Computation (Vancouver, Canada), S. Dooley, (ed.), ACM, pp van der Hoeven, J.: Relax, but don t be too lazy. J. Symbolic Comput. 34(6), (00) 3. Zimmermann, P.: Karatsuba square root. Research Report 3805, INRIA, 999 Appendix. The General Situation In this section, we give a more general and formal presentation of the trick described in. This presentation is suitable for any bilinear multiplication

20 434 G. Hanrot et al. algorithm, and gives, in the case of Karatsuba and FFT multiplication, corresponding algorithms of the same complexity for the middle product (see Theorems and 3). In the sequel, R is any commutative ring. Theorem Let M p,q,n : R p R q R n and n,p,q : R n R p R q be the bilinear forms defined by M p,q,n (y, z) = y j z k n,p,q (x, y) = x i y j. j+k=i+p 0 i<n i+j=k 0 k<q Then, for any (X,Y,Z) R n R p R q, we have (X M p,q,n (Y, Z)) = ( n,p,q (X, Ỹ) Z), (3) where ( ) denotes the canonical inner product of two vectors of the same length and Ỹ = (y p i ) 0 i<p. The n,p,q and M p,q,n bilinear mappings are related to the product of power series in the following way. Let X = (x 0,...,x n ) R n, Y = (y 0,...,y p ) R p and Z = (z 0,...,z q ) R q.wehave ( ) n,p,q (X, Y ) = x i y j,..., x i y j i+j=0 i+j=q ( = coefficients of t 0,...,t q in ( x i t i )( ) y j t j ) ; ( ) M p,q,n (Y, Z) = y j z k,..., y j z k = j+k=p j+k=n+p ( coefficients of t p,...,t n+p in ( y j t j )( ) z k t k ). For p = n and q = n, n,p,q and M p,q,n correspond respectively to the usual multiplication and the middle product. Proof. [Theorem ] We have n (X M p,q,n (Y, Z)) = x i y j z k = x i y j z k i=0 = 0 i<n 0 j<p 0 k<q i+j=k = 0 k<q which concludes the proof. 0 j<p 0 k<q j+k=i+p x i y p j z k z k 0 i<n 0 j<p i+j=k 0 i<n 0 j<p 0 k<q i+p j=k x i y p j = ( n,p,q (X, Ỹ) Z),

21 The Middle Product Algorithm I 435 We are now able to prove our main result, Theorem 4: Proof of Theorem 4. Suppose that we have a formula n,p,q (X, Y ) = l (a m X)(b m Y)c m (4) m= with a m R n, b m R p and c m R q. Such a formula will be called a decomposition of n,p,q into l ring multiplications. Then we have according to Theorem (X M p,q,n (Y, Z)) = ( n,p,q (X, Ỹ) Z) = l (a m X)(b m Ỹ )(c m Z). m= Therefore, X being an arbitrary vector in R n : M p,q,n (Y, Z) = l (b m Ỹ )(c m Z)a m, (5) m= and M p,q,n can also be decomposed into l ring multiplications, with same coefficients. Remark. Formulas (4) (5) show that the complexities of a (n, n) full product and a (n, n ) middle product are equal when one only counts multiplications between linear forms in the operands. However, if one also takes into account the time to compute those linear forms and the time to combine the products, then the complexities of n,n,n and M n,n,n remain comparable according to the transposition principle (see the following paragraph). Matrix multiplication. With the above notations, let us write a m = (a m,,..., a m,n ), b m = (b m,,...,b m,p ), c m = (c m,,...,c m,q ) and let A, B, C be the (l, n), (l, p) and (l, q) corresponding matrices. Then formulas (4) (5) read: n,p,q (X, Y ) = C T ((AX T ) (BY T )), M p,q,n (Y, Z) = A T ((BỸ T ) (CZ T )), where denotes the component-wise product of two vectors of length l. From the transposition principle [5, Def. 8 and Th. 4], we conclude that one can transform a bilinear algorithm P computing n,p,q (X, Y ) into a bilinear algorithm M computing M p,q,n (Y, Z).Ifα(X ), β(x ) and γ(x ) denote the number of additions, multiplications by a known scalar and multiplications between two indeterminates for algorithm X then we have α(m) = α(p) + q p, β(m) = β(p), γ (M) = γ(p).

22 436 G. Hanrot et al. n 0 n 0 n n 0 X = X 0 X X Z = Z 0 Z Y = Y 0 Y Y Ỹ = Ỹ Ỹ Ỹ 0 Z 0 Z n n n n 0 Fig. 3. Karatsuba and MP splitting. This means, in the case p = n, q = n, that from any multiplication bilinear algorithm using M(n) multiplications in R we can derive an algorithm for MP using M(n) multiplications in R, and n more additions. We now illustrate how the algorithms of Theorems and 3 can be obtained from Theorem. The Karatsuba Model Let X = (x i ) 0 i<n R n, Y = (y j ) 0 j<n R n and Z = (z k ) 0 k<n R n. Write n = n 0 + n, with 0 <n 0 n and split X, Y in the following way. X = [X 0,X,X ] with X 0 = (x 0,...,x n0 ), X = (x n0,...,x n ), X = (x n,...,x n ), Y = [Y 0,Y,Y ] with Y 0 = (y 0,...,y n0 ), Y = (y n0,...,y n ), Y = (y n,...,y n ). Note that the subvectors X and Y are empty when n 0 = n. Furthermore, we define Z 0,Z 0,Z and Z as the following subvectors of Z: Z 0 = (z 0,...,z n ), Z 0 = (z n,...,z 3n ), Z = (z n,...,z n +n 0 ), Z = (z n,...,z n ). Karatsuba s method amounts to make a recursive call on low halves and high halves of the operands, and on the differences of those operands, i.e., to compute a = n,n,n ([X 0,X ], [Y 0,Y ]), b = n0,n 0,n 0 (X,Y ), c = n,n,n ([X 0 X,X ], [Y 0 Y,Y ]),

23 The Middle Product Algorithm I 437 and then the high part of the result is b, the low part is a and the middle part is a c + b. In our formalism, we write ( n,n,n (X, Ỹ) Z) = ( n,n,n ([X 0,X ], [Ỹ, Ỹ ]) Z 0 + Z 0 ) ( n,n,n ([X 0 X,X ], [Ỹ Ỹ 0, Ỹ ]) Z 0 ) +( n0,n 0,n 0 (X, Ỹ 0 ) Z + Z ). Then, using equation (3) and Karatsuba s method, we can write (X M n,n,n (Y, Z)) = ( n,n,n (X, Ỹ) Z) = ( n,n,n ([X 0,X ], [Ỹ, Ỹ ]) Z 0 + Z 0 ) ( n,n,n ([X 0 X,X ], [Ỹ Ỹ 0, Ỹ ]) Z 0 ) +( n0,n 0,n 0 (X, Ỹ 0 ) Z + Z ) = ([X 0,X ] M n,n,n ([Y,Y ],Z 0 + Z 0 )) ([X 0 X,X ] M n,n,n ([Y,Y Y 0 ],Z 0 )) +(X M n0,n 0,n 0 (Y 0,Z + Z )) = (X [α β,γ + low(β)]) where α = M n,n,n ([Y,Y ],Z 0 + Z 0 ), β = M n,n,n ([Y,Y Y 0 ],Z 0 ), low(β) = (β 0,...,β n0 ), γ = M n0,n 0,n 0 (Y 0,Z + Z ). X being an arbitrary vector of R n, we infer: M n,n,n (Y, Z) = [α β,γ + low(β)]. (6) This formula is exactly Algorithm MP-Karatsuba from.. The FFT Model Let ω be a primitive (n)-th root of unity, X R n, Y R n and Z R n. The FFT algorithm becomes, in our formalism,

24 438 G. Hanrot et al. n,n,n (X, Y ) = n ((ω mi ) 0 i<n X)((ω mj ) 0 j<n Y )(ω mk ) 0 k<n. n m=0 Hence we obtain from (4) (5) the formula M n,n,n (Y, Z) = n ((ω mj ) 0 j<n Ỹ )((ω mk ) 0 k<n Z)(ω mi ) 0 i<n. n m=0 In more classical notations, if f m (n) is the m-th component of the direct Fourier transform, and g m (n) that of the inverse one, we get y j z k = n ω mi f m (n) n (ỹ,0,...,0)g(n) m (z), 0 i<n, 0 j<n 0 k<n j+k=i+n m=0 which is exactly Algorithm MP-FFT from.3. This means that MP(y, z) can be computed in two Fourier transforms of size n and one inverse Fourier transform of size n, which is exactly the same as for a full n n product. Remark. The preceding formulas are still valid if we suppose X of length n+: n+,n,n (X, Y ) = n ((ω mi ) 0 i n X)((ω mj ) 0 j<n Y )(ω mk ) 0 k<n, n M n,n,n+ (Y, Z) = n m=0 n ((ω mj ) 0 j<n Ỹ )((ω mk ) 0 k<n Z)(ω mi ) 0 i n, m=0 therefore we can compute the n+ middle coefficients of the product ( y j t j ) ( z k t k ) with no additional operation (assuming no particular optimization).

Short Division of Long Integers. (joint work with David Harvey)

Short Division of Long Integers (joint work with David Harvey) Paul Zimmermann October 6, 2011 The problem to be solved Divide efficiently a p-bit floating-point number by another p-bit f-p number in the