There are two approaches to Hensel lftng. Lnear lftng starts wth polynomals f = f (0) and teratvely constructs polynomals f () such that ()f () f (?)

On Bvarate Hensel Lftng and ts Parallelzaton Laurent Bernardn Insttut fur Wssenschaftlches Rechnen ETH Zurch bernardn@nf.ethz.ch Abstract We present a new parallel algorthm for performng lnear Hensel lftng of bvarate polynomals over a nte eld. The sequental verson of our algorthm has a runnng tme of O(mn 4 ) for lftng m unvarate polynomals of degree n wth respect to a bvarate polynomal of degree n n both varables, assumng that we use classcal polynomal multplcaton. Our parallel algorthm further reduces ths complexty to O(m n s n3 ) on s processng nodes, assumng that s < n. We also present an asymptotcally faster algorthm, whch has a complexty of O((ln m)n ln n) operatons n the coecent eld, usng fast polynomal multplcaton and O(n ln m) processors. Expermental results on a massvely parallel, dstrbuted memory machne conrm that our algorthm scales well on hgh numbers of processng nodes. Introducton Gven polynomals f ; : : : ; f m F p [x], parwse relatvely prme and a prmtve, square-free polynomal f F p [x; y] such that f Y f (mod y) bvarate Hensel lftng ams to construct polynomals f ; : : : ; f m F p [x; y] such that ()8 : f f (mod y) ()f Y f (mod y k ) If k s sucently large, the f obtaned can be used to compute a factorzaton of f over F p. We restrct ourselves to the case of bvarate polynomals over a nte eld. In practce, bvarate polynomals are common. Moreover, state-of-the-art algorthms for factorng polynomals n more than two varables rely on multple factorzatons of bvarate polynomals [4, 5]. For these reasons, t s mportant to have a fast way of lftng bvarate polynomals. Parallel factorzaton algorthms for sparse multvarate polynomals have been presented n [9]. We use a dense lftng approach whch s most eectve for bvarate polynomals. Only as the number of varables ncreases does t become more and more mportant to use sparse technques to prevent exponental behavor n the number of varables. As mentoned above, we restrct ourselves to polynomals over nte elds, although the same deas can be appled over any rng.

There are two approaches to Hensel lftng. Lnear lftng starts wth polynomals f = f (0) and teratvely constructs polynomals f () such that ()f () f (?) (mod y ) ()f Y f () (mod y ) The bound k s reached after k lftng steps. Quadratc lftng also starts wth f = f (0) but constructs polynomals f ( ) such that ()f ( ) f (? ) (mod y () ) ()f Y f ( ) (mod y () ) The bound k s reached after log k lftng steps. If classcal multplcaton s used, the asymptotc complexty of both approaches s equvalent [7]. Parallelzng the quadratc algorthm s temptng as t nvolves large polynomal multplcatons that can easly be parallelzed usng Karatsuba's algorthm. However, n practce, the sequental quadratc lftng algorthm s not able to compete wth the lnear algorthm, at least for bvarate polynomals of degree up to 000 n both varables. For ths reason we wll concentrate on a parallel verson of lnear Hensel lftng. Above we assume, that we can evaluate f(x; y) at y = 0 such that deg x (f(x; y)) = deg x (f(x; 0)) and such that f(x; 0) s square-free. If ths does not hold, we compute the translated polynomal f = f(x; y ) such that f(x; 0) satses the above condtons. It s shown n [6], that ths translaton can be done usng O(n 3 ) operatons n the coecent eld (wth n a bound on the degree of f n both varables). We assume n the followng that the coecent eld F p contans such an. For more detals on ths selecton process and on the case where F p does not contan a sutable evaluaton value, see []. The Sequental Lftng Algorthm Lnear lftng algorthms for dense bvarate polynomals over nte elds gven n [3, 0, 8] need O(mn 5 ) operatons n F p, wth m the number of factors and n a bound on the degree n each varable polynomal to factor. We present a sequental algorthm, that s an order of magntude faster than these, needng only O(mn 4 ) coecent operatons. We then descrbe our parallel verson of ths algorthm. Gven polynomals f F p [x; y] and f () ::m F p[x] such that: f = f () (mod y) () wth deg x (f) n and deg y (f) n. Assume we want to lft the f () ::m up to degree n n y,.e. compute f (n) ::m such that f = f (n) (mod y n ) () and 8 = ::m f () f (n) (mod y) (3)

Usng lnear Hensel lftng, we want to compute, at step k, the f ::m F p [x; y] such that: wth For (5) to hold, we set: f f f = f (mod y k ) (4) f (k?) (mod y k ) (5) := f (k?) y k (6) wth F p [x]. Pluggng (6) nto (4), we see that lftng from y k? to y k amounts to solvng for the 's n Q m f? mx = f (k?) y k f (0) (mod y) (7) = (7) s a unvarate Dophantne equaton n F p [x] that can be solved by rst precomputng the solutons for mx = m Y = 6= Now we can easly compute the = 6= f (0) (mod y) (8) at each step by multplcaton wth the lefthand sde of (7) and reducton modulo f (0). Ths means that solvng the Dophantne equaton (7) has a cost of O(m) multplcatons n the coecent rng F p [x] and thus a total cost of O(mM(n)) operatons n F p, where M(n) s the complexty of multplyng two unvarate polynomals of degree n. Before we can solve the Dophantne equaton, we have to compute the left-hand sde of (7): Q m f? = f (k?) y k (mod y) (9) We notce that only the coecent of y k n the numerator, denoted by C k = f? = f (k?) s needed. We wll now dscuss how to ecently compute c k, such that! [y k ] (0) C k = f [yk ]? c k () Our dea s to compute the product of the f (k?) modulo y k at each step, reusng sub-products already computed n the prevous step. At step k we thus have to compute = f (k?) (mod y k ) () In the followng we wll denote the coecent of y = () ). u [] n f as u [] (notng that 3

We wll compute the product () teratvely factor by factor. We dene P q := qy = wth c k = P m [yk ]. The product of the rst two factors gves P = u[0] : : : u[] u[] u[0] f (k?) (mod y k ) (3) y : : : u[k?] u [] u[k?3] : : : u [k?3] u [] u[k?] u[k?] u [] u[k?] : : : u [k?] u [] u[k?] u [] u[k?] u [] u[k?] : : : u [k?] u [] u[k?] u [] For successve we can compute P as P = p [0] p [0] u [] p [] : : : p [0] u [k?] p [] u [k?] y : : : p [] u [k?] : : : p [k?] u [] p [k?] p [] u [k?] : : : p [k?] u [] p [k] y k y k? y k? y k y k? wth p [l] := P? [yl. Note that p s used exclusvely for smplfyng the notaton ] and that the p's are derent for varyng and k. Movng to step k we get P = u[0] P : : : = p [0] : : : u[] u[] u[0] y : : : u[k?] u [] u[k?3] : : : u [k?3] u [] u[k?] u[k?] u [] u[k?] : : : u [k?] u [] u[k?] u[k] u[] u[k?] : : : u [k?] u [] u[k] u[0] u [] u[k] u[] u[k?] : : : u [k?] u [] u[k] u[] p [0] u [k?] p [0] u [k] p [] u [k] p [0] u [] p [] y : : : y k y k? y k p [] u [k?] : : : p [k?] u [] p [k?] p [] u [k?] : : : p [k?] u [] p [k] p [] u [k?] : : : p [k] u [] p [k] y k y k y k? y k? wth p [l] := P? [y l ] We see that, lftng from y k? to y k, the only products that we have to compute and that have not already been computed n the prevous step are u[k] u [k] u[0] 4

for the rst two factors and u [] u[k] u[] u[k?] : : : u [k?] u [] u[k] u[] p [] u [k] p [0] u [k] p [] u [k?] : : : p [k] u [] p [k] for each subsequent factor. Thus the total number of multplcatons needed at step k of the lftng s equal to M k = k (m? )( (k )) = (k )(m? ) (4) Supposng that we want to lft our unvarate mage polynomals up to degree n, we need a total number of nx k= (k )(m? ) = ( m? )(n 5n) (5) multplcatons n the coecent rng F p [x] for computng the left-hand sdes of the arsng Dophantne equatons. Thus we get a total runnng tme of O(mn ) multplcatons n F p [x] for the lnear lftng algorthm. Supposng, that the degree n x of the factors are bounded by n, the total complexty of lnear Hensel lftng n terms of eld operatons s O(mn M(n)). Assumng classcal multplcaton we get a complexty of O(mn 4 ). In order to store the output,.e. the m factors lfted to degree n, we need memory for mn 4 elements from F p. In addton to these, we need to store the products P. These requre extra memory to hold (m? )n 4 elements from F p. Ths means that the amount of requred workng memory s less than the amount requred to store the result. 3 The Parallel Lftng Algorthm We wll now dscuss how to mplement ths algorthm n parallel. Each step needs the computaton of a sum of products of unvarate polynomals. We wll dstrbute ths computaton evenly across the avalable processng nodes. One node wll be reserved for collectng the results from the slave nodes and for solvng the Dophantne equaton at each step. The parallel lnear Hensel lftng algorthm s outlned n table. At step (D) and (G), the slave nodes need to compute a convoluton of the form kx =0 a b k? Ths s done by dstrbutng the products from ths sum evenly across the avalable slave nodes. The partal sums are then added together on the master node. The cost of ths addton s O( n s n) and although t could be reduced to O(n ln n s ) usng a bnary tree shaped algorthm, the ecency gan would be margnal as the cost of ths step s comparatvely small. Note, that whle the master node solves the Dophantne equaton necessary to lft the factors to y k, the slave nodes are already workng on the convoluton product that the master node wll need n order to lft the factors to y k. Ths gves us a nce computaton overlap and prevents the master node from ever havng to spn dly, watng for results from the slave nodes. At step k of each teraton, the master node has to perform m multplcatons and m dvsons of unvarate polynomals of degree n n order to solve the 5

Step Master Slave(s) (A) Input: f F p [x; y], f (0) ::m F p[x] Intalze = f (0) for = ::m Intalze p = = f (0) for = ::m Precompute the from (8) (B)?! send ::m, p ::m from master to slaves?! (C) Iterate steps (C){(J) for k from to n Compute the u [k] ::m from the Compute: and p m va equaton (7). u [] u[k?] : : : u [k?] u [] Compute u[k], u[k] u[0] Compute u [] u[k] u[k] u[] (D)? send u [] u[k?] : : : u [k?] u [] (E)?! send u [k] ::m (F) Update p (G) Iterate steps (G){(J) for from 3 to m Compute p [0]? u[k] Compute p [k]? u[] (H)? send p []? u[k] (I) Update p (J) from slaves to master? from master to slaves?! Compute: p [k]? u[0] p []? u[k] : : : p [k?]? u[] : : : p [k?]? u[] from slaves to master??! send p [k] ; p[k] from master to slaves?! Table : Parallel Algorthm 6

Dophantne equaton plus 4 3(m? ) multplcatons of unvarate polynomals of degree n to compute ts share of the convolutons. Ths amounts to a total cost of O(mnM(n)) operatons n F p, n order to lft the unvarate mage polynomals up to degree n. Assumng a number of s slave nodes, each one has to compute O(m k s ) multplcatons of unvarate polynomals of degree n at step k of the lftng. The total work of a sngle slave node sums up to O(m n s nm(n)) operatons n F p or O(m n s nn ) operatons n F p, assumng classcal polynomal multplcaton. 4 Expermental Results We have mplemented our algorthm on a massvely parallel, dstrbuted memory machne, an Intel Paragon, usng a verson of Maple that has been extended wth message passng prmtves []. Table summarzes the tmngs from lftng two degree n mage polynomals up to degree n n the second varable. Such a lftng s needed for the factorzaton of a bvarate polynomal of degree n n both varables. Our examples are over Nodes Tme Speedup Ecency 5.0 00% n = 00 3 9.7 90% 5 4.3 85% 35.0 00% n = 00 3 09.9 99% 5 58 5.6 % 34 9.6 87% 53.0 00% n = 300 3 354 3.3 09% 5 04 5.7 3% 9.7 5% 9036.0 00% 3 486 3.6 % n = 400 5 69 5.5 % 87.0 00% 49 8.4 87% 4 365 4.8 60% 7.0 00% 3 5790 3.7 % n = 500 5 379 5.6 % 88.3 0% 6 8.8 90% 4 704 30. 74% Table : Paragon Tmngs the coecent eld Z 3. Tmes are gven n wall-clock (real tme) seconds. The speedup factor s computed as: and we dene ecency as: Tme on one node Tme on s nodes Speedup Number of nodes 7

The n = 500 example corresponds to the factorzaton of a dense bvarate polynomal wth degree 000 n both varables. Its expanded form would have over a mllon terms. On a state-of-the-art workstaton, a Dgtal Alpha 500/333, the same lftng takes 099s, compared too 704s on 4 nodes of the Paragon. For n = 000, we could not get sequental tmngs on the Paragon, as our envronment mposes a ob lmt of 0 hours. However, we ran t on our Dgtal Alpha workstaton where t took 44 hours. Usng 8 nodes on the Paragon, we could reduce ths tme to hours, yeldng a speedup of. The sequental algorthm performs slghtly worse than the expected O(n 4 ). Ths s due to the overhead of Maple's garbage collectng memory manager whch ncreases wth the memory usage. Dstrbutng the computatons across more nodes, we also dstrbute the memory usage. Ths explans the super-lnear speedups that we encountered. 5 An Asymptotc Improvement n parallel. We acheve ths usng a bnary tree structured algorthm to combne the f ::m two by two. We assume that m s a power of two. If that's not the case, we pad usng A further mprovement of our lftng algorthm s to compute the P m dummy factors. Frst we dene T ; := f. Next we can compute: T ; = T ; yk ; yk wth ; = (T ) ;? (T [y 0 ] ; ) (T ) [y k ] ;? (T [y k ] ; ) [y 0 ] ; = kx q=0 (T ) ;? (T [y q ] ; ) [y k?q ] We can see that T ; = f? f T ;, wth T = ln m; P m : (mod y k ). Now we can compute successve wth ; ; T ; = T ; ; yk ; yk = (T?;? ) [y 0 ]?; (T?; ) [y 0 ]?;? ; = kx q=0 (T )?;? (T [y q ]?; ) [y k?q ] At each step k, smlarly to our ntal algorthm, the master node computes the, and those products of the ;, that nvolve coecents of yk, whle the slave nodes compute the remanng products of the ;. As n the ntal algorthm, we overlap the computaton of the ; on the slave nodes wth the computaton of (k?) ; and parts of (k?) ; on the master. Ths algorthm reduces the overall complexty to O((ln m) k ln s s nm(n)). Assumng O(n ln m) processors, the runnng tme s O(ln mnm(n)). Further assumng that we use fast unvarate polynomal multplcaton (M(n) = n ln n), we can clam an asymptotc runnng tme of O((ln m)n ln n) operatons n F p usng O(n ln m) processors. 8

6 Conclusons and further work We have presented a new algorthm for parallel bvarate Hensel lftng. The new algorthm has an asymptotc complexty of O((ln m)n ln n) operatons n F p on O(n ln m) processors. Addtonally t behaves well n practce, as experments on a massvely parallel machne have shown. The more varables a polynomal nvolves, the sparser t wll be n practce. For ths reason, even f our algorthm can be generalzed to polynomals n many varables, t wll be less ecent as t s nherently dense. Subect of further research wll be how to parallelze sparse multvarate Hensel lftng algorthms [4, 5] References [] Bernardn, L. Maple on a massvely parallel, dstrbuted memory machne. In Proceedngs of PASCO '97 (997). to appear. [] Bernardn, L., and Monagan, M. B. Ecent multvarate factorzaton over nte elds. In Proceedngs of AAECC '97 (997), Lecture Notes n Computer Scence, Sprnger-Verlag. to appear. [3] Geddes, K. O., Czapor, S. R., and Labahn, G. Algorthms for Computer Algebra. Kluwer Academc Publshers, Boston, 99. [4] Kaltofen, E. Sparse Hensel lftng. In Proceedngs of Eurocal '85, Vol. II (985), B. F. Cavness, Ed., vol. 04 of Lecture Notes n Computer Scence, Sprnger-Verlag, pp. 4{7. [5] Kaltofen, E., and Trager, B. M. Computng wth polynomals gven by black boxes for ther evaluatons: Greatest common dvsors, factorzaton, separaton of numerators and denomnators. Journal of Symbolc Computaton 9, 3 (March 990), 300{30. [6] Knuth, D. E. Semnumercal Algorthms, vol. of The Art of Computer Programmng. Addson Wesley, 98. [7] Mulders, T., and Bernardn, L. An analyss of lnear versus quadratc Hensel lftng. n preparaton, 997. [8] Vry, G. Factorzaton of multvarate polynomals wth coecents n F p. Journal of Symbolc Computaton 5, 4 (Aprl 993), 37{39. [9] Wang, P. S. Parallel polynomal operatons on SMPs: an overvew. Journal of Symbolc Computaton, 4 (996), 397{40. [0] Zppel, R. E. Eectve Polynomal Computaton. Kluwer Academc Publshers, Boston, 993. 9