Balanced Dense Polynomial Multiplication on Multicores

Size: px

Start display at page:

Download "Balanced Dense Polynomial Multiplication on Multicores"

Warren Cain
5 years ago
Views:

1 Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009

2 Introduction Motivation: Multicore-enabling parallel polynomial arithmetic in computer algebra Fast dense polynomial multiplication via FFT Multivariate polynomials over finite fields

3 Introduction Motivation: Multicore-enabling parallel polynomial arithmetic in computer algebra Fast dense polynomial multiplication via FFT Multivariate polynomials over finite fields Framework: Assume 1-D FFT (in particular 1-D TFT) as a black box Rely on the modpn C library (shipped with Maple 13): for 1-D FFT and 1-D TFT computations, for integer modulo arithmetic (Montgomery trick) Implement in Cilk++ targeting multi-cores: provably efficient work-stealing scheduling ease-of-use and low-overhead parallel constructs: cilk for, cilk spawn, cilk sync Cilkscreen for data race detection and parallelism analysis

4 FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1.

5 FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT.

6 FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT. Step 2. Evaluate fg at each point P of the grid, simply by computing f (P)g(P),

7 FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT. Step 2. Evaluate fg at each point P of the grid, simply by computing f (P)g(P), Step 3. Interpolate fg (from its values on the grid) via n-d FFT.

8 Performance of Bivariate Interpolation in Step 3 (d 1 = d 2 ) Bivariate Interpolation Speedup edup Number of Cores Thanks to Dr. Frigo for his cache-efficient code for matrix transposition!

9 Performance of Bivariate Multiplication (d 1 = d 2 = d 1 = d 2) Speedup Bivariate Multiplication Number of Cores

10 Challenges: Irregular Input Data Speedup Multiplication linear speedup bivariate (32765, 63) 8 variate ( all 4) 4 variate (1023, 1, 1, 1023) univariate ( ) Number of Cores These unbalanced data pattern are common in symbolic computation.

11 Performance Analysis by VTune No. Size of Product Two Input Size Polynomials vars. of deg No. INST Clocks per L2 Cache Modified Data Time on RETIRED. Instruction Miss Rate Sharing Ratio 8 Cores ANY 10 9 Retired ( 10 3 ) ( 10 3 ) (s)

12 Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i

13 Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i Under our 1-D FFT black box assumption, the span of Step 1 is 9 2 (s 1 lg(s 1 ) + + s n lg(s n )), and the parallelism of Step 1 is lower bounded by s/max(s 1,...,s n ). (1)

14 Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i Under our 1-D FFT black box assumption, the span of Step 1 is 9 2 (s 1 lg(s 1 ) + + s n lg(s n )), and the parallelism of Step 1 is lower bounded by s/max(s 1,...,s n ). (1) Let L be the size of a cache line. For some constant c > 0, the number of cache misses of Step 1 is upper bounded by n cs L + cs( 1 s s n ). (2)

15 Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s s n ) (3)

16 Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s s n ) (3) Since n s 1/n 1 s s n, we deduce when s i = s 1/n holds for all i. Q(s 1,...,s n ) ncs( 2 L + 1 ) (4) s1/n

17 Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s s n ) (3) Since n s 1/n 1 s s n, we deduce when s i = s 1/n holds for all i. Q(s 1,...,s n ) ncs( 2 L + 1 ) (4) s1/n Remark 1: For n 2, Expr. (4) is minimized at n = 2 and s 1 = s 2 = s. Moreover, when n = 2, under a fixed s = s 1 s 2, Expr. (1) is maximized at s 1 = s 2 = s.

18 Our Solutions (1) Contraction to bivariate from multivariate (2) Extension from univariate to bivariate (3) Balanced multiplication by extension and contraction

19 Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3.

20 Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3. Contracting f (x,y,z) to p(u,v) by x e1 y e2 u e1+(dx+1)e2, z e3 v e3 : p v 0 v 1 v 2 v

21 Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3. Contracting f (x,y,z) to p(u,v) by x e1 y e2 u e1+(dx+1)e2, z e3 v e3 : p v 0 v 1 v 2 v Remark 2: The coefficient array is essentially unchanged by contraction, which is a property of recursive dense representation.

22 Performance of Contraction (timing) Time (second) econd) variate Multiplication degrees = (1023, 1, 1, 1023) 4D TFT method, size = 2047x3x3x2047 Balanced 2D TFT method, size = 6141x Number of Cores

23 Speedup edup Performance of Contraction (speedup) variate Multiplication degrees = (1023, 1, 1, 1023) Balanced 2D TFT method, size = 6141x6141 4D TFT method, size = 2047x3x3x2047 Net speedup Number of Cores

24 Performance of Contraction for a Large Range of Problems (timing on 1 processor) 4-D TFT method on 1 core ( s) Kronecker substitution of 4-D to 1-D TFT on 1 core (35.8- s) Contraction of 4-D to 2-D TFT on 1 core ( s) Time d 1 =d d 4 =d 4 (d 2 =d 2 =d 3 =d 3 =1)

25 Performance of Contraction for a Large Range of Problems (speedup) Contraction of 4-D to 2-D TFT on 16 cores ( x speedup, x net gain) Contraction of 4-D to 2-D TFT on 8 cores ( x speedup, x net gain) 4-D TFT method on 16 cores ( x speedup) Speedup d 1 =d d 4 =d 4 (d 2 =d 2 =d 3 =d 3 =1)

26 Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b.

27 Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b. Here b = 3 works since deg(f b g b, u) = deg(f b g b, v) = 4; moreover the dense size of f b g b is 25.

28 Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b. Here b = 3 works since deg(f b g b, u) = deg(f b g b, v) = 4; moreover the dense size of f b g b is 25. Proposition: For any non-constant f, g k[x], one can always compute b such that deg(f b g b, u) deg(f b g b, v) 2 and the dense size of f b g b is at most twice that of fg.

29 Extension of f (x) to f b (u, v) in Recursive Dense Representation f x 2 x 3 x 4 x 5 x 6 x

30 Extension of f (x) to f b (u, v) in Recursive Dense Representation f x 2 x 3 x 4 x 5 x 6 x f b v 0 v 1 v

31 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14

32 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15

33 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2

34 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5

35 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6

36 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6 c 7 c 8 + c 10 c 9 + c 11 c 12

37 Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6 c 7 c 8 + c 10 c 9 + c 11 c 12 Remark 4: Converting back to fg from f b g b requires only to traverse the coefficient array once, and perform at most deg(fg, x) additions.

38 Remark 3 Our extension technique provides an alternative to the Schönage -Strassen Algorithm for handling problems which sizes are too large for the available primitive roots of unity, a limitation with FFTs over finite fields.

39 Performance of Extension (timing) Time (second) Univariate Mupltiplication degree = D TFT method, size = Balanced 2D TFT method, size = 10085x Number of Cores

40 Performance of Extension (speedup) Speedup Univariate Multiplication degree = D TFT method, size = Balanced 2D TFT method, size = 10085x10085 Net speedup Number of Cores

41 Performance of Extension for a Large Range of Problems (timing) Extension of 1-D to 2-D TFT on 1 core ( s) 1-D TFT method on 1 core ( s) Extension of 1-D to 2-D TFT on 2 cores ( x speedup, x net gain) Extension of 1-D to 2-D TFT on 16 cores ( x speedup, x net gain) Time d d

42 Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v).

43 Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v). Algorithm. Let f, g k[ <... < x n ]. W.l.o.g. one can assume d 1 >> d i and d 1 >> d i for 2 i n (up to variable re-ordering and contraction). Then we obtain fg by Step 1. Extending to {u, v}. Step 2. Contracting {v, x 2,...,x n } to v.

44 Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v). Algorithm. Let f, g k[ <... < x n ]. W.l.o.g. one can assume d 1 >> d i and d 1 >> d i for 2 i n (up to variable re-ordering and contraction). Then we obtain fg by Step 1. Extending to {u, v}. Step 2. Contracting {v, x 2,...,x n } to v. Remark 5: The above extension Φ b can be determined such that f b, g b is (nearly) a balanced pair and f b g b has dense size at most twice that of fg.

45 Performance of Balanced Multiplication for a Large Range of Problems (timing) Ext.+Contr. of 4-D to 2-D TFT on 1 core ( s) Kronecker substitution of 4-D to 1-D TFT on 1 core ( s) Ext.+Contr. of 4-D to 2-D TFT on 2 cores (1.96x speedup, 1.75x net gain) Ext.+Contr. of 4-D to 2-D TFT on 16 cores ( x speedup, x net gain) Time d 1 (d 2 =d 3 =d 4 =2) d 1 (d 2 =d 3 =d 4 =2)

46 Summary and future work: Conclusion We have obtained efficient techniques and implementations for dense polynomial multiplication on multicores: - use balanced bivariate multiplication as a kernel, - contract mutivariate to bivariate, - extend univariate to bivariate, - combine contraction and extension. Our work-in-progress include normal form, GCD/resultant and a polynomial solver via triangular decompositions. Acknowledgements: This work was supported by NSERC and MITACS NCE of Canada, and NSF Grants , , , and We are very grateful for the help of Professor Charles E. Leiserson at MIT, Dr. Matteo Frigo and all other members of Cilk Arts. Thanks to all who have been supporting!

FFT-based Dense Polynomial Arithmetic on Multi-cores

FFT-based Dense Polynomial Arithmetic on Multi-cores Yuzhen Xie Computer Science and Artificial Intelligence Laboratory, MIT and Marc Moreno Maza Ontario Research Centre for Computer Algebra, UWO ACA 2009,