Balanced Dense Polynomial Multiplication on Multicores

Similar documents
FFT-based Dense Polynomial Arithmetic on Multi-cores

Balanced dense polynomial multiplication on multi-cores

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

High-Performance Symbolic Computation in a Hybrid Compiled-Interpreted Programming Environment

Dense Arithmetic over Finite Fields with CUMODP

On the Factor Refinement Principle and its Implementation on Multicore Architectures

TOWARD HIGH-PERFORMANCE POLYNOMIAL SYSTEM SOLVERS BASED ON TRIANGULAR DECOMPOSITIONS. (Spine title: Contributions to Polynomial System Solvers)

Fast Polynomial Multiplication

The modpn library: Bringing Fast Polynomial Arithmetic into Maple

Cache Complexity and Multicore Implementation for Univariate Real Root Isolation

Toward High-performance Polynomial System Solvers Based on Triangular Decompositions

Analysis of Multithreaded Algorithms

Plain Polynomial Arithmetic on GPU

In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field

Component-level Parallelization of Triangular Decompositions

Modular Methods for Solving Nonlinear Polynomial Systems

Exact Arithmetic on a Computer

Fast arithmetic for triangular sets: from theory to practice

Efficient Evaluation of Large Polynomials

PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

Implementation of the DKSS Algorithm for Multiplication of Large Numbers

On The Parallelization Of Integer Polynomial Multiplication

A parallel implementation for polynomial multiplication modulo a prime.

Computations Modulo Regular Chains

Solving Polynomial Systems Symbolically and in Parallel

1. Algebra 1.5. Polynomial Rings

Exact Computation of the Real Solutions of Arbitrary Polynomial Systems

Computing Limits of Real Multivariate Rational Functions

When does T equal sat(t)?

Sparse Polynomial Multiplication and Division in Maple 14

Optimizing and Parallelizing Brown s Modular GCD Algorithm

Real Solving on Bivariate Systems with Sturm Sequences and SLV Maple TM library

Toward High Performance Matrix Multiplication for Exact Computation

Parallel Sparse Polynomial Division Using Heaps

Real Solving on Algebraic Systems of Small Dimension

Quantum algorithms (CO 781/CS 867/QIC 823, Winter 2013) Andrew Childs, University of Waterloo LECTURE 13: Query complexity and the polynomial method

Implementing Fast Carryless Multiplication

d A L. T O S O U LOWELL, MICHIGAN. THURSDAY, DECEMBER 5, 1929 Cadillac, Nov. 20. Indignation

MCS 563 Spring 2014 Analytic Symbolic Computation Monday 14 April. Binomial Ideals

Change of Ordering for Regular Chains in Positive Dimension

On the BMS Algorithm

Fast computation of normal forms of polynomial matrices

D-MATH Algebra I HS18 Prof. Rahul Pandharipande. Solution 6. Unique Factorization Domains

Fast algorithms for polynomials and matrices Part 2: polynomial multiplication

Computing with Constructible Sets in Maple

Output-sensitive algorithms for sumset and sparse polynomial multiplication

Section III.6. Factorization in Polynomial Rings

Handout - Algebra Review

Polynomial multiplication and division using heap.

CS 829 Polynomial systems: geometry and algorithms Lecture 3: Euclid, resultant and 2 2 systems Éric Schost

Places of Number Fields and Function Fields MATH 681, Spring 2018

SOLVING VIA MODULAR METHODS

Fast Algorithms, Modular Methods, Parallel Approaches and Software Engineering for Solving Polynomial Systems Symbolically

On Fulton s Algorithm for Computing Intersection Multiplicities

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

8 Appendix: Polynomial Rings

CS 4424 Matrix multiplication

PART II: Research Proposal Algorithms for the Simplification of Algebraic Formulae

Space- and Time-Efficient Polynomial Multiplication

On the complexity of the D5 principle

CSE 421 Algorithms. T(n) = at(n/b) + n c. Closest Pair Problem. Divide and Conquer Algorithms. What you really need to know about recurrences

SPECTRA - a Maple library for solving linear matrix inequalities in exact arithmetic

Parallel Sparse Polynomial Interpolation over Finite Fields

Plan. Analysis of Multithreaded Algorithms. Plan. Matrix multiplication. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza

Algorithms for the Non-monic Case of the Sparse Modular GCD Algorithm

Optimizing and Parallelizing the Modular GCD Algorithm

Intro to Rings, Fields, Polynomials: Hardware Modeling by Modulo Arithmetic

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. Prof. R. Fateman

Elliptic Curves Spring 2013 Lecture #3 02/12/2013

SCALED REMAINDER TREES

2 Multi-point evaluation in higher dimensions tion and interpolation problems in several variables; as an application, we improve algorithms for multi

Strassen s Algorithm for Tensor Contraction

On the complexity of computing with zero-dimensional triangular sets

On enumerating monomials and other combinatorial structures by polynomial interpolation

Computing the real solutions of polynomial systems with the RegularChains library in Maple

The tangent FFT. D. J. Bernstein University of Illinois at Chicago

Real root isolation for univariate polynomials on GPUs and multicores

FILTERED RINGS AND MODULES. GRADINGS AND COMPLETIONS.

Between Sparse and Dense Arithmetic

Inversion Modulo Zero-dimensional Regular Chains

Partial Fractions and the Coverup Method Haynes Miller and Jeremy Orloff

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

Fast computation of normal forms of polynomial matrices

Towards abstract and executable multivariate polynomials in Isabelle

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Reconstructibility of trees from subtree size frequencies

arxiv: v1 [cs.ds] 28 Jan 2010

Computing the Monodromy Group of a Plane Algebraic Curve Using a New Numerical-modular Newton-Puiseux Algorithm

Chapter 1 Divide and Conquer Polynomial Multiplication Algorithm Theory WS 2015/16 Fabian Kuhn

A Maple Package for Parametric Matrix Computations

Fast Multivariate Power Series Multiplication in Characteristic Zero

An Illustrated Introduction to the Truncated Fourier Transform

WORKING WITH MULTIVARIATE POLYNOMIALS IN MAPLE

COMMUTATIVE/NONCOMMUTATIVE RANK OF LINEAR MATRICES AND SUBSPACES OF MATRICES OF LOW RANK

On the complexity of the D5 principle

Computing Cylindrical Algebraic Decomposition via Triangular Decomposition

Four-Dimensional GLV Scalar Multiplication

Real Root Isolation of Regular Chains.

What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?

Transcription:

Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009

Introduction Motivation: Multicore-enabling parallel polynomial arithmetic in computer algebra Fast dense polynomial multiplication via FFT Multivariate polynomials over finite fields

Introduction Motivation: Multicore-enabling parallel polynomial arithmetic in computer algebra Fast dense polynomial multiplication via FFT Multivariate polynomials over finite fields Framework: Assume 1-D FFT (in particular 1-D TFT) as a black box Rely on the modpn C library (shipped with Maple 13): for 1-D FFT and 1-D TFT computations, for integer modulo arithmetic (Montgomery trick) Implement in Cilk++ targeting multi-cores: provably efficient work-stealing scheduling ease-of-use and low-overhead parallel constructs: cilk for, cilk spawn, cilk sync Cilkscreen for data race detection and parallelism analysis

FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1.

FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT.

FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT. Step 2. Evaluate fg at each point P of the grid, simply by computing f (P)g(P),

FFT-based Multivariate Multiplication (Algorithm Review) Let k be a field and f, g k[ < < x n ] be polynomials. Define d i = deg(f, x i ) and d i = deg(g, x i ), for all i. Assume there exists a primitive s i -th root unity ω i k for all i, where s i is a power of 2 satisfying s i d i + d i + 1. Then fg can be computed as follows. Step 1. Evaluate f and g at each point P (i.e. f (P),g(P)) of the n-dimensional grid ((ω e1 1,...,ωen n ),0 e 1 < s 1,...,0 e n < s n ) via n-d FFT. Step 2. Evaluate fg at each point P of the grid, simply by computing f (P)g(P), Step 3. Interpolate fg (from its values on the grid) via n-d FFT.

Performance of Bivariate Interpolation in Step 3 (d 1 = d 2 ) 16.00 Bivariate Interpolation Speedup edup 14.00 12.00 10.00 800 8.00 6.00 16383 8191 4095 2047 4.00 2.00 0.00 0 2 4 6 8 10 12 14 16 Number of Cores Thanks to Dr. Frigo for his cache-efficient code for matrix transposition!

Performance of Bivariate Multiplication (d 1 = d 2 = d 1 = d 2) Speedup 16.00 14.00 12.00 10.00 800 8.00 Bivariate Multiplication 8191 4095 2047 1023 6.00 4.00 2.00 0.00 0 2 4 6 8 10 12 14 16 Number of Cores

Challenges: Irregular Input Data 16.00 14.00 12.00 10.00 Speedup 800 8.00 6.00 Multiplication linear speedup bivariate (32765, 63) 8 variate ( all 4) 4 variate (1023, 1, 1, 1023) univariate (25427968) 4.00 2.00 0.00 0 2 4 6 8 10 12 14 16 Number of Cores These unbalanced data pattern are common in symbolic computation.

Performance Analysis by VTune No. Size of Product Two Input Size Polynomials 1 8191 8191 268402689 2 259575 258 268401067 3 63 63 63 63 260144641 4 8 vars. of deg. 5 214358881 No. INST Clocks per L2 Cache Modified Data Time on RETIRED. Instruction Miss Rate Sharing Ratio 8 Cores ANY 10 9 Retired ( 10 3 ) ( 10 3 ) (s) 1 659.555 0.810 0.333 0.078 16.15 2 713.882 0.890 0.735 0.192 19.52 3 714.153 0.854 1.096 0.635 22.44 4 1331.340 1.418 1.177 0.576 72.99

Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i

Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i Under our 1-D FFT black box assumption, the span of Step 1 is 9 2 (s 1 lg(s 1 ) + + s n lg(s n )), and the parallelism of Step 1 is lower bounded by s/max(s 1,...,s n ). (1)

Complexity Analysis (1/2) Let s = s 1 s n. The number of operations in k for computing fg via n-d FFT is 9 n s j )s i lg(s i ) + (n + 1)s = 9 s lg(s) + (n + 1)s. 2 2 i=1( j i Under our 1-D FFT black box assumption, the span of Step 1 is 9 2 (s 1 lg(s 1 ) + + s n lg(s n )), and the parallelism of Step 1 is lower bounded by s/max(s 1,...,s n ). (1) Let L be the size of a cache line. For some constant c > 0, the number of cache misses of Step 1 is upper bounded by n cs L + cs( 1 s 1 + + 1 s n ). (2)

Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s 1 + + 1 s n ) (3)

Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s 1 + + 1 s n ) (3) Since n s 1/n 1 s 1 + + 1 s n, we deduce when s i = s 1/n holds for all i. Q(s 1,...,s n ) ncs( 2 L + 1 ) (4) s1/n

Complexity Analysis (2/2) Let Q(s 1,...,s n ) denotes the total number of cache misses for the whole algorithm, for some constant c we obtain Q(s 1,...,s n ) cs n + 1 L + cs( 1 s 1 + + 1 s n ) (3) Since n s 1/n 1 s 1 + + 1 s n, we deduce when s i = s 1/n holds for all i. Q(s 1,...,s n ) ncs( 2 L + 1 ) (4) s1/n Remark 1: For n 2, Expr. (4) is minimized at n = 2 and s 1 = s 2 = s. Moreover, when n = 2, under a fixed s = s 1 s 2, Expr. (1) is maximized at s 1 = s 2 = s.

Our Solutions (1) Contraction to bivariate from multivariate (2) Extension from univariate to bivariate (3) Balanced multiplication by extension and contraction

Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y 1 8 40 7 24 16 0 5 2 21 17 3 37 18 4 29 16 The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3.

Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y 1 8 40 7 24 16 0 5 2 21 17 3 37 18 4 29 16 The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3. Contracting f (x,y,z) to p(u,v) by x e1 y e2 u e1+(dx+1)e2, z e3 v e3 : p v 0 v 1 v 2 v 3 8 40 7 24 16 0 5 2 21 17 3 37 18 4 29 16

Solution 1: Contraction to Bivariate from Multivar. Example. Let f k[x,y,z] where k = Z/41Z, with d x = d y = 1, d z = 3, and recursive dense representation: f z 0 z 1 z 2 z 3 y 0 y 1 y 0 y 1 y 0 y 1 y 0 y 1 8 40 7 24 16 0 5 2 21 17 3 37 18 4 29 16 The coefficients (not monomials) are stored in a contiguous array. Index monomial x e1 y e2 z e3 by e 1 + (d x + 1)e 2 + (d x + 1)(d y + 1)e 3. Contracting f (x,y,z) to p(u,v) by x e1 y e2 u e1+(dx+1)e2, z e3 v e3 : p v 0 v 1 v 2 v 3 8 40 7 24 16 0 5 2 21 17 3 37 18 4 29 16 Remark 2: The coefficient array is essentially unchanged by contraction, which is a property of recursive dense representation.

Performance of Contraction (timing) 45 40 35 30 Time (second) econd) 25 20 15 4 variate Multiplication degrees = (1023, 1, 1, 1023) 4D TFT method, size = 2047x3x3x2047 Balanced 2D TFT method, size = 6141x6141 10 5 0 0 2 4 6 8 10 12 14 16 Number of Cores

Speedup edup Performance of Contraction (speedup) 32.00 30.00 28.00 26.00 24.00 22.00 20.00 18.00 1600 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 4 variate Multiplication degrees = (1023, 1, 1, 1023) Balanced 2D TFT method, size = 6141x6141 4D TFT method, size = 2047x3x3x2047 Net speedup 0 2 4 6 8 10 12 14 16 Number of Cores

Performance of Contraction for a Large Range of Problems (timing on 1 processor) 4-D TFT method on 1 core (43.5-179.9 s) Kronecker substitution of 4-D to 1-D TFT on 1 core (35.8- s) Contraction of 4-D to 2-D TFT on 1 core (19.8-86.2 s) 180 160 140 120 Time 100 80 60 40 20 1024 1280 d 1 =d 1 1536 1792 2048 1792 1536 2048 1024 1280 d 4 =d 4 (d 2 =d 2 =d 3 =d 3 =1)

Performance of Contraction for a Large Range of Problems (speedup) Contraction of 4-D to 2-D TFT on 16 cores (8.2-13.2x speedup, 15.9-29.9x net gain) Contraction of 4-D to 2-D TFT on 8 cores (6.5-7.7x speedup, 12.8-16.5x net gain) 4-D TFT method on 16 cores (2.7-3.4x speedup) 16 14 12 Speedup 10 8 6 4 2 1024 1280 d 1 =d 1 1536 1792 2048 1792 1536 2048 1024 1280 d 4 =d 4 (d 2 =d 2 =d 3 =d 3 =1)

Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b.

Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b. Here b = 3 works since deg(f b g b, u) = deg(f b g b, v) = 4; moreover the dense size of f b g b is 25.

Solution 2: Extension from Univariate to Bivariate Example: Consider f, g k[x] univariate, with deg(f ) = 7 and deg(g) = 8; fg has dense size 16. We compute an integer b, such that fg can be performed via f b g b using nearly square 2-D FFTs, where f b := Φ b (f ), g b := Φ b (g) and Φ b : x e u e rem b v e quo b. Here b = 3 works since deg(f b g b, u) = deg(f b g b, v) = 4; moreover the dense size of f b g b is 25. Proposition: For any non-constant f, g k[x], one can always compute b such that deg(f b g b, u) deg(f b g b, v) 2 and the dense size of f b g b is at most twice that of fg.

Extension of f (x) to f b (u, v) in Recursive Dense Representation f x 2 x 3 x 4 x 5 x 6 x 7 32 13 5 7 8 11 40 35

Extension of f (x) to f b (u, v) in Recursive Dense Representation f x 2 x 3 x 4 x 5 x 6 x 7 32 13 5 7 8 11 40 35 f b v 0 v 1 v 2 32 13 5 7 8 11 40 35 0

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6 c 7 c 8 + c 10 c 9 + c 11 c 12

Conversion to Univariate from the Bivariate Product The bivariate product: deg(f b g b, u) = 4, deg(f b g b, v) = 4. f b g b v 0 v 1 v 2 v u 4 u 4 u 4 u c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 Convert to univariate: deg(fg, x) = 15. fg x 2 x 3 x 4 x 5 x 6 x 7 x 8 x9,,15 c 0 c 1 c 2 c 3 + c 5 c 4 + c 6 c 7 c 8 + c 10 c 9 + c 11 c 12 Remark 4: Converting back to fg from f b g b requires only to traverse the coefficient array once, and perform at most deg(fg, x) additions.

Remark 3 Our extension technique provides an alternative to the Schönage -Strassen Algorithm for handling problems which sizes are too large for the available primitive roots of unity, a limitation with FFTs over finite fields.

Performance of Extension (timing) 65 60 55 50 45 Time (second) 40 35 30 25 20 15 10 5 0 Univariate Mupltiplication degree = 25427968 1D TFT method, size = 50855937 Balanced 2D TFT method, size = 10085x10085 0 2 4 6 8 10 12 14 16 Number of Cores

Performance of Extension (speedup) Speedup 16.00 14.00 12.00 10.00 800 8.00 Univariate Multiplication degree = 25427968 1D TFT method, size = 50855937 Balanced 2D TFT method, size = 10085x10085 Net speedup 6.00 4.00 2.00 0.00 0 2 4 6 8 10 12 14 16 Number of Cores

Performance of Extension for a Large Range of Problems (timing) Extension of 1-D to 2-D TFT on 1 core (2.2-80.1 s) 1-D TFT method on 1 core (1.8-59.7 s) Extension of 1-D to 2-D TFT on 2 cores (1.96-2.0x speedup, 1.5-1.7x net gain) Extension of 1-D to 2-D TFT on 16 cores (8.0-13.9x speedup, 6.5-11.5x net gain) Time 80 70 60 50 40 30 20 10 0 8.12646 16.2529 24.3794 16.2529 24.3794 d 8.12646 1 0 6 32.5059 d 1 0 6 32.5059

Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v).

Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v). Algorithm. Let f, g k[ <... < x n ]. W.l.o.g. one can assume d 1 >> d i and d 1 >> d i for 2 i n (up to variable re-ordering and contraction). Then we obtain fg by Step 1. Extending to {u, v}. Step 2. Contracting {v, x 2,...,x n } to v.

Solution 3: Balanced Multiplication Definition. A pair of bivariate polynomials p, q k[u, v] is balanced if deg(p, u) + deg(q, u) = deg(p, v) + deg(q, v). Algorithm. Let f, g k[ <... < x n ]. W.l.o.g. one can assume d 1 >> d i and d 1 >> d i for 2 i n (up to variable re-ordering and contraction). Then we obtain fg by Step 1. Extending to {u, v}. Step 2. Contracting {v, x 2,...,x n } to v. Remark 5: The above extension Φ b can be determined such that f b, g b is (nearly) a balanced pair and f b g b has dense size at most twice that of fg.

Performance of Balanced Multiplication for a Large Range of Problems (timing) Ext.+Contr. of 4-D to 2-D TFT on 1 core (7.6-15.7 s) Kronecker substitution of 4-D to 1-D TFT on 1 core (6.8-14.1 s) Ext.+Contr. of 4-D to 2-D TFT on 2 cores (1.96x speedup, 1.75x net gain) Ext.+Contr. of 4-D to 2-D TFT on 16 cores (7.0-11.3x speedup, 6.2-10.3x net gain) 16 14 12 Time 10 8 6 4 2 0 32768 40960 49152 d 1 (d 2 =d 3 =d 4 =2) 57344 65536 32768 40960 65536 57344 49152 d 1 (d 2 =d 3 =d 4 =2)

Summary and future work: Conclusion We have obtained efficient techniques and implementations for dense polynomial multiplication on multicores: - use balanced bivariate multiplication as a kernel, - contract mutivariate to bivariate, - extend univariate to bivariate, - combine contraction and extension. Our work-in-progress include normal form, GCD/resultant and a polynomial solver via triangular decompositions. Acknowledgements: This work was supported by NSERC and MITACS NCE of Canada, and NSF Grants 0540248, 0615215, 0541209, and 0621511. We are very grateful for the help of Professor Charles E. Leiserson at MIT, Dr. Matteo Frigo and all other members of Cilk Arts. Thanks to all who have been supporting!