Toward High Performance Matrix Multiplication for Exact Computation

Size: px

Start display at page:

Download "Toward High Performance Matrix Multiplication for Exact Computation"

Amberly Norris
6 years ago
Views:

1 Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014

2 Motivations Matrix multiplication plays a central role in computer algebra. algebraic complexity of O(n ω ) with ω < [Williams 2011] Modern processors provide many levels of parallelism. superscalar, SIMD units, multiple cores High performance matrix multiplication numerical computing = classic algorithm + hardware arithmetic exact computing numerical computing algebraic algorithm is not the most efficient ( complexity model) arithmetic is not directly in the hardware (e.g. Z, F q, Z[x], Q[x, y, z]).

3 Motivation : Superscalar processor with SIMD Hierarchical memory : L1 cache : 32kB - 4 cycles L2 cache : 256kB - 12 cycles L3 cache : 8MB - 36 cycles RAM : 32GB - 36 cycles + 57ns

4 Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism

5 Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism Our goal : try to incorporate these rules into exact matrix multiplications

6 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

7 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

8 Matrix multiplication with small integers This corresponds to the case where each integer result holds in one processor register : where s is the register size. A, B Z n n such that AB < 2 s Main interests ring isomorphism : computation over Z/pZ is congruent to Z/2 s Z when p(n 1) 2 < 2 s. its a building block for matrix mutiplication with larger integers

9 Matrix multiplication with small integers Two possibilities for hardware support : use floating point mantissa, i.e. s = 2 53, use native integer, i.e. s = Using floating point historically, the first approach in computer algebra [Dumas, Gautier, Pernet 2002] out of the box performance from optimized BLAS handle matrix with entries < 2 26 Using native integers apply same optimizations as BLAS libraries [Goto, Van De Geijn 2008] handle matrix with entries < 2 32

10 Matrix multiplication with small integers floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add # vector operations per cycle (pipelined) OpenBlas: double - SSE OpenBlas: double - AVX OpenBlas: double - AVX2 Our code: int - SSE Our code: int - AVX2 30 GFLOPS Matrix dimensions benchmark on Intel 2.60GHz

11 Matrix multiplication with small integers Matrix multiplication modulo a small integer Let p such that (p 1) 2 n < perform the multiplication in Z using BLAS 2 reduce the result modulo p OpenBlas mod p - Classic mod p - Winograd 40 GFLOPS Matrix dimensions benchmark on Intel 2.60GHz

12 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

13 Matrix multiplication with multi-precision integers Direct approach Let M(k) be the bit complexity of k-bit integers multiplication and A, B Z n n such that A, B O(2 k ). Computing AB using direct algorithm costs n ω M(k) bit operations. not best possible complexity, i.e. M(k) is super-linear not efficient in practice Remark: Use evaluation/interpolation technique for better performances!!!

14 Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction

15 Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction R(k) = O(M(k) log(k)) using divide and conquer strategy R(k) = O(k 2 ) using naive approach

16 Multi-modular matrix multiplication Improving naive approach with linear algebra reduction/reconstruction of n 2 data corresponds to matrix multiplication improve the bit complexity from O(n 2 k 2 ) to O(n 2 k ω 1 ) benefit from optimized matrix multiplication, i.e. SIMD Remark : A similar approach has been used by [Doliskani, Schost 2010] in a non-distributed code.

17 Multi-modular reductions of an integer matrix Let us assume M = k i=1 m i < β k with m i < β. Multi-reduction of a single entry Let a = a 0 + a 1 β +... a k 1 β k 1 be a value to reduce modm i then a m1 1 β m1... β k 1 m1 a0 m 1. = Q. a mk 1 β mk... β k 1 mk a k 1 m k with Q < kβ 2 Lemma : if kβ 2 O(1) than the reduction of n 2 integers modulo the m i s costs O(n 2 k ω 1 ) + O(n 2 k) bit operations.

18 Multi-modular reconstruction of an integer matrix Let us assume M = k i=1 m i < β k with m i < β and M i = M/m i CRT formulae : a = ( Reconstruction of a single entry k i=1 a m1 M i M 1 i mi ) mod M Let M i M 1 i mi = α (i) 0 + α(i) 1 β +... α(i) k 1 βk 1 be the CRT constants, then a 0 α (1) 0... α (1) k 1 a m1. = a k 1 α (k) 0... α (k) a mk k 1 with a i < kβ 2 and a = a a k 1 β k 1 mod M the CRT solution. Lemma : if kβ 2 O(1) than the reconstruction of n 2 integers from their images modulo the m i s costs O(n 2 k ω 1 ) + O (n 2 k) bit operations.

19 Matrix multiplication with multi-precision integers Implementation of multi-modular approach choose β = 2 16 to optimize β-adic conversions choose m i s.t. nβm i < 2 53 and use BLAS dgemm use a linear storage for multi-modular matrices Compare sequential performances with : FLINT library 1 : uses divide and conquer Mathemagix library 2 : uses divide and conquer Doliskani s code 3 : uses dgemm for reductions only courtesy of J. Doliskani

20 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 32) 4.00 Time in seconds Entry bitsize benchmark on Intel 2.0GHz

21 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 128) Time in seconds Entry bitsize benchmark on Intel 2.0GHz

22 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 512) Time in seconds Entry bitsize benchmark on Intel 2.0GHz

23 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 1024) Time in seconds Entry bitsize benchmark on Intel 2.0GHz

24 Parallel multi-modular matrix multiplication 1 for i = 1... k compute A i = A mod m i and B i = B mod m i 3 reconstruct C = AB from (C 1,..., C k ) Parallelization of multi-modular reduction/reconstruction each thread reduces (resp. reconstructs) a chunk of the given matrix thread 0 thread 1 thread 2 thread 3 A 0 = A mod m 0 A 1 = A mod m 1 A 2 = A mod m 2 A 3 = A mod m 3 A 4 = A mod m 4 A 5 = A mod m 5 A 6 = A mod m 6

25 Parallel multi-modular matrix multiplication 2 for i = 1... k compute C i = A i B i mod m i Parallelization of modular multiplication each thread computes a bunch of matrix multiplications modm i thread 0 C 0 = A 0B 0 mod m 0 C 1 = A 1B 1 mod m 1 thread 1 C 2 = A 2B 2 mod m 2 C 3 = A 3B 3 mod m 3 thread 2 C 4 = A 4B 4 mod m 4 C 5 = A 5B 5 mod m 5 thread 3 C 6 = A 6B 6 mod m 6

26 Parallel Matrix multiplication with multi-precision integers based on OpenMP task CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy!!! Speedup of parallel integer matrix mul5plica5on speedup dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = number of cores benchmark on Intel 2.0GHz (2 NUMA with 6 cores)

27 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

28 Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT

29 Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT Remark: using Vandermonde matrix on can get a similar approach as for integers, i.e. O(n ω k + n 2 k ω 1 )

30 Matrix multiplication over F p [x] Evaluation/Interpolation scheme Let θ a primitive kth root of unity in F p. 1 for i = 1... k compute A i = A(θ i 1 ) and B i = B(θ i 1 ) 2 for i = 1... k compute C i = A i B i F p 3 interpolate C = AB from (C 1,..., C k ) steps 1 and 3 : O(n 2 ) call to FFT k over F p [x] step 2 : k matrix multiplications modulo a small prime p

31 FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y )

32 FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y ) SSE/AVX provide 16 or 32-bits low short product no high short product available (use full product)

33 Matrix multiplication over F p [x] Implementation radix-4 FFT with 128-bits SSE (29 bits primes) BLAS-based matrix multiplication over F p [FFLAS-FFPACK library] Time in seconds 1 024,00 512,00 256,00 128,00 64,00 32,00 16,00 8,00 4,00 2,00 1,00 0,50 0,25 0,13 0,06 0,03 0,02 Polynomial Matrix Mul9plica9on Performances deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32 n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048 Matrix with dimension (n) and degree (deg) FLINT MMX our code benchmark on Intel 2.0GHz

34 Matrix multiplication over Z[x] A, B Z[x] n n such that deg(ab) < d and (AB) i < k Complexity O (n ω d log(d) log(k)) bit op. using Kronecker substitution O(n ω d log(k) + n 2 d log(d) log(k)) bit op. using CRT+FFT Remark: if the result s degree and bitsize are not too large, CRT with Fourier primes might suffice.

35 Matrix multiplication over Z[x] Implementation use CRT with Fourier primes re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication in F p [x] Time in seconds FLINT our code 1 log(k)=64 log(k)=128 log(k)=256 log(k)=512 log(k)=64 log(k)=128 log(k)=256 n=16 ; deg=8192 n=128 ; deg=512 benchmark on Intel 2.0GHz

36 Parallel Matrix multiplication over Z[x] Very first attempt (work still in progress) parallel CRT with linear algebra (same code as in Z case) perform each multiplication over F p [x] in parallel some part of the code still sequential n d log(k) 6 cores 12 cores time seq s s s s

37 Polynomial Matrix in LinBox (proposition) Generic handler class for Polynomial Matrix template<s i z e t type, s i z e t s t o r a g e, c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x ; Specialization for different memory strategy // M a t r i x o f p o l y n o m i a l s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : p o l f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s ( p a r t i a l view on monomials ) template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : view, F i e l d >;

38 Conclusion High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT FFT for polynomial over wordsize prime fields adaptative matrix representation We provide in the LinBox library ( efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication over F p [x] and Z[x]

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,