Toward High Performance Matrix Multiplication for Exact Computation

Similar documents
Dense Arithmetic over Finite Fields with CUMODP

Solving Sparse Rational Linear Systems. Pascal Giorgi. University of Waterloo (Canada) / University of Perpignan (France) joint work with

Faster arithmetic for number-theoretic transforms

Exact Linear Algebra Algorithmic: Theory and Practice

Q-adic Transform revisited

hal , version 1-29 Feb 2008

Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients

Matrix multiplication over word-size prime fields using Bini s approximate formula

Computing Characteristic Polynomials of Matrices of Structured Polynomials

PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science

Fast and Small: Multiplying Polynomials without Extra Space

Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages

High Performance and Reliable Algebraic Computing

Chapter 1 Divide and Conquer Polynomial Multiplication Algorithm Theory WS 2015/16 Fabian Kuhn

On the Computational Complexity of the Discrete Pascal Transform

Polynomial Selection. Thorsten Kleinjung École Polytechnique Fédérale de Lausanne

Output-sensitive algorithms for sumset and sparse polynomial multiplication

What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?

Polynomial arithmetic and applications in number theory

Prime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation

Class invariants by the CRT method

Parallel Polynomial Evaluation

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Solving Sparse Integer Linear Systems

Old and new algorithms for computing Bernoulli numbers

On the use of polynomial matrix approximant in the block Wiedemann algorithm

Software implementation of ECC

Exact Arithmetic on a Computer

Fast reversion of formal power series

Addition sequences and numerical evaluation of modular forms

ERLANGEN REGIONAL COMPUTING CENTER

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

Space- and Time-Efficient Polynomial Multiplication

Fast evaluation of iterated multiplication of very large polynomials: An application to chinese remainder theory

Fast Polynomial Multiplication

Chapter 1 Divide and Conquer Algorithm Theory WS 2016/17 Fabian Kuhn

HPMPC - A new software package with efficient solvers for Model Predictive Control

Computing Bernoulli numbers

Parallel Transposition of Sparse Data Structures

Big Prime Field FFT on the GPU

Exactly Solve Integer Linear Systems Using Numerical Methods

Algorithms and Methods for Fast Model Predictive Control

CRYPTOGRAPHIC COMPUTING

On the evaluation of modular polynomials

Fixed-Point Arithmetic in SHE Schemes

Fast reversion of power series

In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field

Computing modular polynomials with the Chinese Remainder Theorem

Towards High Performance Multivariate Factorization. Michael Monagan. This is joint work with Baris Tuncer.

A parallel implementation for polynomial multiplication modulo a prime.

Fast Polynomials Multiplication Using FFT

Elliptic Curves Spring 2013 Lecture #3 02/12/2013

Four-Dimensional GLV Scalar Multiplication

Eliminations and echelon forms in exact linear algebra

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Solving Sparse Integer Linear Systems. Pascal Giorgi. Dali team - University of Perpignan (France)

Component-level Parallelization of Triangular Decompositions

Integer multiplication with generalized Fermat primes

Balanced Dense Polynomial Multiplication on Multicores

Parallel Numerical Algorithms

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Balanced dense polynomial multiplication on multi-cores

Modular polynomials and isogeny volcanoes

Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Smoothness Testing of Polynomials over Finite Fields

Why solve integer linear systems exactly? David Saunders - U of Delaware

Polynomial multiplication and division using heap.

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Efficient implementation of the Hardy-Ramanujan-Rademacher formula

Faster Homomorphic Evaluation of Discrete Fourier Transforms

Even faster integer multiplication

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Towards High Performance Multivariate Factorization. Michael Monagan. This is joint work with Baris Tuncer.

Faster ECC over F 2. School of Computer and Communication Sciences EPFL, Switzerland 2 CertiVox Labs.

POLY : A new polynomial data structure for Maple 17 that improves parallel speedup.

Counting points on smooth plane quartics

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to publication record in Explore Bristol Research PDF-document

Three Ways to Test Irreducibility

Algorithms for Solving Linear Differential Equations with Rational Function Coefficients

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

An Approach to Hensel s Lemma

Cache Complexity and Multicore Implementation for Univariate Real Root Isolation

(Group-theoretic) Fast Matrix Multiplication

Some notes on efficient computing and setting up high performance computing environments

Distributed computing of simultaneous Diophantine approximation problems

Computing over Z, Q, K[X]

Computers and Mathematics with Applications. Ramanujan s class invariants and their use in elliptic curve cryptography

Bon anniversaire Marc!

COMPUTER ARITHMETIC. 13/05/2010 cryptography - math background pp. 1 / 162

Cover and Decomposition Index Calculus on Elliptic Curves made practical

Partitions in the quintillions or Billions of congruences

CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication

Algorithms for Structured Linear Systems Solving and their Implementation

RSA Implementation. Oregon State University

Computing L-series of geometrically hyperelliptic curves of genus three. David Harvey, Maike Massierer, Andrew V. Sutherland

On the Factor Refinement Principle and its Implementation on Multicore Architectures

Scientific Computing: An Introductory Survey

Fast Computation of the Nth Term of an Algebraic Series over a Finite Prime Field

Presentation of Normal Bases

Class polynomials for abelian surfaces

Transcription:

Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014

Motivations Matrix multiplication plays a central role in computer algebra. algebraic complexity of O(n ω ) with ω < 2.3727 [Williams 2011] Modern processors provide many levels of parallelism. superscalar, SIMD units, multiple cores High performance matrix multiplication numerical computing = classic algorithm + hardware arithmetic exact computing numerical computing algebraic algorithm is not the most efficient ( complexity model) arithmetic is not directly in the hardware (e.g. Z, F q, Z[x], Q[x, y, z]).

Motivation : Superscalar processor with SIMD Hierarchical memory : L1 cache : 32kB - 4 cycles L2 cache : 256kB - 12 cycles L3 cache : 8MB - 36 cycles RAM : 32GB - 36 cycles + 57ns

Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism

Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism Our goal : try to incorporate these rules into exact matrix multiplications

Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

Matrix multiplication with small integers This corresponds to the case where each integer result holds in one processor register : where s is the register size. A, B Z n n such that AB < 2 s Main interests ring isomorphism : computation over Z/pZ is congruent to Z/2 s Z when p(n 1) 2 < 2 s. its a building block for matrix mutiplication with larger integers

Matrix multiplication with small integers Two possibilities for hardware support : use floating point mantissa, i.e. s = 2 53, use native integer, i.e. s = 2 64. Using floating point historically, the first approach in computer algebra [Dumas, Gautier, Pernet 2002] out of the box performance from optimized BLAS handle matrix with entries < 2 26 Using native integers apply same optimizations as BLAS libraries [Goto, Van De Geijn 2008] handle matrix with entries < 2 32

Matrix multiplication with small integers floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add # vector operations per cycle (pipelined) 50 45 40 35 OpenBlas: double - SSE OpenBlas: double - AVX OpenBlas: double - AVX2 Our code: int - SSE Our code: int - AVX2 30 GFLOPS 25 20 15 10 5 0 32 64 128 256 512 1024 2048 4096 8192 Matrix dimensions benchmark on Intel i7-4960hq @ 2.60GHz

Matrix multiplication with small integers Matrix multiplication modulo a small integer Let p such that (p 1) 2 n < 2 53 1 perform the multiplication in Z using BLAS 2 reduce the result modulo p 60 50 OpenBlas mod p - Classic mod p - Winograd 40 GFLOPS 30 20 10 0 32 64 128 256 512 1024 2048 4096 8192 Matrix dimensions benchmark on Intel i7-4960hq @ 2.60GHz

Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

Matrix multiplication with multi-precision integers Direct approach Let M(k) be the bit complexity of k-bit integers multiplication and A, B Z n n such that A, B O(2 k ). Computing AB using direct algorithm costs n ω M(k) bit operations. not best possible complexity, i.e. M(k) is super-linear not efficient in practice Remark: Use evaluation/interpolation technique for better performances!!!

Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction

Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction R(k) = O(M(k) log(k)) using divide and conquer strategy R(k) = O(k 2 ) using naive approach

Multi-modular matrix multiplication Improving naive approach with linear algebra reduction/reconstruction of n 2 data corresponds to matrix multiplication improve the bit complexity from O(n 2 k 2 ) to O(n 2 k ω 1 ) benefit from optimized matrix multiplication, i.e. SIMD Remark : A similar approach has been used by [Doliskani, Schost 2010] in a non-distributed code.

Multi-modular reductions of an integer matrix Let us assume M = k i=1 m i < β k with m i < β. Multi-reduction of a single entry Let a = a 0 + a 1 β +... a k 1 β k 1 be a value to reduce modm i then a m1 1 β m1... β k 1 m1 a0 m 1. =....... Q. a mk 1 β mk... β k 1 mk a k 1 m k with Q < kβ 2 Lemma : if kβ 2 O(1) than the reduction of n 2 integers modulo the m i s costs O(n 2 k ω 1 ) + O(n 2 k) bit operations.

Multi-modular reconstruction of an integer matrix Let us assume M = k i=1 m i < β k with m i < β and M i = M/m i CRT formulae : a = ( Reconstruction of a single entry k i=1 a m1 M i M 1 i mi ) mod M Let M i M 1 i mi = α (i) 0 + α(i) 1 β +... α(i) k 1 βk 1 be the CRT constants, then a 0 α (1) 0... α (1) k 1 a m1. =...... a k 1 α (k) 0... α (k) a mk k 1 with a i < kβ 2 and a = a 0 +... + a k 1 β k 1 mod M the CRT solution. Lemma : if kβ 2 O(1) than the reconstruction of n 2 integers from their images modulo the m i s costs O(n 2 k ω 1 ) + O (n 2 k) bit operations.

Matrix multiplication with multi-precision integers Implementation of multi-modular approach choose β = 2 16 to optimize β-adic conversions choose m i s.t. nβm i < 2 53 and use BLAS dgemm use a linear storage for multi-modular matrices Compare sequential performances with : FLINT library 1 : uses divide and conquer Mathemagix library 2 : uses divide and conquer Doliskani s code 3 : uses dgemm for reductions only 1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani

Matrix multiplication with multi-precision integers 64.00 16.00 Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 32) 4.00 Time in seconds 1.00 0.25 0.06 0.02 0.00 2 4 2 6 2 8 2 10 2 12 2 14 2 16 Entry bitsize benchmark on Intel Xeon-2620 @ 2.0GHz

Matrix multiplication with multi-precision integers 256.00 64.00 16.00 Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 128) Time in seconds 4.00 1.00 0.25 0.06 0.02 0.00 2 4 2 6 2 8 2 10 2 12 2 14 2 16 Entry bitsize benchmark on Intel Xeon-2620 @ 2.0GHz

Matrix multiplication with multi-precision integers 256.00 128.00 64.00 Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 512) Time in seconds 32.00 16.00 8.00 4.00 2.00 1.00 0.50 0.25 2 4 2 6 2 8 2 10 2 12 2 14 2 16 Entry bitsize benchmark on Intel Xeon-2620 @ 2.0GHz

Matrix multiplication with multi-precision integers 128.00 64.00 Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 1024) Time in seconds 32.00 16.00 8.00 4.00 2.00 2 4 2 6 2 8 2 10 2 12 2 14 2 16 Entry bitsize benchmark on Intel Xeon-2620 @ 2.0GHz

Parallel multi-modular matrix multiplication 1 for i = 1... k compute A i = A mod m i and B i = B mod m i 3 reconstruct C = AB from (C 1,..., C k ) Parallelization of multi-modular reduction/reconstruction each thread reduces (resp. reconstructs) a chunk of the given matrix thread 0 thread 1 thread 2 thread 3 A 0 = A mod m 0 A 1 = A mod m 1 A 2 = A mod m 2 A 3 = A mod m 3 A 4 = A mod m 4 A 5 = A mod m 5 A 6 = A mod m 6

Parallel multi-modular matrix multiplication 2 for i = 1... k compute C i = A i B i mod m i Parallelization of modular multiplication each thread computes a bunch of matrix multiplications modm i thread 0 C 0 = A 0B 0 mod m 0 C 1 = A 1B 1 mod m 1 thread 1 C 2 = A 2B 2 mod m 2 C 3 = A 3B 3 mod m 3 thread 2 C 4 = A 4B 4 mod m 4 C 5 = A 5B 5 mod m 5 thread 3 C 6 = A 6B 6 mod m 6

Parallel Matrix multiplication with multi-precision integers based on OpenMP task CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy!!! Speedup of parallel integer matrix mul5plica5on speedup 12 10 8 6 dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = 8192 4 2 0 2 6 8 12 number of cores benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)

Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials

Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT

Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT Remark: using Vandermonde matrix on can get a similar approach as for integers, i.e. O(n ω k + n 2 k ω 1 )

Matrix multiplication over F p [x] Evaluation/Interpolation scheme Let θ a primitive kth root of unity in F p. 1 for i = 1... k compute A i = A(θ i 1 ) and B i = B(θ i 1 ) 2 for i = 1... k compute C i = A i B i F p 3 interpolate C = AB from (C 1,..., C k ) steps 1 and 3 : O(n 2 ) call to FFT k over F p [x] step 2 : k matrix multiplications modulo a small prime p

FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y )

FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y ) SSE/AVX provide 16 or 32-bits low short product no high short product available (use full product)

Matrix multiplication over F p [x] Implementation radix-4 FFT with 128-bits SSE (29 bits primes) BLAS-based matrix multiplication over F p [FFLAS-FFPACK library] Time in seconds 1 024,00 512,00 256,00 128,00 64,00 32,00 16,00 8,00 4,00 2,00 1,00 0,50 0,25 0,13 0,06 0,03 0,02 Polynomial Matrix Mul9plica9on Performances deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32 n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048 Matrix with dimension (n) and degree (deg) FLINT MMX our code benchmark on Intel Xeon-2620 @ 2.0GHz

Matrix multiplication over Z[x] A, B Z[x] n n such that deg(ab) < d and (AB) i < k Complexity O (n ω d log(d) log(k)) bit op. using Kronecker substitution O(n ω d log(k) + n 2 d log(d) log(k)) bit op. using CRT+FFT Remark: if the result s degree and bitsize are not too large, CRT with Fourier primes might suffice.

Matrix multiplication over Z[x] Implementation use CRT with Fourier primes re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication in F p [x] Time in seconds 4096 2048 1024 512 256 128 64 32 16 8 4 2 FLINT our code 1 log(k)=64 log(k)=128 log(k)=256 log(k)=512 log(k)=64 log(k)=128 log(k)=256 n=16 ; deg=8192 n=128 ; deg=512 benchmark on Intel Xeon-2620 @ 2.0GHz

Parallel Matrix multiplication over Z[x] Very first attempt (work still in progress) parallel CRT with linear algebra (same code as in Z case) perform each multiplication over F p [x] in parallel some part of the code still sequential n d log(k) 6 cores 12 cores time seq 64 1024 600 3.52 4.88 61.1s 32 4096 600 3.68 5.02 64.4s 32 2048 1024 3.95 5.73 54.5s 128 128 1024 3.76 5.55 53.9s

Polynomial Matrix in LinBox (proposition) Generic handler class for Polynomial Matrix template<s i z e t type, s i z e t s t o r a g e, c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x ; Specialization for different memory strategy // M a t r i x o f p o l y n o m i a l s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : p o l f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s ( p a r t i a l view on monomials ) template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : view, F i e l d >;

Conclusion High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT FFT for polynomial over wordsize prime fields adaptative matrix representation We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication over F p [x] and Z[x]