Toward High Performance Matrix Multiplication for Exact Computation
|
|
- Amberly Norris
- 6 years ago
- Views:
Transcription
1 Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014
2 Motivations Matrix multiplication plays a central role in computer algebra. algebraic complexity of O(n ω ) with ω < [Williams 2011] Modern processors provide many levels of parallelism. superscalar, SIMD units, multiple cores High performance matrix multiplication numerical computing = classic algorithm + hardware arithmetic exact computing numerical computing algebraic algorithm is not the most efficient ( complexity model) arithmetic is not directly in the hardware (e.g. Z, F q, Z[x], Q[x, y, z]).
3 Motivation : Superscalar processor with SIMD Hierarchical memory : L1 cache : 32kB - 4 cycles L2 cache : 256kB - 12 cycles L3 cache : 8MB - 36 cycles RAM : 32GB - 36 cycles + 57ns
4 Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism
5 Motivations : practical algorithms High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism Our goal : try to incorporate these rules into exact matrix multiplications
6 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials
7 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials
8 Matrix multiplication with small integers This corresponds to the case where each integer result holds in one processor register : where s is the register size. A, B Z n n such that AB < 2 s Main interests ring isomorphism : computation over Z/pZ is congruent to Z/2 s Z when p(n 1) 2 < 2 s. its a building block for matrix mutiplication with larger integers
9 Matrix multiplication with small integers Two possibilities for hardware support : use floating point mantissa, i.e. s = 2 53, use native integer, i.e. s = Using floating point historically, the first approach in computer algebra [Dumas, Gautier, Pernet 2002] out of the box performance from optimized BLAS handle matrix with entries < 2 26 Using native integers apply same optimizations as BLAS libraries [Goto, Van De Geijn 2008] handle matrix with entries < 2 32
10 Matrix multiplication with small integers floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add # vector operations per cycle (pipelined) OpenBlas: double - SSE OpenBlas: double - AVX OpenBlas: double - AVX2 Our code: int - SSE Our code: int - AVX2 30 GFLOPS Matrix dimensions benchmark on Intel 2.60GHz
11 Matrix multiplication with small integers Matrix multiplication modulo a small integer Let p such that (p 1) 2 n < perform the multiplication in Z using BLAS 2 reduce the result modulo p OpenBlas mod p - Classic mod p - Winograd 40 GFLOPS Matrix dimensions benchmark on Intel 2.60GHz
12 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials
13 Matrix multiplication with multi-precision integers Direct approach Let M(k) be the bit complexity of k-bit integers multiplication and A, B Z n n such that A, B O(2 k ). Computing AB using direct algorithm costs n ω M(k) bit operations. not best possible complexity, i.e. M(k) is super-linear not efficient in practice Remark: Use evaluation/interpolation technique for better performances!!!
14 Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction
15 Multi-modular matrix multiplication Multi-modular approach AB < M = k m i, i=1 with primes m i O(1) then AB can be reconstructed with the CRT from (AB) mod m i. 1 for each m i compute A i = A mod m i and B i = B mod m i 2 for each m i compute C i = A i B i mod m i 3 reconstruct C = AB from (C 1,..., C k ) Bit complexity : O(n ω k + n 2 R(k)) where R(k) is the cost of reduction/reconstruction R(k) = O(M(k) log(k)) using divide and conquer strategy R(k) = O(k 2 ) using naive approach
16 Multi-modular matrix multiplication Improving naive approach with linear algebra reduction/reconstruction of n 2 data corresponds to matrix multiplication improve the bit complexity from O(n 2 k 2 ) to O(n 2 k ω 1 ) benefit from optimized matrix multiplication, i.e. SIMD Remark : A similar approach has been used by [Doliskani, Schost 2010] in a non-distributed code.
17 Multi-modular reductions of an integer matrix Let us assume M = k i=1 m i < β k with m i < β. Multi-reduction of a single entry Let a = a 0 + a 1 β +... a k 1 β k 1 be a value to reduce modm i then a m1 1 β m1... β k 1 m1 a0 m 1. = Q. a mk 1 β mk... β k 1 mk a k 1 m k with Q < kβ 2 Lemma : if kβ 2 O(1) than the reduction of n 2 integers modulo the m i s costs O(n 2 k ω 1 ) + O(n 2 k) bit operations.
18 Multi-modular reconstruction of an integer matrix Let us assume M = k i=1 m i < β k with m i < β and M i = M/m i CRT formulae : a = ( Reconstruction of a single entry k i=1 a m1 M i M 1 i mi ) mod M Let M i M 1 i mi = α (i) 0 + α(i) 1 β +... α(i) k 1 βk 1 be the CRT constants, then a 0 α (1) 0... α (1) k 1 a m1. = a k 1 α (k) 0... α (k) a mk k 1 with a i < kβ 2 and a = a a k 1 β k 1 mod M the CRT solution. Lemma : if kβ 2 O(1) than the reconstruction of n 2 integers from their images modulo the m i s costs O(n 2 k ω 1 ) + O (n 2 k) bit operations.
19 Matrix multiplication with multi-precision integers Implementation of multi-modular approach choose β = 2 16 to optimize β-adic conversions choose m i s.t. nβm i < 2 53 and use BLAS dgemm use a linear storage for multi-modular matrices Compare sequential performances with : FLINT library 1 : uses divide and conquer Mathemagix library 2 : uses divide and conquer Doliskani s code 3 : uses dgemm for reductions only courtesy of J. Doliskani
20 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 32) 4.00 Time in seconds Entry bitsize benchmark on Intel 2.0GHz
21 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 128) Time in seconds Entry bitsize benchmark on Intel 2.0GHz
22 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 512) Time in seconds Entry bitsize benchmark on Intel 2.0GHz
23 Matrix multiplication with multi-precision integers Flint Mathemagix Doliskani s code Our code Integer matrix multiplication (matrix dim = 1024) Time in seconds Entry bitsize benchmark on Intel 2.0GHz
24 Parallel multi-modular matrix multiplication 1 for i = 1... k compute A i = A mod m i and B i = B mod m i 3 reconstruct C = AB from (C 1,..., C k ) Parallelization of multi-modular reduction/reconstruction each thread reduces (resp. reconstructs) a chunk of the given matrix thread 0 thread 1 thread 2 thread 3 A 0 = A mod m 0 A 1 = A mod m 1 A 2 = A mod m 2 A 3 = A mod m 3 A 4 = A mod m 4 A 5 = A mod m 5 A 6 = A mod m 6
25 Parallel multi-modular matrix multiplication 2 for i = 1... k compute C i = A i B i mod m i Parallelization of modular multiplication each thread computes a bunch of matrix multiplications modm i thread 0 C 0 = A 0B 0 mod m 0 C 1 = A 1B 1 mod m 1 thread 1 C 2 = A 2B 2 mod m 2 C 3 = A 3B 3 mod m 3 thread 2 C 4 = A 4B 4 mod m 4 C 5 = A 5B 5 mod m 5 thread 3 C 6 = A 6B 6 mod m 6
26 Parallel Matrix multiplication with multi-precision integers based on OpenMP task CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy!!! Speedup of parallel integer matrix mul5plica5on speedup dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = number of cores benchmark on Intel 2.0GHz (2 NUMA with 6 cores)
27 Outline 1 Matrix multiplication with small integers 2 Matrix multiplication with multi-precision integers 3 Matrix multiplication with polynomials
28 Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT
29 Matrix multiplication over F p [x] We consider the easiest case : A, B F p [x] n n such that deg(ab) < k = 2 t p is a Fourier prime, i.e. p = 2 t q + 1 p is such that n(p 1) 2 < 2 53 Complexity O(n ω k + n 2 k log(k)) op. in F p using evaluation/interpolation with FFT Remark: using Vandermonde matrix on can get a similar approach as for integers, i.e. O(n ω k + n 2 k ω 1 )
30 Matrix multiplication over F p [x] Evaluation/Interpolation scheme Let θ a primitive kth root of unity in F p. 1 for i = 1... k compute A i = A(θ i 1 ) and B i = B(θ i 1 ) 2 for i = 1... k compute C i = A i B i F p 3 interpolate C = AB from (C 1,..., C k ) steps 1 and 3 : O(n 2 ) call to FFT k over F p [x] step 2 : k matrix multiplications modulo a small prime p
31 FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y )
32 FFT with SIMD over F p Butterly operation modulo p compute X + Y mod p and (X Y )θ 2i mod p. Barret s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals [Harvey 2014] Let X, Y [0, 2p), W [0, p), p < β/4 and W = W β/p. Algorithm: Butterfly(X,Y,W,W,p) 1: X := X + Y mod 2p 2: T := X Y + 2p 3: Q := W T /β 1 high short product 4: Y := (WT Qp) mod β 2 low short products 5: return (X, Y ) SSE/AVX provide 16 or 32-bits low short product no high short product available (use full product)
33 Matrix multiplication over F p [x] Implementation radix-4 FFT with 128-bits SSE (29 bits primes) BLAS-based matrix multiplication over F p [FFLAS-FFPACK library] Time in seconds 1 024,00 512,00 256,00 128,00 64,00 32,00 16,00 8,00 4,00 2,00 1,00 0,50 0,25 0,13 0,06 0,03 0,02 Polynomial Matrix Mul9plica9on Performances deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32 n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048 Matrix with dimension (n) and degree (deg) FLINT MMX our code benchmark on Intel 2.0GHz
34 Matrix multiplication over Z[x] A, B Z[x] n n such that deg(ab) < d and (AB) i < k Complexity O (n ω d log(d) log(k)) bit op. using Kronecker substitution O(n ω d log(k) + n 2 d log(d) log(k)) bit op. using CRT+FFT Remark: if the result s degree and bitsize are not too large, CRT with Fourier primes might suffice.
35 Matrix multiplication over Z[x] Implementation use CRT with Fourier primes re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication in F p [x] Time in seconds FLINT our code 1 log(k)=64 log(k)=128 log(k)=256 log(k)=512 log(k)=64 log(k)=128 log(k)=256 n=16 ; deg=8192 n=128 ; deg=512 benchmark on Intel 2.0GHz
36 Parallel Matrix multiplication over Z[x] Very first attempt (work still in progress) parallel CRT with linear algebra (same code as in Z case) perform each multiplication over F p [x] in parallel some part of the code still sequential n d log(k) 6 cores 12 cores time seq s s s s
37 Polynomial Matrix in LinBox (proposition) Generic handler class for Polynomial Matrix template<s i z e t type, s i z e t s t o r a g e, c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x ; Specialization for different memory strategy // M a t r i x o f p o l y n o m i a l s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : p o l f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : p l a i n, F i e l d >; // P o l y n o m i a l o f m a t r i c e s ( p a r t i a l view on monomials ) template<c l a s s F i e l d > c l a s s P o l y n o m i a l M a t r i x <PMType : : m a t f i r s t, PMStorage : : view, F i e l d >;
38 Conclusion High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT FFT for polynomial over wordsize prime fields adaptative matrix representation We provide in the LinBox library ( efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication over F p [x] and Z[x]
Dense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationSolving Sparse Rational Linear Systems. Pascal Giorgi. University of Waterloo (Canada) / University of Perpignan (France) joint work with
Solving Sparse Rational Linear Systems Pascal Giorgi University of Waterloo (Canada) / University of Perpignan (France) joint work with A. Storjohann, M. Giesbrecht (University of Waterloo), W. Eberly
More informationFaster arithmetic for number-theoretic transforms
University of New South Wales 7th October 2011, Macquarie University Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm
More informationExact Linear Algebra Algorithmic: Theory and Practice
Exact Linear Algebra Algorithmic: Theory and Practice ISSAC 15 Tutorial Clément Pernet Université Grenoble Alpes, Inria, LIP-AriC July 6, 2015 C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 1
More informationQ-adic Transform revisited
Q-adic Transform revisited Jean-Guillaume Dumas To cite this version: Jean-Guillaume Dumas. Q-adic Transform revisited. 2007. HAL Id: hal-00173894 https://hal.archives-ouvertes.fr/hal-00173894v3
More informationhal , version 1-29 Feb 2008
Compressed Modular Matrix Multiplication Jean-Guillaume Dumas Laurent Fousse Bruno Salvy February 29, 2008 Abstract We propose to store several integers modulo a small prime into a single machine word.
More informationFast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients
, July 4-6, 01, London, UK Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients Andrzej Chmielowiec Abstract This paper aims to develop and analyze an effective parallel algorithm
More informationMatrix multiplication over word-size prime fields using Bini s approximate formula
Matrix multiplication over word-size prime fields using Bini s approximate formula Brice Boyer, Jean-Guillaume Dumas To cite this version: Brice Boyer, Jean-Guillaume Dumas. Matrix multiplication over
More informationComputing Characteristic Polynomials of Matrices of Structured Polynomials
Computing Characteristic Polynomials of Matrices of Structured Polynomials Marshall Law and Michael Monagan Department of Mathematics Simon Fraser University Burnaby, British Columbia, Canada mylaw@sfu.ca
More informationPUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science
PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) by Linxiao Wang Graduate Program in Computer Science A thesis submitted in partial fulfillment of the requirements
More informationFast and Small: Multiplying Polynomials without Extra Space
Fast and Small: Multiplying Polynomials without Extra Space Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo CECM Day SFU, Vancouver, 24 July 2009 Preliminaries
More informationDense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages
Dense Linear Algebra over Finite Fields: the FFLAS and FFPACK packages Jean-Guillaume Dumas, Thierry Gautier, Pascal Giorgi, Clément Pernet To cite this version: Jean-Guillaume Dumas, Thierry Gautier,
More informationHigh Performance and Reliable Algebraic Computing
High Performance and Reliable Algebraic Computing Soutenance d habilitation à diriger des recherches Clément Pernet Université Joseph Fourier (Grenoble 1) November 25, 2014 Rapporteurs : Daniel Augot,
More informationChapter 1 Divide and Conquer Polynomial Multiplication Algorithm Theory WS 2015/16 Fabian Kuhn
Chapter 1 Divide and Conquer Polynomial Multiplication Algorithm Theory WS 2015/16 Fabian Kuhn Formulation of the D&C principle Divide-and-conquer method for solving a problem instance of size n: 1. Divide
More informationOn the Computational Complexity of the Discrete Pascal Transform
6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković
More informationPolynomial Selection. Thorsten Kleinjung École Polytechnique Fédérale de Lausanne
Polynomial Selection Thorsten Kleinjung École Polytechnique Fédérale de Lausanne Contents Brief summary of polynomial selection (no root sieve) Motivation (lattice sieving, monic algebraic polynomial)
More informationOutput-sensitive algorithms for sumset and sparse polynomial multiplication
Output-sensitive algorithms for sumset and sparse polynomial multiplication Andrew Arnold Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada Daniel S. Roche Computer Science
More informationWhat s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?
What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers? Michael Monagan Center for Experimental and Constructive Mathematics Simon Fraser University British
More informationPolynomial arithmetic and applications in number theory
Polynomial arithmetic and applications in number theory September 26, 2008 What is my research about? Computational number theory. Most of the time, I use big computers to multiply and divide big polynomials
More informationPrime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation
Galois field The issue Topic: finite fields with word size cardinality Field: 4 arithmetic operators to implement (+, -, *, /) We will focus on axpy: r = a x + y (operation mainly used in linear algebra
More informationClass invariants by the CRT method
Class invariants by the CRT method Andreas Enge Andrew V. Sutherland INRIA Bordeaux-Sud-Ouest Massachusetts Institute of Technology ANTS IX Andreas Enge and Andrew Sutherland Class invariants by the CRT
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationSolving Sparse Integer Linear Systems
Solving Sparse Integer Linear Systems Wayne Eberly, Mark Giesbrecht, Giorgi Pascal, Arne Storjohann, Gilles Villard To cite this version: Wayne Eberly, Mark Giesbrecht, Giorgi Pascal, Arne Storjohann,
More informationOld and new algorithms for computing Bernoulli numbers
Old and new algorithms for computing Bernoulli numbers University of New South Wales 25th September 2012, University of Ballarat Bernoulli numbers Rational numbers B 0, B 1,... defined by: x e x 1 = n
More informationOn the use of polynomial matrix approximant in the block Wiedemann algorithm
On the use of polynomial matrix approximant in the block Wiedemann algorithm ECOLE NORMALE SUPERIEURE DE LYON Laboratoire LIP, ENS Lyon, France Pascal Giorgi Symbolic Computation Group, School of Computer
More informationSoftware implementation of ECC
Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia Software implementation of (H)ECC Radboud University,
More informationExact Arithmetic on a Computer
Exact Arithmetic on a Computer Symbolic Computation and Computer Algebra William J. Turner Department of Mathematics & Computer Science Wabash College Crawfordsville, IN 47933 Tuesday 21 September 2010
More informationFast reversion of formal power series
Fast reversion of formal power series Fredrik Johansson LFANT, INRIA Bordeaux RAIM, 2016-06-29, Banyuls-sur-mer 1 / 30 Reversion of power series F = exp(x) 1 = x + x 2 2! + x 3 3! + x 4 G = log(1 + x)
More informationAddition sequences and numerical evaluation of modular forms
Addition sequences and numerical evaluation of modular forms Fredrik Johansson (INRIA Bordeaux) Joint work with Andreas Enge (INRIA Bordeaux) William Hart (TU Kaiserslautern) DK Statusseminar in Strobl,
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationParallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication
Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy
More informationSpace- and Time-Efficient Polynomial Multiplication
Space- and Time-Efficient Polynomial Multiplication Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo ISSAC 2009 Seoul, Korea 30 July 2009 Univariate Polynomial
More informationFast evaluation of iterated multiplication of very large polynomials: An application to chinese remainder theory
ANZIAM J. 48 (CTAC2006) pp.c709 C724, 2007 C709 Fast evaluation of iterated multiplication of very large polynomials: An application to chinese remainder theory D. Laing 1 B. Litow 2 (Received 30 August
More informationFast Polynomial Multiplication
Fast Polynomial Multiplication Marc Moreno Maza CS 9652, October 4, 2017 Plan Primitive roots of unity The discrete Fourier transform Convolution of polynomials The fast Fourier transform Fast convolution
More informationChapter 1 Divide and Conquer Algorithm Theory WS 2016/17 Fabian Kuhn
Chapter 1 Divide and Conquer Algorithm Theory WS 2016/17 Fabian Kuhn Formulation of the D&C principle Divide-and-conquer method for solving a problem instance of size n: 1. Divide n c: Solve the problem
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationComputing Bernoulli numbers
Computing Bernoulli numbers David Harvey (joint work with Edgar Costa) University of New South Wales 27th September 2017 Jonathan Borwein Commemorative Conference Noah s on the Beach, Newcastle, Australia
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationBig Prime Field FFT on the GPU
Big Prime Field FFT on the GPU Liangyu Chen, Svyatoslav Covanov, Davood Mohajerani, Marc Moreno Maza To cite this version: Liangyu Chen, Svyatoslav Covanov, Davood Mohajerani, Marc Moreno Maza. Big Prime
More informationExactly Solve Integer Linear Systems Using Numerical Methods
Exactly Solve Integer Linear Systems Using Numerical Methods Zhendong Wan Department of Computer Science, University of Delaware, Newark, DE, USA Abstract We present a new fast way to exactly solve non-singular
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationOn the evaluation of modular polynomials
On the evaluation of modular polynomials Andrew V. Sutherland Massachusetts Institute of Technology ANTS X July 10, 2012 http://math.mit.edu:/ drew 1 / 16 Introduction Let l be a prime and let F q be a
More informationFixed-Point Arithmetic in SHE Schemes
Fixed-Point Arithmetic in SHE Schemes Anamaria Costache 1, Nigel P. Smart 1, Srinivas Vivek 1, Adrian Waller 2 1 University of Bristol 2 Thales UK Research & Technology July 6, 2016 Outline Motivation
More informationFast reversion of power series
Fast reversion of power series Fredrik Johansson November 2011 Overview Fast power series arithmetic Fast composition and reversion (Brent and Kung, 1978) A new algorithm for reversion Implementation results
More informationIn-place Arithmetic for Univariate Polynomials over an Algebraic Number Field
In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field Seyed Mohammad Mahdi Javadi 1, Michael Monagan 2 1 School of Computing Science, Simon Fraser University, Burnaby, B.C., V5A
More informationComputing modular polynomials with the Chinese Remainder Theorem
Computing modular polynomials with the Chinese Remainder Theorem Andrew V. Sutherland Massachusetts Institute of Technology ECC 009 Reinier Bröker Kristin Lauter Andrew V. Sutherland (MIT) Computing modular
More informationTowards High Performance Multivariate Factorization. Michael Monagan. This is joint work with Baris Tuncer.
Towards High Performance Multivariate Factorization Michael Monagan Center for Experimental and Constructive Mathematics Simon Fraser University British Columbia This is joint work with Baris Tuncer. To
More informationA parallel implementation for polynomial multiplication modulo a prime.
A parallel implementation for polynomial multiplication modulo a prime. ABSTRACT Marshall Law Department of Mathematics Simon Fraser University Burnaby, B.C. Canada. mylaw@sfu.ca. We present a parallel
More informationFast Polynomials Multiplication Using FFT
Li Chen lichen.xd at gmail.com Xidian University January 17, 2014 Outline 1 Discrete Fourier Transform (DFT) 2 Discrete Convolution 3 Fast Fourier Transform (FFT) 4 umber Theoretic Transform (TT) 5 More
More informationElliptic Curves Spring 2013 Lecture #3 02/12/2013
18.783 Elliptic Curves Spring 2013 Lecture #3 02/12/2013 3.1 Arithmetic in finite fields To make explicit computations with elliptic curves over finite fields, we need to know how to perform arithmetic
More informationFour-Dimensional GLV Scalar Multiplication
Four-Dimensional GLV Scalar Multiplication ASIACRYPT 2012 Beijing, China Patrick Longa Microsoft Research Francesco Sica Nazarbayev University Elliptic Curve Scalar Multiplication A (Weierstrass) elliptic
More informationEliminations and echelon forms in exact linear algebra
Eliminations and echelon forms in exact linear algebra Clément PERNET, INRIA-MOAIS, Grenoble Université, France East Coast Computer Algebra Day, University of Waterloo, ON, Canada, April 9, 2011 Clément
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationSolving Sparse Integer Linear Systems. Pascal Giorgi. Dali team - University of Perpignan (France)
Solving Sarse Integer Linear Systems Pascal Giorgi Dali team - University of Perignan (France) in collaboration with A. Storjohann, M. Giesbrecht (University of Waterloo), W. Eberly (University of Calgary),
More informationComponent-level Parallelization of Triangular Decompositions
Component-level Parallelization of Triangular Decompositions Marc Moreno Maza & Yuzhen Xie University of Western Ontario, Canada February 1, 2007 Interactive Parallel Computation in Support of Research
More informationInteger multiplication with generalized Fermat primes
Integer multiplication with generalized Fermat primes CARAMEL Team, LORIA, University of Lorraine Supervised by: Emmanuel Thomé and Jérémie Detrey Journées nationales du Calcul Formel 2015 (Cluny) November
More informationBalanced Dense Polynomial Multiplication on Multicores
Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009 Introduction Motivation: Multicore-enabling
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 13 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel Numerical Algorithms
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationBalanced dense polynomial multiplication on multi-cores
Balanced dense polynomial multiplication on multi-cores The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher
More informationModular polynomials and isogeny volcanoes
Modular polynomials and isogeny volcanoes Andrew V. Sutherland February 3, 010 Reinier Bröker Kristin Lauter Andrew V. Sutherland (MIT) Modular polynomials and isogeny volcanoes 1 of 9 Isogenies An isogeny
More informationImplementation of the DKSS Algorithm for Multiplication of Large Numbers
Implementation of the DKSS Algorithm for Multiplication of Large Numbers Christoph Lüders Universität Bonn The International Symposium on Symbolic and Algebraic Computation, July 6 9, 2015, Bath, United
More informationSmoothness Testing of Polynomials over Finite Fields
Smoothness Testing of Polynomials over Finite Fields Jean-François Biasse and Michael J. Jacobson Jr. Department of Computer Science, University of Calgary 2500 University Drive NW Calgary, Alberta, Canada
More informationWhy solve integer linear systems exactly? David Saunders - U of Delaware
NSF CDI, 2007Oct30 1 Why solve integer linear systems exactly?... an enabling tool for discovery and innovation? David Saunders - U of Delaware LinBox library at linalg.org NSF CDI, 2007Oct30 2 LinBoxers
More informationPolynomial multiplication and division using heap.
Polynomial multiplication and division using heap. Michael Monagan and Roman Pearce Department of Mathematics, Simon Fraser University. Abstract We report on new code for sparse multivariate polynomial
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationEfficient implementation of the Hardy-Ramanujan-Rademacher formula
Efficient implementation of the Hardy-Ramanujan-Rademacher formula or: Partitions in the quintillions Fredrik Johansson RISC-Linz July 10, 2013 2013 SIAM Annual Meeting San Diego, CA Supported by Austrian
More informationFaster Homomorphic Evaluation of Discrete Fourier Transforms
Faster Homomorphic Evaluation of Discrete Fourier Transforms Anamaria Costache, Nigel P. Smart, and Srinivas Vivek University of Bristol, Bristol, UK Abstract. We present a methodology to achieve low latency
More informationEven faster integer multiplication
Even faster integer multiplication DAVID HARVEY School of Mathematics and Statistics University of New South Wales Sydney NSW 2052 Australia Email: d.harvey@unsw.edu.au JORIS VAN DER HOEVEN a, GRÉGOIRE
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationTowards High Performance Multivariate Factorization. Michael Monagan. This is joint work with Baris Tuncer.
Towards High Performance Multivariate Factorization Michael Monagan Center for Experimental and Constructive Mathematics Simon Fraser University British Columbia This is joint work with Baris Tuncer. The
More informationFaster ECC over F 2. School of Computer and Communication Sciences EPFL, Switzerland 2 CertiVox Labs.
Faster ECC over F 2 521 1 Robert Granger 1 and Michael Scott 2 1 Laboratory for Cryptologic Algorithms School of Computer and Communication Sciences EPFL, Switzerland robbiegranger@gmail.com 2 CertiVox
More informationPOLY : A new polynomial data structure for Maple 17 that improves parallel speedup.
: A new polynomial data structure for Maple 17 that improves parallel speedup. Centre for Experimental and Constructive Mathematics Simon Fraser University. Maplesoft presentation, August 14th, 2012 This
More informationCounting points on smooth plane quartics
Counting points on smooth plane quartics David Harvey University of New South Wales Number Theory Down Under, University of Newcastle 25th October 2014 (joint work with Andrew V. Sutherland, MIT) 1 / 36
More informationUniversity of Bristol - Explore Bristol Research. Peer reviewed version. Link to publication record in Explore Bristol Research PDF-document
Huelse, D., & Hemmer, M. (29). Generic implementation of a modular gcd over algebraic extension fields. Paper presented at 25th European Workshop on Computational Geometry, Brussels, Belgium. Peer reviewed
More informationThree Ways to Test Irreducibility
Three Ways to Test Irreducibility Richard P. Brent Australian National University joint work with Paul Zimmermann INRIA, Nancy France 12 Feb 2009 Outline Polynomials over finite fields Irreducibility criteria
More informationAlgorithms for Solving Linear Differential Equations with Rational Function Coefficients
Algorithms for Solving Linear Differential Equations with Rational Function Coefficients Erdal Imamoglu Department of Mathematics Florida State University May 25, 2017 1 / 37 1 Introduction 2 Formal Solutions,
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationAn Approach to Hensel s Lemma
Irish Math. Soc. Bulletin 47 (2001), 15 21 15 An Approach to Hensel s Lemma gary mcguire Abstract. Hensel s Lemma is an important tool in many ways. One application is in factoring polynomials over Z.
More informationCache Complexity and Multicore Implementation for Univariate Real Root Isolation
Cache Complexity and Multicore Implementation for Univariate Real Root Isolation Changbo Chen, Marc Moreno Maza and Yuzhen Xie University of Western Ontario, London, Ontario, Canada E-mail: cchen5,moreno,yxie@csd.uwo.ca
More information(Group-theoretic) Fast Matrix Multiplication
(Group-theoretic) Fast Matrix Multiplication Ivo Hedtke Data Structures and Efficient Algorithms Group (Prof Dr M Müller-Hannemann) Martin-Luther-University Halle-Wittenberg Institute of Computer Science
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationDistributed computing of simultaneous Diophantine approximation problems
Stud. Univ. Babeş-Bolyai Math. 59(2014), No. 4, 557 566 Distributed computing of simultaneous Diophantine approximation problems Norbert Tihanyi, Attila Kovács and Ádám Szűcs Abstract. In this paper we
More informationComputing over Z, Q, K[X]
Computing over Z, Q, K[X] Clément PERNET M2-MIA Calcul Exact Outline Introduction Chinese Remainder Theorem Rational reconstruction Problem Statement Algorithms Applications Dense CRT codes Extension to
More informationComputers and Mathematics with Applications. Ramanujan s class invariants and their use in elliptic curve cryptography
Computers and Mathematics with Applications 59 ( 9 97 Contents lists available at Scienceirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa Ramanujan s class
More informationBon anniversaire Marc!
Bon anniversaire Marc! The full counting function of Gessel walks is algebraic Alin Bostan (INRIA) joint work with Manuel Kauers (INRIA/RISC) Journées GECKO/TERA 2008 Context: restricted lattice walks
More informationCOMPUTER ARITHMETIC. 13/05/2010 cryptography - math background pp. 1 / 162
COMPUTER ARITHMETIC 13/05/2010 cryptography - math background pp. 1 / 162 RECALL OF COMPUTER ARITHMETIC computers implement some types of arithmetic for instance, addition, subtratction, multiplication
More informationCover and Decomposition Index Calculus on Elliptic Curves made practical
Cover and Decomposition Index Calculus on Elliptic Curves made practical Application to a previously unreachable curve over F p 6 Vanessa VITSE Antoine JOUX Université de Versailles Saint-Quentin, Laboratoire
More informationPartitions in the quintillions or Billions of congruences
Partitions in the quintillions or Billions of congruences Fredrik Johansson November 2011 The partition function p(n) counts the number of ways n can be written as the sum of positive integers without
More informationCPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication
CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform polynomial multiplication
More informationAlgorithms for Structured Linear Systems Solving and their Implementation
Algorithms for Structured Linear Systems Solving and their Implementation Seung Gyu Hyun Cheriton School of Computer Science University of Waterloo sghyun@edu.uwaterloo.ca Romain Lebreton LIRMM Université
More informationRSA Implementation. Oregon State University
RSA Implementation Çetin Kaya Koç Oregon State University 1 Contents: Exponentiation heuristics Multiplication algorithms Computation of GCD and Inverse Chinese remainder algorithm Primality testing 2
More informationComputing L-series of geometrically hyperelliptic curves of genus three. David Harvey, Maike Massierer, Andrew V. Sutherland
Computing L-series of geometrically hyperelliptic curves of genus three David Harvey, Maike Massierer, Andrew V. Sutherland The zeta function Let C/Q be a smooth projective curve of genus 3 p be a prime
More informationOn the Factor Refinement Principle and its Implementation on Multicore Architectures
Home Search Collections Journals About Contact us My IOPscience On the Factor Refinement Principle and its Implementation on Multicore Architectures This article has been downloaded from IOPscience Please
More informationScientific Computing: An Introductory Survey
Scientific Computing: An Introductory Survey Chapter 12 Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted for noncommercial,
More informationFast Computation of the Nth Term of an Algebraic Series over a Finite Prime Field
Fast Computation of the Nth Term of an Algebraic Series over a Finite Prime Field Alin Bostan 1 Gilles Christol 2 Philippe Dumas 1 1 Inria (France) 2 Institut de mathématiques de Jussieu (France) Specfun
More informationPresentation of Normal Bases
Presentation of Normal Bases Mohamadou Sall mohamadou1.sall@ucad.edu.sn University Cheikh Anta Diop, Dakar (Senegal) Pole of Research in Mathematics and their Applications in Information Security (PRMAIS)
More informationClass polynomials for abelian surfaces
Class polynomials for abelian surfaces Andreas Enge LFANT project-team INRIA Bordeaux Sud-Ouest andreas.enge@inria.fr http://www.math.u-bordeaux.fr/~aenge LFANT seminar 27 January 2015 (joint work with
More information