Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Similar documents
Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Space- and Time-Efficient Polynomial Multiplication

CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication

Integer multiplication with generalized Fermat primes

Output-sensitive algorithms for sumset and sparse polynomial multiplication

PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science

Fast Multiplication of Large Integers: Implementation and Analysis of the DKSS Algorithm

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

Fast and Small: Multiplying Polynomials without Extra Space

Fast integer multiplication

Fast Polynomials Multiplication Using FFT

Faster arithmetic for number-theoretic transforms

Even faster integer multiplication

Integer multiplication and the truncated product problem

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

Faster integer multiplication using short lattice vectors

Fast Integer Multiplication Using Modular Arithmetic

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. Prof. R. Fateman

Fast algorithms for polynomials and matrices Part 2: polynomial multiplication

Even faster integer multiplication

arxiv: v3 [cs.sc] 19 Sep 2008

Elliptic Curves Spring 2013 Lecture #3 02/12/2013

Lucas Lehmer primality test - Wikipedia, the free encyclopedia

Polynomial multiplication and division using heap.

CSE 548: Analysis of Algorithms. Lecture 4 ( Divide-and-Conquer Algorithms: Polynomial Multiplication )

Three Ways to Test Irreducibility

RSA Implementation. Oregon State University

Arithmétique et Cryptographie Asymétrique

Computer Vision, Convolutions, Complexity and Algebraic Geometry

Implementing Fast Carryless Multiplication

Algorithm Design and Analysis

Three Ways to Test Irreducibility

Lecture 1 & 2: Integer and Modular Arithmetic

Large Integer Multiplication on Hypercubes. Barry S. Fagin Thayer School of Engineering Dartmouth College Hanover, NH

Fast Polynomial Multiplication

Fast integer multiplication using generalized Fermat primes

Fast, Parallel Algorithm for Multiplying Polynomials with Integer Coefficients

Exact Arithmetic on a Computer

Literature Review: Adaptive Polynomial Multiplication

Speedy Maths. David McQuillan

Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago

Frequency Domain Finite Field Arithmetic for Elliptic Curve Cryptography

Sparse Polynomial Multiplication and Division in Maple 14

Chinese Remainder Theorem

Prime Fields 04/05/2007. Hybrid system simulator for ODE 1. Galois field. The issue. Prime fields: naïve implementation

CS/COE 1501 cs.pitt.edu/~bill/1501/ Integer Multiplication

Schönhage-Strassen Algorithm with MapReduce for Multiplying Terabit Integers (April 29, 2011)

Input Decidable Language -- Program Halts on all Input Encoding of Input -- Natural Numbers Encoded in Binary or Decimal, Not Unary

How to Multiply. 5.5 Integer Multiplication. Complex Multiplication. Integer Arithmetic. Complex multiplication. (a + bi) (c + di) = x + yi.

Chapter 1 Divide and Conquer Polynomial Multiplication Algorithm Theory WS 2015/16 Fabian Kuhn

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

arxiv: v1 [cs.ds] 28 Jan 2010

DIVIDE AND CONQUER II

feb abhi shelat FFT,Median

Even faster integer multiplication

Integer Multiplication

Scientific Computing: An Introductory Survey

Class invariants by the CRT method

Balanced dense polynomial multiplication on multi-cores

arxiv: v1 [cs.na] 8 Feb 2016

Fast reversion of formal power series

shelat 16f-4800 sep Matrix Mult, Median, FFT

Fast evaluation of iterated multiplication of very large polynomials: An application to chinese remainder theory

2WF15 - Discrete Mathematics 2 - Part 1. Algorithmic Number Theory

feb abhi shelat Matrix, FFT

CS 4424 Matrix multiplication

CSE 421 Algorithms. T(n) = at(n/b) + n c. Closest Pair Problem. Divide and Conquer Algorithms. What you really need to know about recurrences

Discrete Mathematics U. Waterloo ECE 103, Spring 2010 Ashwin Nayak May 17, 2010 Recursion

Parallelism in Computer Arithmetic: A Historical Perspective

In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field

Short Division of Long Integers. (joint work with David Harvey)

Balanced Dense Polynomial Multiplication on Multicores

Faster polynomial multiplication over nite elds

Notes for Lecture 14

Polynomial Multiplication over Finite Fields using Field Extensions and Interpolation

arxiv:quant-ph/ v1 24 Jun 1998

Kartsuba s Algorithm and Linear Time Selection

Parallel Numerical Algorithms

An Illustrated Introduction to the Truncated Fourier Transform

Divide and Conquer. CSE21 Winter 2017, Day 9 (B00), Day 6 (A00) January 30,

arxiv: v5 [cs.sc] 15 May 2018

Polynomial arithmetic and applications in number theory

Integer multiplication in time O(n log n)

Design and Analysis of Algorithms

Fast multiplication and its applications

The tangent FFT. D. J. Bernstein University of Illinois at Chicago

Anatomy of SINGULAR talk at p. 1

CRYPTOGRAPHIC COMPUTING

3 Finite fields and integer arithmetic

Between Sparse and Dense Arithmetic

Modular Algorithms for Computing Minimal Associated Primes and Radicals of Polynomial Ideals. Masayuki Noro. Toru Aoyama

Complexity of computation in Finite Fields

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

Efficient modular arithmetic in Adapted Modular Number System using Lagrange representation

Multiplying huge integers using Fourier transforms

CS 542G: Conditioning, BLAS, LU Factorization

Dense Arithmetic over Finite Fields with CUMODP

TOWARD HIGH-PERFORMANCE POLYNOMIAL SYSTEM SOLVERS BASED ON TRIANGULAR DECOMPOSITIONS. (Spine title: Contributions to Polynomial System Solvers)

Chapter 1 Divide and Conquer Algorithm Theory WS 2016/17 Fabian Kuhn

COMPUTER ARITHMETIC. 13/05/2010 cryptography - math background pp. 1 / 162

Transcription:

Implementation of the DKSS Algorithm for Multiplication of Large Numbers Christoph Lüders Universität Bonn The International Symposium on Symbolic and Algebraic Computation, July 6 9, 2015, Bath, United Kingdom Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 1 / 17

Introduction In 2008, De, Kurur, Saha & Saptharishi (DKSS) published a paper on how to multiply large numbers based on ideas of Fürer s algorithm. Their procedure was implemented and compared to Schönhage- Strassen multiplication to see how it performs in practice. But first, some context... Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 2 / 17

Representation of Large Numbers On 64-bit machines a word can hold non-negative values < W = 2 64. A large number 0 a < W n is represented as array of n words: (a 0, a 1,..., a n 1 ). Each word a i is a digit of a in base W. Ordinary (grade-school) multiplication of a b: multiply each a i with each b j. Run-time is O(n 2 ). Function name OMUL. Can we do better? Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 3 / 17

Multiplication: Karatsuba (Karatsuba 1960): cut numbers a and b in half. With the help of some linear time operations, only 3 half-sized multiplications are needed: a = a 0 + a 1 W n, b = b 0 + b 1 W n P 0 = a 0 b 0, P 1 = (a 0 a 1 )(b 0 b 1 ), P 2 = a 1 b 1 ab = P 0 (1 + W n ) P 1 W n + P 2 (W n + W 2n ) When done recursively run-time is O(n log 2 3 ) O(n 1.58 ). Function name KMUL. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 4 / 17

Multiplication: Toom-Cook (Toom 1963, Cook 1966): cut numbers in k 2 pieces and perform only 2k 1 small multiplications plus some linear time operations. Run-time is O(n log k (2k 1) ). For k = 3, 4, 5 this is O(n 1.46 ), O(n 1.40 ), O(n 1.37 ). Function name for k = 3 is T3MUL. Problem: the number of linear time operations grows quickly with k. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 5 / 17

Multiplication: FFT-Methods (Strassen 1968): Cut numbers a and b in n/2 pieces each and interpret pieces as coefficients of polynomials over R[x], R ring. Evaluate polynomials at n points, multiply the sample values and interpolate to obtain product. Propagate carries. If ω is primitive n-th root of unity in R, evaluation and interpolation can be done on ω k, 0 k < n. We can use the fast Fourier transform (FFT) with O(n log n) steps. Function name QMUL. Problem: the larger n becomes, the more precision is needed in coefficient ring R. This limits the length of input numbers. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 6 / 17

Multiplication: Schönhage-Strassen (Schönhage & Strassen 1971): Use R = Z/(2 K + 1)Z and ω = 2 as primitive 2K-th root of unity for the FFT. Multiplications by ω k are just cyclic shifts, can be done in linear time. Run-time is O(N log N log log N), coefficient length is O( N). Function name SMUL. Problem: the order of ω is not very high. Except for 2, there are generally no higher order roots of unity, thus FFT length is quite limited. Nevertheless, Schönhage-Strassen is the standard for multiplication of large numbers with over 150 000 bits. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 7 / 17

Crossover Points Between Algorithms Execution cycles 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 OMUL KMUL T3MUL QMUL SMUL KMUL 28 T3MUL 152 SMUL > 2464 10 0 10 1 10 2 10 3 10 4 10 5 Input in 64-bit words Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 8 / 17

Multiplication: DKSS (De, Kurur, Saha & Saptharishi 2008): Use polynomial quotient ring R = P[α]/(α m + 1) with P = Z/p c Z, p = h 2M + 1 prime. Select M = N/ log 2 N and m = log N as powers of 2, M > m. Let µ = M/m. From a generator of F p calculate a primitive 2M-th root of unity ρ P[α] with ρ µ = α. With α as primitive 2m-th root of unity and modulus (α m + 1) multiplications by α k are cyclic shifts: fast! ρ is high order root of unity: large FFT length. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 9 / 17

Multiplication: DKSS (continued) A length-2m FFT can be calculated like this: 2M = µ 2m. Interpret the coefficients as a matrix with 2m rows and µ columns. Do µ many length-2m FFTs (on the columns) with α as root of unity. Perform bad multiplications on the coefficients, i.e. multiply them by some ρ k. Do 2m many length-µ FFTs (on the rows) by calling the FFT routine recursively. Multiplication in R is reduced to integer multiplication by use of Kronecker-Schönhage substitution. Run-time is O(N log N K log N ) with K = 16, coefficient length is O(log 2 N). Function name DKSS_MUL. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 10 / 17

Multiplication: Simplified DKSS In genuine DKSS, prime p is searched at run-time. To keep that time low, p must be kept small. So, input numbers are encoded as k-variate polynomials, k constant. Since input length is limited by available memory, we can precompute all of the required primes p and generators of F p. This allows to use univariate polynomials and simplifies calculation of the root of unity ρ. We can use c = 1 and hence P = Z/pZ. For 64-bit architecture, only 6 primes need to be precomputed. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 11 / 17

Comparison of Execution Time 10 13 SMUL DKSS_MUL 10 12 Execution cycles 10 11 10 10 10 9 10 8 10 7 10 6 10 4 10 5 10 6 10 7 10 8 Input in 64-bit words Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 12 / 17

Quotient of Run-times 36 T DKSS_MUL /T SMUL Quotient of run-times 34 32 30 28 10 4 10 5 10 6 10 7 10 8 Input in 64-bit words Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 13 / 17

Results For the numbers tested (up to 1.27 GB input size, total temporary memory required 26 GB): DKSS_MUL is between 27 and 36 times slower than SMUL. DKSS_MUL requires 2.3 times the temporary memory than SMUL. About 80 % of run-time is spent with bad multiplications, i.e. multiplications by ρ k that are not powers of α. Another 9 % are spent for pointwise products. Recursion did not take place. Even with the largest inputs, inner multiplications were just 195 words long. Cache effects did not slow it down, either. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 14 / 17

When Will DKSS Beat Schönhage-Strassen? Model SMUL run-time: T σ σ N log N log log N. Model DKSS_MUL run-time: T η η N log N K log N, K = 16. Find fitting constants σ and η from measured run-times. Solve T σ T η numerically: N 10 104796!! Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 15 / 17

Future work Some ideas: Exploit the sparseness of the factors in the underlying multiplication. Estimated speed-up: factor 2. Use variant of Kronecker-Schönhage substitution (Harvey). Parameters p, M and m should be selected with more care. Estimated speed-up: maybe 30 %. Modular reduction should be sped up (Montgomery s trick or other). Estimated speed-up: about 22 %. Total estimated possible speed-up: factor 3.2, but even then DKSS_MUL is at best 8.5 times slower than SMUL. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 16 / 17

Source Code & Thanks Implementation was done in C++ and assembly language under Windows as part of BIGNUM, my large integer library. Multiplication compares favorably with MPIR (GMP for Windows) and is only 1.3 times slower on average. Source code is available from http://www.wrogn.com/bignum and licensed under LGPL. Many thanks to Andreas Weber and Michael Clausen. Christoph Lüders (Universität Bonn) Implementation of DKSS Multiplication ISSAC 2015, Bath, U.K. 17 / 17