Short Division of Long Integers. (joint work with David Harvey)

Similar documents
The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

arxiv: v1 [cs.na] 8 Feb 2016

Faster arithmetic for number-theoretic transforms

Low-Weight Polynomial Form Integers for Efficient Modular Multiplication

Division-Free Binary-to-Decimal Conversion

Reliable real arithmetic. one-dimensional dynamical systems

Optimized Binary64 and Binary128 Arithmetic with GNU MPFR

Remainders. We learned how to multiply and divide in elementary

In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field

feb abhi shelat Matrix, FFT

Lecture 11. Advanced Dividers

Sparse Polynomial Multiplication and Division in Maple 14

Modern Computer Arithmetic

Faster arbitrary-precision dot product and matrix multiplication

New Results on the Distance Between a Segment and Z 2. Application to the Exact Rounding

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

EDULABZ INTERNATIONAL NUMBER SYSTEM

Modern Computer Arithmetic

Partitions in the quintillions or Billions of congruences

shelat 16f-4800 sep Matrix Mult, Median, FFT

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.

2WF15 - Discrete Mathematics 2 - Part 1. Algorithmic Number Theory

Accurate Multiple-Precision Gauss-Legendre Quadrature

Fast Polynomial Multiplication

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Computing Bernoulli numbers

Discrete Mathematics U. Waterloo ECE 103, Spring 2010 Ashwin Nayak May 17, 2010 Recursion

Computer Architecture 10. Residue Number Systems

On Schönhage s algorithm and subquadratic integer gcd computation

Implementation of the DKSS Algorithm for Multiplication of Large Numbers

Exact Arithmetic on a Computer

Faster integer multiplication using short lattice vectors

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Algorithms CMSC Basic algorithms in Number Theory: Euclid s algorithm and multiplicative inverse


Curve41417: Karatsuba revisited

Computing Machine-Efficient Polynomial Approximations

More Polynomial Equations Section 6.4

Chapter 1: Solutions to Exercises

Efficient implementation of the Hardy-Ramanujan-Rademacher formula

feb abhi shelat FFT,Median

Automating the Verification of Floating-point Algorithms

Fast and Small: Multiplying Polynomials without Extra Space

Polynomial multiplication and division using heap.

The tangent FFT. D. J. Bernstein University of Illinois at Chicago

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

An algorithm for the Jacobi symbol

Lesson 7.1 Polynomial Degree and Finite Differences

MATH Dr. Halimah Alshehri Dr. Halimah Alshehri

Algorithm Design and Analysis

Discrete Mathematics CS October 17, 2006

Algorithms (II) Yu Yu. Shanghai Jiaotong University

Fall 2017 Test II review problems

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

Experience in Factoring Large Integers Using Quadratic Sieve

Implementation of float float operators on graphic hardware

RSA Implementation. Oregon State University

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

CSE 421 Algorithms. T(n) = at(n/b) + n c. Closest Pair Problem. Divide and Conquer Algorithms. What you really need to know about recurrences

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Lecture 2: Number Representations (2)

The Middle Product Algorithm I

Computation of the error functions erf and erfc in arbitrary precision with correct rounding

Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago

Computation of the omega function

Formal verification of IA-64 division algorithms

Integer factorization, part 1: the Q sieve. part 2: detecting smoothness. D. J. Bernstein

Accurate Solution of Triangular Linear System

What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?

3.3 Dividing Polynomials. Copyright Cengage Learning. All rights reserved.

not to be republished NCERT REAL NUMBERS CHAPTER 1 (A) Main Concepts and Results

Math /Foundations of Algebra/Fall 2017 Foundations of the Foundations: Proofs

SCORE BOOSTER JAMB PREPARATION SERIES II

Fast evaluation of iterated multiplication of very large polynomials: An application to chinese remainder theory

Residue Number Systems. Alternative number representations. TSTE 8 Digital Arithmetic Seminar 2. Residue Number Systems.

Arithmetic in Integer Rings and Prime Fields

Introduction CSE 541

CSE 20 DISCRETE MATH. Fall

A New Algorithm to Compute Terms in Special Types of Characteristic Sequences

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

A parallel implementation for polynomial multiplication modulo a prime.

COMPUTER ARITHMETIC. 13/05/2010 cryptography - math background pp. 1 / 162

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Issues in Implementation of Public Key Cryptosystems

Notes on arithmetic. 1. Representation in base B

SCALED REMAINDER TREES

IEEE P1363 / D13 (Draft Version 13). Standard Specifications for Public Key Cryptography

Lecture 8: Number theory

Arithmétique et Cryptographie Asymétrique

L1 2.1 Long Division of Polynomials and The Remainder Theorem Lesson MHF4U Jensen

Hyperelliptic-curve cryptography. D. J. Bernstein University of Illinois at Chicago

Tunable Floating-Point for Energy Efficient Accelerators

Algebra Review. Terrametra Resources. Lynn Patten

Elements of Floating-point Arithmetic

Integer multiplication and the truncated product problem

Basic elements of number theory

Basic elements of number theory

π π π points:= { seq([n*cos(pi/4*n), N*sin(Pi/4*N)], N=0..120) }:

Counting points on smooth plane quartics

SUFFIX PROPERTY OF INVERSE MOD

Transcription:

Short Division of Long Integers (joint work with David Harvey) Paul Zimmermann October 6, 2011

The problem to be solved Divide efficiently a p-bit floating-point number by another p-bit f-p number in the 100-10000 digit range Paul Zimmermann - Short Division of Long Integers October 6, 2011-2

From www.mpfr.org/mpfr-3.0.0/timings.html (ms): Maple Mathematica Sage GMP MPF MPFR digits 12.00 6.0.1 4.5.2 5.0.1 3.0.0 100 mult 0.0020 0.0006 0.00053 0.00011 0.00012 div 0.0029 0.0017 0.00076 0.00031 0.00032 sqrt 0.032 0.0018 0.00132 0.00055 0.00049 1000 mult 0.0200 0.007 0.0039 0.0036 0.0028 div 0.0200 0.015 0.0071 0.0040 0.0058 sqrt 0.160 0.011 0.0064 0.0049 0.0047 10000 mult 0.80 0.28 0.11 0.107 0.095 div 0.80 0.56 0.28 0.198 0.261 sqrt 3.70 0.36 0.224 0.179 0.176 Paul Zimmermann - Short Division of Long Integers October 6, 2011-3

What is GMP (GNU MP)? the most popular library for arbitrary-precision arithmetic distributed under a free license (LGPL) from gmplib.org main developer is Torbjörn Granlund contains several layers: mpn (arrays of words), mpz (integers), mpq (rationals), mpf (floating-point numbers) mpn is the low-level layer, with optimized assembly code for common hardware, and provides optimized implementations of state-of-the-art algorithms Paul Zimmermann - Short Division of Long Integers October 6, 2011-4

Can we do better than GMP? An anonymous reviewer said: What are the paper s weaknesses? The resulting performance, in the referee s opinion, is only marginally better a standard exact-quotient algorithm in GMP. One can expect about 10% improvement. It seems to be a weak result for the sophisticated recursive algorithm with the big error analysis effort. Paul Zimmermann - Short Division of Long Integers October 6, 2011-5

What is GNU MPFR? a widely used library for arbitrary-precision floating-point arithmetic distributed under a free license (LGPL) from mpfr.org main developers are Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, Philippe Théveny and Paul Zimmermann contrary to GMP mpf, implements correct rounding and mathematical functions (exp, log, sin,...) implements Sections 3.7 (Extended and extendable precisions) and 9.2 (Recommended correctly rounded functions) of IEEE 754-2008 aims to be (at least) as efficient than other arbitrary-precision floating-point without correct rounding Paul Zimmermann - Short Division of Long Integers October 6, 2011-6

The problem to be solved (binary fp division) Assume we want to divide a > 0 of p bits by b > 0 of p bits, with a quotient c of p bits. First write a = m a 2 ea and b = m b 2 e b such that: m b has exactly p bits 2 p 1 m a /m b < 2 p (m a has 2p 1 or 2p bits) The problem reduces to finding the p-bit correct rounding of m a /m b with the given rounding mode. We do not assume that the divisor b is invariant, thus we do not allow precomputations involving b. Paul Zimmermann - Short Division of Long Integers October 6, 2011-7

Division routine mpfr div in MPFR 3.0.x The MPFR division routine relies on the (GMP) low-level division with remainder mpn divrem. mpn divrem computes q and r such that m a = qm b + r with 0 r < m b. Since 2 p 1 m a /m b < 2 p, q has exactly p bits. The correct rounding of the quotient is q or q + 1 depending on the rounding mode. For rounding to nearest, if r < m b /2, the correct rounding is q; if r > m b /2, the correct rounding is q + 1. Paul Zimmermann - Short Division of Long Integers October 6, 2011-8

What s new with GMP 5? In GMP 5, the floating-point division (mpf div) calls mpn div q, which only computes the (exact) quotient, and is faster (on average) than mpn divrem or its equivalent mpn tdiv qr. This is based on an approximate Barrett s algorithm, presented at ICMS 2006. In most cases computing one more word of the quotient is enough to decide the correct rounding: pad the dividend with two zero low words pad the divisor with one zero low word one will obtain one extra quotient low word Paul Zimmermann - Short Division of Long Integers October 6, 2011-9

Our goal Design an approximate division routine for arrays of n words An array of n words [a n 1,..., a 1, a 0 ] represents the integer a n 1 β n 1 + + a 1 β + a 0 with β = 2 64 We want a rigorous error analysis and a O(n) error Paul Zimmermann - Short Division of Long Integers October 6, 2011-10

Plan of the talk Mulders short product Mulders short division Barrett s algorithm l-fold Barrett s algorithm (cf Hasenplaugh, Gaubatz, Gopal, Arith 18) Paul Zimmermann - Short Division of Long Integers October 6, 2011-11

Mulders short product for polynomials (2000) Short product: compute the upper half of U V, U and V having n terms (degree n 1) With Karatsuba s multiplication, can save 20% over a full product. Paul Zimmermann - Short Division of Long Integers October 6, 2011-12

Our variant of Mulders s algorithm for integers Algorithm ShortMul. Input: U = n 1 i=0 u iβ i, V = n 1 i=0 v iβ i, integer n Output: an integer approximation W of UV β n 1: if n < n 0 then 2: W ShortMulNaive(U, V, n) 3: else 4: choose a parameter k, n/2 + 1 k < n, l n k 5: write U = U 1 β l + U 0, V = V 1 β l + V 0 6: write U = U 1 βk + U 0, V = V 1 βk + V 0 7: W 11 Mul(U 1, V 1, k) 2k words 8: W 10 ShortMul(U 1, V 0, l) l most significant words 9: W 01 ShortMul(U 0, V 1, l) l most significant words 10: W W 11 β 2l n + W 10 + W 01 Paul Zimmermann - Short Division of Long Integers October 6, 2011-13

Lemma The output of Algorithm ShortMul satisfies UV β n (n 1) < W UV β n. (In other words, the error is less than n ulps.) Paul Zimmermann - Short Division of Long Integers October 6, 2011-14

Mulders short division (2000) U is unknown V is known W = UV is known 1. estimate U high from V high and W high, subtract U high V high from W 2. compute U high V low and subtract from W 3. estimate U low from V high and the remainder W Paul Zimmermann - Short Division of Long Integers October 6, 2011-15

Our variant of Mulders short division for integers Algorithm ShortDiv. Input: W = 2n 1 i=0 w iβ i, V = n 1 i=0 v iβ i, with V β n /2 Output: an integer approximation U of Q = W /V 1: if n < n 1 then 2: U Div(W, V ) Returns W /V 3: else 4: choose a parameter k, n/2 < k n, l n k 5: write W = W 1 β 2l + W 0, V = V 1 β l + V 0, V = V 1 βk + V 0 6: (U 1, R 1 ) DivRem(W 1, V 1 ) 7: write U 1 = U 1 βk l + S with 0 S < β k l 8: T ShortMul(U 1, V 0, l) 9: W 01 R 1 β l + (W 0 div β l ) T β k 10: while W 01 < 0 do (U 1, W 01 ) (U 1 1, W 01 + V ) 11: U 0 ShortDiv(W 01 div β k l, V 1, l) 12: return U 1 β l + U 0 Paul Zimmermann - Short Division of Long Integers October 6, 2011-16

Lemma The output U of Algorithm ShortDiv satisfies, with Q = W /V : Q U Q + 2(n 1). (In other words, the error is less than 2n ulps.) Paul Zimmermann - Short Division of Long Integers October 6, 2011-17

The optimal cutoff k in ShortMul and ShortDiv heavily depends on n. There is no simple formula. Instead, we determine the best k(n) by tuning, for say n < 1000 words (about 20000 digits). For ShortMul the best k varies between 0.5n and 0.64n, for ShortDiv it varies between 0.54n and 0.88n (for a particular computer and a given version of GMP). Paul Zimmermann - Short Division of Long Integers October 6, 2011-18

Barrett s Algorithm (1987) Goal: given W and V, compute quotient Q and remainder R: W = QV + R 1. compute an approximation I of 1/V 2. compute an approximation Q = WI of the quotient 3. (optional) compute the remainder R = W QV and adjust if necessary When V is not invariant, computing 1/V is quite expensive: l-fold reduction from Hasenplaugh, Gaubatz, Gopal (Arith 18, 2007) (LSB variant) for l = 2, HGG is exactly Karp-Markstein division (1997) Paul Zimmermann - Short Division of Long Integers October 6, 2011-19

2-fold division (Karp-Markstein) 1. compute an approximation I of 1/V to n/2 words 2. deduce the upper n/2 words Q 1 = ShortMul(W, I, n/2) 3. subtract Q 1 V from W, giving W 4. deduce the lower n/2 words Q 0 = ShortMul(W, I, n/2) In step 3, Q 1 V is a (n/2) n multiplication, giving a 3n/2 product. However, we know the upper n/2 words match with W, and we are not interested in the lower n/2 words. This is exactly a middle product (Hanrot, Quercia, Zimmermann, 2004): Q 1 middle lower product upper V Paul Zimmermann - Short Division of Long Integers October 6, 2011-20

The 3-fold division algorithm Paul Zimmermann - Short Division of Long Integers October 6, 2011-21

The integer middle product (Harvey 2009) Input: X of m words and Y of n words, with m n Lemma X = m 1 i=0 Output:MP m,n (X, Y ) = n 1 x i β i, Y = y j β j j=0 0 i<m,0 j<n n 1 i+j m 1 x i y j β i+j n+1 (XY β n 1 MP m,n (X, Y )) mod β m < (n 1)β n Classical case: m = 2n 1 with n 2 word-products. Quadratic-time algorithms: n 2. Karatsuba-like middle product: O(n 1.58... ). FFT-variant: O(M(n)). Paul Zimmermann - Short Division of Long Integers October 6, 2011-22

l-fold Barrett division Algorithm FoldDiv(l), l 2. Input: W = 2n 1 i=0 w i β i, V = n 1 i=0 v iβ i, with V β n /2, W < β n V Output: an integer approximation U of Q = W /V 1: if n < n 2 then return U Div(W, V ) 2: k n/l 3: write V = V 1 β n (k+1) + V 0 V 1 has k + 1 words 4: I (β 2(k+1) 1)/V 1 5: r n, W r W, U 0 6: while r > k + 1 do invariant: 0 W r < β r V 7: Q r ShortMul(W r div β n+r (k+1), I, k + 1) 8: Q r min(q r, β k+1 1) 9: T r MP r+1,k+1 (V div β n r, Q r ) 10: W r k (W r T r β n 1 ) mods β n+r k 11: U U + Q r β r (k+1) 12: if W r k < 0 then W r k W r k + β r k V, U U β r k 13: r r k 14: Q r ShortMul(W r div β n+r (k+1), I, k + 1) 15: U U + (Q r div β k+1 r ) Paul Zimmermann - Short Division of Long Integers October 6, 2011-23

Theorem Assuming n + 8 < β/2 and l n/2, Algorithm FoldDiv(l) returns an approximation U of Q = W /V, with error less than 2n. Paul Zimmermann - Short Division of Long Integers October 6, 2011-24

Experimental results Hardware: gcc16.fsffrance.org, 2.2Ghz AMD Opteron 8354 GMP: changeset 131005cc271b from 5.0 branch ( 5.0.1) mulmid patch from David Harvey (threshold 36 words) n 100 200 500 1000 mpn mul n 7.52 22.4 80.8 225 ShortMul 0.76 0.81 0.89 0.85 mpn invert 1.21 1.32 1.59 1.57 mpn mulmid n 1.12 1.20 1.45 1.59 mpn tdiv qr 1.74 1.86 2.35 2.46 mpn div q 1.22 1.34 1.79 1.87 ShortDiv 1.34 1.32 1.62 1.75 FoldDiv(2) 1.37 1.36 1.62 1.74 FoldDiv(3) 1.34 1.35 1.61 1.73 FoldDiv(4) 1.35 1.32 1.63 1.76 Paul Zimmermann - Short Division of Long Integers October 6, 2011-25

0.45 0.4 mpn_div_q ShortDiv FoldDiv2(2) FoldDiv2(3) FoldDiv2(4) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 100 200 300 400 500 600 700 800 900 1000 Paul Zimmermann - Short Division of Long Integers October 6, 2011-26

Algorithm ShortMul is implemented in GNU MPFR since version 2.2.0 (2005) Extended to the MPFR squaring operation in 2010 Algorithm ShortDiv is implemented in GNU MPFR since version 3.1.0 (2011) Algorithm FoldDiv is not (yet) implemented since it requires a middle-product routine, which is not (yet) provided by GMP Paul Zimmermann - Short Division of Long Integers October 6, 2011-27

From www.mpfr.org/mpfr-3.0.0/timings.html (ms): Maple Mathematica Sage GMP MPF MPFR digits 12.00 6.0.1 4.5.2 5.0.1 3.0.0 100 mult 0.0020 0.0006 0.00053 0.00011 0.00012 div 0.0029 0.0017 0.00076 0.00031 0.00032 sqrt 0.032 0.0018 0.00132 0.00055 0.00049 1000 mult 0.0200 0.007 0.0039 0.0036 0.0028 div 0.0200 0.015 0.0071 0.0040 0.0058 sqrt 0.160 0.011 0.0064 0.0049 0.0047 10000 mult 0.80 0.28 0.11 0.107 0.095 div 0.80 0.56 0.28 0.198 0.261 sqrt 3.70 0.36 0.224 0.179 0.176 Paul Zimmermann - Short Division of Long Integers October 6, 2011-28

MPFR 3.1.0 (canard à l orange, Oct 3, 2011): Maple Mathematica Sage GMP MPF MPFR digits 12.00 6.0.1 4.7 5.0.2 3.1.0 100 mult 0.0020 0.0006 0.00056 0.00011 0.00013 sqr 0.00051 0.00009 0.00010 div 0.0029 0.0017 0.00078 0.00031 0.00031 sqrt 0.032 0.0018 0.00114 0.00056 0.00049 1000 mult 0.0200 0.007 0.0040 0.0036 0.0030 sqr 0.0029 0.0024 0.0018 div 0.0200 0.015 0.0070 0.0041 0.0048 sqrt 0.160 0.011 0.0061 0.0050 0.0047 10000 mult 0.80 0.28 0.113 0.107 0.095 sqr 0.086 0.076 0.064 div 0.80 0.56 0.267 0.198 0.183 sqrt 3.70 0.36 0.183 0.178 0.176 Paul Zimmermann - Short Division of Long Integers October 6, 2011-29

Conclusion Our contributions: two variants of Mulders short product and short division for integers, with detailed description and rigorous error analysis a detailed description and rigorous error analysis of the l-fold division for integers we get a 10% speedup, and more speedup can be obtained for FoldDiv, by using a Toom-3 middle product, a faster (approximate) inverse based on the same ideas,... Benchmarks are a good way to improve software tools! Still to do: design an approximate inverse using the l-fold algorithm Adapt the FoldDiv algorithm for an approximate inverse and update the error analysis Paul Zimmermann - Short Division of Long Integers October 6, 2011-30