A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

Size: px

Start display at page:

Download "A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang"

Rudolph Berry
5 years ago
Views:

1 A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic library

2 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions.

1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)!

3 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions. Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the Hénon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. x x1

4 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs.

5 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY CudaA Multiple Precision ARithmetic library

6 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions

7 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition.

8 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition. Cons: morethanonerepresentation; existingmultiplication algorithms do not generalize well for an arbitrary number of terms; di cultrigorouserroranalysis! lack of thorough error bounds.

9 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9.

10 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17.

11 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17. Restriction: n apple 12 for single-precision and n apple 39 for double-precision.

12 5/1 Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 (Fast2Sum (a, b)) s RN (a + b) z RN (s a) e RN (b z) return (s, e) Algorithm 2 (2MultFMA (a, b)) p RN (a b) e fma(a, b, p) return (p, e) Requirement: e a e b ;! Uses 3 FP operations. Requirement: e a + e b e min + p 1;! Uses 2 FP operations.

13 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof.

14 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof. 2 quad-double multiplication in QD library [Hida et.al. 07]: does not straightforwardly generalize; canleadtoo(n 3 ) complexity; worstcaseerrorboundispessimistic; nocorrectnessproofispublished.

15 7/1 New multiplication algorithms requires: ulp -nonoverlapping FP expansion x =(x 0,x 1,...,x n 1 ) and y =(y 0,y 1,...,y m 1 ); ensures: ulp -nonoverlapping FP expansion =( 0, 1,..., r 1 ). Let me explain it with an example...

16 Example: n =4,m=3and r =4 8/1

17 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ;

18 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ; term-times-expansion products, x i y; error correction term, r.

19 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ;

20 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ; the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09].

21 Example: n =4,m=3and r =4 10 / 1

22 10 / 1 Example: n =4,m=3and r =4 subtract initial value;

23 10 / 1 Example: n =4,m=3and r =4 subtract initial value; apply renormalization step to B: Fast2Sum and branches; rendertheresult ulp -nonoverlapping.

24 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources:

25 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) ;

26 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; ;

27 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. ;

28 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. Tight error bound: x 0 y 0 2 (p 1)r [1 + (r +1)2 p (p 1) 2 (p 1) (1 2 (p 1) ) 2 + m + n r 2 A]. 1 2 (p 1) ;

29 12 / 1 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91]

30 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91] Table : Performance in Mops/s for multiplying two FP expansions on a Tesla K40c GPU, using CUDA 7.5 software architecture, running on a single thread of execution. precision not supported. d x,d y,d r New algorithm QD 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16, / 1

31 13 / 1 Conclusions AMPAR CudA Multiple Precision ARithmetic library Available online at: algorithm with strong regularity; based on partial products accumulation; uses a fixed-point structure that is floating-point friendly; thorough error analysis and tight error bound; natural fit for GPUs; proved to be too complex for small precisions; performance gains with increased precision.

32 Comparison Table : Performance in Mops/s for multiplying two FP expansions on the CPU; d x and d y represent the number of terms in the input expansions and d r is the size of the computed result. precision not supported d x,d y,d r New algorithm QD MPFR 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16,

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration Mioara Joldes, Valentina Popescu, Jean-Michel Muller ASAP 2014 1 / 10 Motivation Numerical problems