4ccurate 4lgorithms in Floating Point 4rithmetic

Size: px

Start display at page:

Download "4ccurate 4lgorithms in Floating Point 4rithmetic"

Shannon Neal
5 years ago
Views:

1 4ccurate 4lgorithms in Floating Point 4rithmetic Philippe LANGLOIS 1 1 DALI Research Team, University of Perpignan, France SCAN 2006 at Duisburg, Germany, September 2006 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 1 / 42

2 Compensated algorithms in floating point arithmetic 1 Compensated algorithms: why and how? 2 Basic bricks towards compensated algorithms 3 Compensated Horner algorithm 4 Performance: arithmetic and processor issues 5 Conclusion Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 2 / 42

3 Algorithm accuracy vs. condition number For a given working precision u, Backward stable algorithms: The solution accuracy is no worse than condition number computing precision ( O(problem dimension)) Example: arithmetic operators, classic algorithms for summation, dot product, Horner s method, backward substitution, GEPP,... How to manage ill-conditionned cases, i.e., when condition number > 1/u? Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 3 / 42

4 Algorithm accuracy vs. condition number For a given working precision u, Backward stable algorithms: The solution accuracy is no worse than condition number computing precision ( O(problem dimension)) Example: arithmetic operators, classic algorithms for summation, dot product, Horner s method, backward substitution, GEPP,... How to manage ill-conditionned cases, i.e., when condition number > 1/u? Highly accurate or even faithful algorithms: The solution accuracy is independant of the condition number (for a range of it) A faithful rounded result is sometimes available Example: Priest s algorithm, distillation algorithms, Rump-Ogita-Oishi for summation and dot product, iterative refinement with extra precision, (polynomial evaluation of a canonical form needs the roots!,... ). Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 3 / 42

5 Compensated algorithms Compensated algorithm corrects the generated rounding error the solution accuracy is no worse than computing precision + condition number (computing precision) K ; the computation is performed only with the working precision u; the performances are no worse than expansion libraries (double-double, quad-double) and even more general ones (arprec, MPFR,... ). It also returns a validated dynamic error bound. (bonus) Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 4 / 42

6 Compensated algorithms Compensated algorithm corrects the generated rounding error the solution accuracy is no worse than computing precision + condition number (computing precision) K ; the computation is performed only with the working precision u; the performances are no worse than expansion libraries (double-double, quad-double) and even more general ones (arprec, MPFR,... ). It also returns a validated dynamic error bound. (bonus) Accuracy as if computed in twice the working precision alternative to double-double in e.g., XBLAS Compensated summation: Neumaier (74), Sum2 in Ogita-Rump-Oishi (05) Compensated Horner algorithm and compensated backward substitution Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 4 / 42

7 Compensated algorithms Compensated algorithm corrects the generated rounding error the solution accuracy is no worse than computing precision + condition number (computing precision) K ; the computation is performed only with the working precision u; the performances are no worse than expansion libraries (double-double, quad-double) and even more general ones (arprec, MPFR,... ). It also returns a validated dynamic error bound. (bonus) Accuracy as if computed in twice the working precision alternative to double-double in e.g., XBLAS Compensated summation: Neumaier (74), Sum2 in Ogita-Rump-Oishi (05) Compensated Horner algorithm and compensated backward substitution Accuracy as if computed in K times the working precision Compensated summation: SumK Ogita-Rump-Oishi (05) Compensated Horner algorithm Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 4 / 42

8 Algorithm accuracy vs. condition number 1 1/u Backward stable algorithms 1/u 2 1/u 3 1/u 4 relative forward error Compensated algorithms u Highly accurate Faithful algorithms 1/u 1/u 2 1/u 4 condition number Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 5 / 42

9 Introducing example: rounding errors when solving Tx = b Classic substitution algorithm computes x i = ( b i i 1 j=1 t i,jx j ) /t i,i, i = 1 : n Algorithm (... and its generated rounding errors inner loop) s i,0 = b i for j = 1 : i 1 loop p i,j = t i,j x j { generates π i,j } s i,j = s i,j 1 p i,j { generates σ i,j } end loop x i = s i,i 1 t i,i { generates δ i } The global forward error in computed x is ( x i ) i=1:n, x i = x i x i = 1 i 1 (σ i,j π i,j t i,j x j ) + δ i t i,i j=1 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 6 / 42

10 Algorithm (Inlining the computation of the correcting term x) for i = 1 : n loop s i,0 = b i u i,0 = 0 for j = 1 : i 1 loop (p i,j, π i,j ) = TwoProd(t i,j, x j ) (s i,j, σ i,j ) = TwoSum(s i,j 1, p i,j ) u i,j = u i,j 1 (t i,j x j π i,j σ i,j ) end loop ( x i, δ i ) = ApproxTwoDiv(s i,i 1, t i,i ) x i = u i,i 1 t i,i δ i end loop Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 7 / 42

11 Algorithm (Inlining the computation of the correcting term x) for i = 1 : n loop s i,0 = b i u i,0 = 0 for j = 1 : i 1 loop (p i,j, π i,j ) = TwoProd(t i,j, x j ) (s i,j, σ i,j ) = TwoSum(s i,j 1, p i,j ) u i,j = u i,j 1 (t i,j x j π i,j σ i,j ) end loop ( x i, δ i ) = ApproxTwoDiv(s i,i 1, t i,i ) x i = u i,i 1 t i,i δ i end loop Algorithm (Compensating x adding the correction x) for i = 1 : n loop x i = x i x i end loop Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 7 / 42

12 Accuracy as in doubled precision for Tx = b dtrsv: IEEE-754 double precision substitution algorithm, dtrsv cor: compensated substitution algorithm, dtrsv x: XBLAS substitution with inner computation in double-double arithmetic 1 1e-2 dtrsv dtrsv_cor dtrsv_x 1e-4 relative forward error 1e-6 1e-8 1e-10 u n cond (2) u + u 2 n cond (3) 1e-12 1e-14 1e-16 u 1/(n u) 1/(n u 2 ) e+10 1e+15 1e+20 1e+25 1e+30 1e+35 1e+4 condition number Compensated substitution accuracy as if computed in doubled precision. Actual accuracy of compensated substitution is bounded as x x / x u + n cond(t, x) u 2. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 8 / 42

13 Measured running times are interesting The general behavior: 1 Overhead to double the precision: measured running times are significantly better than expected from the theoretical flops count. 2 Compensated versus double-double: compensated algorithm runs about two times faster than its double-double counterpart. An example Env. 1: Intel Celeron: 2.4GHz, 256kB L2 cache - GCC Env. 2: Pentium 4: 3.0GHz, 1024kB L2 cache - GCC measured times theoretical ratio env. min mean max # flops compensated / double double-double / double double-double/compensated Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 9 / 42

14 Our first step towards compensated algorithms ( ) CENA: a software for automatic linear correction of rounding errors Motivation: automatic improvement and verification of the accuracy of a result computed by a program considered as a black box. Principle: compute a linear approximation of the global absolute error w.r.t. the generated elementary rounding error and use it as a correcting term. CENA = algorithmic differentiation + computing elementary rounding error + running error analysis + overloaded facilities (e.g., Fortran 90). Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 10 / 42

15 Our first step towards compensated algorithms ( ) CENA: a software for automatic linear correction of rounding errors Motivation: automatic improvement and verification of the accuracy of a result computed by a program considered as a black box. Principle: compute a linear approximation of the global absolute error w.r.t. the generated elementary rounding error and use it as a correcting term. CENA = algorithmic differentiation + computing elementary rounding error + running error analysis + overloaded facilities (e.g., Fortran 90). Should be considered as a development tool: easy to use, help to identify where to use more precision. Step 2: inlining the correction! Much more details in BIT 00, JCAM 01, TCS 03. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 10 / 42

16 1 Compensated algorithms: why and how? 2 Basic bricks towards compensated algorithms Classic error free transformations 3 Compensated Horner algorithm 4 Performance: arithmetic and processor issues 5 Conclusion Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 11 / 42

17 Classic error free transformations Error free transformations are properties and algorithms to compute the generated elementary rounding errors at the CWP. Related works a, b entries in F, a b = fl (a b) + ε, with ε output in F EFT are so named by Ogita-Rump-Oishi (SISC, 05) sum-and-roundoff pair, product-and-roundoff pair in Priest (PhD, 92), no special name before. Recent and interesting discussion in Nievergelt (TOMS,04) Algorithms to compute the error term are extremely dependent to properties of the floating point arithmetic e.g., basis of representation, rounding mode... Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 12 / 42

18 Classic error free transformations Be careful: some of the following relations only apply for IEEE-754 arithmetic with rounding to the nearest. Summation: Møller (65), Kahan (65), Knuth (74), Priest(91). (s, σ) = TwoSum (a, b) is such that a + b = s + σ and s = a b. Product: Veltkampt and Dekker (72),... (p, π) = TwoProd (a, b) is such that a b = p + π and p = a b. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 13 / 42

19 Classic error free transformations Be careful: some of the following relations only apply for IEEE-754 arithmetic with rounding to the nearest. Summation: Møller (65), Kahan (65), Knuth (74), Priest(91). (s, σ) = TwoSum (a, b) is such that a + b = s + σ and s = a b. Product: Veltkampt and Dekker (72),... (p, π) = TwoProd (a, b) is such that a b = p + π and p = a b. FMA: Boldo+Muller (05) (p, δ 1, δ 2 ) = ErrFMA(a, b, c) is such that p = FMA(a, b, c) and a b + c = p + δ 1 + δ 2. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 13 / 42

20 Classic error free transformations Be careful: some of the following relations only apply for IEEE-754 arithmetic with rounding to the nearest. Summation: Møller (65), Kahan (65), Knuth (74), Priest(91). (s, σ) = TwoSum (a, b) is such that a + b = s + σ and s = a b. Product: Veltkampt and Dekker (72),... (p, π) = TwoProd (a, b) is such that a b = p + π and p = a b. FMA: Boldo+Muller (05) (p, δ 1, δ 2 ) = ErrFMA(a, b, c) is such that p = FMA(a, b, c) and a b + c = p + δ 1 + δ 2. Division and square root: Pichat (77), Markstein (90) If d = a b, then r = a d b is a floating point value. EFT exist for the remainder and the square and are not useful hereafter. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 13 / 42

21 Not error free transformations Approximate division: Langlois+Nativel (98) (d, δ) = ApproxTwoDiv (a, b) is such that a/b = d + δ and d = a b, δ δ u δ. Approximate square root: Langlois+Nativel (98) Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 14 / 42

22 Algorithms to compute EFT: two examples Algorithm (fast-two-sum, Dekker (71)) function s, σ = FastTwoSum (a, b) % a b s := a b; b app := s a; σ := b b app cost: 3 flops and 1 test Algorithm (algorithm without pre-ordering or two-sum, Knuth (74)) function s, σ = TwoSum (a, b) s := a b; b app := s a; a app := s b app ; δ b := b b app ; δ a := a a app ; σ := δ a δ b ; cost: 6 flops Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 15 / 42

23 Error Free Transformations: summary EFT for proving theorems on error bounds Standard model: a b = fl (a b) (1 + η) with η u; EFT: a b = fl (a b) + ε Mathematical equalities between floating point entries Algorithms to compute EFT yield generated rounding errors, are performed at the working precision, without change of rounding mode, better benefit of hardware units: pipeline Cost in #flops TwoSum FastTwoSum TwoProd TwoProdFMA ThreeFMA 6 3 (+1 test) Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 16 / 42

24 1 Compensated algorithms: why and how? 2 Basic bricks towards compensated algorithms 3 Compensated Horner algorithm Compensated Horner algorithm Validated error bounds Faithful rounding with compensated Horner algorithm 4 Performance: arithmetic and processor issues 5 Conclusion Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 17 / 42

25 Accuracy of the Horner algorithm We consider the polynomial n p(x) = a i x i, i=0 with a i F, x F Algorithm (Horner) function r 0 = Horner (p, x) r n = a n for i = n 1 : 1 : 0 r i = r i+1 x a i end P cond(p, x) = ai x i p(x) 1 Relative accuracy of the evaluation with the Horner scheme: Horner (p, x) p(x) p(x) γ 2n }{{} 2nu cond(p, x) u is the computing precision: IEEE-754 double, 53-bits mantissa, rounding to the nearest u = Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 18 / 42

26 EFT for the Horner scheme Let p be a polynomial of degree n with coefficients in F. If no underflow occurs, p(x) = Horner (p, x) + (p π + p σ )(x), with p π (X ) = n 1 i=0 π ix i, p σ (X ) = n 1 i=0 σ ix i, and π i and σ i in F. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 19 / 42

27 EFT for the Horner scheme Let p be a polynomial of degree n with coefficients in F. If no underflow occurs, p(x) = Horner (p, x) + (p π + p σ )(x), with p π (X ) = n 1 i=0 π ix i, p σ (X ) = n 1 i=0 σ ix i, and π i and σ i in F. Algorithm (Horner scheme) function r 0 = Horner (p, x) r n = a n for i = n 1 : 1 : 0 loop p i = r i+1 x % rounding error π i F r i = p i a i % rounding error σ i F end loop Algorithm (EFT for Horner) function [r 0, p π, p σ ] = EFTHorner (p, x) r n = a n for i = n 1 : 1 : 0 loop [p i, π i ] = TwoProd (r i+1, x) [r i, σ i ] = TwoSum (p i, a i ) p π [i] = π i p σ [i] = σ i end loop Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 19 / 42

28 The compensated Horner algorithm and its accuracy (p π + p σ )(x) is the global error affecting Horner (p, x). we compute an approximate of (p π + p σ )(x) as a correcting term. Algorithm (Compensated Horner algorithm) function r = CompHorner (p, x) [ r, p π, p σ ] = EFTHorner (p, x) % r = Horner (p, x) ĉ = Horner (p π p σ, x) r = r ĉ Theorem Given p a polynomial with floating point coefficients, and x a floating point value, CompHorner (p, x) p(x) / p(x) u + γ 2 2n }{{} 2n 2 u 2 cond(p, x). Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 20 / 42

29 ... as if computed in doubled precision 1 Accuracy of polynomial evaluation [n=25] e-04 γ 2n cond u + γ 2n 2 cond relative forward error 1e-06 1e-08 1e-10 1e-12 1e-14 1e-16 Horner DDHorner CompHorner u 1e-18 1/u 1/u e+10 1e+15 1e+20 1e+25 1e+30 1e+35 condition number Expected accuracy improvement compared to Horner The a priori bound is correct but pessimistic Same accuracy is provided by CompensatedHorner and DDHorner (Horner with inner computation with double-doubles). Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 21 / 42

30 How to validate the accuracy of the computed evaluation? Classic dynamic approaches: Compensated RoT or approximate Wilkinson s running error bound Interval arithmetic Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 22 / 42

31 A dynamic and validated error bound Theorem Given a polynomial p with floating point coefficients, and a floating point value x, we consider res = CompensatedHorner (p, x). The absolute forward error affecting the evaluation is bounded according to CompensatedHorner (p, x) p(x) fl((u res + (γ 4n+2 HornerSum ( p π, p σ, x ) + 2u 2 res ))). The dynamic bound is computed in entire FPA with round to the nearest mode only. The dynamic bound improves the a priori bound. Proof uses EFT and the standard model of FPA (as in Ogita-Rump-Oishi paper) Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 23 / 42

32 The dynamic bound improves the a priori bound Example: evaluating p 5 (x) = (x 1) 5 for 1024 points around x = 1 1e-24 Accuracy of the absolute error bounds 1e-26 Actual forward error A priori error bound Dynamic error bound absolute forward error 1e-28 1e-30 1e-32 1e argument x Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 24 / 42

33 Motivation for faithfully rounding a real value A faithful rounded result = one of the two neighbouring floating point values Interest: sign determination, e.g., for geometric predicates Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 25 / 42

34 Motivation for faithfully rounding a real value A faithful rounded result = one of the two neighbouring floating point values Interest: sign determination, e.g., for geometric predicates 1 Accuracy of the polynomial evaluation with CompHorner [n=50] e-04 relative forward error 1e-06 1e-08 1e-10 1e-12 u + γ 2n 2 cond 1e-14 1e-16 u 1e-18 1/u 1/u e+10 1e+15 1e+20 1e+25 1e+30 1e+35 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 25 / 42

35 Two sufficient conditions for faithful rounding Let p be a polynomial of degree n, x be a floating point value. Theorem CompHorner (p, x) is a faithful rounding of p(x) while one of the next two bounds is satisfied. An a priori bound for the condition number: cond(p, x) < 1 u u 2 + u γ 2 2n 1/(8n 2 u). A dynamic bound, validated for floating-point evaluation: ( ) γ2n 1 Horner ( p π p σ, x ) fl < u CompHorner (p, x). 1 2(n + 1)u 2 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 26 / 42

36 Faithful rounding: explanations The previous error bound is too large to prove faithful rounding: CompHorner (p, x) p(x) p(x) u + γ 2 2n }{{} 2n 2 u 2 cond(p, x) We use a result from Rump-Ogita-Oishi (nov. 2005). In practice for IEEE double precision A priori bound: cond up to for n = 20 and up to for n = 100 Dynamic bound: previous bound is improved but still not optimal. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 27 / 42

37 Checking for faithful Horner Accuracy of the polynomial evaluation with the Horner scheme [n=50] (1-u)/(2+u)uγ 2n -2 1/u 1/u 2 1e-04 relative forward error 1e-06 1e-08 1e-10 1e-12 u + γ 2n 2 cond 1e-14 1e-16 u 1e e+10 1e+15 1e+20 1e+25 1e+30 1e+35 condition number Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 28 / 42

38 1 Compensated algorithms: why and how? 2 Basic bricks towards compensated algorithms 3 Compensated Horner algorithm 4 Performance: arithmetic and processor issues How to benefit from the FMA instruction? How to explain measured performances? 5 Conclusion Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 29 / 42

39 Two compensated Horner algorithms with FMA Left: Same EFT with FMA inside TwoProd: p(x) = Horner (p, x) + (p π + p σ )(x) Right: A variant of previous EFT when FMA is the inner operation of the Horner algorithm: p(x) = HornerFMA (p, x) + (p ε + p ϕ )(x) Algorithm (EFT with FMA) [h, p π, p σ ] = EFTHornerFMA(p, x) s n = a n for i = n 1 : 1 : 0 loop [p i, π i ] = TwoProdFMA(s i+1, x) [s i, σ i ] = TwoSum(p i, a i ) p π [i] = π i p σ [i] = σ i end loop h = s 0 Algorithm (EFT for FMA) [h, p ε, p ϕ ] = EFTHornerFMA(p, x) s n = a n for i = n 1 : 1 : 0 loop [u i, ε i, ϕ i ] = ThreeFMA(u i+1, x, a i ) p ε [i] = ε i p ϕ [i] = ϕ i end loop h = u 0 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 30 / 42

40 Corresponding results Theorem Given p a polynomial with floating point coefficients, and x a floating point value, we have CompHornerFMA (p, x) p(x) p(x) u + (1 + u)γ n γ 2n cond(p, x) u + 2n 2 u 2 cond(p, x) + O(u 3 ). and CompHornerFMA flops counts is 10n 1. We have also CompHornerThreeFMA (p, x) p(x) p(x) u + (1 + u)γ 2 ncond(p, x) u + n 2 u 2 cond(p, x) + O(u 3 );. and CompHornerThreeFMA flops counts is 19n. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 31 / 42

41 Accuracy experiments: no surprise 1 1e-2 1e-4 1e-06 γ n cond u + (1+u)γ n 2 cond relative forward error 1e-08 1e-10 1e-12 Horner HornerFMA CompHornerFMA CompHorner3FMA DDHorner 1e-14 1e-16 u 1e e+10 1e+15 1e+20 1e+25 1e+30 1e+35 1e+40 condition number Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 32 / 42

42 For the same output accuracy, prefer FMA to compensate Horner without FMA environment Itanium 1, 733MHz, GCC 2.96 Itanium 2, 900MHz, GCC Itanium 2, 1.6GHz, GCC CompHornerFMA CompHorner3FMA DDHorner HornerFMA HornerFMA HornerFMA mean theo. mean theo. mean theo Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 33 / 42

43 How to explain measured performances? Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 34 / 42

44 How to explain measured performances? Theoretical ratios (flops): CompHorner Horner 10.5 DDHorner Horner 14 DDHorner CompHorner 1.33 Some practical ratios (average running times for degree 5 to 200): CompHorner Horner DDHorner Horner DDHorner CompHorner Pentium 4, 3.00 GHz GCC ICC Athlon 64, 2.00 GHz GCC Itanium 2, 1.4 GHz GCC ICC Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 34 / 42

45 How to explain measured performances? Theoretical ratios (flops): CompHorner Horner 10.5 DDHorner Horner 14 DDHorner CompHorner 1.33 Some practical ratios (average running times for degree 5 to 200): CompHorner Horner DDHorner Horner DDHorner CompHorner Pentium 4, 3.00 GHz GCC ICC Athlon 64, 2.00 GHz GCC Itanium 2, 1.4 GHz GCC ICC Why such difference between theoretical and measured running times? Why does CompHorner run between 2 and 5 times faster than DDHorner? Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 34 / 42

46 What is Instruction-Level Parallelism? All processors since about 1985, including those in the embedded space, use pipelining to overlap the execution of instructions and improve performance. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instruction can be evaluated in parallel. (Hennessy & Patterson) Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 35 / 42

47 What is Instruction-Level Parallelism? All processors since about 1985, including those in the embedded space, use pipelining to overlap the execution of instructions and improve performance. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instruction can be evaluated in parallel. (Hennessy & Patterson) A wide range of techniques exploit the parallelism among instructions: pipelining, superscalar architectures... More ILP thanks to less instruction dependancies if two instructions are parallel they can execute simultaneously in a pipeline, if two instructions are dependent they must be executed in order. Here data dependence is the main source of instruction dependences (no control dependencies nor name dependences). Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 35 / 42

48 Data dependences An instruction i is data dependent on an instruction j if either instruction j produce a result that may be used by instruction i, their exist a chain of dependences of the first type between i and j. j j k i i Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 36 / 42

49 Data dependences An instruction i is data dependent on an instruction j if either instruction j produce a result that may be used by instruction i, their exist a chain of dependences of the first type between i and j. j j k i i If two instructions are data dependent, they cannot execute simultaneously. Dependences are properties of programs: the presence of a data dependence in an instruction sequence reflects a data dependence in the source code. What about the dependences in CompHorner and DDHorner? Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 36 / 42

50 Identification of the difference between CompHorner and DDHorner function r = CompHorner(P, x) s n = a i ; c n = 0 for i = n 1 : 1 : 0 [p i, π i ] = 2Prod(s i+1, x) [s i, σ i ] = 2Sum(p i, a i ) c i = c i+1 x (π i σ i ) end r = s 0 c 0 function r = DDHorner(P, x) sh n = a i ; sl n = 0 for i = n 1 : 1 : 0 %% [ph i, pl i ]=[sh i+1, sl i+1 ] x [th, tl] = 2Prod(sh i+1, x) tl = sl i+1 x tl [ph i, pl i ] = Fast2Sum(th, tl) %% [sh i, sl i ]=[ph i, pl i ] a i [th, tl] = 2Sum(ph i, a i ) tl = tl pl i [sh i, sl i ] = Fast2Sum(th, tl) end r = sh 0 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 37 / 42

51 In red the only differences between CompHorner and DDHorner function r = CompHorner (P, x) s n = a i ; c n = 0 for i = n 1 : 1 : 0 [p i, π i ] = 2Prod(s i+1, x) t i = c i+1 x π i [s i, σ i ] = 2Sum(p i, a i ) c i = t i σ i end r = s 0 c 0 function r = DDHorner(P, x) sh n = a i ; sl n = 0 for i = n 1 : 1 : 0 %% [ph i, pl i ]=[sh i+1, sl i+1 ] x [th, tl] = 2Prod(sh i+1, x) tl = sl i+1 x tl [ph i, pl i ] = Fast2Sum(th, tl) %% [sh i, sl i ]=[ph i, pl i ] a i [th, tl] = 2Sum(ph i, a i ) tl = tl pl i [sh i, sl i ] = Fast2Sum(th, tl) end r = sh 0 Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 38 / 42

52 Difference in the data flow graph r splitter r x * r CompHorner * 1 * 1 * 1 14 P[i] r 15 x c 2 3 sh x DDHorner sh splitter sh sl x * * We represent all data dependences in the inner loop of each algorithm. r x_lo x_hi P[i] x_hi x_lo x_lo 5 * 8 * * * x_hi * * + + * * The critical path in floating point operations for CompHorner is twice shorter than in DDHorner c ph pl P[i] 23 P[i] + More parallelism among floating point operations in CompHorner than in DDHorner TwoProd, TwoSum, next iteration can be pipelined. Normalisation forbid such ILP sh Thus more potential ILP, and greater practical performance! sl Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 39 / 42

53 1 Compensated algorithms: why and how? 2 Basic bricks towards compensated algorithms 3 Compensated Horner algorithm 4 Performance: arithmetic and processor issues 5 Conclusion Summary and work in progress Acknowledgements Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 40 / 42

54 To finish... Main message: Compensated algorithms are an interesting alternative to existing librairies as double or quad-double accurary as if computed with K times the available precision, efficiency in term of running-time without too much portability sacrifice since only working with IEEE-754 precisions: single, double; dynamic bound to control the accuracy of the compensated result. Some work in progress: GPU and Sony-IBM Cell: new exotic processors are extremely powerful architectures but only in single precision! Highly accurate Horner schema thanks to compensation and faithful summation of Rump-Ogita-Horner. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 41 / 42

55 Acknowledgements SCAN 06 organizers and Professor W. Luther in particular Stef Graillat and Nicolas Louvet go to his talk this afternoon at 15:15 Professor B. Goossens and D. Parello for their expertise in micro-architecture and compiler optimisation Rump-Ogita-Oishi recent papers for the motivation of these developments. Philippe LANGLOIS (UPVD) 4ccurate 4lgorithms in Floating Point 4rithmetic SCAN 2006 at Duisburg 42 / 42

Solving Triangular Systems More Accurately and Efficiently

17th IMACS World Congress, July 11-15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas