Floating Point Arithmetic and Rounding Error Analysis

Size: px

Start display at page:

Download "Floating Point Arithmetic and Rounding Error Analysis"

Meryl Lewis
5 years ago
Views:

1 Floating Point Arithmetic and Rounding Error Analysis Claude-Pierre Jeannerod Nathalie Revol Inria LIP, ENS de Lyon

2 2 Context Starting point: How do numerical algorithms behave in nite precision arithmetic? Typically, basic matrix computations: Ax = b,... oating-point arithmetic as specied by IEEE 754.

3 2 Context Starting point: How do numerical algorithms behave in nite precision arithmetic? Typically, basic matrix computations: Ax = b,... oating-point arithmetic as specied by IEEE 754. Finite precision rounding errors: =

4 2 Context Starting point: How do numerical algorithms behave in nite precision arithmetic? Typically, basic matrix computations: Ax = b,... oating-point arithmetic as specied by IEEE 754. Finite precision rounding errors: = / =

5 2 Context Starting point: How do numerical algorithms behave in nite precision arithmetic? Typically, basic matrix computations: Ax = b,... oating-point arithmetic as specied by IEEE 754. Finite precision rounding errors: = / = =

6 2 Context Starting point: How do numerical algorithms behave in nite precision arithmetic? Typically, basic matrix computations: Ax = b,... oating-point arithmetic as specied by IEEE 754. Finite precision rounding errors: = / = = What is the eect of all such errors on the computed solution x?

7 Rounding error analysis 3 Old and nontrivial question [von Neumann, Turing, Wilkinson,...]

8 3 Rounding error analysis Old and nontrivial question [von Neumann, Turing, Wilkinson,...] In this lecture, two approaches: A priori analysis: Goal: bound on x x / x for any input and format Tool: the many nice properties of oating-point Ideal: readable, provably tight bound + short proof

9 3 Rounding error analysis Old and nontrivial question [von Neumann, Turing, Wilkinson,...] In this lecture, two approaches: A priori analysis: Goal: bound on x x / x for any input and format Tool: the many nice properties of oating-point Ideal: readable, provably tight bound + short proof A posteriori, automatic analysis: Goal: x and enclosure of x x for given input and format Tool: interval arithmetic based on oating-point Ideal: a narrow interval computed fast

10 4 Rounding error analysis Old and nontrivial question [von Neumann, Turing, Wilkinson,...] In this lecture, two approaches: A priori analysis: this lecture Goal: bound on x x / x for any input and format Tool: the many nice properties of oating-point Ideal: readable, provably tight bound + short proof A posteriori, automatic analysis: Nathalie's lecture Goal: x and enclosure of x x for given input and format Tool: interval arithmetic based on oating-point Ideal: a narrow interval computed fast

11 5 Context Floating-point arithmetic A priori analysis Conclusion

12 Floating-point arithmetic 6 An approximation of arithmetic over R. 1940's: rst implementations [Zuse's computers] : full specication [IEEE 754 standard]. Today: IEEE arithmetic everywhere!

13 Floating-point arithmetic 6 An approximation of arithmetic over R. 1940's: rst implementations [Zuse's computers] : full specication [IEEE 754 standard]. Today: IEEE arithmetic everywhere! Although often considered as fuzzy, it is highly structured and has many nice mathematical properties.

14 Floating-point arithmetic 6 An approximation of arithmetic over R. 1940's: rst implementations [Zuse's computers] : full specication [IEEE 754 standard]. Today: IEEE arithmetic everywhere! Although often considered as fuzzy, it is highly structured and has many nice mathematical properties. How to exploit these properties for rigorous analyses?

15 7 Floating-point numbers Rational numbers of the form M β E, where M, E Z, M < β p, E + p 1 [e min, e max ]. base β, precision p, exponent range dened by e min and e max.

16 7 Floating-point numbers Rational numbers of the form M β E, where M, E Z, M < β p, E + p 1 [e min, e max ]. base β, precision p, exponent range dened by e min and e max. Floats in C have β = 2, p = 24, and [e min, e max ] = [ 126, 127].

17 7 Floating-point numbers Rational numbers of the form M β E, where M, E Z, M < β p, E + p 1 [e min, e max ]. base β, precision p, exponent range dened by e min and e max. Floats in C have β = 2, p = 24, and [e min, e max ] = [ 126, 127].

18 8 Floating-point numbers We assume e min = and e max = + : unbounded exponent range, β is even and p 2.

19 8 Floating-point numbers We assume e min = and e max = + : unbounded exponent range, β is even and p 2. Denition: The set F of oating-point numbers in base β and precision p is F := {0} {M β E : M, E Z, β p 1 M < β p }.

20 Floating-point numbers: other representations x F\{0} x = m β e, m = ( } {{ } ) β [1, β). p 1 Three useful units: Unit in the rst place: ufp(x) = β e, Unit in the last place: ulp(x) = β e p+1, Unit roundo: u = 1 2 β1 p. 9

21 Floating-point numbers: other representations 9 x F\{0} x = m β e, m = ( } {{ } ) β [1, β). p 1 Three useful units: Unit in the rst place: ufp(x) = β e, Unit in the last place: ulp(x) = β e p+1, Unit roundo: u = 1 2 β1 p. Alternative views, which display the structure of F very well: x ulp(x)z, x = (1 + 2ku) ufp(x), k N. F [1, β) = { } 1, 1 + 2u, 1 + 4u,....

22 10 Floating-point numbers: some properties F can be seen as a structured grid with many nice properties: Symmetry: f F f F ; Auto-similarity: f F, e Z f β e F ;

23 Floating-point numbers: some properties 10 F can be seen as a structured grid with many nice properties: Symmetry: f F f F ; Auto-similarity: f F, e Z f β e F ; { } F [1, β) = 1, 1 + 2u, 1 + 4u,... ;

24 Floating-point numbers: some properties 10 F can be seen as a structured grid with many nice properties: Symmetry: f F f F ; Auto-similarity: f F, e Z f β e F ; { } F [1, β) = 1, 1 + 2u, 1 + 4u,... ; F [β e, β e+1 ] has (β 1)β p equally spaced elements, with spacing equal to 2uβ e ;

25 Floating-point numbers: some properties 10 F can be seen as a structured grid with many nice properties: Symmetry: f F f F ; Auto-similarity: f F, e Z f β e F ; { } F [1, β) = 1, 1 + 2u, 1 + 4u,... ; F [β e, β e+1 ] has (β 1)β p equally spaced elements, with spacing equal to 2uβ e ; Neighborhood of 1 F:..., 1 4u β, 1 2u β, 1, 1 + 2u, 1 + 4u,.... Hence 1 is β x closer to its predecessor than to its successor.

26 Rounding functions 11 Round-to-nearest function RN : R F such that t R, RN(t) t = min f t, f F with given tie-breaking rule.

27 Rounding functions 11 Round-to-nearest function RN : R F such that t R, RN(t) t = min f t, f F with given tie-breaking rule. t F RN(t) = t RN nondecreasing reasonable tie-breaking rule: RN( t) = RN(t) RN(tβ e ) = RN(t)β e, e Z

28 Rounding functions 12 More generally, a rounding function is any map from R to F such that t F (t) = t; t t (t) (t ).

29 12 Rounding functions More generally, a rounding function is any map from R to F such that t F (t) = t; t t (t) (t ). Adding just one extra constraint gives the usual directed roundings: Rounding down: RD(t) t. Rounding up: t RU(t). Rounding to zero: RZ(t) t.

30 12 Rounding functions More generally, a rounding function is any map from R to F such that t F (t) = t; t t (t) (t ). Adding just one extra constraint gives the usual directed roundings: Rounding down: RD(t) t. Rounding up: t RU(t). Rounding to zero: RZ(t) t. Key property for interval arithmetic: t F t [ ] RD(t), RU(t).

31 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u.

32 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}.

33 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}. Then RN(t) t 1 2u = u. 2

34 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}. Then RN(t) t 1 2u = u. 2 Dividing by RN(t) 1 gives directly the bound on E 2.

35 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}. Then RN(t) t 1 2u = u. 2 Dividing by RN(t) 1 gives directly the bound on E 2. If t 1 + u then the bound E 1 (t) u 1+u follows.

36 Error bounds for real numbers 13 E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}. Then RN(t) t 1 2u = u. 2 Dividing by RN(t) 1 gives directly the bound on E 2. If t 1 + u then the bound E 1 (t) u 1+u follows. Else 1 t < 1 + u RN(t) = 1 E 1 (t) = t 1 t < u 1+u.

37 Error bounds for real numbers E 1 (t) := RN(t) t t u 1 + u, E 2(t) := RN(t) t RN(t) u. Proof: Assume 1 t < β, so that RN(t) {1, 1 + 2u, 1 + 4u,..., β}. Then RN(t) t 1 2u = u. 2 Dividing by RN(t) 1 gives directly the bound on E 2. If t 1 + u then the bound E 1 (t) u 1+u follows. Else 1 t < 1 + u RN(t) = 1 E 1 (t) = t 1 t < u 1+u. u Bound 1+u : sharp and well known [Dekker'71, Holm'80, Knuth'81-98], but simpler bound u almost always used in practice. 13

38 Error bounds for real numbers 14 Example for the usual binary formats: if p = 11 (half), u u 1 + u if p = 24 (oat), if p = 53 (double), if p = 113 (quad). For directed roundings, replace these bounds by 2u. Conclusion: in all cases, the relative errors due to rounding can be bounded by a tiny quantity which depends only on the format.

39 Correct rounding 15 This is the result of the composition of two functions: basic operations performed exactly, and exact result then rounded: x, y F, op = ±,, return r := RN(x op y). op extends to square root and FMA (fused multiply add: xy + z).

40 Correct rounding 15 This is the result of the composition of two functions: basic operations performed exactly, and exact result then rounded: x, y F, op = ±,, return r := RN(x op y). op extends to square root and FMA (fused multiply add: xy + z). The error bounds on E 1 and E 2 yield two standard models: r = (x op y) (1 + δ 1 ), δ 1 u 1+u =: u 1, = (x op y) 1, 1 + δ 2 δ 2 u.

41 Example 16 ( ) Let r = x+y RN(x+y) be evaluated naively as r = RN. 2 2

42 Example 16 ( ) Let r = x+y RN(x+y) be evaluated naively as r = RN. 2 2 High relative accuracy is ensured: RN(x + y) r = (1 + δ 1 ), 2 δ 1 u 1, = x + y (1 + δ 1 )(1 + δ 2 1), δ 1 u 1, =: r (1 + ɛ), ɛ 2u.

43 Example 16 ( ) Let r = x+y RN(x+y) be evaluated naively as r = RN. 2 2 High relative accuracy is ensured: RN(x + y) r = (1 + δ 1 ), 2 δ 1 u 1, = x + y (1 + δ 1 )(1 + δ 2 1), δ 1 u 1, =: r (1 + ɛ), ɛ 2u. We'd also like to have min(x, y) r max(x, y)...

44 Example Not always true: (RN ( ) ) ( ) 10 β = 10, p = 3 RN = RN =

45 Example Not always true: (RN ( ) ) ( ) 10 β = 10, p = 3 RN = RN = True if β = 2 or sign(x) sign(y). 17

46 Example 17 Not always true: (RN ( ) ) ( ) 10 β = 10, p = 3 RN = RN = True if β = 2 or sign(x) sign(y). Proof for base two: r := RN ( RN(x+y) 2 ) = RN ( ) x+y 2. x x+y y RN(x) RN ( x+y 2 2 x r y. ) RN(y)

47 Example 17 Not always true: (RN ( ) ) ( ) 10 β = 10, p = 3 RN = RN = True if β = 2 or sign(x) sign(y). Proof for base two: r := RN ( RN(x+y) 2 ) = RN ( ) x+y 2. x x+y y RN(x) RN ( x+y 2 2 x r y. ) RN(y) Repair other cases using r = x + y x. [Sterbenz'74, Boldo'15] 2

48 Other typical oating-point surprises 18 β = 2, any precision p 1 10 F.

49 Other typical oating-point surprises 18 β = 2, any precision p 1 10 F. Loss of algebraic properties: commutativity of +,, is preserved by correct rounding, but associativity and distributivity are lost: in general, ( (x + y) + z) (x + (y + z)).

50 Other typical oating-point surprises 18 β = 2, any precision p 1 10 F. Loss of algebraic properties: commutativity of +,, is preserved by correct rounding, but associativity and distributivity are lost: in general, ( (x + y) + z) (x + (y + z)). Catastrophic cancellation: 2 oating-point operations are enough to produce a result with relative error 1.

51 Catastrophic cancellation 19 For example, if x = 1, y = u β, and z = 1 then r := RN(RN(x + y) + z) = 0, and, since r := x + y + z is nonzero, we obtain r r / r = 1.

52 19 Catastrophic cancellation For example, if x = 1, y = u β, and z = 1 then r := RN(RN(x + y) + z) = 0, and, since r := x + y + z is nonzero, we obtain r r / r = 1. Possible workarounds: Sorting the input (if possible) Rewriting: a 2 b 2 = (a + b)(a b). Compensation: compute the rounding errors, and use them later in the algorithm in order to compensate for their eect. [Kahan, Rump,...]

53 Conditions for exact subtraction 20 Sterbenz' lemma: [Sterbenz'74] x, y F, y 2 x 2y x y F.

54 20 Conditions for exact subtraction Sterbenz' lemma: x, y F, y 2 x 2y x y F. [Sterbenz'74] Valid for any base β. Applications: Cody and Waite's range reduction, Kahan's accurate algorithms (discriminants, triangle area),...

55 Conditions for exact subtraction 20 Sterbenz' lemma: [Sterbenz'74] x, y F, y 2 x 2y x y F. Valid for any base β. Applications: Cody and Waite's range reduction, Kahan's accurate algorithms (discriminants, triangle area),... Proof: assume 0 < y x 2y. ulp(y) ulp(x) x y β e Z with β e = ulp(y). x y β is an integer such that 0 x y e β y e ulp(y) < βp. [Hauser'96]

56 Representable error terms 21 Addition and multiplication: x, y F, op {+, } x op y RN(x op y) F.

57 21 Representable error terms Addition and multiplication: x, y F, op {+, } x op y RN(x op y) F. Division and square root: x y RN(x/y) F, x RN ( x ) 2 F. Noted quite early. [Dekker'71, Pichat'76, Bohlender et al.'91] RN required only for ADD and SQRT. [Boldo & Daumas'03] FMA: its error is the sum of two oats. [Boldo & Muller'11]

58 22 Error-free transformations (EFT) Floating-point algorithms for computing such error terms exactly: x + y RN(x + y) in 6 additions [Møller'65, Knuth] and not less [Kornerup, Lefèvre, Louvet, Muller'12]

59 Error-free transformations (EFT) 22 Floating-point algorithms for computing such error terms exactly: x + y RN(x + y) in 6 additions [Møller'65, Knuth] and not less [Kornerup, Lefèvre, Louvet, Muller'12] xy RN(xy) can be obtained in 17 + and x [Dekker'71, Boldo'06] in only 2 ops if an FMA is available: ẑ := RN(xy) xy ẑ = FMA(x, y, ẑ).

60 Error-free transformations (EFT) 22 Floating-point algorithms for computing such error terms exactly: x + y RN(x + y) in 6 additions [Møller'65, Knuth] and not less [Kornerup, Lefèvre, Louvet, Muller'12] xy RN(xy) can be obtained in 17 + and x [Dekker'71, Boldo'06] in only 2 ops if an FMA is available: ẑ := RN(xy) xy ẑ = FMA(x, y, ẑ). Similar FMA-based EFT for DIV, SQRT... and FMA. EFT are key for extended precision algorithms: error compensation [Kahan'65,..., Higham'96, Ogita, Rump, Oishi'04+, Graillat, Langlois, Louvet'05+,...], oating-point expansions [Priest'91, Shewchuk'97, Joldes, Muller, Popescu'14+].

61 23 Optimal relative error bounds When t can be any real number, E 1 (t) best possible: u 1+u and E 2(t) u are t := 1 + u RN(t) is 1 or 1 + 2u t RN(t) = u.

62 23 Optimal relative error bounds When t can be any real number, E 1 (t) best possible: u 1+u and E 2(t) u are t := 1 + u RN(t) is 1 or 1 + 2u t RN(t) = u. Hence E 1 (t) = u 1 + u

63 23 Optimal relative error bounds When t can be any real number, E 1 (t) best possible: u 1+u and E 2(t) u are t := 1 + u RN(t) is 1 or 1 + 2u t RN(t) = u. Hence E 1 (t) = u 1 + u and, if rounding ties to even, RN(t) = 1 and thus E 2 (t) = u.

64 23 Optimal relative error bounds When t can be any real number, E 1 (t) best possible: u 1+u and E 2(t) u are t := 1 + u RN(t) is 1 or 1 + 2u t RN(t) = u. Hence E 1 (t) = u 1 + u and, if rounding ties to even, RN(t) = 1 and thus E 2 (t) = u. These are examples of optimal bounds: valid for all (t, RN) with t of a certain type; attained for some (t, RN) with t parametrized by β and p.

65 24 Can we do better when t = x op y and x, y F? This depends on op and, sometimes, on β and p. [J. & Rump'14]

66 Can we do better when t = x op y and x, y F? This depends on op and, sometimes, on β and p. [J. & Rump'14] t optimal bound on E 1(t) optimal bound on E 2(t) x ± y xy x/y u 1+u u 1+u u u 1+u ( ) u ( ) if β > 2, u 2u 2 if β = 2 u if β > 2, u 2u 2 1+u 2u 2 if β = 2 x u 1 + 2u 1 ( ) i β > 2 or 2 p + 1 is not a Fermat prime. Two standard models for each arithmetic operation. Application: sharper bounds and/or much simpler proofs. 24

67 25 Context Floating-point arithmetic A priori analysis Conclusion

68 Classical approach: Wilkinson's analysis 26 This is the most common way to guarantee a priori that the computed solution x has some kind of numerical quality: the forward error x x is 'small', the backward error A such that (A + A) x = b is 'small'.

69 Classical approach: Wilkinson's analysis 26 This is the most common way to guarantee a priori that the computed solution x has some kind of numerical quality: the forward error x x is 'small', the backward error A such that (A + A) x = b is 'small'. Developed by Wilkinson in the 1950s and 1960s. Relies almost exclusively on the rst standard model: r = (x op y)(1 + δ), δ u.

Classical approach: Wilkinson's analysis 26 This is the most common way to guarantee a priori that the computed solution x has some kind of numerical quality: the forward error x x is 'small', the

70 Classical approach: Wilkinson's analysis 26 This is the most common way to guarantee a priori that the computed solution x has some kind of numerical quality: the forward error x x is 'small', the backward error A such that (A + A) x = b is 'small'. Developed by Wilkinson in the 1950s and 1960s. Relies almost exclusively on the rst standard model: Eminently powerful: see Higham's book Accuracy and Stability of Numerical Algorithms (SIAM). r = (x op y)(1 + δ), δ u.

71 Example of analysis: a 2 b 2 Applying the standard model to each operation gives: r = ( RN(a 2 ) RN(b 2 ) ) (1 + δ 3 ) = ( a 2 (1 + δ 1 ) b 2 (1 + δ 2 ) ) (1 + δ 3 ), δ i u 1. 27

72 Example of analysis: a 2 b 2 Applying the standard model to each operation gives: r = ( RN(a 2 ) RN(b 2 ) ) (1 + δ 3 ) = ( a 2 (1 + δ 1 ) b 2 (1 + δ 2 ) ) (1 + δ 3 ), δ i u 1. r r = a 2 (δ 1 + δ 3 + δ 1 δ 3 ) b 2 (δ 2 + δ 3 + δ 2 δ 3 ). 27

73 Example of analysis: a 2 b 2 Applying the standard model to each operation gives: r = ( RN(a 2 ) RN(b 2 ) ) (1 + δ 3 ) = ( a 2 (1 + δ 1 ) b 2 (1 + δ 2 ) ) (1 + δ 3 ), δ i u 1. r r = a 2 (δ 1 + δ 3 + δ 1 δ 3 ) b 2 (δ 2 + δ 3 + δ 2 δ 3 ). r r 2u (a 2 +b 2 ). 27

74 Example of analysis: a 2 b 2 Applying the standard model to each operation gives: r = ( RN(a 2 ) RN(b 2 ) ) (1 + δ 3 ) = ( a 2 (1 + δ 1 ) b 2 (1 + δ 2 ) ) (1 + δ 3 ), δ i u 1. r r = a 2 (δ 1 + δ 3 + δ 1 δ 3 ) b 2 (δ 2 + δ 3 + δ 2 δ 3 ). r r 2u (a 2 +b 2 ). r r r 2u C, condition number C := a2 + b 2 a 2 b 2. 27

75 Example of analysis: a 2 b 2 Applying the standard model to each operation gives: r = ( RN(a 2 ) RN(b 2 ) ) (1 + δ 3 ) = ( a 2 (1 + δ 1 ) b 2 (1 + δ 2 ) ) (1 + δ 3 ), δ i u 1. r r = a 2 (δ 1 + δ 3 + δ 1 δ 3 ) b 2 (δ 2 + δ 3 + δ 2 δ 3 ). r r 2u (a 2 +b 2 ). r r r 2u C, condition number C := a2 + b 2 a 2 b 2. Bound easy to derive and to interpret: If C = O(1) then relative error in O(u): highly accurate! If C 1/u then relative error upper bounded by 1: it could be that catastrophic cancellation occurs. 27

76 28 Example of analysis: (a + b)(a b) ( ) r := RN RN(a + b) RN(a b) = (a + b)(a b) (1 + δ 1 )(1 + δ 2 )(1 + δ 3 ), δ i u 1. r r r (1 + u) 3 1 3u. Always highly accurate!

77 29 Floating-point summation Given x 1,..., x n F, evaluate their sum in any order. Classical analysis [Wilkinson'60]: Apply the standard model n 1 times. Deduce that the computed value ŝ F satises n n ŝ x i α x i, α = (1 + u) n 1 1. i=1 i=1

78 29 Floating-point summation Given x 1,..., x n F, evaluate their sum in any order. Classical analysis [Wilkinson'60]: Apply the standard model n 1 times. Deduce that the computed value ŝ F satises n n ŝ x i α x i, α = (1 + u) n 1 1. i=1 i=1 Easy to derive, valid for any order, asymptotically optimal: error 1 as u 0. error bound

79 29 Floating-point summation Given x 1,..., x n F, evaluate their sum in any order. Classical analysis [Wilkinson'60]: Apply the standard model n 1 times. Deduce that the computed value ŝ F satises n n ŝ x i α x i, α = (1 + u) n 1 1. i=1 i=1 Easy to derive, valid for any order, asymptotically optimal: error 1 as u 0. error bound u But, even with u replaced by 1+u, α = (n 1)u + O(u2 ), which hides a constant. So, classically bounded as α γ n 1, γ l = lu, lu < 1. [Higham'96] 1 lu

80 A simpler, O(u 2 )-free bound 30 Theorem [Rump'12] For recursive summation, one can take α = (n 1)u.

81 30 A simpler, O(u 2 )-free bound Theorem [Rump'12] For recursive summation, one can take α = (n 1)u. To prove this, Don't use just the (rened) standard model RN(x + y) (x + y) u 1+u x + y. (1)

82 30 A simpler, O(u 2 )-free bound Theorem [Rump'12] For recursive summation, one can take α = (n 1)u. To prove this, Don't use just the (rened) standard model RN(x + y) (x + y) But combine it with the lower-level property u 1+u x + y. (1) RN(x + y) (x + y) f (x + y), f F, min{ x, y }; (2)

83 30 A simpler, O(u 2 )-free bound Theorem [Rump'12] For recursive summation, one can take α = (n 1)u. To prove this, Don't use just the (rened) standard model RN(x + y) (x + y) But combine it with the lower-level property u 1+u x + y. (1) RN(x + y) (x + y) f (x + y), f F, min{ x, y }; (2) Conclude by induction on n with a clever case-distinction comparing x n to u i<n x i, and using either (1) or (2).

84 Wilkinson's bounds revisited 31 Problem Classical α New α Ref. summation (n 1)u + O(u 2 ) (n 1)u [1]

85 Wilkinson's bounds revisited Problem Classical α New α Ref. summation (n 1)u + O(u 2 ) (n 1)u [1] dot prod., mat. mul. nu + O(u 2 ) nu [1] Euclidean norm ( n + 1)u + 2 O(u2 ) ( n + 1)u 2 [2] Tx = b, A = LU nu + O(u 2 ) nu [2] A = R T R (n + 1)u + O(u 2 ) (n + 1)u [2] x n (recursive, β = 2) (n 1)u + O(u 2 ) (n 1)u ( ) [3] product x 1x 2 x n (n 1)u + O(u 2 ) (n 1)u ( ) [4] poly. eval. (Horner) 2nu + O(u 2 ) 2nu ( ) [4] ( ) if n < c u 1/2. [1]: with Rump'13; [2]: with Rump'14; [3]: Graillat, Lefèvre, Muller'14; [4]: with Bünger and Rump'14. 31

86 32 Kahan's algorithm for ad bc [ a Kahan's algorithm uses the FMA to evaluate det c ] b = ad bc: d ŵ := RN(bc); f := RN(ad ŵ); r := RN( f + e); e := RN(ŵ bc);

87 32 Kahan's algorithm for ad bc [ a Kahan's algorithm uses the FMA to evaluate det c ] b = ad bc: d ŵ := RN(bc); f := RN(ad ŵ); r := RN( f + e); e := RN(ŵ bc); The operation ad bc is not in IEEE 754, but very common: complex arithmetic, discriminant of a quadratic equation, robust orientation predicates using tests like 'ad bc > ɛ?' If evaluated naively, ad bc leads to highly inaccurate results: f r r can be of the order of u 1 1.

88 33 Kahan's algorithm for ad bc Analysis in the standard model [Higham'96]: r r r ( 2u 1 + u bc 2 r high relative accuracy as long as u bc 2 r. ).

89 33 Kahan's algorithm for ad bc Analysis in the standard model [Higham'96]: r r r ( 2u 1 + u bc 2 r ). high relative accuracy as long as u bc 2 r. When u bc 2 r, the error bound can be > 1 and does not even allow to conclude that sign( r) = sign(r).

90 33 Kahan's algorithm for ad bc Analysis in the standard model [Higham'96]: r r r ( 2u 1 + u bc 2 r ). high relative accuracy as long as u bc 2 r. When u bc 2 r, the error bound can be > 1 and does not even allow to conclude that sign( r) = sign(r). In fact, Kahan's algorithm is always highly accurate: the standard model alone fails to predict this; misinterpreting bounds dismissing good algorithms.

91 34 Further analysis [J., Louvet, Muller'13] The key is an ulp-analysis of the error terms ɛ 1 and ɛ 2 given by: ŵ := RN(bc); f := RN(ad ŵ); r := RN( f + e); e := RN(ŵ bc); f = ad ŵ + ɛ 1 r = f + e + ɛ 2 Since e is exactly ŵ bc, we have r r = ɛ 1 + ɛ 2. Furthermore, we can prove that ɛ i β ulp(r) for i = 1, 2. 2 Proposition: r r β ulp(r) 2βu r.

92 34 Further analysis [J., Louvet, Muller'13] The key is an ulp-analysis of the error terms ɛ 1 and ɛ 2 given by: ŵ := RN(bc); f := RN(ad ŵ); r := RN( f + e); e := RN(ŵ bc); f = ad ŵ + ɛ 1 r = f + e + ɛ 2 Since e is exactly ŵ bc, we have r r = ɛ 1 + ɛ 2. Furthermore, we can prove that ɛ i β ulp(r) for i = 1, 2. 2 Proposition: r r β ulp(r) 2βu r. These bounds mean Kahan's algorithm is always highly accurate.

93 Further analysis [J., Louvet, Muller'13] We can do better via a case analysis comparing ɛ 2 to 1 2 ulp(r): Theorem: relative error r r / r 2u; 35

94 Further analysis [J., Louvet, Muller'13] We can do better via a case analysis comparing ɛ 2 to 1 2 ulp(r): Theorem: relative error r r / r 2u; the leading constant 2 is best possible. 35

95 Certicate of optimality This is an explicit input set parametrized by β and p such that error error bound 1 as u 0. 36

96 Certicate of optimality 36 This is an explicit input set parametrized by β and p such that error error bound 1 as u 0. Example: for Kahan's algorithm for r = ad bc: a = b = β p c = β p 1 + β r r / r 2 βp 2 = 1 2u 1 + 2u = 1 2u + O(u2 ). d = 2β p 1 + β 2 βp 2

97 Certicate of optimality 36 This is an explicit input set parametrized by β and p such that error error bound 1 as u 0. Example: for Kahan's algorithm for r = ad bc: a = b = β p c = β p 1 + β r r / r 2 βp 2 = 1 2u 1 + 2u = 1 2u + O(u2 ). d = 2β p 1 + β 2 βp 2 Optimality is asymptotic, but often OK in practice: for β = 2 and p = 11, the above example has relative error u. The certicate consists of sparse, symbolic oating-point data, which we can handle automatically. [J., Louvet, Muller, Plet]

98 37 Context Floating-point arithmetic A priori analysis Conclusion

99 38 Summary Floating-point arithmetic is specied rigorously by IEEE 754, highly structured and much richer than the standard model. Exploiting this structure leads to enhanced a priori analysis: optimal standard models for basic arithmetic operations, simpler and sharper Wilkinson-like bounds, proofs of nice behavior of some numerical kernels.

100 38 Summary Floating-point arithmetic is specied rigorously by IEEE 754, highly structured and much richer than the standard model. Exploiting this structure leads to enhanced a priori analysis: optimal standard models for basic arithmetic operations, simpler and sharper Wilkinson-like bounds, proofs of nice behavior of some numerical kernels. On-going research: consider directed roundings as well. take underow and overow into account.

Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function

Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function Jean-Michel Muller collaborators: S. Graillat, C.-P. Jeannerod, N. Louvet V. Lefèvre