The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal

Similar documents
Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function

On various ways to split a floating-point number

Newton-Raphson Algorithms for Floating-Point Division Using an FMA

Floating Point Arithmetic and Rounding Error Analysis

MANY numerical problems in dynamical systems

Tight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu

Computation of dot products in finite fields with floating-point arithmetic

1 Floating point arithmetic

ECS 231 Computer Arithmetic 1 / 27

Some issues related to double rounding

On the Computation of Correctly-Rounded Sums

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION

Lecture 7. Floating point arithmetic and stability

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

Accurate Solution of Triangular Linear System

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Some Functions Computable with a Fused-mac

Tunable Floating-Point for Energy Efficient Accelerators

Innocuous Double Rounding of Basic Arithmetic Operations

Faithful Horner Algorithm

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

Comparison between binary and decimal floating-point numbers

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

Round-off Error Analysis of Explicit One-Step Numerical Integration Methods

Elements of Floating-point Arithmetic

Binary floating point

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Floating-point Computation

Elements of Floating-point Arithmetic

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Binary Floating-Point Numbers

Bound on Run of Zeros and Ones for Images of Floating-Point Numbers by Algebraic Functions

Error Bounds on Complex Floating-Point Multiplication

Dense LU factorization and its error analysis

Automated design of floating-point logarithm functions on integer processors

ERROR ESTIMATION OF FLOATING-POINT SUMMATION AND DOT PRODUCT

RN-coding of numbers: definition and some properties

Notes for Chapter 1 of. Scientific Computing with Case Studies

Isolating critical cases for reciprocals using integer factorization. John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

Computing Machine-Efficient Polynomial Approximations

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

Accurate polynomial evaluation in floating point arithmetic

Computing correctly rounded logarithms with fixed-point operations. Julien Le Maire, Florent de Dinechin, Jean-Michel Muller and Nicolas Brunie

Computation of the error functions erf and erfc in arbitrary precision with correct rounding

Homework 2 Foundations of Computational Math 1 Fall 2018

How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm

arxiv: v1 [math.nt] 3 Jun 2016

1 Finding and fixing floating point problems

On relative errors of floating-point operations: optimal bounds and applications

BASIC COMPUTER ARITHMETIC

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.

Level Set Solution of an Inverse Electromagnetic Casting Problem using Topological Analysis

Optimizing the Representation of Intervals

ACM 106a: Lecture 1 Agenda

Chapter 1: Introduction and mathematical preliminaries

Formal verification of IA-64 division algorithms

Automating the Verification of Floating-point Algorithms

The numerical stability of barycentric Lagrange interpolation

Fast and accurate Bessel function computation

Chapter 1 Error Analysis

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

A strongly polynomial algorithm for linear systems having a binary solution

Notes on floating point number, numerical computations and pitfalls

Introduction to Scientific Computing Languages

Propagation of Roundoff Errors in Finite Precision Computations: a Semantics Approach 1

Comparison between binary and decimal floating-point numbers

CS 577 Introduction to Algorithms: Strassen s Algorithm and the Master Theorem

MATH ASSIGNMENT 03 SOLUTIONS

Laboratoire de l Informatique du Parallélisme. École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 8512

Introduction to Scientific Computing Languages

Numerical Methods - Preliminaries

Parallel Reproducible Summation

Exponential Functions and Their Graphs (Section 3-1)

Numerical Algorithms. IE 496 Lecture 20

Accurate Multiple-Precision Gauss-Legendre Quadrature

How do computers represent numbers?

Dimensionality reduction of SDPs through sketching

On Newton-Raphson iteration for multiplicative inverses modulo prime powers

Admissible Strategies for Synthesizing Systems

1 Backward and Forward Error

Chapter 4 Number Representations

MAT 460: Numerical Analysis I. James V. Lambers

Local null controllability of the N-dimensional Navier-Stokes system with N-1 scalar controls in an arbitrary control domain

Lecture 1. Introduction to Numerical Methods. T. Gambill. Department of Computer Science University of Illinois at Urbana-Champaign.

Transformation of a PID Controller for Numerical Accuracy

QUADRATIC PROGRAMMING?

Lecture 2 MODES OF NUMERICAL COMPUTATION

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007

Compensated Horner scheme in complex floating point arithmetic

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

New Results on the Distance Between a Segment and Z 2. Application to the Exact Rounding

Adaptive and Efficient Algorithm for 2D Orientation Problem

Computing Machine-Efficient Polynomial Approximations

On the Optimum Asymptotic Multiuser Efficiency of Randomly Spread CDMA

Online Learning, Mistake Bounds, Perceptron Algorithm

Multiplication by an Integer Constant: Lower Bounds on the Code Length

Chapter 8: Solutions to Exercises

Transcription:

-1- The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal Claude-Pierre Jeannerod Jean-Michel Muller Antoine Plet Inria, CNRS, ENS Lyon, Université de Lyon, Lyon, France ARITH 24, London, July 2017

a2 + b 2 and c/ a 2 + b 2 basic building blocks of numerical computing: computation of 2D-norms, Givens rotations, etc.; radix-2, precision-p, FP arithmetic, round-to-nearest, unbounded exponent range; Classical analyses: relative error bounded by 2u for a 2 + b 2, and by 3u + O(u 2 ) for c/ a 2 + b 2, where u = 2 p is the unit roundoff. main results: the O(u 2 ) term is not needed; these error bounds are asymptotically optimal; the bounds and their asymptotic optimality remain valid when an FMA is used to evaluate a 2 + b 2.

-3- Introduction and notation radix-2, precision-p FP number of exponent e and integral significand M 2 p 1: x = M 2 e p+1. RN(t) is t rounded to nearest, ties-to-even ( RN(a 2 ) is the result of the FP multiplication a*a, assuming the round-to-nearest mode) RD(t) is t rounded towards, u = 2 p is the unit roundoff. we have RN(t) = t(1 + ɛ) with ɛ u 1+u < u.

-4- Relative error due to rounding (Knuth) if 2 e t < 2 e+1, then t RN(t) 2 e p = u 2 e, and if t 2 e (1 + u), then t RN(t) /t u/(1 + u); if t = 2 e (1 + τ u) with τ [0, 1), then t RN(t) /t = τ u/(1 + τ u) < u/(1 + u), the maximum relative error due to rounding is bounded by u 1 + u. attained no further general improvement. 2 e (1 + u) 2 e p = 1 ulp(t) 2 2 e t 2 e+1 ˆt = RN(t) t ˆt 2 e p

Wobbling relative error For t 0, define (Rump s ufp function) We have, Lemma 1 Let t R. If ufp(t) = 2 log 2 t. 2 e w 2 e t < 2 e+1, e = log 2 ufp(p) Z (1) (in other words, if 1 w t/ufp(t)) then RN(t) t t u w.

w2 e t RN(t) t u w 2 e y z 2 e+1 ŷ = RN(y) y ŷ y = u z ẑ z = u 2 u 1+u (largest) ẑ = RN(z) Figure 1: If we know that w t/ufp(t) = t/2 e, then RN(t) t /t u/w. the bound on the relative error of rounding t is largest when t is just above a power of 2.

0.06 0.05 0.04 0.03 0.02 0.01 0 1 2 3 4 5 6 7 8 t Figure 2: Relative error RN(t) t /t due to rounding for 1 5 t 8, and p = 4.

On the quality of error bounds When giving for some algorithm a relative error bound that is a function B(p) of the precision p (or, equivalently, of u = 2 p ), if there exist FP inputs parameterized by p for which the bound is attained for every p p 0, the bound is optimal; if there exist some FP inputs parameterized by p and for which the relative error E(p) satisfies E(p)/B(p) 1 as p (or, equivalenty, u 0), the bound is asymptotically optimal. If a bound is asymptotically optimal: no need to try to obtain a substantially better bound.

-9- Computation of a 2 + b 2 Algorithm 1 Without FMA. s a RN(a 2 ) s b RN(b 2 ) s RN(s a + s b ) ρ RN( s) return ρ Algorithm 2 With FMA. s b RN(b 2 ) s RN(a 2 + s b ) ρ RN( s) return ρ classical result: relative error of both algorithms 2u + O(u 2 ) Jeannerod & Rump (2016): relative error of Algorithm 1 2u. tight bounds: in binary64 arithmetic, with a = 1723452922282957/2 64 and b = 4503599674823629/2 52, both algorithms have relative error 1.99999993022... u. both algorithms rather equivalent in terms of worst case error;

Comparing both algorithms? both algorithms rather equivalent in terms of worst case error; for 1, 000, 000 randomly chosen pairs (a, b) of binary64 numbers with the same exponent, same result in 90.08% of cases; Algorithm 2 (FMA) is more accurate in 6.26% of cases; Algorithm 1 is more accurate in 3.65% of cases; for 100, 000 randomly chosen pairs (a, b) of binary64 numbers with exponents satisfying e a e b = 26, same result in 83.90% of cases; Algorithm 2 (FMA) is more accurate in 13.79% of cases; Algorithm 1 is more accurate in 2.32% of cases. Algorithm 2 wins, but not by a big margin.

-11- Our main result for a 2 + b 2 Theorem 2 For p 12, there exist floating-point inputs a and b for which the result ρ of Algorithm 1 or Algorithm 2 satisfies ρ a 2 + b 2 = 2u ɛ, ɛ = O(u 3/2 ). a 2 + b 2 Consequence: asymptotic optimality of the relative error bounds.

-12- Building the generic input values a and b (generic: they are given as a function of p) 1 We restrict to a and b such that 0 < a < b. 2 b such that the largest possible absolute error that is, (1/2)ulp(b 2 ) is committed when computing b 2. To maximize the relative error, b 2 must be slightly above an even power of 2. 3 a small enough the computed approximation to a 2 + b 2 is slightly above the same power of 2; We choose b = 1 + 2 p/2 if p is even; 2 p 3 b = 1 + 2 2 2 p+1 if p is odd. Example (p even): b = 1 + 2 p/2 gives b 2 = 1 + 2 p/2+1 + 2 p RN(b 2 ) = 1 + 2 p/2+1.

-13- Building the generic input values a and b 4 In Algorithm 1, when computing s a + s b, the significand of s a is right-shifted by a number of positions equal to the difference of their exponents. Gives the form s a should have to produce a large relative error. 5 We choose a = square root of that value, adequately rounded. s b s a We would like this part to maximize the error of the computation of s. We would like that part to be of the form 01111 or 10000 to maximize the error of the computation of s. Figure 3: Constructing suitable generic inputs to Algorithms 1 and 2.

-14- Generic values for a 2 + b 2, for even p b = 1 + 2 p/2, and where with G = ( ) a = RD 2 3p 4 G, ( ) 2 p 2 2 1 + δ 2 p 2 +1 + 2 p 2 δ = { 1 if 2 p 2 2 2 otherwise, is odd, (2)

-15- Table 1: Relative errors of Algorithm 1 or Algorithm 2 for our generic values a and b for various even values of p between 16 and 56. p relative error 16 1.97519352187392... u 20 1.99418559548869... u 24 1.99873332158282... u 28 1.99967582969338... u 32 1.99990783760560... u 36 1.99997442258505... u 40 1.99999449547633... u 44 1.99999835799502... u 48 1.99999967444005... u 52 1.99999989989669... u 56 1.99999997847972... u

Generic values for a 2 + b 2, for odd p We choose with and with η = b = 1 + η, p 3 2 2 2 2 p+1, a = RN ( H ), H = 2 p+3 2 2η 3 2 p + 2 3p+3 2.

-17- Table 2: Relative errors of Algorithm 1 or Algorithm 2 for our generic values a and b and for various odd values of p between 53 and 113. p relative error 53 1.9999999188175005308... u 57 1.9999999764537355319... u 61 1.9999999949811629228... u 65 1.9999999988096732861... u 69 1.9999999997055095283... u 73 1.9999999999181918151... u 77 1.9999999999800815518... u 81 1.9999999999954499727... u 101 1.9999999999999949423... u 105 1.9999999999999986669... u 109 1.9999999999999996677... u 113 1.9999999999999999175... u

The case of c/ a 2 + b 2 Algorithm 3 Without FMA. s a RN(a 2 ) s b RN(b 2 ) s RN(s a + s b ) ρ RN( s) g RN(c/ρ) return g Algorithm 4 With FMA. s b RN(b 2 ) s RN(a 2 + s b ) ρ RN( s) g RN(c/ρ) return g Straightforward error analysis: relative error 3u + O(u 2 ). Theorem 3 If p 3, then the relative error committed when approximating c/ a 2 + b 2 by the result g of Algorithm 3 or 4 is less than 3u.

-19- Sketch of the proof Previous result on the computation of squares if p 3, then s a = a 2 (1 + ɛ 1 ) and s b = b 2 (1 + ɛ 2 ) with ɛ 1, ɛ 2 u 1+3u =: u 3; ɛ 3 and ɛ 4 such that ɛ 3, ɛ 4 u 1+u =: u 1 and { (sa + s b )(1 + ɛ 3 ) for Algorithm 3, s = in both cases: (a 2 + s b )(1 + ɛ 4 ) for Algorithm 4. (a 2 + b 2 )(1 u 1 )(1 u 3 ) s (a 2 + b 2 )(1 + u 1 )(1 + u 3 ). the relative error of division in radix-2 FP arithmetic is at most u 2u 2 (Jeannerod/Rump, 2016), hence g = with ɛ 5 u 1 and ɛ 6 u 2u 2. c s (1 + ɛ5 ) (1 + ɛ 6)

Sketch of the proof and then Consequently, with and c s 1 u + 2u2 1 + u 1 g c s 1 + u 2u2 1 u 1. ζ := ζ := c c ζ g a 2 ζ + b2 a 2 + b 2 1 (1 + u1 )(1 + u 3 ) 1 u + 2u2 1 + u 1 1 (1 u1 )(1 u 3 ) 1 + u 2u2 1 u 1. To conclude, we check that 1 3u < ζ and ζ < 1 + 3u for all u 1/2.

Asymptotic optimality of the bound for c/ a 2 + b 2 Theorem 4 For p 12, there exist floating-point inputs a, b, and c for which the result g returned by Algorithm 3 or Algorithm 4 satisfies g c a 2 +b 2 = 3u ɛ, ɛ = O(u3/2 ). c a 2 +b 2 The generic values of a and b used to prove Theorem 4 are the same as the ones we have chosen for a 2 + b 2, and we use { 1 + 2 p+1 3 2 2 p/2 2 (even p), c = 1 + 3 2 p 1 2 + 2 p+1 (odd p).

-22- Table 3: Relative errors obtained, for various precisions p, when running Algorithm 3 or Algorithm 4 with our generic values a, b, and c. p relative error 24 2.998002589136762596763498... u 53 2.999999896465758351542169... u 64 2.999999997359196820010396... u 113 2.999999999999999896692295... u 128 2.999999999999999999566038... u

-23- Conclusion we have reminded the relative error bound 2u for a 2 + b 2, slightly improved the bound (3u + O(u 2 ) 3u) for c/ a 2 + b 2, and considered variants that take advantage of the possible availability of an FMA, asymptotically optimal bounds trying to significantly refine them further is hopeless. Unbounded exponent range our results hold provided that no underflow or overflow occurs. handling spurious overflows and underflows: using an exception handler and/or scaling the input values: if the scaling introduces rounding errors, then our bounds may not hold anymore; if a and b (and c) are scaled by a power of 2, our analyses still apply.