ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION

Size: px

Start display at page:

Download "ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION"

Ernest Thompson
6 years ago
Views:

1 MATHEMATICS OF COMPUTATION Volume 00, Number 0, Pages S (XX) ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN In memory of Erin Brent ( ) Abstract. Given floating-point arithmetic with t-digit base-β significands in which all arithmetic operations are performed as if calculated to infinite precision and rounded to a nearest representable value, we prove that the product of complex values z 0 and z 1 can be computed with maximum absolute error z 0 z 1 1 β1 t 5. In particular, this provides relative error bounds of 4 5 and 53 5 for IEEE 754 single and double precision arithmetic respectively, provided that overflow, underflow, and denormals do not occur. We also provide the numerical worst cases for IEEE 754 single and double precision arithmetic. 1. Introduction In an earlier paper [], the second author made the claim that the maximum relative error which can occur when computing the product z 0 z 1 of two complex values using floating-point arithmetic is ɛ 5, where ɛ is the maximum relative error which can result from rounded floating-point addition, subtraction, or multiplication. While reviewing that paper a few years later, the other two authors noted that the proof given was incorrect, although the result claimed was true. Since the bound of ɛ 8 which is commonly used [1] is suboptimal, we present here a corrected proof of the tighter bound. Interestingly, by explicitly finding worst-case inputs, we can demonstrate that our error bound is effectively optimal. Throughout this paper, we concern ourselves with floating-point arithmetic with t-digit base-β significands, denote by ulp(x) for x 0 the (unique) power of β such that β t 1 x /ulp(x) < β t, and write ɛ = 1 ulp(1) = 1 β1 t ; we also define ulp(0) = 0. The notations x y, x y, and x y represent rounded-to-nearest floating-point addition, subtraction, and multiplication of the values x and y.. An Error Bound Theorem 1. Let z 0 = a 0 + b 0 i and z 1 = a 1 + b 1 i, with a 0, b 0, a 1, b 1 floatingpoint values with t-digit base-β significands, and let z = ((a 0 a 1 ) (b 0 b 1 )) + ((a 0 b 1 ) (b 0 a 1 ))i be computed. Providing that no overflow or underflow occur, no denormal values are produced, arithmetic results are correctly rounded to a nearest representable value, z 0 z 1 0, and ɛ 5, the relative error z (z 0 z 1 ) 1 1 is less than ɛ 5 = 1 β1 t 5. 1 c 1997 American Mathematical Society

2 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN Proof. Let a 0, b 0, a 1, and b 1 be chosen such that the relative error is maximized. By multiplying z 0 and z 1 by powers of i and/or taking complex conjugates, we can assume without loss of generality that (1) () 0 a 0, b 0, a 1, b 1 b 0 b 1 a 0 a 1 and given our assumptions that overflow, underflow, and denormals do not occur, and that rounding is performed to a nearest representable value, we can conclude that for any x occurring in the computation, the error introduced when rounding x is at most 1 ulp(x) and is strictly less than ɛ x. We note that the error I(z z 0 z 1 ) in the imaginary part of z is bounded as follows: I(z z 0 z 1 ) = ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) and consider two cases: a 0 b 1 a 0 b 1 + b 0 a 1 b 0 a 1 + ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) Case I1: ulp(a 0 b 1 + b 0 a 1 ) < ulp(a 0 b 1 + b 0 a 1 ) Using first the definition of ulp and second the assumption above, we must have a 0 b 1 + b 0 a 1 < β t ulp(a 0 b 1 + b 0 a 1 ) a 0 b 1 + b 0 a 1 and therefore (a0 b 1 + b 0 a 1 ) β t ulp(a 0 b 1 + b 0 a 1 ) < (a0 b 1 + b 0 a 1 ) (a 0 b 1 + b 0 a 1 ) a 0 b 1 a 0 b 1 + b 0 a 1 b 0 a 1 ɛ (a 0 b 1 + b 0 a 1 ). However, β t ulp(a 0 b 1 + b 0 a 1 ) is a representable floating-point value; so given our assumption that rounding is performed to a nearest representable value, we must now have ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) < ɛ (a 0 b 1 + b 0 a 1 ). Case I: ulp(a 0 b 1 + b 0 a 1 ) ulp(a 0 b 1 + b 0 a 1 ) From our assumption that the results of arithmetic operations are correctly rounded, we obtain ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) 1 ulp(a 0 b 1 + b 0 a 1 ) 1 ulp(a 0b 1 + b 0 a 1 ) ɛ (a 0 b 1 + b 0 a 1 ). Combining these two cases with the earlier-stated bound, we obtain I(z z 0 z 1 ) a 0 b 1 a 0 b 1 + b 0 a 1 b 0 a 1 + ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) < ɛ (a 0 b 1 ) + ɛ (b 0 a 1 ) + ɛ (a 0 b 1 + b 0 a 1 ) = ɛ (a 0 b 1 + b 0 a 1 ).

3 ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION 3 Now that we have a bound on the imaginary part of the error, we turn our attention to the real part, and consider the following four cases (where the examples given apply to β = ): ulp(b 0 b 1 ) ulp(a 0 a 1 ) ulp(a 0 a 1 b 0 b 1 ) e.g., z 0 = z 1 = i ulp(b 0 b 1 ) < ulp(a 0 a 1 b 0 b 1 ) < ulp(a 0 a 1 ) e.g., z 0 = z 1 = i ulp(a 0 a 1 b 0 b 1 ) ulp(b 0 b 1 ) < ulp(a 0 a 1 ) e.g., z 0 = z 1 = i ulp(a 0 a 1 b 0 b 1 ) < ulp(b 0 b 1 ) = ulp(a 0 a 1 ) e.g., z 0 = z 1 = i. Since we have assumed that b 0 b 1 a 0 a 1, we know that ulp(b 0 b 1 ) ulp(a 0 a 1 ), and thus these four cases cover all possible inputs. Consequently, it suffices to prove the required bound for each of these four cases. Case R1: ulp(b 0 b 1 ) ulp(a 0 a 1 ) ulp(a 0 a 1 b 0 b 1 ) Note that the right inequality can only be strict if a 0 a 1 rounds up to a power of β and b 0 b 1 = 0. We observe that a 0 a 1 b 0 b 1 < a 0 a 1 b 0 b 1 + ɛ (a 0 a 1 + b 0 b 1 ) and bound the real part of the complex error as follows: R(z z 0 z 1 ) a 0 a 1 a 0 a 1 + b 0 b 1 b 0 b 1 + ((a 0 a 1 ) (b 0 b 1 )) (a 0 a 1 b 0 b 1 ) 1 ulp(a 0a 1 ) + 1 ulp(b 0b 1 ) + 1 ulp(a 0 a 1 b 0 b 1 ) 1 ulp(a 0 a 1 b 0 b 1 ) + 1 ulp(b 0b 1 ) + 1 ulp(a 0 a 1 b 0 b 1 ) < ɛ (a 0 a 1 b 0 b 1 ) + ɛ (b 0 b 1 ) < ɛ (a 0 a 1 b 0 b 1 ) + ɛ (a 0 a 1 + b 0 b 1 ). Applying the triangle inequality, we now observe that z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) as required. < ɛ (a 0 a 1 b 0 b 1 ) + (a 0 b 1 + b 0 a 1 ) + ɛ (a 0 a 1 + b 0 b 1 ) 3 ɛ 7 z 0z (a 0b 1 b 0 a 1 ) 1 7 (a 0a 1 5b 0 b 1 ) + ɛ z 0 z 1 ( ) ɛ 3/7 + ɛ z 0 z 1 < ɛ 5 z 0 z 1 Case R: ulp(b 0 b 1 ) < ulp(a 0 a 1 b 0 b 1 ) < ulp(a 0 a 1 ) Noting that ulp(x) < ulp(y) implies ulp(x) β 1 ulp(y) 1 ulp(y), we obtain R(z z 0 z 1 ) 1 ulp(a 0a 1 ) + 1 ulp(b 0b 1 ) + 1 ulp(a 0 a 1 b 0 b 1 ) 7 8 ulp(a 0a 1 ) ( ) 7 ɛ 4 a 0a 1

4 4 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN and therefore z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) (7 ) < ɛ 4 a 0a 1 + (a 0 b 1 + b 0 a 1 ) 104 = ɛ 07 z 0z (a 0b 1 b 0 a 1 ) (79a 0a 1 18b 0 b 1 ) as required. ɛ 104/07 z 0 z 1 < ɛ 5 z 0 z 1 Case R3: ulp(a 0 a 1 b 0 b 1 ) ulp(b 0 b 1 ) < ulp(a 0 a 1 ) In this case, there is no rounding error introduced in computing the difference between a 0 a 1 and b 0 b 1 since ulp(a 0 a 1 b 0 b 1 ) ulp(b 0 b 1 ) ulp(b 0 b 1 ) and ulp(a 0 a 1 b 0 b 1 ) < ulp(a 0 a 1 ) ulp(a 0 a 1 ). Also, so we have ulp(b 0 b 1 ) 1 β ulp(a 0a 1 ) 1 ulp(a 0a 1 ) R(z z 0 z 1 ) 1 ulp(a 0a 1 ) + 1 ulp(b 0b 1 ) 3 4 ulp(a 0a 1 ) ( ) 3 ɛ a 0a 1 and consequently z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) (3 ) < ɛ a 0a 1 + (a 0 b 1 + b 0 a 1 ) 56 = ɛ 55 z 0z (a 0b 1 b 0 a 1 ) 1 0 (3a 0a 1 3b 0 b 1 ) as required. ɛ 56/55 z 0 z 1 < ɛ 5 z 0 z 1 Case R4: ulp(a 0 a 1 b 0 b 1 ) < ulp(b 0 b 1 ) = ulp(a 0 a 1 ) In this case, there is again no rounding error introduced in computing the difference between a 0 a 1 and b 0 b 1, so we obtain R(z z 0 z 1 ) a 0 a 1 a 0 a 1 + b 0 b 1 b 0 b 1 < ɛ (a 0 a 1 + b 0 b 1 )

5 ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION 5 and consequently z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) < ɛ (a 0 a 1 + b 0 b 1 ) + (a 0 b 1 + b 0 a 1 ) = ɛ 5 z 0 z 1 (a 0 b 1 b 0 a 1 ) 4(a 0 a 1 b 0 b 1 ) ɛ 5 z 0 z 1 as required. 3. Worst-case multiplicands for β = Having proved an upper bound on the relative error which can result from floating-point rounding when computing the product of complex values, we now turn to a more number-theoretic problem: finding precise worst-case inputs for β =. Starting with the assumption that some inputs produce errors very close to the proven upper bound, we will repeatedly reduce the set of possible inputs until an exhaustive search becomes feasible. Theorem. Let β = and assume that z 0 = a 0 + b 0 i 0 and z 1 = a 1 + b 1 i 0, where a 0, b 0, a 1, b 1 are floating-point values with t-digit base-β significands, and z = ((a 0 a 1 ) (b 0 b 1 )) + ((a 0 b 1 ) (b 0 a 1 ))i are such that (1) () (3) (4) 0 a 0, b 0, a 1, b 1 b 0 b 1 a 0 a 1 b 0 a 1 a 0 b 1 1/ a 0 a 1 < 1 and no overflow, underflow, or denormal values occur during the computation of z. Assume further that the results of arithmetic operations are correctly rounded to a nearest representable value and that (5) z z 0 z 1 z 0 z 1 for some positive integer n. Then for some integers j xy, k xy satisfying > ɛ ( ) 5 nɛ > ɛ max 104/07, 3/7 + ɛ a 0 a 1 = 1/ + (j aa + 1/)ɛ + k aa ɛ a 0 b 1 = 1/ + (j ab + 1/)ɛ + k ab ɛ b 0 a 1 = 1/ + (j ba + 1/)ɛ + k ba ɛ b 0 b 1 = 1/ + (j bb + 1/)ɛ + k bb ɛ 0 j aa, j ab, j ba, j bb < n 4 k aa, k bb < n and a 0 b 0, a 1 b 1. k ab, k ba < n

6 6 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN Proof. From equation (5), we note that ɛ nɛ < 11/07 < 4 ; we will use this trivial bound later without explicit comment. From the proof of Theorem 1, we know that Case R4 must hold, i.e., there is no error introduced in the computation of the difference between a 0 a 1 and b 0 b 1, and ulp(b 0 b 1 ) = ulp(a 0 a 1 ). From inequalities () and (4) above, this implies that 1/ b 0 b 1 a 0 a 1 < 1 R(z z 0 z 1 ) a 0 a 1 a 0 a 1 + b 0 b 1 b 0 b 1 ɛ. We can now obtain lower bounds on z 0 z 1 and z z 0 z 1, using the fact that (a 0 a 1 )(b 0 b 1 ) = (a 0 b 1 )(b 0 a 1 ): z 0 z 1 = ( a 0 + b ( 0) a 1 + b ) 1 = (a 0 a 1 ) + (a 0 b 1 ) + (b 0 a 1 ) + (b 0 b 1 ) (1/) + (a 0 b 1 ) + (1/)4 (a 0 b 1 ) + (1/) 1 z z 0 z 1 > z 0 z 1 ɛ (5 nɛ) ɛ (5 nɛ) as well as an upper bound on z 0 z 1 : We now note that z 0 z 1 104ɛ 07 z 0 z 1 < < z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) < ɛ + (ɛ (a 0 b 1 + b 0 a 1 )) ɛ + 4ɛ z 0 z 1 (a 0 b 1 ) z 0 z 1 (a 0 a 1 ) (b 0 b 1 ) = so b 0 a 1 a 0 b 1 109/196 < 1 and a 0 b 1 + b 0 a 1 109/49 (1 + ɛ) < ; this implies that ulp(b 0 a 1 ) ulp(a 0 b 1 ) ulp(1/) and ulp(a 0 b 1 + b 0 a 1 ) ulp(1), and therefore a 0 b 1 a 0 b 1 ɛ/ b 0 a 1 b 0 a 1 ɛ/ ((a 0 b 1 ) (b 0 a 1 )) (a 0 b 1 + b 0 a 1 ) ɛ I(z z 0 z 1 ) ɛ/ + ɛ/ + ɛ = ɛ which allows us to place upper bounds on z z 0 z 1 and z 0 z 1 : z z 0 z 1 = R(z z 0 z 1 ) + I(z z 0 z 1 ) (ɛ) + (ɛ) = 5ɛ z 0 z 1 < z z 0 z 1 ɛ (5 nɛ) 5 5 nɛ. Combining the known lower bound ɛ (5 nɛ) for z z 0 z 1 with the upper bounds on the error contributed by each individual rounding step, we find that ɛ/ (1 1 nɛ)ɛ < a 0 a 1 a 0 a 1 ɛ/

7 ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION 7 ɛ/ (1 1 nɛ)ɛ < b 0 b 1 b 0 b 1 ɛ/ ɛ/ ( 4 nɛ)ɛ < a 0 b 1 a 0 b 1 ɛ/ ɛ/ ( 4 nɛ)ɛ < b 0 a 1 b 0 a 1 ɛ/ and similarly, by combining the upper bound on z 0 z 1 with lower bound of 1/ for each pairwise product, we obtain 5 1/ b 0 b 1 a 0 a 1 5 nɛ nɛ 4 = 0 4nɛ 5 1/ b 0 a 1 a 0 b 1 5 nɛ nɛ 4 = 0 4nɛ. Now consider the possible values for a 0 a 1 which satisfy these restrictions. Since it is the product of two values which are expressible using t digits of significand, a 0 a 1 can be exactly represented using t digits of significand; but since 1/ a 0 a 1 < 1, this implies that a 0 a 1 is an integer multiple of ɛ. There is therefore at least one pair of integers j aa, k aa with 0 j aa < ɛ 1 /, k aa ɛ 1 / for which a 0 a 1 = 1/ + (j aa + 1/)ɛ + k aa ɛ. Since a 0 a 1 is the closest multiple of ɛ to a 0 a 1, this implies that ɛ/ (1 1 nɛ)ɛ < a 0 a 1 a 0 a 1 = ɛ/ k aa ɛ i.e., k aa < n, and similarly k aa ɛ < 1 1 nɛ < 1 (1 nɛ) = nɛ 1/ + j aa ɛ a 0 a nɛ 0 4nɛ < 1/4 + nɛ/4 < 1/ + nɛ 4 i.e., 0 j aa < n/4. Applying the same argument to a 0 b 1, b 0 a 1, and b 0 b 1 allows us to infer that they possess the same structure, as required. To complete the proof, we note that the rounding errors from the products a 0 a 1 and b 0 b 1 must be in opposite directions (in order that they accumulate when subtracted), while the rounding errors from the products a 0 b 1 and b 0 a 1 must be in the same direction (in order that they accumulate when added); consequently, we must have a 0 b 0 and a 1 b 1. Corollary 1. Assume that the preconditions of Theorem are satisfied, and assume further that 1 (6) a 0 < 1 and n ɛ 1/. Then 1 < a 0, b 0, a 1, b 1 < 1. Proof. Assume that a 1 1. Then we can write a 0 = 1/ + Aɛ a 1 = 1 + Bɛ for some 0 A, B < (ɛ) 1. From Theorem, we have 1/ + (A + B)ɛ + ABɛ = a 0 a 1 = 1/ + (j aa + 1/)ɛ + k aa ɛ for some 0 j aa < n/4, k aa < n.

8 8 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN As a result, we must have A + B n/4 1/ ɛ 1/, and since 0 A, B this implies 0 ABɛ ɛ/8. However, by reducing the equation above modulo ɛ, we find that ABɛ ɛ/ + k aa ɛ, which contradicts our bounds on ABɛ. Consequently, we can conclude that a 1 < 1. Now we note that a 0 a 1 > 1/ and a 0 < 1, so a 1 > 1/, and we have both of the bounds required for a 1. Applying the same argument to the other products provides the same bounds for a 0, b 0, and b 1. Corollary. Assume that the preconditions of Corollary 1 are satisfied, and assume further that n ɛ 1/ and ɛ 6. Then Proof. From Theorem, we obtain that j aa j ab j ba + j bb = 0, a 0 b 0 a 1 b 1 < 3nɛ. a 0 (a 1 b 1 ) = a 0 a 1 a 0 b 1 = (j aa j ab )ɛ + (k aa k ab )ɛ where j aa j ab < n 4, k aa k ab < 3n, and since a 0 > 1 (from Corollary 1), we can conclude that a 1 b 1 < n ɛ + 3nɛ. Since a 1 and b 1 are integer multiples of ɛ and 3nɛ < ɛ/, we conclude that a 1 b 1 n ɛ. Applying the same argument to the product a 1 (a 0 b 0 ) provides the same bound for a 0 b 0. We now note that (jaa j ab j ba + j bb )ɛ + (k aa k ab k ba + k bb )ɛ = a0 b 0 a 1 b 1 ( n ) ɛ < ɛ 4 from our assumed upper bound on n, and consequently we can conclude that j aa j ab j ba + j bb = 0. Finally, this allows us to write as required. a 0 b 0 a 1 b 1 = k aa k ab k ba + k bb ɛ < 3nɛ Corollary 3. Assume that the preconditions of Corollary 1 are satisfied, and assume further that n 1 4 ɛ 1/. Then (a 0 b 0 )(a 1 b 1 ) = (j aa j ab )(j aa j ba )ɛ (a 0 b 0 )(a 1 b 1 )k aa = (k aa k ab )(k aa k ba )ɛ Proof. For brevity and clarity, we will write (a 0 b 0 )(a 1 b 1 ) = xɛ and note that x is an integer between 3n and 3n, from Corollary. Then xa 0 a 1 = x ( + x j aa + 1 ) ɛ + xk aa ɛ, xa 0 a 1 = a 0(a 1 b 1 ) a1(a 0 b 0 ) ɛ ɛ = ((j aa j ab ) + (k aa k ab )ɛ) ((j aa j ba ) + (k aa k ba )ɛ) = (j aa j ab )(j aa j ba ) + ((j aa j ab )(k aa k ba ) + (j aa j ba )(k aa k ab )) ɛ + (k aa k ab )(k aa k ba )ɛ.

9 ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION 9 Consequently, x (j aa j ab )(j aa j ba ) = ((j aa j ab )(k aa k ba ) + (j aa j ba )(k aa k ab ) (j aa + 1)x) ɛ + ((k aa k ab )(k aa k ba ) k aa x) ɛ, ( x (j aa j ab )(j aa j ba ) n 3n 4 + n 3n ( n ) ) n ɛ ( + 3n ) 3n + 6n ɛ = ( 3n + 3n ) ɛ + 1 n ɛ ɛ ɛ < 1, and since the only integer with absolute value less than one is zero, we can conclude that x = (j aa j ab )(j aa j ba ) as required. We now consider xa 0 a 1 ɛ modulo 1 ɛ 1, and note that and further that xk aa xa 0 a 1 ɛ (k aa k ab )(k aa k ba ) xk aa (k aa k ab )(k aa k ba ) 3n n + 3n 3n = 1n 4 1ɛ 1 64 < 1 ɛ 1 and therefore xk aa = (k aa k ab )(k aa k ba ). Theorem 3. Let β = and assume that z 0 = a 0 + b 0 i, z 1 = a 1 + b 1 i, and z = ((a 0 a 1 ) (b 0 b 1 )) + ((a 0 b 1 ) (b 0 a 1 ))i are such that (1) () (3) (4) (6) 0 a 0, b 0, a 1, b 1 b 0 b 1 a 0 a 1 b 0 a 1 a 0 b 1 1/ a 0 a 1 < 1 1/ a 0 < 1 and no overflow, underflow, or denormal values occur during the computation of z. Assume further that the results of arithmetic operations are correctly rounded to the nearest representable value, and that (5) z z 0 z 1 z 0 z 1 > ɛ ( ) 5 nɛ > ɛ max 104/07, 3/7 + ɛ

10 10 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN for some n < 1 4 ɛ 1/ and ɛ 6. Then there exist integers c 0, d 0, α 0, β 0, c 1, d 1, α 1, β 1 satisfying a 0 = c 0 d 0 (1 + α 0 ɛ) b 0 = c 0 d 0 (1 + β 0 ɛ) a 1 = c 1 d 1 (1 + α 1 ɛ) b 1 = c 1 d 1 (1 + β 1 ɛ) gcd(c 0, d 0 ) = 1 gcd(c 1, d 1 ) = 1 c 0 c 1 = d 0 d 1 < 3n d 0 c 0 d 0 d 1 c 1 d 1 1 < a 0,b 0, a 1, b 1 < 1 α 0 β 0 ɛ 1 (mod d 0 ) α 0 β 0 α 1 β 1 ɛ 1 (mod d 1 ) α 1 β 1 min (α 0, β 0 ) + min (α 1, β 1 ) 0 max ( α 0, β 0 ) max ( α 1, β 1 ) < n Proof. Let the values j aa, j ab, j ba, j bb, k aa, k ab, k ba, and k bb be as constructed in Theorem, and further let g 0 = gcd(j aa j ab, (a 1 b 1 )/ɛ). From Corollary 1 we know that 1/ < a 1, b 1 < 1, so a 1 and b 1 are multiples of ɛ; consequently g 0 must be an integer. By the same argument, g 1 = gcd(j aa j ba, (a 0 b 0 )/ɛ) is an integer. Now note that g 0 (a 1 b 1 )ɛ 1 (a 1 b 1 )a 0 ɛ = (j aa j ab )ɛ 1 + (k aa k ab ) and since g 0 (j aa j ab ), we can conclude that g 0 (k aa k ab ). By the same argument, g 1 (k aa k ba ). We now write c 0 = j aa j ab d 0 = a 1 b 1 g 0 g 0 ɛ c 1 = j aa j ba d 1 = a 0 b 0 g 1 g 1 ɛ e 0 = k aa k ab g 0 e 1 = k aa k ba g 1 and note that these values are all integers; further, from Corollary 3 we have d 0 d 1 k aa = e 0 e 1 and d 0 d 1 = c 0 c 1, and since gcd(c 0, d 0 ) = gcd(c 1, d 1 ) = 1 by construction, this implies gcd(c 0, c 1 ) = 1. We now observe that and therefore 1 + a 0 = a 0(a 1 b 1 ) = c 0g 0 ɛ + e 0 g 0 ɛ = c 0 + e 0 ɛ a 1 b 1 d 0 g 0 ɛ d 0 a 1 = a 1(a 0 b 0 ) = c 1g 1 ɛ + e 1 g 1 ɛ = c 1 + e 1 ɛ a 0 b 0 d 1 g 1 ɛ d 1 ( j aa + 1 ) ɛ + k aa ɛ = a 0 a 1 = c 0c 1 d 0 d 1 + c 0e 1 + e 0 c 1 d 0 d 1 ɛ + e 0e 1 d 0 d 1 ɛ = 1 + c 0e 1 + e 0 c 1 d 0 d 1 ɛ + k aa ɛ

11 ERROR BOUNDS ON COMPLEX FLOATING-POINT MULTIPLICATION 11 and thus (using d 0 d 1 = c 0 c 1 ) c 0 c 1 (j aa + 1) = c 0 e 1 + e 0 c 1. Consequently c 0 e 0 c 1 and c 1 c 0 e 1, and since gcd(c 0, c 1 ) = 1 it follows that c 0 e 0 and c 1 e 1. Writing e 0 = c 0 α 0, e 1 = c 1 α 1 for integers α 0, α 1, we now have a 0 = c 0 (1 + α 0 ɛ) a 1 = c 1 (1 + α 1 ɛ) d 0 d 1 and taking β 0 = α 0 + c 1 g 1, β 1 = α 1 + c 0 g 0, we have b 0 = c 0 (1 + β 0 ɛ) d 0 b 1 = c 1 (1 + β 1 ɛ) d 1 as required. The remaining conditions can be obtained by remembering that a 0, b 0, a 1, and b 1 are integer multiples of ɛ, and by using the bounds on j xy and k xy given in Theorem. Corollary 4. In IEEE 754 single-precision arithmetic (β =, t = 4, ɛ = 4 ), using nearest even rounding mode, the values 1 a 0 = 3 b 0 = (1 4ɛ) a 1 = 3 (1 + 11ɛ) b 1 = (1 + 5ɛ) 3 result in a relative error δ ɛ 5 168ɛ ɛ in z, and δ is the largest possible relative error for for IEEE 754 single-precision inputs provided that overflow, underflow, and denormals do not occur. Proof. Straightforward computation for the values given establishes that a 0 a 1 = 1 (1 + 11ɛ) a 0 a 1 = 1 (1 + 1ɛ) b 0 b 1 = 1 (1 + ɛ 0ɛ ) b 0 b 1 = 1 R(z 0 z 1 ) = 5ɛ + 10ɛ R(z ) = 6ɛ a 0 b 1 = 1 (1 + 5ɛ) a 0 b 1 = 1 (1 + 4ɛ) b 0 a 1 = 1 (1 + 7ɛ 44ɛ ) b 0 a 1 = 1 (1 + 6ɛ) I(z 0 z 1 ) = 1 + 6ɛ ɛ I(z ) = 1 + 4ɛ z z 0 z 1 = ɛ (5 108ɛ + O(ɛ )) z 0 z 1 = 1 + 1ɛ + O(ɛ ) and the ratio of these provides the error as stated. To prove that this is the largest possible relative error for IEEE single-precision inputs, we note that the mappings z 0 z 0 i, z 1 z 1 i, (z 0, z 1 ) ( z 0, z 1 ), (z 0, z 1 ) (z 1, z 0 ), z 0 z 0 j, and z 1 z 1 k do not affect the relative error in z ; consequently, this allows us to assume without loss of generality that conditions (1-4) and (6) are satisfied by the worst-case inputs. Using the results of Theorem 3, an exhaustive computer search (taking about five minutes in MAPLE on the second author s 1.4 GHz laptop) completes the proof. 1 Note that while 3 is not an IEEE 754 single-precision value, 3 (1 + 5ɛ) and 3 (1 + 11ɛ) are, since ɛ ɛ (mod 3).

12 1 RICHARD BRENT, COLIN PERCIVAL, AND PAUL ZIMMERMANN Corollary 5. In IEEE 754 double-precision arithmetic (β =, t = 53, ɛ = 53 ), using nearest even rounding mode, the values a 0 = 3 4 (1 + 4ɛ) b 0 = 3 4 a 1 = 3 (1 + 7ɛ) b 1 = (1 + 1ɛ) 3 result in a relative error in z of approximately ɛ 5 96ɛ ɛ , and this is the worst possible provided that overflow, underflow, and denormals do not occur. Proof. Straightforward computation for the values given establishes that z z 0 z 1 = ɛ (5 36ɛ + O(ɛ )) z 0 z 1 = 1 + 1ɛ + O(ɛ ) and the ratio of these provides the error as stated. As in Corollary 4, an exhaustive search using the results of Theorem 3 (again, taking just a few minutes) completes the proof. For β = and t > 6, the constructions given in Corollaries 4 and 5 for a 0, b 0, a 1, b 1 provide for even and odd t respectively relative errors of ɛ 5 168ɛ + O(ɛ ) and ɛ 5 96ɛ + O(ɛ ). We believe that these are the worst-case inputs for all sufficiently large t when β =. 4. A note on methods The existence of this paper serves a strong demonstration of the power of experimental mathematics. The initial result the upper bound of 5ɛ was discovered experimentally seven years ago, on the basis of testing a few million random single-precision products. Experimental methods became even more important when it came to the results concerning worst-case inputs. Here the approach taken was to perform an exhaustive search, taking several hours on the second author s laptop, of IEEE single-precision inputs, using only a few arguments from Theorem 1 to prune the search. Once the worst few sets of inputs had been enumerated, it became clear that they possessed the structure described in Theorem 3, and it was natural to conjecture that this structure would be satisfied by the worst-case inputs in any precision. As is common with such problems, once the required result was known, constructing a proof was fairly straightforward. References 1. N.J. Higham, Accuracy and Stability of Numerical Algorithms, Second Edition, SIAM, 00.. C. Percival, Rapid multiplication modulo the sum and difference of highly composite numbers, Math. Comp. 7 (00), Mathematical Sciences Institute, Australian National University, Canberra, ACT 000, Australia address: complex@rpbrent.com IRMACS Centre, Simon Fraser University, Burnaby, BC, Canada address: cperciva@irmacs.sfu.ca INRIA Lorraine/LORIA, 615 rue du Jardin Botanique, F-5460 Villers-lès-Nancy Cedex, France address: zimmerma@loria.fr

Error Bounds on Complex Floating-Point Multiplication

Error Bounds on Complex Floating-Point Multiplication Richard Brent, Colin Percival, Paul Zimmermann To cite this version: Richard Brent, Colin Percival, Paul Zimmermann. Error Bounds on Complex Floating-Point