Error Bounds for Arithmetic Operations on Computers Without Guard Digits

IMA Journal of Numerical Analysis (1983) 3, 153-160 Error Bounds for Arithmetic Operations on Computers Without Guard Digits F. W. J. OLVER The University of Maryland and the National Bureau of Standards, U.S.A. [Received 22 November 1982] For computers having no form of guard digit in the accumulator register it is not possible to bound the relative error offloating-pointsubtraction processes in a satisfactory manner. This paper describes some modifications of recent error analyses to cover this situation, including the evaluation of sums and inner products, and indicates the corresponding modifications for the solution of systems of linear algebraic equations. 1. Introduction Two RECENT PAPERS (Olver, 1982; Olver & Wilkinson, 1982) have described the construction and computation of a posteriori error bounds for various algebraic operations with floating-point arithmetic, including the solution of linear algebraic equations by Gaussian elimination. The analysis employs a new definition of relative error, called relative precision (rp). This is used in conjunction with the conventional definition of absolute error, or absolute precision (ap). A basic assumption of the two papers is that it is possible to assign a common rp bound to all the abbreviation processes (chopping or rounding) that accompany internal floating-point arithmetic operations, including addition, subtraction, multiplication and division. Generally this assumption is quite satisfactory for computers that are equipped with at least one guard digit in the accumulator register, that is, a register used for storing intermediate results in addition or subtraction. But without any form of guard digit, or indicator bit,f the assumption is unrealistic; see Olver (1978), especially the numerical example in Section 4.6. The purpose of the present paper is to supply modifications to the analyses to cover these excepted cases. 2. Addition and Subtraction: Ap Forms Let a t and a 2 be real numbers in normalized floating-point form, given by a l =r pi xm u a 2 = r n xm 2, (2.1) where r is the internal radix of the computing facility, the exponents p t and p 2 are signed integers, and the mantissae m x and m 2 are signed rational numbers such that t Yobe (1973). 153 0272-4979/83/020153+08 S03.00/0 "83 Aademk Prea Inc. (London) Limited

154 F. W. J. OLVER r-^ki-cl, r-^lmj^l. (22) Also, let the number of digits in each mantissa be denoted by d. Setting aside the trivial cases in which a x or a 2 vanishes, we assume that Mil 2* \a 2 \ > 0. The sum ai + a 2 = s, (23) say, is computed in the following manner. First, a 2 is converted to unnormalized form in an intermediate register in such a way that it has the same exponent, p lt as a t. This is effected by preshifting m 2 a total of p^ p 2 places to the right Secondly, a 2 is either chopped or roundedf to the same number of places as a u and the resulting approximation is denoted by d 2. Thirdly, d 2 is added to a u the sum is renormalized, if necessary, and the mantissa chopped or rounded to d digits to yield the final answer s, say. THEOREM 21 where if the abbreviation mode is chopping, or if the abbreviation mode is rounding. S^aAl+BJ+aill+ej, (24) \8i\,\9i\<r % -' (25) PiU^l^l + Or 1 " (26) There are four cases to consider, depending on whether addition or subtraction is being performed and whether the abbreviation mode is chopping or rounding. For the rounding cases the result is a well-known formula of Wilkinson (1963, Chapter 1, Sections 17-18). For completeness, however, we supply proofs in all four cases. Without loss of generality we suppose that p y = 0 throughout the proofs. (i) Addition with chopping. In this case and the next we assume that a x ^ a 2 > 0. From the definition of d 2 we have Hence from (23) we derive 0^a 2 -d 2 <r~ d. (27) r-'. (28) Suppose first that a t + a2^ 1. Then no further abbreviations take place after forming d 2.% In consequence s = a t + d 2 and In (24) set 9 2 = 0, so that 0 t = (s-s)/a v Then 0<s-J <r-'. (29) O<-0 1 <r-"/a 1^r 1-1 ', (210) since a t > r" 1 ; compare (21) and (22). Therefore (25) is satisfied. f In the case of rounding it is assumed that r is even. When r» 2, or a power of 2, one way of effecting the rounding is simply to add (sign aj x (ft" 1 neglected bit of aj x 2 to the retained part of a 2. J If a, +dj 1, then renormalization occurs but no abbreviation error is introduced.

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 155 Alternatively, assume that a i +d 2 > 1. In this event the final digit of a, + d 2. is chopped to form s, and we have Addition of this result to (18) yields s^(r-l)r-' 1. (Ill) 0^s-s<r l ~ d. (112) In (14) set 0j = 9 2 = 9, so that 9 = {s-s)/s. Then O<-0<r 1-7s<r 1 -' > (2.13) since s^ a l +d 2 > 1. Again, (15) is satisfied (ii) Addition with rounding. In place of (17) and (18) we have, for all methods of rounding, and -ir- - <fl 2 -fi 2^ir-' (2.14) -** -'^s-fo + SjXir-'. (115) Suppose first that a 1 + a 2 < 1. Then no further abbreviations take place after forming d 2. In consequence s = a x + d 2 ; hence with 9 2 = 0 we find-that IfljKir 1 -*; (2.16) compare (110). Therefore (16) is satisfied.' Alternatively, assume that a 1 + d 2 > 1. Then the final digit of a t +a 2 is rounded to form s. Accordingly, Addition of this result to (2.15) yields With 9t = 9 2 = 9 = (s-s)/s we see that s^±r 1 -''. (2.17) s-s1^i{l + r- 1 )r 1 "''- (2.18) 0 ^Kl+r~V~7* (2-19) Because a 1 + d 2 > I it follows that a x + d 2 ^ 1 +r~'. Using (115) we then derive '+ir- - >l, (2.20) and from this inequality and (119) it is clear that (16) is satisfied. (iii) Subtraction with chopping. In this case and the next we replace a 2 by a 2, so that (13) and (14) become and respectively. We may again assume that a x^ a 2 > 0. ai-a 2 = s (121) (122)

156 F. W. J. OLVER The inequalities (27) continue to apply. We also have s = a l a 2, since no abbreviation takes place on subtracting d 2 from a v From these results and (221) we see that In (222) set 6 2 = 0, so that 0j = (s s)/a t. Then we have 0 ^ S-s < r~'. (123) 0^9 l <r- i /a 1^r l - i ; (224) compare (210). Therefore (25) is satisfied. (iv) Subtraction with rounding. The inequalities (214) continue to apply, and no abbreviation takes place on subtracting 3 2 from a v Hence In (222) set 0 2 = 0, so that 0 X = (s-s)/^. Therefore (26) is satisfied. This completes the proof of Theorem 21. 3. Addition: Rp Forms -if' < s-s < ±r-'. (225) Then we have PiKir-'/a^ir 1 -'. (226) For the floating-point addition of two non-negative numbers we require an rp form of error bound. In this section we employ the same notation and make the same assumptions as in the opening two paragraphs of Section 2 except that we now suppose a t^ a 2^ 0. THEOREM 3.1 if the abbreviation mode is chopping, or if the abbreviation mode is rounding. ssa; rpo- 1 -*) (3.1) s^y, rpuu+r-v-'} (3-2) Neither of these results is deducible immediately from Theorem 21. The relation (3.1) is easily derived from the inequalities (2.9) and (212), however, and it was also included in Section 4.5 of Olver (1978). Furthermore, Section 4.7 of the same reference indicates that (3.1) is valid when the abbreviation mode is rounding. To prove the stronger result (3.2) we proceed as follows. Again, without loss of generality we suppose that Pi = 0. Suppose first that a l +d 2 < 1. Then the only abbreviation that takes place is the rounding of a 2 to form a 2. This is equivalent to the rounding of s to form hence by Olver (1978), Section 4.3,f we have and so (3.2) is satisfied. S^r, rpdr 1 -*), (3.3) t In this reference it is assumed that in the event of a "tie" the rounding is always upward. However, it is easily verified that if downward rounding is used in these critical cases, then (3.3) continues to apply.

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 157 Alternatively, assume that a l + a 2 > 1. The rounding of a 2 to form d 2 is described by the relations Similarly, the rounding of a l +d 1 to form s is described by where k is an integer that satisfies ', - H ^ l - (3-4) a l + d 2 = s + kr~ d, (3.5) orf Eliminating d 2 from (3.4) and (3.5) we obtain where To establish (3.2) we need to show that l<fc*s}r if s = 1. (3.7) a 1 + a 2 = ser, (3.8) «= ln{l+ (fc +? r "'l. (3.9) M^iO+r-V"'- (3.10) There are three subcases to be considered. Suppose first that s = 1. From (3.7) -and the inequalities in (3.4) we have ir+i- Therefore u is positive. Furthermore, u = ln{l+(* + S)r-'} < {k + &)r~ d ^ (ir + i)r"'. (3.11) Secondly, suppose that s > 1 + r 1 "' and k + S ^ 0. Then u is non-negative. From (3.4) and (3.6) we derive fc+3 < - r+}. Hence ' 1. (3.12) Thirdly, suppose that s > 1 + r 1 "' and /c + 9 < 0. From (3.4) and (3.6) we have -ir-i < k + S. Therefore 5 1+r 1 - Since r > 2 and <2 ^ 1 this quantity is positive. Hence from (3.9) we see that and so 0 < -u < hi {1 +#+ l)r-*} < ic+l)^"'. (3.13) The desired inequality (3.10) follows on combining (3.11), (3.12) and (3.13). This completes the proof of Theorem 3.1. f The restriction it > 1 in (3.7) stems from the condition a, +5 2 > 1.

158 F. W. J. OLVER 4. Sums and Inner Products In this section we describe modifications that are needed in the analysis of Olver (1982) to include computing facilities that have no form of guard digit in the accumulator register. As in this reference, we make provision for two modes (or levels) of floating-point operation: a main mode, J(, for the main computations and a lower mode, if, for error-bound computations. We continue to employ the symbols r, d and d, to denote, respectively, the radix of the internal number system of the computing facility, the number of digits assigned to the mantissa in J( and the number of digits assigned to the mantissa in if. We also continue to use single bars to denote numbers stored in Jt and double bars to denote numbers stored in if. In the present paper, however, we define y to be a convenient member of! that equals or exceeds the common rp error bound for abbreviation, multiplication and division in M, and also satisfies yssr 1 "' (chopping); y $s Kl + r" 1 )^ ~ d (rounding). (4.1) In consequence, the rules for multiplication and division in M are unchanged. Thus if a l and a 2 are any two real numbers stored in J(, then their computed product a 1 a 2 and computed quotient aja 2 satisfy a l a 1 ~'a^a' 2 ; rp(y), (4.2) a 1 /a 2 ~~aja 2 ~; rp (y), (4.3) provided that a 2 # 0 in the case of division. For addition (and subtraction) we now have «i + a 2 ; ap {(foj + \a 2 \)y}, (4.4) this result being derived from (4.1) and Theorem 2.1. Similarly, we define y, to be a convenient member of SC that equals or exceeds the common rp error bound for abbreviation, multiplication and division in if, and satisfies y, > r 1 "' (chopping); y, ^ \\\ + r~ ^r 1 "'' (rounding). (4.5) In consequence, if a x and <x 2 are two non-negative members of SC, then from the definition of rp we have ai<x 2 e~" < a t a 2 ^ct x a. 2 e n, a-ja 2 e~" < <zja 2 < a.ja 2 e", (4.6) provided that a 2 # 0 in the case of division. Furthermore, a t +a 2 e"" s$ a! +a 2 < a t +a 2 e 71, (4.7) this result being a consequence of (4.5) and Theorem 3.1. In other words, with the new definition ofy t the exponential rule formulated in Olver (1982, Section 3) continues

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 159 to apply. Indeed, if we retain the conditions ys% y, <, then except for (3.12) the results in Sections 3, 4, 5, 8 and 9 of Olver (1982) apply unchanged. The necessary modifications to the error bounds for sums and inner products given in Sections 6 and 7 of Olver (1982) are obtained by replacing formula (6.4) by (4.4) whenever addition takes place: here and in the rest of this section we use italics to indicate equation numbers that belong to Olver (1982). In the case of sums we find that (6.7) applies, provided that the definition (6.8) of p m is changed to J-l \J-1 J'2 From this result and the exponential rule we see that the computable form (6.75) becomes where a m is now defined by MlH>)} } (4.9) (41O) In evaluating this bound the grouping brackets must be obeyed. In the case of inner products, we find that (7.6) remains valid, provided that (7.7) is replaced by P. = I \aj' J $e i ' +^-l) + (i\^bj\+ "l Is/) y. (4.11) J-l V-l j-l ) The computable form (7.77) also remains valid, provided that the definition (7.70) of a m is changed to ( \ajbij + ^ s^j y. (4.12) Again, grouping brackets must be obeyed. In summary, the essential modifications to the previous results for sums and inner products are to replace the term sjy in the error bounds by ( a 1 + a I l + - + l< 1»l)y '«the case of sums, and by ( a l bi + a 2 b 2 +... + \a m b m $y in the case of inner products, and to make a minor change in the compensating factors. 5. Conclusions On computers that are not equipped with some form of guard digit in the accumulator register, the relative error in the computed difference of two floatingpoint numbers may be quite large. In consequence, a bound for the absolute error should be used. This paper supplies the necessary modifications of error bounds that were obtained for sums and inner products in an earlier paper. The main effect of the changes on the error bounds is to replace the absolute value of a sum by the

160 F. W. J. OLVER corresponding sum of the absolute values of the individual terms. In consequence, the modified forms of bound are weaker, but not grossly so in general. Corresponding modifications of the a posteriori error bounds for Gaussian elimination that were obtained recently by J. H. Wilkinson and the present writer can be constructed by straightforward application of the results for sums and inner products supplied in the present paper. The writer is pleased to acknowledge helpful suggestions by D. W. Lozier and the referee. This work was supported by the U.S. Army Research Office, Durham, under Contract DAAG 29-80-C-0032 and by the National Science Foundation under Grant MCS 81-11725. REFERENCES OLVER, F. W. J. 1978 A new approach to error arithmetic. SI AM J. num. Analysis 15, 368-393. OLVER, F. W. J. 1982 Further developments of rp and ap error analysis. IMA J. num. Analysis 2, 249-274. OLVER, F. W. J. & WILKINSON, J. H. 1982 A posteriori error bounds for Gaussian elimination. IMA J. num. Analysis 2, 377^*06. WILKINSON, J. H. 1963 Rounding Errors in Algebraic Processes. National Physical Laboratory Notes on Appl. Sci. No. 32. London: Her Majesty's Stationery Office. YOHE, J. M. 1973 Roundings in floating-point arithmetic. IEEE Trans. Comput. C-22, 577-586.