Error Bounds for Arithmetic Operations on Computers Without Guard Digits

Similar documents
Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

QUADRATIC PROGRAMMING?

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

Matrix notation. A nm : n m : size of the matrix. m : no of columns, n: no of rows. Row matrix n=1 [b 1, b 2, b 3,. b m ] Column matrix m=1

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

LESSON ASSIGNMENT. After completing this lesson, you should be able to:

Mathematical preliminaries and error analysis

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Introduction to Applied Linear Algebra with MATLAB

ALU (3) - Division Algorithms

Chapter 1 Computer Arithmetic

Chapter 1 Mathematical Preliminaries and Error Analysis

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

ROUNDOFF ERRORS; BACKWARD STABILITY

REVIEW Chapter 1 The Real Number System

1 Number Systems and Errors 1

Chapter 1: Introduction and mathematical preliminaries

How do computers represent numbers?

Notes on floating point number, numerical computations and pitfalls

4.2 Floating-Point Numbers

Lecture Notes 7, Math/Comp 128, Math 250

Lecture 7. Floating point arithmetic and stability

Shipping Grade 6 Solving Equations Clarification

UNIT 4 NOTES: PROPERTIES & EXPRESSIONS

NUMERICAL METHODS C. Carl Gustav Jacob Jacobi 10.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING

Chapter 1 Error Analysis

10.1. Square Roots and Square- Root Functions 2/20/2018. Exponents and Radicals. Radical Expressions and Functions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. B) 6x + 4

Svoboda-Tung Division With No Compensation


ARITHMETIC AND BASIC ALGEBRA

Math Review. for the Quantitative Reasoning measure of the GRE General Test

Chapter 4 Number Representations

CHAPTER 6. Direct Methods for Solving Linear Systems

Mathematics (Core - Level: 08) Pre-Algebra Course Outline

The natural numbers. The natural numbers come with an addition +, a multiplication and an order < p < q, q < p, p = q.

EDULABZ INTERNATIONAL NUMBER SYSTEM

Example #3: 14 (5 + 2) 6 = = then add = 1 x (-3) then. = 1.5 = add

CHAPTER 2 NUMBER SYSTEMS

KNOWLEDGE OF NUMBER SENSE, CONCEPTS, AND OPERATIONS

Holt McDougal Larson Algebra Common Core Edition correlated to the South Carolina High School Core-Area Standards Intermediate Algebra

Pre Algebra, Unit 1: Variables, Expression, and Integers

Introduction CSE 541

Floating-point Computation

Engineering Fundamentals and Problem Solving, 6e. Chapter 6 Engineering Measurements

NAME DATE PERIOD. A negative exponent is the result of repeated division. Extending the pattern below shows that 4 1 = 1 4 or 1. Example: 6 4 = 1 6 4

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Section 3.6 Complex Zeros

Practical Algebra. A Step-by-step Approach. Brought to you by Softmath, producers of Algebrator Software

Sample. An Incremental Development. John H. Saxon, Jr. Third Edition. Used by Permission SAXON PUBLISHERS, INC.

FUNCTIONS OVER THE RESIDUE FIELD MODULO A PRIME. Introduction

Algebra I Chapter 4 Curriculum and IXL

Complement Arithmetic

Expressions and Formulas 1.1. Please Excuse My Dear Aunt Sally

Exponential Properties 0.1 Topic: Exponential Properties

Number Systems III MA1S1. Tristan McLoughlin. December 4, 2013

Every time a measurement is taken, we must be aware of significant figures! Define significant figures.

Chapter 1: Fundamentals of Algebra Lecture notes Math 1010


ADVANCED/HONORS ALGEBRA 2 - SUMMER PACKET

Algebra 1. Correlated to the Texas Essential Knowledge and Skills. TEKS Units Lessons

Math 411 Preliminaries

OPTIMAL RESOURCE ALLOCATION AMONG AREAS OF EQUAL VALUE

MAT 460: Numerical Analysis I. James V. Lambers

T HE "COordinate Rotation DIgital Computer"

7 Multipliers and their VHDL representation

MTH306: Algebra II. Course length: Two semesters. Materials: None

DSP Design Lecture 2. Fredrik Edman.

On the Skeel condition number, growth factor and pivoting strategies for Gaussian elimination

Introduction to Numerical Analysis

Radiological Control Technician Training Fundamental Academic Training Study Guide Phase I

Equations and Inequalities

MULTIPLIERS OF THE TERMS IN THE LOWER CENTRAL SERIES OF THE LIE ALGEBRA OF STRICTLY UPPER TRIANGULAR MATRICES. Louis A. Levy

Algebra 1 Correlation of the ALEKS course Algebra 1 to the Washington Algebra 1 Standards

Applied Mathematics 205. Unit 0: Overview of Scientific Computing. Lecturer: Dr. David Knezevic

College Algebra. Basics to Theory of Equations. Chapter Goals and Assessment. John J. Schiller and Marie A. Wurster. Slide 1

Degree of a polynomial

How fast can we add (or subtract) two numbers n and m?

Section 1.3 Review of Complex Numbers

ALGEBRA 1. Interactive Notebook Chapter 2: Linear Equations

Redundant Radix Enumeration Systems

Algebra Review. Terrametra Resources. Lynn Patten

Revised: 1/17/14 EASTERN LEBANON COUNTY SCHOOL DISTRICT STUDENT LEARNING MAP

Sloping Binary Numbers: A New Sequence Related to the Binary Numbers

SYMMETRICOMPLETIONS AND PRODUCTS OF SYMMETRIC MATRICES

FLORIDA STANDARDS TO BOOK CORRELATION

CHAPTER 2 INTERPOLATION

FYSE410 DIGITAL ELECTRONICS [1] [2] [3] [4] [5] A number system consists of an ordered set of symbols (digits).

Study Guide for Math 095

ECS 231 Computer Arithmetic 1 / 27

Linear Algebra Tutorial for Math3315/CSE3365 Daniel R. Reynolds

Section-A. Short Questions

FINAL EXAM REVIEW ITEMS Math 0312: Intermediate Algebra Name

Geometry 21 Summer Work Packet Review and Study Guide

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

INDEPENDENCE OF RESULTANTS TH. MOTZKIN

7 = 8 (Type a simplified fraction.)

ASSIGNMENT. Please complete only the assignment for the class you will begin in September 2018.

Transcription:

IMA Journal of Numerical Analysis (1983) 3, 153-160 Error Bounds for Arithmetic Operations on Computers Without Guard Digits F. W. J. OLVER The University of Maryland and the National Bureau of Standards, U.S.A. [Received 22 November 1982] For computers having no form of guard digit in the accumulator register it is not possible to bound the relative error offloating-pointsubtraction processes in a satisfactory manner. This paper describes some modifications of recent error analyses to cover this situation, including the evaluation of sums and inner products, and indicates the corresponding modifications for the solution of systems of linear algebraic equations. 1. Introduction Two RECENT PAPERS (Olver, 1982; Olver & Wilkinson, 1982) have described the construction and computation of a posteriori error bounds for various algebraic operations with floating-point arithmetic, including the solution of linear algebraic equations by Gaussian elimination. The analysis employs a new definition of relative error, called relative precision (rp). This is used in conjunction with the conventional definition of absolute error, or absolute precision (ap). A basic assumption of the two papers is that it is possible to assign a common rp bound to all the abbreviation processes (chopping or rounding) that accompany internal floating-point arithmetic operations, including addition, subtraction, multiplication and division. Generally this assumption is quite satisfactory for computers that are equipped with at least one guard digit in the accumulator register, that is, a register used for storing intermediate results in addition or subtraction. But without any form of guard digit, or indicator bit,f the assumption is unrealistic; see Olver (1978), especially the numerical example in Section 4.6. The purpose of the present paper is to supply modifications to the analyses to cover these excepted cases. 2. Addition and Subtraction: Ap Forms Let a t and a 2 be real numbers in normalized floating-point form, given by a l =r pi xm u a 2 = r n xm 2, (2.1) where r is the internal radix of the computing facility, the exponents p t and p 2 are signed integers, and the mantissae m x and m 2 are signed rational numbers such that t Yobe (1973). 153 0272-4979/83/020153+08 S03.00/0 "83 Aademk Prea Inc. (London) Limited

154 F. W. J. OLVER r-^ki-cl, r-^lmj^l. (22) Also, let the number of digits in each mantissa be denoted by d. Setting aside the trivial cases in which a x or a 2 vanishes, we assume that Mil 2* \a 2 \ > 0. The sum ai + a 2 = s, (23) say, is computed in the following manner. First, a 2 is converted to unnormalized form in an intermediate register in such a way that it has the same exponent, p lt as a t. This is effected by preshifting m 2 a total of p^ p 2 places to the right Secondly, a 2 is either chopped or roundedf to the same number of places as a u and the resulting approximation is denoted by d 2. Thirdly, d 2 is added to a u the sum is renormalized, if necessary, and the mantissa chopped or rounded to d digits to yield the final answer s, say. THEOREM 21 where if the abbreviation mode is chopping, or if the abbreviation mode is rounding. S^aAl+BJ+aill+ej, (24) \8i\,\9i\<r % -' (25) PiU^l^l + Or 1 " (26) There are four cases to consider, depending on whether addition or subtraction is being performed and whether the abbreviation mode is chopping or rounding. For the rounding cases the result is a well-known formula of Wilkinson (1963, Chapter 1, Sections 17-18). For completeness, however, we supply proofs in all four cases. Without loss of generality we suppose that p y = 0 throughout the proofs. (i) Addition with chopping. In this case and the next we assume that a x ^ a 2 > 0. From the definition of d 2 we have Hence from (23) we derive 0^a 2 -d 2 <r~ d. (27) r-'. (28) Suppose first that a t + a2^ 1. Then no further abbreviations take place after forming d 2.% In consequence s = a t + d 2 and In (24) set 9 2 = 0, so that 0 t = (s-s)/a v Then 0<s-J <r-'. (29) O<-0 1 <r-"/a 1^r 1-1 ', (210) since a t > r" 1 ; compare (21) and (22). Therefore (25) is satisfied. f In the case of rounding it is assumed that r is even. When r» 2, or a power of 2, one way of effecting the rounding is simply to add (sign aj x (ft" 1 neglected bit of aj x 2 to the retained part of a 2. J If a, +dj 1, then renormalization occurs but no abbreviation error is introduced.

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 155 Alternatively, assume that a i +d 2 > 1. In this event the final digit of a, + d 2. is chopped to form s, and we have Addition of this result to (18) yields s^(r-l)r-' 1. (Ill) 0^s-s<r l ~ d. (112) In (14) set 0j = 9 2 = 9, so that 9 = {s-s)/s. Then O<-0<r 1-7s<r 1 -' > (2.13) since s^ a l +d 2 > 1. Again, (15) is satisfied (ii) Addition with rounding. In place of (17) and (18) we have, for all methods of rounding, and -ir- - <fl 2 -fi 2^ir-' (2.14) -** -'^s-fo + SjXir-'. (115) Suppose first that a 1 + a 2 < 1. Then no further abbreviations take place after forming d 2. In consequence s = a x + d 2 ; hence with 9 2 = 0 we find-that IfljKir 1 -*; (2.16) compare (110). Therefore (16) is satisfied.' Alternatively, assume that a 1 + d 2 > 1. Then the final digit of a t +a 2 is rounded to form s. Accordingly, Addition of this result to (2.15) yields With 9t = 9 2 = 9 = (s-s)/s we see that s^±r 1 -''. (2.17) s-s1^i{l + r- 1 )r 1 "''- (2.18) 0 ^Kl+r~V~7* (2-19) Because a 1 + d 2 > I it follows that a x + d 2 ^ 1 +r~'. Using (115) we then derive '+ir- - >l, (2.20) and from this inequality and (119) it is clear that (16) is satisfied. (iii) Subtraction with chopping. In this case and the next we replace a 2 by a 2, so that (13) and (14) become and respectively. We may again assume that a x^ a 2 > 0. ai-a 2 = s (121) (122)

156 F. W. J. OLVER The inequalities (27) continue to apply. We also have s = a l a 2, since no abbreviation takes place on subtracting d 2 from a v From these results and (221) we see that In (222) set 6 2 = 0, so that 0j = (s s)/a t. Then we have 0 ^ S-s < r~'. (123) 0^9 l <r- i /a 1^r l - i ; (224) compare (210). Therefore (25) is satisfied. (iv) Subtraction with rounding. The inequalities (214) continue to apply, and no abbreviation takes place on subtracting 3 2 from a v Hence In (222) set 0 2 = 0, so that 0 X = (s-s)/^. Therefore (26) is satisfied. This completes the proof of Theorem 21. 3. Addition: Rp Forms -if' < s-s < ±r-'. (225) Then we have PiKir-'/a^ir 1 -'. (226) For the floating-point addition of two non-negative numbers we require an rp form of error bound. In this section we employ the same notation and make the same assumptions as in the opening two paragraphs of Section 2 except that we now suppose a t^ a 2^ 0. THEOREM 3.1 if the abbreviation mode is chopping, or if the abbreviation mode is rounding. ssa; rpo- 1 -*) (3.1) s^y, rpuu+r-v-'} (3-2) Neither of these results is deducible immediately from Theorem 21. The relation (3.1) is easily derived from the inequalities (2.9) and (212), however, and it was also included in Section 4.5 of Olver (1978). Furthermore, Section 4.7 of the same reference indicates that (3.1) is valid when the abbreviation mode is rounding. To prove the stronger result (3.2) we proceed as follows. Again, without loss of generality we suppose that Pi = 0. Suppose first that a l +d 2 < 1. Then the only abbreviation that takes place is the rounding of a 2 to form a 2. This is equivalent to the rounding of s to form hence by Olver (1978), Section 4.3,f we have and so (3.2) is satisfied. S^r, rpdr 1 -*), (3.3) t In this reference it is assumed that in the event of a "tie" the rounding is always upward. However, it is easily verified that if downward rounding is used in these critical cases, then (3.3) continues to apply.

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 157 Alternatively, assume that a l + a 2 > 1. The rounding of a 2 to form d 2 is described by the relations Similarly, the rounding of a l +d 1 to form s is described by where k is an integer that satisfies ', - H ^ l - (3-4) a l + d 2 = s + kr~ d, (3.5) orf Eliminating d 2 from (3.4) and (3.5) we obtain where To establish (3.2) we need to show that l<fc*s}r if s = 1. (3.7) a 1 + a 2 = ser, (3.8) «= ln{l+ (fc +? r "'l. (3.9) M^iO+r-V"'- (3.10) There are three subcases to be considered. Suppose first that s = 1. From (3.7) -and the inequalities in (3.4) we have ir+i- Therefore u is positive. Furthermore, u = ln{l+(* + S)r-'} < {k + &)r~ d ^ (ir + i)r"'. (3.11) Secondly, suppose that s > 1 + r 1 "' and k + S ^ 0. Then u is non-negative. From (3.4) and (3.6) we derive fc+3 < - r+}. Hence ' 1. (3.12) Thirdly, suppose that s > 1 + r 1 "' and /c + 9 < 0. From (3.4) and (3.6) we have -ir-i < k + S. Therefore 5 1+r 1 - Since r > 2 and <2 ^ 1 this quantity is positive. Hence from (3.9) we see that and so 0 < -u < hi {1 +#+ l)r-*} < ic+l)^"'. (3.13) The desired inequality (3.10) follows on combining (3.11), (3.12) and (3.13). This completes the proof of Theorem 3.1. f The restriction it > 1 in (3.7) stems from the condition a, +5 2 > 1.

158 F. W. J. OLVER 4. Sums and Inner Products In this section we describe modifications that are needed in the analysis of Olver (1982) to include computing facilities that have no form of guard digit in the accumulator register. As in this reference, we make provision for two modes (or levels) of floating-point operation: a main mode, J(, for the main computations and a lower mode, if, for error-bound computations. We continue to employ the symbols r, d and d, to denote, respectively, the radix of the internal number system of the computing facility, the number of digits assigned to the mantissa in J( and the number of digits assigned to the mantissa in if. We also continue to use single bars to denote numbers stored in Jt and double bars to denote numbers stored in if. In the present paper, however, we define y to be a convenient member of! that equals or exceeds the common rp error bound for abbreviation, multiplication and division in M, and also satisfies yssr 1 "' (chopping); y $s Kl + r" 1 )^ ~ d (rounding). (4.1) In consequence, the rules for multiplication and division in M are unchanged. Thus if a l and a 2 are any two real numbers stored in J(, then their computed product a 1 a 2 and computed quotient aja 2 satisfy a l a 1 ~'a^a' 2 ; rp(y), (4.2) a 1 /a 2 ~~aja 2 ~; rp (y), (4.3) provided that a 2 # 0 in the case of division. For addition (and subtraction) we now have «i + a 2 ; ap {(foj + \a 2 \)y}, (4.4) this result being derived from (4.1) and Theorem 2.1. Similarly, we define y, to be a convenient member of SC that equals or exceeds the common rp error bound for abbreviation, multiplication and division in if, and satisfies y, > r 1 "' (chopping); y, ^ \\\ + r~ ^r 1 "'' (rounding). (4.5) In consequence, if a x and <x 2 are two non-negative members of SC, then from the definition of rp we have ai<x 2 e~" < a t a 2 ^ct x a. 2 e n, a-ja 2 e~" < <zja 2 < a.ja 2 e", (4.6) provided that a 2 # 0 in the case of division. Furthermore, a t +a 2 e"" s$ a! +a 2 < a t +a 2 e 71, (4.7) this result being a consequence of (4.5) and Theorem 3.1. In other words, with the new definition ofy t the exponential rule formulated in Olver (1982, Section 3) continues

ERROR BOUNDS FOR ARITHMETIC OPERATIONS 159 to apply. Indeed, if we retain the conditions ys% y, <, then except for (3.12) the results in Sections 3, 4, 5, 8 and 9 of Olver (1982) apply unchanged. The necessary modifications to the error bounds for sums and inner products given in Sections 6 and 7 of Olver (1982) are obtained by replacing formula (6.4) by (4.4) whenever addition takes place: here and in the rest of this section we use italics to indicate equation numbers that belong to Olver (1982). In the case of sums we find that (6.7) applies, provided that the definition (6.8) of p m is changed to J-l \J-1 J'2 From this result and the exponential rule we see that the computable form (6.75) becomes where a m is now defined by MlH>)} } (4.9) (41O) In evaluating this bound the grouping brackets must be obeyed. In the case of inner products, we find that (7.6) remains valid, provided that (7.7) is replaced by P. = I \aj' J \(e i ' +^-l) + (i\^bj\+ "l Is/) y. (4.11) J-l V-l j-l ) The computable form (7.77) also remains valid, provided that the definition (7.70) of a m is changed to ( \ajbij + ^ s^j y. (4.12) Again, grouping brackets must be obeyed. In summary, the essential modifications to the previous results for sums and inner products are to replace the term sjy in the error bounds by ( a 1 + a I l + - + l< 1»l)y '«the case of sums, and by ( a l bi + a 2 b 2 +... + \a m b m \)y in the case of inner products, and to make a minor change in the compensating factors. 5. Conclusions On computers that are not equipped with some form of guard digit in the accumulator register, the relative error in the computed difference of two floatingpoint numbers may be quite large. In consequence, a bound for the absolute error should be used. This paper supplies the necessary modifications of error bounds that were obtained for sums and inner products in an earlier paper. The main effect of the changes on the error bounds is to replace the absolute value of a sum by the

160 F. W. J. OLVER corresponding sum of the absolute values of the individual terms. In consequence, the modified forms of bound are weaker, but not grossly so in general. Corresponding modifications of the a posteriori error bounds for Gaussian elimination that were obtained recently by J. H. Wilkinson and the present writer can be constructed by straightforward application of the results for sums and inner products supplied in the present paper. The writer is pleased to acknowledge helpful suggestions by D. W. Lozier and the referee. This work was supported by the U.S. Army Research Office, Durham, under Contract DAAG 29-80-C-0032 and by the National Science Foundation under Grant MCS 81-11725. REFERENCES OLVER, F. W. J. 1978 A new approach to error arithmetic. SI AM J. num. Analysis 15, 368-393. OLVER, F. W. J. 1982 Further developments of rp and ap error analysis. IMA J. num. Analysis 2, 249-274. OLVER, F. W. J. & WILKINSON, J. H. 1982 A posteriori error bounds for Gaussian elimination. IMA J. num. Analysis 2, 377^*06. WILKINSON, J. H. 1963 Rounding Errors in Algebraic Processes. National Physical Laboratory Notes on Appl. Sci. No. 32. London: Her Majesty's Stationery Office. YOHE, J. M. 1973 Roundings in floating-point arithmetic. IEEE Trans. Comput. C-22, 577-586.