Round-off Errors and Computer Arithmetic - (1.2)

Round-off Errors and Comuter Arithmetic - (.). Round-off Errors: Round-off errors is roduced when a calculator or comuter is used to erform real number calculations. That is because the arithmetic erformed in a machine involves numbers with only a finite number of digits and the calculated results are only aroximations of the actual numbers. Formats for single, double and extended recision, and their standards are given in IEEE Reort on Binary Floating Point Arithmetic Standard 74-98. Let us look through the following examles to see how a calculator and a comuter stores and works with real numbers. Examle Using a TI-89, comute the values of n n for n 0,0,0 4,0. n 0 0 0 4 0 n n. 78888477. 78888489. 0. 0 It is known that lim x n n e.7888849044... Whyare.0? What went wrong? Let us check out a few more numbers. 0 n n n 0.78888489 0 0.4.78888489 calculator treat 0 0.4 as 0 0 4 0 0..7888849 calculator treat 0 0. as 0 0.7888849 this value is larger than e so it is no longer accurate 4. 0 and The calculator treat 0 as.0 and then 0 4 4 0 4.004.0. I think the recision of a TI-89 is 4-digit. So, it either truncates the th digit if it is less than or adds to the 4th digit if it is or higher. Examle Consider a PC that imlements a 64-bit (binary digit) reresentation for a real number. bit e e...e 64 bit m m...m sign exonent mantissa s c (characteristic) f (fraction) s c e 0 e 9...e 0 e f m m The system gives a floating-oint number of the form: s c0 f. Note that 0 c 0 9... 0 f... Let c max 047, and f max. 047...m

Since all machine numbers are in the form of s c0 f, the minimum number is 0. 6996 0 08 and the maximum number (in magnitude) is cmax 0 f max 0470 04. 98669746 0 08. Any number x occurring on comutation with x 0 results in underflow and is reresented by a 0 and with x 04 results in overflow and the comutation will be stoed. Examle: Consider the machine number x: 0 00000000 000000...0. Find the interval I which contains all real numbers whose machine numbers are x. We need to find a lower bound a and an uer bound b of x and then I a, b. c 0 00 f 0.406 x 0 000 0.406 80.0 The very next machine number which is smaller to x, 0 00000000 000000...0 is: which is 0 00000000 000... f 6 7... 6... 46 6 47 47 a 000 47 000 47 x 000 x 80 4 The very next machine number which is larger to 0 00000000 000000...0 is: 0 00000000 000000...0 f b 000 47 000 x 4 80 4 000

4. 847094040 4 0 4,. 90804677 0 x reresents all real numbers in a, b 80. 90804677 0, 80. 8470940404 0 4 Note that if x 80 ( 00000000 000000...0 ) then a, b 80. 8470940404 0 4, 80. 90804677 0 Given x 80, its binary reresentation can be comuted as follows. s. 80 7 80 8, ( 8 6 80, use the same idea below) 0, 0 4 4, 4 0 80 7 4 7 4, f Hence, c 7 0 00, 00 0 00 04 6, 6 c 0 x 00000000 000000...0. Now let the machine numbers be reresented in the normalized decimal floating-oint form as they are dislayed on a screen, say in a k digit decimal machine numbers: or 0. d d...d k 0 n, where d 9 and 0 d i 9 for i,,...,k. Let flx be the floating-oint form of x. Now remember that flx is a machine aroximation of the true value of x. Let x 0.d d...d k d k d k...0 n. Choing method: flx 0. d d...d k 0 n Rounding method: flx 0. d d...d k 0 n if d k 0. d d...d k 0 n if d k Examle Give the floating -oint form of using a -digit choing; and b -digit rounding.. 49689794 0.49689794 0 a fl 0.4 0 b fl 0.46 0. Absolute Error and Relative Error: Let be an aroximation to. Then the absolute error is defined as and the relative error is defined as rovided that 0. Examle Let and fl 0.46. aroximation. Find the absolute error and relative error of this 0.46 0 0.0000074640 0.46 0. 8449978044 0 6

Examle Let 0. 0 4, and 0.0 0 4 ; and let 0. 0 4, and 0.0 0 4. Comute the absolute error and relative error for each aroximation. 0.0 0 4 0. 0 4 0.00000 0.0 0 4 0. 0 4 0. 0 4 0.0806469 0.0 0 4 0. 0 4 00.0 0.0 0 4 0. 0 4 0.0806469 0. 0 4 The absolute error of is much large than the one for, the relative errors for both and are the same. From this examle, we see that the absolute error deends on the magnitude of, on the other hand, the relative error does not deend on the magnitude of. So, the relative error is usually used to evaluate the closeness of the aroximation.. Significant Digits: The number is said to aroximate to t significant digits if t is the largest nonnegative integer for which 0 t. Examle Consider,,, and in the revious examle, In each case, find t. Since for i, and i, i i 0.0806469.806469 0 0. i Hence, i aroximates i to significant digits. Usually, we don t know the exact value of. If we know the relative error when aroximates is at most 0 t and know the value of, we can find the largest interval containing. Since 0 t, 0t 0 t 0t 0 t 0t 0 t 0 t 0 t 0 t. If we know and the relative error when aroximates is at most 0 t, how can we find the largest interval in which must lie for? 0 t 0 t 0 t 0 t 0 t 0 t Examle Find the largest interval containing if 0.46 0 is used to aroximate to 4

significant digits. 0.46 0 0.46 0 0 0 0.4688446 0 0.464646 0 Examle Find the largest interval in which must lie to aroximate to significant digits.. 46766 0 0. 4640696 The loss of accuracy due to round-off error can often be avoided by a reformulation of the roblem. Examle Solve the equation x 6.0x 0 using a 4-digit rounding arithmetic. We know if b 4ac 0 then the equation ax bx c 0 has two real solutions and they are b x b 4ac, x a b b 4ac a Now comute x and x ste by ste in a 4-digit rounding arithmetic: Ste Exression Value b 6.06.0 86. 4 86 b 4ac 86 4 8 b 4ac 8 6. 064489676 6.06 4 b b 4ac 6.0 6.06 0.04 a 6 b b 4ac a 0.04 0.0 x 7 b b 4ac 6.0 6.06 4. 6. 8 b b 4ac a. 6.6 x True solutions (or solutions comuted using a k-digit rounding arithmetic where k 4 : b 4ac 6.06.0 4 6. 067788 Relative errors: x x x x x x x x 6.0 6. 067788 6.0 6. 067788 0.0 0.060774089 0.060774089 6.6 6. 0889769 6. 0889769 0.060774089 6. 0889769 0.468 7. 794760749 0 Why the aroximation of x is so oor? Note that Ste 4 involves a subtraction of two close numbers. Check out the relative error for this subtraction: aroximation - true - difference true difference 0.04 6.0 6. 067788 6.0 6. 067788 0.4677 8 If the subtraction of two numbers in close magnitudes can be avoided, then the accuracy of the comutation of

x can be imroved. Rewrite the formula for x : x x x x b b 4ac a b b 4ac a b b 4ac x b b b 4ac a c b 4ac b b b 4ac b 4ac x.. 6766766 0.6 0.6 0 0.060774089 0.060774089 7. 6798048 0 Examle The nested method: Let P nx a n x n a n x n...a x a 0 where a i s are real numbers. How many multilications and additions are needed to valuate P nx 0? Rewrite P nx : Px a n x a nx n a n x n...a x a 0 a n x a nx a nx n a n x n...a x a 0 a n x a nx a nx a nx n...a x a 0 : a n x a nx a nx a nx...a x a 0 Each a i x a i requires multilication and addition and there are n of them. So, totally n multilications and n addition are needed. By the way, multilication and addition is also counted as flo (floting oint oeration). Examle: P x 4x x x. Evaluate P. P 4, 4 6 6, P. 6