Round-off Errors and Computer Arithmetic - (1.2)

Roud-off Errors ad Comuter Arithmetic - (1.) 1. Roud-off Errors: Roud-off errors is roduced whe a calculator or comuter is used to erform real umber calculatios. That is because the arithmetic erformed i a machie ivolves umbers with oly a fiite umber of digits ad the calculated results are oly aroximatios of the actual umbers. Formats for sigle, double ad exteded recisio, ad their stadards are give i IEEE Reort o Biary Floatig Poit Arithmetic Stadard 74-198. Ay real umber ca be rereseted by four values as follows: 1 sig sigificad b exoet The desigers of the IEEE 74 biary floatig oit arithmetic stadard selected abaseof(b because it avoided the sudde jums i reresetable values suffered by larger bases. They the took the ositio that the bit format would set aside 1bitforthesig(sig0, or 1), 8 bits for the exoet ad bits for the sigificad. This format reresets values i the rage 18 (about 10 9 ) ad 17 (about 10 9 ). It also rovides a recisio of about 7 digits as 4 is about 16,000,000 (16,777,16). Their 64 bit format sets aside three additioal bits for the exoet (i.e. 11 bits) while leavig bits for the sigificad. This rovides a cosiderably icreased rage of roughly 10 00 to 10 00 ad a rough doublig of recisio to about 1 decimal digits. Let us look through the followig examles to see how a calculator ad a comuter stores ad works with real umbers. Examle Usig a TI-89, comute the values of 1 for 0 1,10 1,10 14,10 1. 10 1 10 1 10 14 10 1 1. 7188188477. 7188188489 1. 0 1. 0 It is kow that lim x e. 7188188489... Why are 1 1 1.0? What wet wrog? Let us check out a few more umbers. 10 1 1 10 1.7188188489 10 1 0.4.7188188489 calculator treat 10 1 0.4 as 10 1 10 14 14.0 ad 10 1 0..718818849 calculator treat 10 1 0. as 10 1 10 1.718818849 this value is larger tha e so it is o loger accurate The calculator treat 1 10 as 1.0 ad the 1 0 1 14 1.0014.0. The recisio of a 14 10 14 TI-89 is 14-digit. So, it either trucates the 1th digit if it is less tha or adds 1 to the 14th digit if it is or higher. Examle Usig MatLab 14, comute the values of x[10^1; 10^1; 10^14; 10^1]; y(11./x).^x 1 1 for 0 1,10 1,10 14,10 1.

y.718496074.7161100408690.716110040870.0006496 ex(1) as.7188188490 From these two examles, we kow e is ot comuted by TI-89. How is the value e comuted? 1 for very large i MatLab 14 ad Examle Cosider a PC that imlemets a 64-bit (biary digit) reresetatio for a real umber. 1 1 11 bit e 1 e...e 11 1 64 bit m 1 m...m sig exoet matissa (sigificad) s c (characteristic) f (fractio) 1 s c e 1 10 e 9...e 10 e 11 f m 1 m...m The system gives a floatig-oit umber of the form: 1 s c 10 1 f. Note that 0 c 10 9... 1 1 1 047 0 f 1... 1 1 1 1 Let c max 047, ad f max 1. Sice all machie umbers are i the form of 1 s c 10 1 f, the miimum umber is 10. 116996 0 08 ad the maximum umber (i magitude) is cmax 10 1 f max 047 10 1 1 104 1. 98669746 0 08. Ay umber x occurrig o comutatio with x 10 results i uderflow ad is rereseted by a 0 ad with x 104 results i overflow ad the comutatio will be stoed. Examle Cosider the machie umber x: 0 10000000110 011010000...0. Fid the iterval I digits which cotais all real umbers whose machie umbers are x. We eed to fid a lower boud a ad a uer boud b of x ad the I a, b.

c 10 00 f 0.406 x 1 0 100 10 1 0.406 80.0 The very ext machie umber which is smaller to x is: 0 10000000110 011001111...1 digits which is f 6 7... 6 1... 46 6 1 1 47 1 1 1 1 1 47 1 47 1 a 100 10 1 1 x 7 1 x 1 80 1 4 4 The very ext machie umber which is larger to x is: 0 10110000000 011010000...01 f b 100 10 1 x 7 1 80 4 1. 84170940404 0 14. 4 Hece, x reresets all real umbers i a, b 80. 84170940404 0 14, 180. 84170940404 0 14. Examle Give x 180, fid the biary reresetatio of x. We kow s.

180 7 80 18, ( 8 6 80, use the same idea below) 0, 0 4 4, 4 0 180 7 4 7 1 4, f Hece, c 7 0 00, 100 10 00 104 6, 6 c 10 1 x 10000000110 011010000...0.. Choig ad Roudig Methods: Now let the machie umbers be rereseted i the ormalized decimal floatig-oit form as they are dislayed o a scree, say i a k digit decimal machie umbers: or 0.d 1 d...d k 0, where 1 d 1 9 ad 0 d i 9fori,,...,k. Let flx be the floatig-oit form of x. Now remember that flx is a machie aroximatio of the true value of x. Letx 0.d 1 d...d k d k1 d k...10. Choig method: flx 0.d 1 d...d k 0 Roudig method: flx 0.d 1 d...d k 0 if d k1 0.d 1 d...d k 0 if d k1 Examle Give the floatig -oit form of usig a -digit choig; ad b -digit roudig.. 1419689794 0.1419689794 0 1 a fl 0.141 0 1 b fl 0.1416 0 1. Absolute Error ad Relative Error: Let be a aroximatio to. The the absolute error is defied as ad the relative error is defied as rovided that 0. Examle Let ad fl 0.1416. aroximatio. Fid the absolute error ad relative error of this 0.1416 0 1 0.000007464101 0.1416 0 1. 8449978044 0 6 Examle Let 1 0.1 0 4, ad 1 0.0 0 4 ; ad let 0.1 0 4, ad 0.0 0 4. Comute the absolute error ad relative error for each aroximatio. 4

1 1 0.0 0 4 0.1 0 4 0.000001 1 1 1 0.0 0 4 0.1 0 4 0.1 0 4 0.080641619 0.0 0 4 0.1 0 4 00.0 0.0 0 4 0.1 0 4 0.080641619 0.1 0 4 The absolute error of is much large tha the oe for, the relative errors for both 1 ad are the same. From this examle, we see that the absolute error deeds o the magitude of, o the other had, the relative error does ot deed o the magitude of. So, the relative error is usually used to evaluate the closeess of the aroximatio. 4. Sigificat Digits: The umber is said to aroximate to t sigificat digits if t is the largest oegative iteger for which 0 t. Examle Cosider 1, 1,, ad i the revious examle, I each case, fid t. Sice for i, ad i, i i 0.080641619.80641619 0 0. i Hece, i aroximates i to sigificat digits. Usually, we do t kow the exact value of. If we kow the relative error whe aroximates is at most 10 t ad kow the value of, we ca fid the largest iterval cotaiig. Sice 0 t, 1 0 t 10 t 1 0 t 1 10 t 0 t 1 0 t 1 10 t 1 0 t 1 10 t. If we kow ad the relative error whe aroximates is at most 10 t, how ca we fid the largest iterval i which must lie for? 0 t 0 t 0 t 0 t 0 t 0 t Examle Fid the largest iterval cotaiig if 0.1416 0 is used to aroximate to

sigificat digits. 0.1416 0 0.1416 0 1 0 1 10 0.14168841416 0 0.141614161416 0 Examle Fid the largest iterval i which must lie to aroximate to sigificat digits.. 14161766 10 10. 1416406916 The loss of accuracy due to roud-off error ca ofte be avoided by a reformulatio of the roblem. Examle Solve the equatio x 6.10x 0 usig a 4-digit roudig arithmetic. We kow if b 4ac 0 the the equatio ax bx c 0hastworealsolutiosadtheyare x 1 b b 4ac, x a b 4ac a Now comute x 1 ad x ste by ste i a 4-digit roudig arithmetic: Ste Exressio Value 1 b 6.106.10 86. 41 86 b 4ac 86 411 8 b 4ac 8 6. 0644896716 6.06 4 b b 4ac 6.10 6.06 0.04 a 1 6 b b 4ac a 0.04 0.0 x 1 7 b 4ac 6.10 6.06 14. 16 1. 8 b 4ac a 1. 61.6 x True solutios (or solutios comuted usig a k-digit roudig arithmetic where k 4 : b 4ac 6.106.10 411 6. 06778181 Relative errors: x 1 x 1 x 1 x x x x 1 x 6.10 6. 06778181 1 6.10 6. 06778181 1 0.0 0.01610774089 0.01610774089 61.6 6. 08897691 6. 08 89 76 91 0.01610774089 6. 08897691 0.4168 7. 79417607491 0 Why the aroximatio of x 1 is so oor? Note that Ste 4 ivolves a subtractio of two close umbers. Check out the relative error for this subtractio: 6

aroximatio - true - differece true differece 0.04 6.10 6. 06778181 6.10 6. 06778181 0.416778 If the subtractio of two umbers i close magitudes ca be avoided, the the accuracy of the comutatio of x 1 ca be imroved. Rewrite the formula for x 1 : x 1 x 1 x 1 x 1 b b 4ac a b b 4ac a b 4ac x 1 b b 4ac a c b 4ac b 4ac b 4ac x 1 1. 1. 6766766 0 1.6 0 1.6 0 0.01610774089 0.01610774089 7. 6179801481 0 Examle The ested method: Let P x a x a 1 x 1...a 1 x a 0 where a i s are real umbers. How may multilicatios ad additios are eeded to valuate P x 0? Rewrite P x : Px a x a 1 x 1 a x...a 1 x a 0 a x a 1 x a x a x...a 1 x a 0 a x a 1 x a x a x...a 1 x a 0 : a x a 1 x a x a x...a 1 x a 0 Each a i x a i 1 requires 1 multilicatio ad 1 additio ad there are of them. So, totally multilicatios ad additio are eeded. By the way, 1 multilicatio ad 1 additio is couted as 1 flo (floatig oit oeratio). Examle P x 4x x x. Evaluate P. P 4, 4 6 6 1, P. 7