ECS 231 Computer Arithmetic 1 / 27

Size: px

Start display at page:

Download "ECS 231 Computer Arithmetic 1 / 27"

Sherilyn O’Connor’
5 years ago
Views:

1 ECS 231 Computer Arithmetic 1 / 27

2 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 2 / 27

3 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 3 / 27

4 Floating-point numbers and representations 1. Floating-point (FP) representation of numbers (scientific notation): sign significand base exponent 2. FP representation of a nonzero binary number: x = ± b 0.b 1 b 2 b p 1 2 E. (1) It is normalized, i.e., b0 = 1 (the hidden bit) Precision (= p) is the number of bits in the significand (mantissa) (including the hidden bit). Machine epsilon ɛ = 2 (p 1), the gap between the number 1 and the smallest FP number that is greater than Special numbers: 0, 0,,, NaN(= Not a Number ). 4 / 27

5 IEEE standard All computers designed since 1985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std ), represent each number as a binary number and use binary arithmetic. Essentials of the IEEE standard: consistent representation of FP numbers correctly rounded FP operations (using various rounding modes) consistent treatment of exceptional situation such as division by zero. 5 / 27

6 IEEE single precision format Single format takes 32 bits (=4 bytes) long: 8 23 s E f sign exponent It represents the number binary point ( 1) s (1.f) 2 E 127 fraction The leading 1 in the fraction need not be stored explicitly since it is always 1 (hidden bit) E min = ( ) 2 = (1) 10, E max = ( ) 2 = (254) 10. E 127 in exponent avoids the need for storage of a sign bit. The range of positive normalized numbers: N min = Emin 127 = N max = Emax Special repsentations for 0, ± and NaN. 6 / 27

7 IEEE double pecision format Double format takes 64 bits (= 8 bytes) long: s E f sign exponent It represents the numer binary point ( 1) s (1.f) 2 E 1023 fraction The range of positive normalized numbers is from N min = N max = Special repsentations for 0, ± and NaN. 7 / 27

8 Summary I Precision and machine epsilon of IEEE single, double and extended formats Format Precision p Machine epsilon ɛ = 2 p 1 single 24 ɛ = double 53 ɛ = extended 64 ɛ = Extra: Higham s lecture for additional formats, such as half (16 bits) and quadruple (128 bits). 8 / 27

9 Rounding modes Let a positive real number x be in the normalized range, i.e., N min x N max, and write in the normalized form x = (1.b 1 b 2 b p 1 b p b p+1...) 2 E, Then the closest fp number less than or equal to x is i.e., x is obtained by truncating. x = 1.b 1 b 2 b p 1 2 E The next fp number bigger than x (also the next one that bigger than x) is x + = ((1.b 1 b 2 b p 1 ) + ( )) 2 E If x is negative, the situtation is reversed. 9 / 27

10 Correctly rounding modes: round down: round(x) = x round up: round(x) = x + round towards zero: round(x) = x if x 0 round(x) = x + if x 0 round to nearest: round(x) = x or x + whichever is nearer to x. 1 1 except that if x > N max, round(x) =, and if x < N max, round(x) =. In the case of tie, i.e., x and x + are the same distance from x, the one with its least significant bit equal to zero is chosen. 10 / 27

11 Rounding error When the round to nearest (IEEE default rounding mode) is in effect, Therefore, we have relerr = relerr(x) = round(x) x x 1 2 ɛ = , single , double. 11 / 27

12 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 12 / 27

13 Floating-point arithmetic IEEE rules for correctly rounded fp operations: if x and y are correctly rounded fp numbers, then fl(x + y) = round(x + y) = (x + y)(1 + δ) fl(x y) = round(x y) = (x y)(1 + δ) fl(x y) = round(x y) = (x y)(1 + δ) fl(x/y) = round(x/y) = (x/y)(1 + δ) where δ 1 2 ɛ IEEE standard also requires that correctly rounded remainder and square root operations be provided. 13 / 27

14 Floating-point arithmetic, cont d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0 NaN Division by zero Finite nonzero/0 ± Overflow x > N max ± or ±N max underflow x 0, x < N min ±0, ±N min or subnormal Inexact whenever fl(x y) x y correctly rounded value 14 / 27

15 Floating-point arithmetic error Let x and ŷ be the fp numbers and that x = x(1 + τ 1 ) and ŷ = y(1 + τ 2 ), for τ i τ 1 where τ i could be the relative errors in the process of collecting/getting the data from the original source or the previous operations, and Question: how do the four basic arithmetic operations behave? 15 / 27

16 Floating-point arithmetic error: +, Addition and subtraction: fl( x + ŷ) = ( x + ŷ)(1 + δ) = x(1 + τ 1)(1 + δ) + y(1 + τ 2)(1 + δ) = x + y + x(τ 1 + δ + O(τɛ)) + y(τ 2 + δ + O(τɛ)) ( = (x + y) 1 + x (τ1 + δ + O(τɛ)) + y ) (τ2 + δ + O(τɛ)) x + y x + y (x + y)(1 + δ), where δ 1 2 ɛ, δ x + y x + y (τ + 12 ɛ + O(τɛ) ). 16 / 27

17 Floating-point arithmetic error: +, Three possible cases: 1. If x and y have the same sign, i.e., xy > 0, then x + y = x + y ; this implies δ τ + 1 ɛ + O(τɛ) 1. 2 Thus fl( x + ŷ) approximates x + y well. 2. If x y x + y 0, then ( x + y )/ x + y 1; this implies that δ could be nearly or much bigger than 1. This is so called catastrophic cancellation, it causes relative errors or uncertainties already presented in x and ŷ to be magnified. 3. In general, if ( x + y )/ x + y is not too big, fl( x + ŷ) provides a good approximation to x + y. 17 / 27

18 Catastrophic cancellation: example 1 Computing x + 1 x straightforward causes substantial loss of significant digits for large n x fl( x + 1) fl( x) fl(fl( x + 1) fl( x) 1.00e e e e e e e e e e e e e e e e e e e e e e e e e e e e+00 Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. In the present case, one can compute x + 1 x almost to full precision by using the equality x + 1 x = 1 x x. 18 / 27

19 Catastrophic cancellation: example 2 Consider the function Note that f(x) = 1 cos x x 2 0 f(x) < 1/2 for all x 0 Let x = , then the computed is completely wrong! fl(f(x)) = Alternatively, the function can be re-written as ( ) 2 sin(x/2) f(x) =. x/2 Consequently, for x = , then the computed fl(f(x)) = < 1/2 is fine! 19 / 27

20 Floating-point arithmetic error:, / Multiplication and Division: fl( x ŷ) = ( x ŷ)(1 + δ) = xy(1 + τ 1 )(1 + τ 2 )(1 + δ) xy(1 + δ ), fl( x/ŷ) = ( x/ŷ)(1 + δ) = (x/y)(1 + τ 1 )(1 + τ 2 ) 1 (1 + δ) xy(1 + δ ), where Thus δ = τ 1 + τ 2 + δ + O(τɛ), δ 2τ ɛ + O(τɛ), δ = τ 1 τ 2 + δ + O(τɛ). δ 2τ ɛ + O(τɛ) we can conclude that multiplication and division are very well-behaved! 20 / 27

21 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 21 / 27

22 Floating-point error analysis Illustrate the basic idea of error analysis through a simple example. Consider the inner product: x T y = x 1 y 1 + x 2 y 2 + x 3 y 3, assuming already x i s and y j s are fp numbers. fl(x T y) is computed in the following order: fl(x T y) = fl ( fl(fl(x 1 y 1 ) + fl(x 2 y 2 )) + fl(x 3 y 3 ) ). By the fp arithmetic model, we have fl(x T y) = fl ( fl(x 1 y 1 (1 + ɛ 1 ) + x 2 y 2 (1 + ɛ 2 )) + x 3 y 3 (1 + ɛ 3 ) ) = fl ( (x 1 y 1 (1 + ɛ 1 ) + x 2 y 2 (1 + ɛ 2 ))(1 + δ 1 ) + x 3 y 3 (1 + ɛ 3 ) ) = ( (x 1 y 1 (1 + ɛ 1 ) + x 2 y 2 (1 + ɛ 2 ))(1 + δ 1 ) + x 3 y 3 (1 + ɛ 3 ) ) (1 + δ 2 ) = x 1 y 1 (1 + ɛ 1 )(1 + δ 1 )(1 + δ 2 ) + x 2 y 2 (1 + ɛ 2 )(1 + δ 1 )(1 + δ 2 ) +x 3 y 3 (1 + ɛ 3 )(1 + δ 2 ), where ɛ i 1 2 ɛ and δ j 1 2 ɛ. 22 / 27

23 Floating-point error analysis, cont d There are two ways to interpret the errors in the computed fl(x T y): Forward error analysis Backward error analysis 23 / 27

24 Forward error analysis We have where fl(x T y) = x T y + E, It implies that E =x 1 y 1 (ɛ 1 + δ 1 + δ 2 ) + x 2 y 2 (ɛ 2 + δ 1 + δ 2 ) + x 3 y 3 (ɛ 3 + δ 2 ) + O(ɛ 2 ). E 1 2 ɛ(3 x 1y x 2 y x 3 y 3 ) + O(ɛ 2 ) 3 2 ɛ x T y + O(ɛ 2 ). This bound on E tells the worst-case difference between the exact x T y and its computed value. 24 / 27

25 Backward error analysis We can also write where fl(x T y) = x T ŷ = (x + x) T (y + y), x 1 = x 1 (1 + ɛ 1 ), ŷ 1 = y 1 (1 + δ 1 )(1 + δ 2 ) y 1 (1 + δ 1 ), x 2 = x 2 (1 + ɛ 2 ), ŷ 2 = y 2 (1 + δ 1 )(1 + δ 2 ) y 2 (1 + δ 2 ), x 3 = x 3 (1 + ɛ 3 ), ŷ 3 = y 3 (1 + δ 2 ) y 3 (1 + δ 3 ). and δ 1 = δ 2 ɛ + O(ɛ 2 ) and δ ɛ. This says the computed value fl(x T y) is the exact inner product of a slightly perturbed x and ŷ. 25 / 27

26 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 26 / 27

27 Further reading 1. D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 18(1):5 48, Rencet lecture by N. Higham on the latest development on low precision and multiprecision arithmetic Discussion of numerical disasters: T. Huckle, Collection of software bugs huckle/bugse.html Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science by T. Huckle and T. Neckel, SIAM, March / 27

1 Floating point arithmetic

1 Floating point arithmetic Introduction to Floating Point Arithmetic Floating point arithmetic Floating point representation (scientific notation) of numbers, for example, takes the following form.346 0 sign fraction base exponent