1 Floating point arithmetic

Size: px

Start display at page:

Download "1 Floating point arithmetic"

Kristian Barker
6 years ago
Views:

1 Introduction to Floating Point Arithmetic Floating point arithmetic Floating point representation (scientific notation) of numbers, for example, takes the following form sign fraction base exponent In general, floating point numbers with p digits and base β are represented as. It is normalized if b 0. ±.b b 2 b p β e, 0 b i β. (.) 2. The number p of digits determines the accuracy (precision) of representing an arbitrary number α in the format (.). Given a number that has more than p digits α = ±.b b 2 b p b p+ b m β e, it has to be rounded before it fits the format (.). A natural choice is to use the nearest number of the form (.) to approximate α. That nearest number is where ˆα = ±.b b 2 b p ˆbp β e, ˆbp = { bp, if b p+.b p+2 b m < 2 β, b p +, if b p+.b p+2 b m 2 β. It can be seen that the absolute representation error α ˆα. 00 }{{ 0} ( 2 β) βe = ( 2 β)β p β e = 2 βe p, p 0 s and the relative representation error α ˆα α 2 βe p.b b 2 b p b p+ b m β e 2 β p.b b 2 b p b p+ b m 2 β p. = 2 β p. So 2 β p is the maximum relative representation error, which is also half the distance between and the next larger floating point number + β p. We have ˆα = α + ˆα α = α( + ˆα α ) α( + δ), (.2) α where δ u def = 2 β p, the unit roundoff. Extra care should be taken towards the case when b p = β. For example rounding to 6 decimal digit number gives.3460.

2 3. The range in which the exponent e can vary e min e e max, determines the largest and smallest representable numbers in magnitudes, where e min and e max are positive integers. Any normalized number α of form (.) satisfies α. (β )(β ) (β ) β e max = ( β p )β e max = (. 00 }{{}}{{ 0} ) β e max, p (β ) s p 0 s and α. 00 }{{ 0} β e min = β e min. p 0 s Overflow when a number in magnitude is bigger than ( β p )β emax. computation will be aborted when Overflow occurs. Typically Underflow when a number in magnitude is smaller than β e min. One choice is to flush the number to zero when Underflow occurs, but numerically more sensible choice is the so-called Gradual Underflow through a transition of subnormal numbers. IEEE floating point standard for binary arithmetic. SUN, DEC, IBM, HP workstations and all PCs. It is most common, e.g., on IEEE Single Precision takes 4 bytes = 32 bits long: 8 23 s e f sign exponent binary point fraction It represents ( ) s 2 e 27 ( + f) (note: the leading in the fraction need not be stored explicitly, because it is always. This hidden bit accounts for the + here). The maximum relative representation error (unit roundoff) is = and the range of positive normalized numbers is from 2 26 to or about from to IEEE Double Precision takes 8 bytes = 64 bits long: 52 s e f sign exponent It represents ( ) s 2 e 023 ( + f). binary point fraction The maximum relative representation error is and the range of positive normalized numbers is from to or about from to

3 IEEE Arithmetic Exceptions and default results Exception type Example Default result Invalid operation 0/0, 0, NaN (Not a Number) Overflow ± Divide by zero Finite nonzero/0 ± Underlow Subnormal numbers Inexact Whenever fl(x y) x y Correctly rounded result Floating point arithmetic models. Let {+,,, } be one of the four basic arithmetic operations, and let x and y be two floating point numbers. It is unlikely that the exact x y fits into the working floating point number system without roundoffs. For example, suppose we work with 4 decimal digits, and consider = , = Both require more than 4 decimal places to hold the results; so roundoffs are inevitable. Ideally a computer could perform basic arithmetic operations (and square root ) in such a way that the computed result is the exact value rounded to the nearest floating point number; this implies at the best the computed result fl(x y) satisfies fl(x y) = (x y)( + δ), for some δ 2 β p. (.3) IEEE Floating Point Arithmetic requires this! When this is guaranteed, we say that x y is correctly rounded. 2 Floating point error analysis How Do the Four Basic Arithmetic Operations Behave? Let u be the unit roundoff of the working floating point number system. Let ˆx and ŷ be the floating point numbers and that ˆx = x( + ϵ ) and ŷ = y( + ϵ 2 ), for ϵ i ϵ, where ϵ could be the error in the process of collecting the data. Addition and subtraction. fl(ˆx + ŷ) = (ˆx + ŷ)( + δ), δ u Subtraction behaves the same as Addition does. = x( + ϵ )( + δ) + y( + ϵ 2 )( + δ) = x + y + x(ϵ + δ + O(ϵu)) + y(ϵ 2 + δ + O(ϵu)) ( = (x + y) + x x + y (ϵ + δ + O(ϵu)) + y ) x + y (ϵ 2 + δ + O(ϵu)) ˆδ can be bounded as follows: (x + y)( + ˆδ). ˆδ x + y x + y [ϵ + u + O(ϵu)]. 3

4 . If x and y have the same sign, i.e., xy > 0, then x + y = x + y ; this implies ˆδ ϵ + u + O(ϵu). Thus fl(ˆx + ŷ) approximates x + y no worse that ˆx and ŷ does to x and y. 2. If x y x + y 0, then ( x + y )/ x + y ; this implies that ˆδ could be nearly or much bigger than. Thus fl(ˆx + ŷ) may turn out to have nothing to do with the true x+y. This is so called catastrophic cancellation which happens when a floating point number is subtracted from another nearly equal floating point number In general, if ( x + y )/ x + y is not too big, fl(ˆx + ŷ) provides a good approximation to x + y. Example 2.. Computing n + n straightforward causes substantial loss of significant digits for large n n fl( n + ) fl( n) fl(fl( n + ) fl( n).00e e e e-06.00e e e e-06.00e e e e-07.00e e e e-07.00e e e e-08.00e e e e-08.00e e e e+00 Catastrophic cancellation sometimes may be avoided if a formula is properly reformulated. In the present case, one can compute n + n almost to full precision by using n + n = /( n + + n). as shown in the following table. n.00e+0.00e+.00e+2.00e+3.00e+4.00e+5.00e+6 fl(/( n + + n)) e e e e e e e-09 In fact, one has where δ 5u + O(u 2 ). fl(/( n + + n)) = ( n + n)( + δ), (2.4) 2 In our model of floating point arithmetic, we know that there is a small relative error associated with individual arithmetic operations. It is important to realize that, however, that this is not necessarily the case when a sequence of operations is involved. 4

5 Multiplication and division. These two operations are very well-behaved. fl(ˆx ŷ) = (ˆx ŷ)( + δ) = xy( + ϵ )( + ϵ 2 )( + δ) xy( + ˆδ ), fl(ˆx/ŷ) = (ˆx/ŷ)( + δ) = (x/y)( + ϵ )( + ϵ 2 ) ( + δ) xy( + ˆδ ), where ˆδ = δ + δ 2 + δ + O(ϵu), ˆδ = δ δ 2 + δ + O(ϵu). Thus ˆδ 2ϵ + u + O(ϵu) and ˆδ 2ϵ + u + O(ϵu). Forward and backward error analysis. We illustrate the idea through an example. Consider the computation of an inner product of two vector x, y R 3 x y def = x y + x 2 y 2 + x 3 y 3, assuming already x i s and y j s are floating point numbers. It is likely that fl(x y) is computed in the following order. fl(x y) = fl ( fl(fl(x y ) + fl(x 2 y 2 )) + fl(x 3 y 3 ) ). Adopting the floating point arithmetic model, we have fl(x y) = fl ( fl(x y ( + ϵ ) + x 2 y 2 ( + ϵ 2 )) + x 3 y 3 ( + ϵ 3 ) ) = fl ( (x y ( + ϵ ) + x 2 y 2 ( + ϵ 2 ))( + δ ) + x 3 y 3 ( + ϵ 3 ) ) = ( (x y ( + ϵ ) + x 2 y 2 ( + ϵ 2 ))( + δ ) + x 3 y 3 ( + ϵ 3 ) ) ( + δ 2 ) = x y ( + ϵ )( + δ )( + δ 2 ) + x 2 y 2 (( + ϵ 2 )( + δ )( + δ 2 ) +x 3 y 3 (( + ϵ 3 )( + δ 2 ), where ϵ i u and δ j u. Now there are two ways to interpret the errors in the computed fl(x y). We have where fl(x y) = x y + E, E = x y (ϵ + δ + δ 2 ) + x 2 y 2 (ϵ 2 + δ + δ 2 ) + x 3 y 3 (ϵ 3 + δ 2 ) + O(u 2 ), E u(3 x y + 3 x 2 y x 3 y 3 ) + O(u 2 ). This bound on E tells the worst case difference between the exact x y and its computed value. Such an error analysis is so-called the Forward Error Analysis. 2. We can also write fl(x y) = ˆx ŷ, ˆx = x ( + ϵ ), ŷ = y ( + δ )( + δ 2 ) y ( + ˆδ ), ˆx 2 = x 2 ( + ϵ 2 ), ŷ 2 = y 2 ( + δ )( + δ 2 ) y 2 ( + ˆδ 2 ), ˆx 3 = x 3 ( + ϵ 3 ), ŷ 3 = y 3 ( + δ 2 ) y 3 ( + ˆδ 3 ). 3 There are many ways to distribute factors ( + ϵ i) and ( + δ j) to x i and y j. In this case it is even possible to make either ˆx x or ŷ y. 5

6 It can be seen that ˆδ = ˆδ 2 2u + O(u 2 ) and ˆδ 3 u. This says the computed value fl(x y) is the exact inner product of a slightly perturbed ˆx and ŷ. Such an error analysis is so-called the Backward Error Analysis. 6

ECS 231 Computer Arithmetic 1 / 27

ECS 231 Computer Arithmetic 1 / 27 Outline 1 Floating-point numbers and representations 2 Floating-point arithmetic 3 Floating-point error analysis 4 Further reading 2 / 27 Outline 1 Floating-point numbers