FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Size: px

Start display at page:

Download "FLOATING POINT ARITHMETHIC - ERROR ANALYSIS"

Brice Lang
5 years ago
Views:

1 FLOATING POINT ARITHMETHIC - ERROR ANALYSIS Brief review of floating point arithmetic Model of floating point arithmetic Notation, backward and forward errors 3-1

2 Roundoff errors and floating-point arithmetic The basic problem: The set A of all possible representable numbers on a given machine is finite - but we would like to use this set to perform standard arithmetic operations (+,*,-,/) on an infinite set. The usual algebra rules are no longer satisfied since results of operations are rounded. Basic algebra breaks down in floating point arithmetic. Example: In floating point arithmetic. a + (b + c)! = (a + b) + c Matlab experiment: For 10,000 random numbers find number of instances when the above is true. Same thing for the multiplication TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-2

3 Floating point representation: Real numbers are represented in two parts: A mantissa (significand) and an exponent. If the representation is in the base β then: x = ±(.d 1 d 2 d t )β e.d 1 d 2 d t is a fraction in the base-β representation (Generally the form is normalized in that d 1 0), and e is an integer Often, more convenient to rewrite the above as: x = ±(m/β t ) β e ±m β e t Mantissa m is an integer with 0 m β t TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-3

4 Machine precision - machine epsilon Notation : f l(x) = closest floating point representation of real number x ( rounding ) When a number x is very small, there is a point when 1+x == 1 in a machine sense. The computer no longer makes a difference between 1 and 1 + x. Machine epsilon: The smallest number ɛ such that 1 + ɛ is a float that is different from one, is called machine epsilon. Denoted by macheps or eps, it represents the distance from 1 to the next larger floating point number. With previous representation, eps is equal to β (t 1). 3-4 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-4

5 Example: In IEEE standard double precision, β = 2, and t = 53 (includes hidden bit ). Therefore eps = Unit Round-off A real number x can be approximated by a floating number fl(x) with relative error no larger than u = 1 2 β (t 1). u is called Unit Round-off. In fact can easily show: fl(x) = x(1 + δ) with δ < u Matlab experiment: find the machine epsilon on your computer. Many discussions on what conditions/ rules should be satisfied by floating point arithmetic. The IEEE standard is a set of standards adopted by many CPU manufacturers. 3-5 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-5

6 Rule 1. fl(x) = x(1 + ɛ), where ɛ u Rule 2. For all operations (one of +,,, /) fl(x y) = (x y)(1 + ɛ ), where ɛ u Rule 3. For +, operations fl(a b) = fl(b a) Matlab experiment: Verify experimentally Rule 3 with 10,000 randomly generated numbers a i, b i. 3-6 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-6

7 Example: Consider the sum of 3 numbers: y = a + b + c. Done as fl(fl(a + b) + c) η = fl(a + b) = (a + b)(1 + ɛ 1 ) y 1 = fl(η + c) = (η + c)(1 + ɛ 2 ) = [(a + b)(1 + ɛ 1 ) + c] (1 + ɛ 2 ) = [(a + b + c) [ + (a + b)ɛ 1 )] (1 + ɛ 2 ) = (a + b + c) 1 + a + b ] a + b + c ɛ 1(1 + ɛ 2 ) + ɛ 2 So disregarding the high order term ɛ 1 ɛ 2 fl(fl(a + b) + c) = (a + b + c)(1 + ɛ 3 ) ɛ 3 a + b a + b + c ɛ 1 + ɛ TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-7

8 If we redid the computation as y 2 = fl(a + fl(b + c)) we would find fl(a + fl(b + c)) = (a + b + c)(1 + ɛ 4 ) ɛ 4 b + c a + b + c ɛ 1 + ɛ 2 The error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case. In order to sum n numbers accurately, it is better to start with small numbers first. [However, sorting before adding is not worth it.] But watch out if the numbers have mixed signs! 3-8 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-8

9 The absolute value notation For a given vector x, x is the vector with components x i, i.e., x is the component-wise absolute value of x. Similarly for matrices: A = { a ij } i=1,...,m; j=1,...,n An obvious result: The basic inequality fl(a ij ) a ij u a ij translates into fl(a) = A + E with E u A A B means a ij b ij for all 1 i m; 1 j n 3-9 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-9

10 Error Analysis: Inner product Inner products are in the innermost parts of many calculations. Their analysis is important. Lemma: If δ i u and nu < 1 then Π n i=1 (1 + δ i) = 1 + θ n where θ n nu 1 nu Common notation γ n nu 1 nu Prove the lemma [Hint: use induction] 3-10 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-10

11 Can use the following simpler result: Lemma: If δ i u and nu <.01 then Π n i=1 (1 + δ i) = 1 + θ n where θ n 1.01nu Example: Previous sum of numbers can be written fl(a + b + c) = a(1 + ɛ 1 )(1 + ɛ 2 ) + b(1 + ɛ 1 )(1 + ɛ 2 ) + c(1 + ɛ 2 ) = a(1 + θ 1 ) + b(1 + θ 2 ) + c(1 + θ 3 ) = exact sum of slightly perturbed inputs, where all θ i s satisfy θ i 1.01nu (here n = 2). Alternatively, can write forward bound: fl(a + b + c) (a + b + c) aθ 1 + bθ 2 + cθ TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-11

12 Backward and forward errors Assume the approximation ŷ to y = alg(x) is computed by some algorithm with arithmetic precision ɛ. Possible analysis: find an upper bound for the Forward error This is not always easy. Alternative question: y = y ŷ find equivalent perturbation on initial data (x) that produces the result ŷ. In other words, find x so that: alg(x + x) = ŷ The value of x is called the backward error. An analysis to find an upper bound for x is called Backward error analysis TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-12

13 Example: A = ( ) a b 0 c B = ( ) d e 0 f Consider the product: f l(a.b) = [ (ad)(1 + ɛ1 ) [ae(1 + ɛ 2 ) + bf(1 + ɛ 3 )] (1 + ɛ 4 ) 0 cf(1 + ɛ 5 ) with ɛ i u, for i = 1,..., 5. Result can be written as: [ ] [ a b(1 + ɛ3 )(1 + ɛ 4 ) d(1 + ɛ1 ) e(1 + ɛ 2 )(1 + ɛ 4 ) 0 c(1 + ɛ 5 ) 0 f ] ] So fl(a.b) = (A + E A )(B + E B ). Backward errors E A, E B satisfy: E A 2u A + O(u 2 ) ; E B 2u B + O(u 2 ) 3-13 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-13

14 When solving Ax = b by Gaussian Elimination, we will see that a bound on e x such that this holds exactly: A(x computed + e x ) = b is much harder to find than bounds on E A, e b such that this holds exactly: (A + E A )x computed = (b + e b ). Note: In many instances backward errors are more meaningful than forward errors: if initial data is accurate only to 4 digits say, then my algorithm for computing x need not guarantee a backward error of less then for example. A backward error of order 10 4 is acceptable TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-14

15 Main result on inner products: Backward error expression: fl(x T y) = [x. (1 + d x )] T [y. (1 + d y )] where d 1.01nu, = x, y. Can show equality valid even if one of the d x, d y absent. Forward error expression: fl(x T y) x T y γ n x T y with 0 γ n 1.01nu. Elementwise absolute value x and multiply. notation. Above assumes nu.01. For u = , this holds for n TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-15

16 Consequence of lemma: fl(a B) A B γ n A B Another way to write the result (less precise) is fl(x T y) x T y n u x T y + O(u 2 ) 3-16 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-16

17 Assume you use single precision for which you have u = What is the largest n for which nu 0.01 holds? Any conclusions for the use of single precision arithmetic? What does the main result on inner products imply for the case when y = x? [Contrast the relative accuracy you get in this case vs. the general case when y x] 3-17 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-17

18 Show for any x, y, there exist x, y such that fl(x T y) = (x + x) T y, with x γ n x fl(x T y) = x T (y + y), with y γ n y (Continuation) Let A an m n matrix, x an n-vector, and y = Ax. Show that there exist a matrix A such fl(y) = (A + A)x, with A γ n A (Continuation) From the above derive a result about a column of the product of two matrices A and B. Does a similar result hold for the product AB as a whole? 3-18 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float 3-18

19 Supplemental notes: Floating Point Arithmetic In most computing systems, real numbers are represented in two parts: A mantissa and an exponent. If the representation is in the base β then: x = ±(.d 1 d 2 d m ) β β e.d 1 d 2 d m is a fraction in the base-β representation e is an integer - can be negative, positive or zero. Generally the form is normalized in that d TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-19

20 Example: In base 10 (for illustration) can be written as can be written as Problem with floating point arithmetic: we have to live with limited precision. Example: Assume that we have only 5 digits of accuray in the mantissa and 2 digits for the exponent (excluding sign)..d 1 d 2 d 3 d 4 d 5 e 1 e TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-20

21 Try to add =.10002e+03 and 1.07 =.10700e+01: = ; 1.07 = First task: align decimal points. The one with smallest exponent will be (internally) rewritten so its exponent matches the largest one: 1.07 = Second task: add mantissas: = TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-21

22 Third task: round result. Result has 6 digits - can use only 5 so we can Chop result: ; Round result: ; Fourth task: Normalize result if needed (not needed here) result with rounding: ; Redo the same thing with or TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-22

23 Some More Examples Each operation f l(x y) proceeds in 4 steps: 1. Line up exponents (for addition & subtraction). 2. Compute temporary exact answer. 3. Normalize temporary result. 4. Round to nearest representable number (round-to-even in case of a tie) e e e e e e-98 temporary e e e-98 normalize e e e-99 round e e e-99 note: round to even round to nearest =not all 0 s too small: unnormalized exactly halfway between values closer to upper value exponent is at minimum 3-23 TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-23

24 The IEEE standard 32 bit (Single precision) : ± 8 bits 23 bits }{{} exponent } mantissa {{} sign In binary: The leading one in mantissa does not need to be represented. One bit gained. Hidden bit. Largest exponent: = 127; Smallest: = 126. [ bias of 127] 3-24 TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-24

25 64 bit (Double precision) : ± 11 bits 52 bits }{{} exponent } mantissa {{} sign Bias of 1023 so if c is the contents of exponent field actual exponent value is 2 c 1023 e + bias = 2047 (all ones) = special use Largest exponent: 1023; Smallest = Including the hidden bit, mantissa has total of 53 bits (52 bits represented, one hidden) TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-25

26 Take the number 1.0 and see what will happen if you add 1/2, 1/4,..., 2 i. Do not forget the hidden bit! Hidden bit (Not represented) Expon. 52 bits e e e e e (Note: The e part has 12 bits and includes the sign) Conclusion fl( ) 1 but: fl( ) == 1!! 3-26 TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-26

27 Special Values Exponent field = (smallest possible value) No hidden bit. All bits == 0 means exactly zero. Allow for unnormalized numbers, leading to gradual underflow. Exponent field = (largest possible value) Number represented is Inf -Inf or NaN TB: 13; GvL 2.7; Ort 9.2; AB: FloatSuppl 3-27

28 Appendix to set 3: Analysis of inner products Consider s n = fl(x 1 y 1 + x 2 y x n y n ) In what follows η i s comme from, ɛ i s comme from + They satisfy: η i u and ɛ i u. The inner product s n is computed as: 1. s 1 = fl(x 1 y 1 ) = (x 1 y 1 )(1 + η 1 ) 2. s 2 = fl(s 1 + fl(x 2 y 2 )) = fl(s 1 + x 2 y 2 (1 + η 2 )) = (x 1 y 1 (1 + η 1 ) + x 2 y 2 (1 + η 2 )) (1 + ɛ 2 ) = x 1 y 1 (1 + η 1 )(1 + ɛ 2 ) + x 2 y 2 (1 + η 2 )(1 + ɛ 2 ) 3. s 3 = fl(s 2 + fl(x 3 y 3 )) = fl(s 2 + x 3 y 3 (1 + η 3 )) = (s 2 + x 3 y 3 (1 + η 3 ))(1 + ɛ 3 ) 3-28 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float2 3-28

29 Expand: s 3 = x 1 y 1 (1 + η 1 )(1 + ɛ 2 )(1 + ɛ 3 ) +x 2 y 2 (1 + η 2 )(1 + ɛ 2 )(1 + ɛ 3 ) +x 3 y 3 (1 + η 3 )(1 + ɛ 3 ) Induction would show that [with convention that ɛ 1 0] s n = n x i y i (1 + η i ) n (1 + ɛ j ) i=1 j=i Q: How many terms in the coefficient of x i y i do we have? A: When i > 1 : 1 + (n i + 1) = n i + 2 When i = 1 : n (since ɛ 1 = 0 does not count) Bottom line: always n TB: 13-15; GvL 2.7; Ort 9.2; AB: Float2 3-29

30 For each of these products (1 + η i ) n j=i (1 + ɛ j) = 1 + θ i, with θ i γ n u so: s n = n i=1 x iy i (1 + θ i ) with θ i γ n or: fl ( n i=1 x iy i ) = n i=1 x iy i + n i=1 x iy i θ i with θ i γ n This leads to the final result (forward form) ( n ) n n fl x i y i x i y i γ n x i y i i=1 i=1 i=1 or (backward form) ( n ) fl x i y i = i=1 n i=1 x i y i (1 + θ i ) with θ i γ n 3-30 TB: 13-15; GvL 2.7; Ort 9.2; AB: Float2 3-30

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS Brief review of floating point arithmetic Model of floating point arithmetic Notation, backward and forward errors Roundoff errors and floating-point arithmetic