FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Similar documents
FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Lecture 7. Floating point arithmetic and stability

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

Mathematical preliminaries and error analysis

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

1 Floating point arithmetic

Notes for Chapter 1 of. Scientific Computing with Case Studies

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

Notes on floating point number, numerical computations and pitfalls

Elements of Floating-point Arithmetic

ECS 231 Computer Arithmetic 1 / 27

Chapter 1 Error Analysis

Elements of Floating-point Arithmetic

Floating-point Computation

ACM 106a: Lecture 1 Agenda

Binary floating point

Introduction CSE 541

Lecture Notes 7, Math/Comp 128, Math 250

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Homework 2 Foundations of Computational Math 1 Fall 2018

Errors. Intensive Computation. Annalisa Massini 2017/2018

Applied Mathematics 205. Unit 0: Overview of Scientific Computing. Lecturer: Dr. David Knezevic

1 Backward and Forward Error

MAT 460: Numerical Analysis I. James V. Lambers

8/13/16. Data analysis and modeling: the tools of the trade. Ø Set of numbers. Ø Binary representation of numbers. Ø Floating points.

Math 128A: Homework 2 Solutions

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Numerical Algorithms. IE 496 Lecture 20

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

Chapter 1 Mathematical Preliminaries and Error Analysis

Numerical Methods - Preliminaries

Lecture Note 2: The Gaussian Elimination and LU Decomposition

How do computers represent numbers?

4.2 Floating-Point Numbers

An Introduction to Numerical Analysis. James Brannick. The Pennsylvania State University

Round-off Errors and Computer Arithmetic - (1.2)

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 9

Chapter 1: Introduction and mathematical preliminaries

Chapter 1 Computer Arithmetic

Let x be an approximate solution for Ax = b, e.g., obtained by Gaussian elimination. Let x denote the exact solution. Call. r := b A x.

Math 411 Preliminaries

INTRODUCTION TO COMPUTATIONAL MATHEMATICS

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

Dense LU factorization and its error analysis

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

1 ERROR ANALYSIS IN COMPUTATION

Gaussian Elimination for Linear Systems

1 Finding and fixing floating point problems

Introduction to Scientific Computing Languages

Next topics: Solving systems of linear equations

Binary Floating-Point Numbers

Tu: 9/3/13 Math 471, Fall 2013, Section 001 Lecture 1

Outline. Math Numerical Analysis. Errors. Lecture Notes Linear Algebra: Part B. Joseph M. Mahaffy,

Algebra II. A2.1.1 Recognize and graph various types of functions, including polynomial, rational, and algebraic functions.

Accurate polynomial evaluation in floating point arithmetic

14.2 QR Factorization with Column Pivoting

Scientific Computing

SOLVING LINEAR SYSTEMS

ROUNDOFF ERRORS; BACKWARD STABILITY

ALU (3) - Division Algorithms

Introduction to Scientific Computing Languages

Decimal Scientific Decimal Scientific

MATH ASSIGNMENT 03 SOLUTIONS

Chapter 1 Mathematical Preliminaries and Error Analysis

Compute the behavior of reality even if it is impossible to observe the processes (for example a black hole in astrophysics).

An Introduction to Numerical Analysis

Review Questions REVIEW QUESTIONS 71

Numerical Analysis. Yutian LI. 2018/19 Term 1 CUHKSZ. Yutian LI (CUHKSZ) Numerical Analysis 2018/19 1 / 41

Lecture notes to the course. Numerical Methods I. Clemens Kirisits

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

What is it we are looking for in these algorithms? We want algorithms that are

Lecture 1. Introduction to Numerical Methods. T. Gambill. Department of Computer Science University of Illinois at Urbana-Champaign.

Error Analysis for Solving a Set of Linear Equations

The numerical stability of barycentric Lagrange interpolation

MATH2071: LAB #5: Norms, Errors and Condition Numbers

GAUSSIAN ELIMINATION AND LU DECOMPOSITION (SUPPLEMENT FOR MA511)

Lecture 12 (Tue, Mar 5) Gaussian elimination and LU factorization (II)

Lecture 2 MODES OF NUMERICAL COMPUTATION

QUADRATIC PROGRAMMING?

Numerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods

Number Representation and Waveform Quantization

Introduction to Scientific Computing

MA580: Numerical Analysis: I. Zhilin Li

AMSC/CMSC 466 Problem set 3

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Least-Squares Systems and The QR factorization

A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem

Order of convergence. MA3232 Numerical Analysis Week 3 Jack Carl Kiefer ( ) Question: How fast does x n

Numerical methods for solving linear systems

CS227-Scientific Computing. Lecture 4: A Crash Course in Linear Algebra

MATH 612 Computational methods for equation solving and function minimization Week # 6

Chapter 4 Number Representations

Chapter 2. Error Correcting Codes. 2.1 Basic Notions

Linear Algebraic Equations

AMS526: Numerical Analysis I (Numerical Linear Algebra)

The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal

Numerics and Error Analysis

Transcription:

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS Brief review of floating point arithmetic Model of floating point arithmetic Notation, backward and forward errors 3-1

Roundoff errors and floating-point arithmetic The basic problem: The set A of all possible representable numbers on a given machine is finite - but we would like to use this set to perform standard arithmetic operations (+,*,-,/) on an infinite set. The usual algebra rules are no longer satisfied since results of operations are rounded. Basic algebra breaks down in floating point arithmetic. Example: In floating point arithmetic. a + (b + c)! = (a + b) + c Matlab experiment: For 10,000 random numbers find number of instances when the above is true. Same thing for the multiplication.. 3-2 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-2

Floating point representation: Real numbers are represented in two parts: A mantissa (significand) and an exponent. If the representation is in the base β then: x = ±(.d 1 d 2 d t )β e.d 1 d 2 d t is a fraction in the base-β representation (Generally the form is normalized in that d 1 0), and e is an integer Often, more convenient to rewrite the above as: x = ±(m/β t ) β e ±m β e t Mantissa m is an integer with 0 m β t 1. 3-3 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-3

Machine precision - machine epsilon Notation : f l(x) = closest floating point representation of real number x ( rounding ) When a number x is very small, there is a point when 1+x == 1 in a machine sense. The computer no longer makes a difference between 1 and 1 + x. Machine epsilon: The smallest number ɛ such that 1 + ɛ is a float that is different from one, is called machine epsilon. Denoted by macheps or eps, it represents the distance from 1 to the next larger floating point number. With previous representation, eps is equal to β (t 1). 3-4 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-4

Example: In IEEE standard double precision, β = 2, and t = 53 (includes hidden bit ). Therefore eps = 2 52. Unit Round-off A real number x can be approximated by a floating number fl(x) with relative error no larger than u = 1 2 β (t 1). u is called Unit Round-off. In fact can easily show: fl(x) = x(1 + δ) with δ < u Matlab experiment: find the machine epsilon on your computer. Many discussions on what conditions/ rules should be satisfied by floating point arithmetic. The IEEE standard is a set of standards adopted by many CPU manufacturers. 3-5 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-5

Rule 1. fl(x) = x(1 + ɛ), where ɛ u Rule 2. For all operations (one of +,,, /) fl(x y) = (x y)(1 + ɛ ), where ɛ u Rule 3. For +, operations fl(a b) = fl(b a) Matlab experiment: Verify experimentally Rule 3 with 10,000 randomly generated numbers a i, b i. 3-6 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-6

Example: Consider the sum of 3 numbers: y = a + b + c. Done as fl(fl(a + b) + c) η = fl(a + b) = (a + b)(1 + ɛ 1 ) y 1 = fl(η + c) = (η + c)(1 + ɛ 2 ) = [(a + b)(1 + ɛ 1 ) + c] (1 + ɛ 2 ) = [(a + b + c) [ + (a + b)ɛ 1 )] (1 + ɛ 2 ) = (a + b + c) 1 + a + b ] a + b + c ɛ 1(1 + ɛ 2 ) + ɛ 2 So disregarding the high order term ɛ 1 ɛ 2 fl(fl(a + b) + c) = (a + b + c)(1 + ɛ 3 ) ɛ 3 a + b a + b + c ɛ 1 + ɛ 2 3-7 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-7

If we redid the computation as y 2 = fl(a + fl(b + c)) we would find fl(a + fl(b + c)) = (a + b + c)(1 + ɛ 4 ) ɛ 4 b + c a + b + c ɛ 1 + ɛ 2 The error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case. In order to sum n numbers accurately, it is better to start with small numbers first. [However, sorting before adding is not worth it.] But watch out if the numbers have mixed signs! 3-8 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-8

The absolute value notation For a given vector x, x is the vector with components x i, i.e., x is the component-wise absolute value of x. Similarly for matrices: A = { a ij } i=1,...,m; j=1,...,n An obvious result: The basic inequality fl(a ij ) a ij u a ij translates into fl(a) = A + E with E u A A B means a ij b ij for all 1 i m; 1 j n 3-9 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-9

Error Analysis: Inner product Inner products are in the innermost parts of many calculations. Their analysis is important. Lemma: If δ i u and nu < 1 then Π n i=1 (1 + δ i) = 1 + θ n where θ n nu 1 nu Common notation γ n nu 1 nu Prove the lemma [Hint: use induction] 3-10 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-10

Can use the following simpler result: Lemma: If δ i u and nu <.01 then Π n i=1 (1 + δ i) = 1 + θ n where θ n 1.01nu Example: Previous sum of numbers can be written fl(a + b + c) = a(1 + ɛ 1 )(1 + ɛ 2 ) + b(1 + ɛ 1 )(1 + ɛ 2 ) + c(1 + ɛ 2 ) = a(1 + θ 1 ) + b(1 + θ 2 ) + c(1 + θ 3 ) = exact sum of slightly perturbed inputs, where all θ i s satisfy θ i 1.01nu (here n = 2). Alternatively, can write forward bound: fl(a + b + c) (a + b + c) aθ 1 + bθ 2 + cθ 3. 3-11 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-11

Backward and forward errors Assume the approximation ŷ to y = alg(x) is computed by some algorithm with arithmetic precision ɛ. Possible analysis: find an upper bound for the Forward error This is not always easy. Alternative question: y = y ŷ find equivalent perturbation on initial data (x) that produces the result ŷ. In other words, find x so that: alg(x + x) = ŷ The value of x is called the backward error. An analysis to find an upper bound for x is called Backward error analysis. 3-12 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-12

Example: A = ( ) a b 0 c B = ( ) d e 0 f Consider the product: f l(a.b) = [ (ad)(1 + ɛ1 ) [ae(1 + ɛ 2 ) + bf(1 + ɛ 3 )] (1 + ɛ 4 ) 0 cf(1 + ɛ 5 ) with ɛ i u, for i = 1,..., 5. Result can be written as: [ ] [ a b(1 + ɛ3 )(1 + ɛ 4 ) d(1 + ɛ1 ) e(1 + ɛ 2 )(1 + ɛ 4 ) 0 c(1 + ɛ 5 ) 0 f ] ] So fl(a.b) = (A + E A )(B + E B ). Backward errors E A, E B satisfy: E A 2u A + O(u 2 ) ; E B 2u B + O(u 2 ) 3-13 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-13

When solving Ax = b by Gaussian Elimination, we will see that a bound on e x such that this holds exactly: A(x computed + e x ) = b is much harder to find than bounds on E A, e b such that this holds exactly: (A + E A )x computed = (b + e b ). Note: In many instances backward errors are more meaningful than forward errors: if initial data is accurate only to 4 digits say, then my algorithm for computing x need not guarantee a backward error of less then 10 10 for example. A backward error of order 10 4 is acceptable. 3-14 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-14

Main result on inner products: Backward error expression: fl(x T y) = [x. (1 + d x )] T [y. (1 + d y )] where d 1.01nu, = x, y. Can show equality valid even if one of the d x, d y absent. Forward error expression: fl(x T y) x T y γ n x T y with 0 γ n 1.01nu. Elementwise absolute value x and multiply. notation. Above assumes nu.01. For u = 2.0 10 16, this holds for n 4.5 10 13. 3-15 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-15

Consequence of lemma: fl(a B) A B γ n A B Another way to write the result (less precise) is fl(x T y) x T y n u x T y + O(u 2 ) 3-16 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-16

Assume you use single precision for which you have u = 2. 10 6. What is the largest n for which nu 0.01 holds? Any conclusions for the use of single precision arithmetic? What does the main result on inner products imply for the case when y = x? [Contrast the relative accuracy you get in this case vs. the general case when y x] 3-17 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-17

Show for any x, y, there exist x, y such that fl(x T y) = (x + x) T y, with x γ n x fl(x T y) = x T (y + y), with y γ n y (Continuation) Let A an m n matrix, x an n-vector, and y = Ax. Show that there exist a matrix A such fl(y) = (A + A)x, with A γ n A (Continuation) From the above derive a result about a column of the product of two matrices A and B. Does a similar result hold for the product AB as a whole? 3-18 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float 3-18

Supplemental notes: Floating Point Arithmetic In most computing systems, real numbers are represented in two parts: A mantissa and an exponent. If the representation is in the base β then: x = ±(.d 1 d 2 d m ) β β e.d 1 d 2 d m is a fraction in the base-β representation e is an integer - can be negative, positive or zero. Generally the form is normalized in that d 1 0. 3-19 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-19

Example: In base 10 (for illustration) 1. 1000.12345 can be written as 0.100012345 10 10 4 2. 0.000812345 can be written as 0.812345 10 10 3 Problem with floating point arithmetic: we have to live with limited precision. Example: Assume that we have only 5 digits of accuray in the mantissa and 2 digits for the exponent (excluding sign)..d 1 d 2 d 3 d 4 d 5 e 1 e 2 3-20 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-20

Try to add 1000.2 =.10002e+03 and 1.07 =.10700e+01: 1000.2 =.1 0 0 0 2 0 4 ; 1.07 =.1 0 7 0 0 0 1 First task: align decimal points. The one with smallest exponent will be (internally) rewritten so its exponent matches the largest one: 1.07 = 0.000107 10 4 Second task: add mantissas: 0. 1 0 0 0 2 + 0. 0 0 0 1 0 7 = 0. 1 0 0 1 2 7 3-21 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-21

Third task: round result. Result has 6 digits - can use only 5 so we can Chop result:.1 0 0 1 2 ; Round result:.1 0 0 1 3 ; Fourth task: Normalize result if needed (not needed here) result with rounding:.1 0 0 1 3 0 4 ; Redo the same thing with 7000.2 + 4000.3 or 6999.2 + 4000.3. 3-22 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-22

Some More Examples Each operation f l(x y) proceeds in 4 steps: 1. Line up exponents (for addition & subtraction). 2. Compute temporary exact answer. 3. Normalize temporary result. 4. Round to nearest representable number (round-to-even in case of a tie)..40015 e+02.40010 e+02.41015 e-98 +.60010 e+02.50001 e-04 -.41010 e-98 temporary 1.00025 e+02.4001050001e+02.00005 e-98 normalize.100025e+03.400105 e+02.00050 e-99 round.10002 e+03.40011 e+02.00050 e-99 note: round to even round to nearest =not all 0 s too small: unnormalized exactly halfway between values closer to upper value exponent is at minimum 3-23 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-23

The IEEE standard 32 bit (Single precision) : ± 8 bits 23 bits }{{} exponent } mantissa {{} sign In binary: The leading one in mantissa does not need to be represented. One bit gained. Hidden bit. Largest exponent: 2 7 1 = 127; Smallest: = 126. [ bias of 127] 3-24 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-24

64 bit (Double precision) : ± 11 bits 52 bits }{{} exponent } mantissa {{} sign Bias of 1023 so if c is the contents of exponent field actual exponent value is 2 c 1023 e + bias = 2047 (all ones) = special use Largest exponent: 1023; Smallest = -1022. Including the hidden bit, mantissa has total of 53 bits (52 bits represented, one hidden). 3-25 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-25

Take the number 1.0 and see what will happen if you add 1/2, 1/4,..., 2 i. Do not forget the hidden bit! Hidden bit (Not represented) Expon. 52 bits e 1 1 0 0 0 0 0 0 0 0 0 0 e 1 0 1 0 0 0 0 0 0 0 0 0 e 1 0 0 1 0 0 0 0 0 0 0 0... e 1 0 0 0 0 0 0 0 0 0 0 1 e 1 0 0 0 0 0 0 0 0 0 0 0 (Note: The e part has 12 bits and includes the sign) Conclusion fl(1 + 2 52 ) 1 but: fl(1 + 2 53 ) == 1!! 3-26 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-26

Special Values Exponent field = 00000000000 (smallest possible value) No hidden bit. All bits == 0 means exactly zero. Allow for unnormalized numbers, leading to gradual underflow. Exponent field = 11111111111 (largest possible value) Number represented is Inf -Inf or NaN. 3-27 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1. FloatSuppl 3-27

Appendix to set 3: Analysis of inner products Consider s n = fl(x 1 y 1 + x 2 y 2 + + x n y n ) In what follows η i s comme from, ɛ i s comme from + They satisfy: η i u and ɛ i u. The inner product s n is computed as: 1. s 1 = fl(x 1 y 1 ) = (x 1 y 1 )(1 + η 1 ) 2. s 2 = fl(s 1 + fl(x 2 y 2 )) = fl(s 1 + x 2 y 2 (1 + η 2 )) = (x 1 y 1 (1 + η 1 ) + x 2 y 2 (1 + η 2 )) (1 + ɛ 2 ) = x 1 y 1 (1 + η 1 )(1 + ɛ 2 ) + x 2 y 2 (1 + η 2 )(1 + ɛ 2 ) 3. s 3 = fl(s 2 + fl(x 3 y 3 )) = fl(s 2 + x 3 y 3 (1 + η 3 )) = (s 2 + x 3 y 3 (1 + η 3 ))(1 + ɛ 3 ) 3-28 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float2 3-28

Expand: s 3 = x 1 y 1 (1 + η 1 )(1 + ɛ 2 )(1 + ɛ 3 ) +x 2 y 2 (1 + η 2 )(1 + ɛ 2 )(1 + ɛ 3 ) +x 3 y 3 (1 + η 3 )(1 + ɛ 3 ) Induction would show that [with convention that ɛ 1 0] s n = n x i y i (1 + η i ) n (1 + ɛ j ) i=1 j=i Q: How many terms in the coefficient of x i y i do we have? A: When i > 1 : 1 + (n i + 1) = n i + 2 When i = 1 : n (since ɛ 1 = 0 does not count) Bottom line: always n. 3-29 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float2 3-29

For each of these products (1 + η i ) n j=i (1 + ɛ j) = 1 + θ i, with θ i γ n u so: s n = n i=1 x iy i (1 + θ i ) with θ i γ n or: fl ( n i=1 x iy i ) = n i=1 x iy i + n i=1 x iy i θ i with θ i γ n This leads to the final result (forward form) ( n ) n n fl x i y i x i y i γ n x i y i i=1 i=1 i=1 or (backward form) ( n ) fl x i y i = i=1 n i=1 x i y i (1 + θ i ) with θ i γ n 3-30 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float2 3-30