FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Similar documents
FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Lecture 7. Floating point arithmetic and stability

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

1 Floating point arithmetic

Mathematical preliminaries and error analysis

Notes on floating point number, numerical computations and pitfalls

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Elements of Floating-point Arithmetic

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

Elements of Floating-point Arithmetic

Chapter 1 Error Analysis

Notes for Chapter 1 of. Scientific Computing with Case Studies

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

ECS 231 Computer Arithmetic 1 / 27

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

ACM 106a: Lecture 1 Agenda

Floating-point Computation

Lecture Notes 7, Math/Comp 128, Math 250

Applied Mathematics 205. Unit 0: Overview of Scientific Computing. Lecturer: Dr. David Knezevic

Binary floating point

Numerical Algorithms. IE 496 Lecture 20

Errors. Intensive Computation. Annalisa Massini 2017/2018

Homework 2 Foundations of Computational Math 1 Fall 2018

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Math 128A: Homework 2 Solutions

Numerical Methods - Preliminaries

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

1 Backward and Forward Error

MAT 460: Numerical Analysis I. James V. Lambers

Introduction CSE 541

INTRODUCTION TO COMPUTATIONAL MATHEMATICS

Dense LU factorization and its error analysis

Lecture Note 2: The Gaussian Elimination and LU Decomposition

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

1 ERROR ANALYSIS IN COMPUTATION

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 9

Chapter 1 Computer Arithmetic

An Introduction to Numerical Analysis. James Brannick. The Pennsylvania State University

Chapter 1 Mathematical Preliminaries and Error Analysis

Math 411 Preliminaries

8/13/16. Data analysis and modeling: the tools of the trade. Ø Set of numbers. Ø Binary representation of numbers. Ø Floating points.

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

Let x be an approximate solution for Ax = b, e.g., obtained by Gaussian elimination. Let x denote the exact solution. Call. r := b A x.

4.2 Floating-Point Numbers

Tu: 9/3/13 Math 471, Fall 2013, Section 001 Lecture 1

Outline. Math Numerical Analysis. Errors. Lecture Notes Linear Algebra: Part B. Joseph M. Mahaffy,

SOLVING LINEAR SYSTEMS

Accurate polynomial evaluation in floating point arithmetic

Scientific Computing

Round-off Errors and Computer Arithmetic - (1.2)

ROUNDOFF ERRORS; BACKWARD STABILITY

Gaussian Elimination for Linear Systems

Next topics: Solving systems of linear equations

Introduction to Scientific Computing Languages

Chapter 1: Introduction and mathematical preliminaries

Chapter 1 Mathematical Preliminaries and Error Analysis

How do computers represent numbers?

14.2 QR Factorization with Column Pivoting

Review Questions REVIEW QUESTIONS 71

1 Finding and fixing floating point problems

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

ALU (3) - Division Algorithms

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

Error Analysis for Solving a Set of Linear Equations

Introduction to Scientific Computing Languages

Decimal Scientific Decimal Scientific

MATH2071: LAB #5: Norms, Errors and Condition Numbers

Compute the behavior of reality even if it is impossible to observe the processes (for example a black hole in astrophysics).

GAUSSIAN ELIMINATION AND LU DECOMPOSITION (SUPPLEMENT FOR MA511)

Algebra II. A2.1.1 Recognize and graph various types of functions, including polynomial, rational, and algebraic functions.

Roundoff Analysis of Gaussian Elimination

MATH ASSIGNMENT 03 SOLUTIONS

An Introduction to Numerical Analysis

Numerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods

MA580: Numerical Analysis: I. Zhilin Li

Number Representation and Waveform Quantization

Introduction to Scientific Computing

AMSC/CMSC 466 Problem set 3

Numerical Analysis. Yutian LI. 2018/19 Term 1 CUHKSZ. Yutian LI (CUHKSZ) Numerical Analysis 2018/19 1 / 41

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

A Backward Stable Hyperbolic QR Factorization Method for Solving Indefinite Least Squares Problem

The Solution of Linear Systems AX = B

CS227-Scientific Computing. Lecture 4: A Crash Course in Linear Algebra

Introduction 5. 1 Floating-Point Arithmetic 5. 2 The Direct Solution of Linear Algebraic Systems 11

Lecture notes to the course. Numerical Methods I. Clemens Kirisits

The numerical stability of barycentric Lagrange interpolation

Chapter 4 Number Representations

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Mathematics for Engineers. Numerical mathematics

BACHELOR OF COMPUTER APPLICATIONS (BCA) (Revised) Term-End Examination December, 2015 BCS-054 : COMPUTER ORIENTED NUMERICAL TECHNIQUES

Binary Floating-Point Numbers

Lecture 28 The Main Sources of Error

What is it we are looking for in these algorithms? We want algorithms that are

Lecture 12 (Tue, Mar 5) Gaussian elimination and LU factorization (II)

Order of convergence. MA3232 Numerical Analysis Week 3 Jack Carl Kiefer ( ) Question: How fast does x n

ESO 208A: Computational Methods in Engineering. Saumyen Guha

Lecture 9. Errors in solving Linear Systems. J. Chaudhry (Zeb) Department of Mathematics and Statistics University of New Mexico

Lecture 1. Introduction to Numerical Methods. T. Gambill. Department of Computer Science University of Illinois at Urbana-Champaign.

Topics. Review of lecture 2/11 Error, Residual and Condition Number. Review of lecture 2/16 Backward Error Analysis The General Case 1 / 22

Transcription:

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS Brief review of floating point arithmetic Model of floating point arithmetic Notation, backward and forward errors

Roundoff errors and floating-point arithmetic The basic problem: The set A of all possible representable numbers on a given machine is finite - but we would like to use this set to perform standard arithmetic operations (+,*,-,/) on an infinite set. The usual algebra rules are no longer satisfied since results of operations are rounded. Basic algebra breaks down in floating point arithmetic. Example: In floating point arithmetic. a + (b + c)! = (a + b) + c Matlab experiment: For 10,000 random numbers find number of instances when the above is true. Same thing for the multiplication.. 3-2 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Machine precision - machine epsilon When a number x is very small, there is a point when 1+x == 1 in a machine sense. The computer no longer makes a difference between 1 and 1 + x. Machine epsilon: the smallest number ɛ such that fl(1 + ɛ) 1 This number is denoted by u sometimes by eps. Matlab experiment: find the machine epsilon on your computer. Many discussions on what conditions/ rules should be satisfied by floating point arithmetic. The IEEE standard is a set of standards adopted by many CPU manufacturers. 3-3 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Rule 1. fl(x) = x(1 + ɛ), where ɛ u Rule 2. For all operations (one of +,,, /) fl(x y) = (x y)(1 + ɛ ), where ɛ u Rule 3. For +, operations fl(a b) = fl(b a) Matlab experiment: Verify experimentally Rule 3 with 10,000 randomly generated numbers a i, b i. 3-4 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Example: Consider the sum of 3 numbers: y = a + b + c. Done as fl(fl(a + b) + c) η = fl(a + b) = (a + b)(1 + ɛ 1 ) y 1 = fl(η + c) = (η + c)(1 + ɛ 2 ) = [(a + b)(1 + ɛ 1 ) + c] (1 + ɛ 2 ) = [(a + b + c) [ + (a + b)ɛ 1 )] (1 + ɛ 2 ) = (a + b + c) 1 + a + b ] a + b + c ɛ 1(1 + ɛ 2 ) + ɛ 2 So disregarding the high order term ɛ 1 ɛ 2 fl(fl(a + b) + c) = (a + b + c)(1 + ɛ 3 ) ɛ 3 a + b a + b + c ɛ 1 + ɛ 2 3-5 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

If we redid the computation as y 2 = fl(a + fl(b + c)) we would find fl(a + fl(b + c)) = (a + b + c)(1 + ɛ 4 ) ɛ 4 b + c a + b + c ɛ 1 + ɛ 2 The error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case. In order to sum n numbers more accurately, it is better to start with the small numbers first. [However, sorting before adding is not worth the cost!] But watch out if the numbers have mixed signs! 3-6 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

The absolute value notation For a given vector x, x is the vector with components x i, i.e., x is the component-wise absolute value of x. Similarly for matrices: A = { a ij } i=1,...,m; j=1,...,n An obvious result: The basic inequality fl(a ij ) a ij u a ij translates into fl(a) = A + E with E u A A B means a ij b ij for all 1 i m; 1 j n 3-7 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Error Analysis: Inner product Inner products are in the innermost parts of many calculations. Their analysis is important. Lemma: If δ i u and nu < 1 then Π n i=1 (1 + δ i) = 1 + θ n where θ n nu 1 nu Common notation γ n nu 1 nu Prove the lemma [Hint: use induction] 3-8 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Can use the following simpler result: Lemma: If δ i u and nu <.01 then Π n i=1 (1 + δ i) = 1 + θ n where θ n 1.01nu Previous sum of numbers can be written fl(a + b + c) = a(1 + ɛ 1 )(1 + ɛ 2 ) + b(1 + ɛ 1 )(1 + ɛ 2 ) + c(1 + ɛ 2 ) = a(1 + θ 1 ) + b(1 + θ 2 ) + c(1 + θ 3 ) = exact sum of slightly perturbed inputs, where all γ s satisfy γ 1.01nu (here n = 3). Here implies forward bound: fl(a + b + c) (a + b + c) aγ 1 + bγ 2 + cγ 3. 3-9 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Main result on inner products: Backward error expression: fl(x T y) = [x. (1 + d x )] T [y. (1 + d y )] where d 1.01nu, = x, y. Can show equality valid even if one of the d x, d y absent. Forward error expression: fl(x T y) x T y γ n x T y where 0 γ n 1.01nu. Elementwise absolute value x and multiply. notation. Above assumes nu.01. For u = 2.0 10 16, this holds for n 4.5 10 13.

Consequence of lemma: fl(a B) A B γ n A B Another way to write the result (less precise) is fl(x T y) x T y n u x T y + O(u 2 ) 3-11 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Assume you use single precision for which you have u = 2. 10 6. What is the largest n for which nu 0.01 holds? Any conclusions for the use of single precision arithmetic? What does the main result on inner products imply for the case when y = x? [Contrast the relative accuracy you get in this case vs. the general case when y x] 3-12 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Backward and forward errors Assume the approximation ŷ to y = alg(x) is computed by some algorithm with arithmetic precision ɛ. Possible analysis: find an upper bound for the Forward error This is not always easy. Alternative question: y = y ŷ find equivalent perturbation on initial data (x) that produces the result ŷ. In other words, find x so that: alg(x + x) = ŷ The value of x is called the backward error. An analysis to find an upper bound for x is called Backward error analysis. 3-13 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Example: A = ( ) a b 0 c B = ( ) d e 0 f Consider the product: f l(a.b) = [ ] (ad)(1 + ɛ1 ) [ae(1 + ɛ 2 ) + bf(1 + ɛ 3 )] (1 + ɛ 4 ) 0 cf(1 + ɛ 5 ) with ɛ i u, for i = 1,..., 5. Result can be written as: [ ] [ ] a b(1 + ɛ3 )(1 + ɛ 4 ) d(1 + ɛ1 ) e(1 + ɛ 2 )(1 + ɛ 4 ) 0 c(1 + ɛ 5 ) 0 f So fl(a.b) = (A + E A )(B + E B ). Backward errors E A, E B satisfy: E A 2u A + O(u 2 ) ; E B 2u B + O(u 2 ) 3-14 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

When solving Ax = b by Gaussian Elimination, we will see that a bound on e x such that this holds exactly: A(x computed + e x ) = b is much harder to find than bounds on E A, e b such that this holds exactly: (A + E A )x computed = (b + e b ). Note: In many instances backward errors are more meaningful than forward errors: if initial data is accurate only to 4 digits say, then my algorithm for computing x need not guarantee a backward error of less then 10 10 for example. A backward error of order 10 4 is acceptable. 3-15 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Show for any x, y, there exist x, y such that fl(x T y) = (x + x) T y, with x γ n x fl(x T y) = x T (y + y), with y γ n y (Continuation) Let A an m n matrix, x an n-vector, and y = Ax. Show that there exist a matrix A such fl(y) = (A + A)x, with A γ n A (Continuation) From the above derive a result about a column of the product of two matrices A and B. Does a similar result hold for the product AB as a whole? 3-16 Text: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1.2 Float

Supplemental notes: Floating Point Arithmetic In most computing systems, real numbers are represented in two parts: A mantissa and an exponent. If the representation is in the base β then: x = ±(.d 1 d 2 d m ) β β e.d 1 d 2 d m is a fraction in the base-β representation e is an integer - can be negative, positive or zero. Generally the form is normalized in that d 1 0. 3-17 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Example: In base 10 (for illustration) 1. 1000.12345 can be written as 0.100012345 10 10 4 2. 0.000812345 can be written as 0.812345 10 10 3 Problem with floating point arithmetic: we have to live with limited precision. Example: Assume that we have only 5 digits of accuray in the mantissa and 2 digits for the exponent (excluding sign)..d 1 d 2 d 3 d 4 d 5 e 1 e 2 3-18 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Try to add 1000.2 =.10002e+03 and 1.07 =.10700e+01: 1000.2 =.1 0 0 0 2 0 4 ; 1.07 =.1 0 7 0 0 0 1 First task: align decimal points. The one with smallest exponent will be (internally) rewritten so its exponent matches the largest one: 1.07 = 0.000107 10 4 Second task: add mantissas: 0. 1 0 0 0 2 + 0. 0 0 0 1 0 7 = 0. 1 0 0 1 2 7 3-19 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Third task: round result. Result has 6 digits - can use only 5 so we can Chop result:.1 0 0 1 2 ; Round result:.1 0 0 1 3 ; Fourth task: Normalize result if needed (not needed here) result with rounding:.1 0 0 1 3 0 4 ; Redo the same thing with 7000.2 + 4000.3 or 6999.2 + 4000.3. 3-20 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Some More Examples Each operation f l(x y) proceeds in 4 steps: 1. Line up exponents (for addition & subtraction). 2. Compute temporary exact answer. 3. Normalize temporary result. 4. Round to nearest representable number (round-to-even in case of a tie)..40015 e+02.40010 e+02.41015 e-98 +.60010 e+02.50001 e-04 -.41010 e-98 temporary 1.00025 e+02.4001050001e+02.00005 e-98 normalize.100025e+03.400105 e+02.00050 e-99 round.10002 e+03.40011 e+02.00050 e-99 note: round to even round to nearest =not all 0 s too small: unnormalized exactly halfway between values closer to upper value exponent is at minimum 3-21 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

The IEEE standard 32 bit (Single precision) : ± 8 bits 23 bits }{{} exponent } mantissa {{} sign In binary: The leading one in mantissa does not need to be represented. One bit gained. Hidden bit. Largest exponent: 2 7 1 = 127; Smallest: = 126. [ bias of 127] 3-22 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

64 bit (Double precision) : ± 11 bits 52 bits }{{} exponent } mantissa {{} sign Bias of 1023 so if c is the contents of exponent field actual exponent value is 2 c 1023 e + bias = 2047 (all ones) = special use Largest exponent: 1023; Smallest = -1022. With hidden bit: mantissa has 52 bits represented. 3-23 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Take the number 1.0 and see what will happen if you add 1/2, 1/4,..., 2 i. Do not forget the hidden bit! Conclusion Hidden bit 52 bits 1 1 0 0 0 0 0 0 0 0 0 0 e e 1 0 1 0 0 0 0 0 0 0 0 0 e e 1 0 0 1 0 0 0 0 0 0 0 0 e e... 1 0 0 0 0 0 0 0 0 0 0 1 e e 1 0 0 0 0 0 0 0 0 0 0 0 e e fl(1 + 2 52 ) 1 but: fl(1 + 2 53 ) == 1!! 3-24 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl

Special Values Exponent field = 00000000000 (smallest possible value) No hidden bit. All bits == 0 means exactly zero. Allow for unnormalized numbers, leading to gradual underflow. Exponent field = 11111111111 (largest possible value) Number represented is Inf -Inf or NaN. 3-25 Text: 13; GvL 2.7; Ort 9.2; AB: 1.4.1.2 FloatSuppl