Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Similar documents
Chapter 1 Mathematical Preliminaries and Error Analysis

Mathematical preliminaries and error analysis

Chapter 1 Mathematical Preliminaries and Error Analysis

Notes for Chapter 1 of. Scientific Computing with Case Studies

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

Notes on floating point number, numerical computations and pitfalls

Floating-point Computation

Numerical Methods - Preliminaries

Lecture 7. Floating point arithmetic and stability

Elements of Floating-point Arithmetic

1 ERROR ANALYSIS IN COMPUTATION

Elements of Floating-point Arithmetic

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

Numerical Analysis and Computing

Chapter 1: Introduction and mathematical preliminaries

Introduction to Numerical Analysis

Introduction CSE 541

Numerical Algorithms. IE 496 Lecture 20

1 What is numerical analysis and scientific computing?

Chapter 1: Preliminaries and Error Analysis

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Fall 2014 MAT 375 Numerical Methods. Numerical Differentiation (Chapter 9)

Tu: 9/3/13 Math 471, Fall 2013, Section 001 Lecture 1

1 Floating point arithmetic

ECS 231 Computer Arithmetic 1 / 27

Homework 2 Foundations of Computational Math 1 Fall 2018

MAT 460: Numerical Analysis I. James V. Lambers

Jim Lambers MAT 460/560 Fall Semester Practice Final Exam

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Lecture Notes 7, Math/Comp 128, Math 250

INTRODUCTION TO COMPUTATIONAL MATHEMATICS

Numerical Methods in Physics and Astrophysics

Introduction and mathematical preliminaries

MTH303. Section 1.3: Error Analysis. R.Touma

Errors. Intensive Computation. Annalisa Massini 2017/2018

ACM 106a: Lecture 1 Agenda

Introduction to Scientific Computing Languages

Chapter 1 Computer Arithmetic

Numerical Analysis. Yutian LI. 2018/19 Term 1 CUHKSZ. Yutian LI (CUHKSZ) Numerical Analysis 2018/19 1 / 41

Chapter 1 Error Analysis

2. Review of Calculus Notation. C(X) all functions continuous on the set X. C[a, b] all functions continuous on the interval [a, b].

Completion Date: Monday February 11, 2008

Math Numerical Analysis

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

Introduction to Scientific Computing Languages

Taylor and Maclaurin Series

Introduction to Finite Di erence Methods

An Introduction to Numerical Analysis. James Brannick. The Pennsylvania State University

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

1.2 Finite Precision Arithmetic

Round-off Errors and Computer Arithmetic - (1.2)

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Mathematics for Engineers. Numerical mathematics

Numerical Mathematical Analysis

Applied Numerical Analysis (AE2220-I) R. Klees and R.P. Dwight

Lecture 32: Taylor Series and McLaurin series We saw last day that some functions are equal to a power series on part of their domain.

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Name. Instructor K. Pernell 1. Berkeley City College Due: HW 4 - Chapter 11 - Infinite Sequences and Series. Write the first four terms of {an}.

1 Backward and Forward Error

Lecture 28 The Main Sources of Error

ESO 208A: Computational Methods in Engineering. Saumyen Guha

AP Calculus (BC) Chapter 9 Test No Calculator Section Name: Date: Period:

10.1 Sequences. Example: A sequence is a function f(n) whose domain is a subset of the integers. Notation: *Note: n = 0 vs. n = 1.

Compute the behavior of reality even if it is impossible to observe the processes (for example a black hole in astrophysics).

MATH ASSIGNMENT 03 SOLUTIONS

Math 113 (Calculus 2) Exam 4

Binary floating point

Review of Power Series

AP Calculus Testbank (Chapter 9) (Mr. Surowski)

Introduction to Series and Sequences Math 121 Calculus II Spring 2015

Chapter 1. Numerical Errors. Module No. 1. Errors in Numerical Computations

Numerical Methods. King Saud University

Making Models with Polynomials: Taylor Series

Section 7.2 Euler s Method

Section Taylor and Maclaurin Series

Ma 530 Power Series II

Introductory Numerical Analysis

MATH115. Sequences and Infinite Series. Paolo Lorenzo Bautista. June 29, De La Salle University. PLBautista (DLSU) MATH115 June 29, / 16

An Introduction to Numerical Analysis

MATH Dr. Halimah Alshehri Dr. Halimah Alshehri

July 21 Math 2254 sec 001 Summer 2015

Notes: Introduction to Numerical Methods

Analysis II: Basic knowledge of real analysis: Part V, Power Series, Differentiation, and Taylor Series

1 Solutions to selected problems

Worksheet 9. Topics: Taylor series; using Taylor polynomials for approximate computations. Polar coordinates.

Midterm Review. Igor Yanovsky (Math 151A TA)

Section x7 +

CHAPTER 3. Iterative Methods

Chapter 4 Number Representations

A Brief Introduction to Numerical Methods for Differential Equations

Homework and Computer Problems for Math*2130 (W17).

TAYLOR AND MACLAURIN SERIES

8.1 Sequences. Example: A sequence is a function f(n) whose domain is a subset of the integers. Notation: *Note: n = 0 vs. n = 1.

Chapter 2 (Part 3): The Fundamentals: Algorithms, the Integers & Matrices. Integers & Algorithms (2.5)

Numerical Methods. Dr Dana Mackey. School of Mathematical Sciences Room A305 A Dana Mackey (DIT) Numerical Methods 1 / 12

Multiple Choice. (c) 1 (d)

Quantum Mechanics for Scientists and Engineers. David Miller

Polynomial Approximations and Power Series

ROUNDOFF ERRORS; BACKWARD STABILITY

Transcription:

Floating Point Number Systems Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le 1

Overview Real number system Examples Absolute and relative errors Floating point numbers Roundoff error analysis Conditioning and stability A stability analysis Rate of convergence 2

Real number system The arithmetic of the mathematically defined real number system, denoted by, is used. is infinite in (1) extent, i.e., there are numbers x arbitrarily large. such that x is (2) density, i.e., any interval I = {x a x b} of infinite set. is an Computer systems can only represent finite sets of numbers, so all the actual implementations of algorithms must use approximations to and inexact arithmetic. 3

Example 1: Evaluate I n = 1 0 x n x + α dx 1 ( ) 1 α + 1 I 0 = dx = ln 0 (x + α) α 1 x n + α x n 1 I n + α I n 1 = dx = 1 0 x + α n = I n = 1 ( ) α + 1 n α I n 1, I 0 = ln α Using single precision floating point arithmetic: α =.5 I 100 = 6.64 10 3, α = 2.0 I 100 = 2.1 10 22. Note. If α > 1, (x + α) > 1 for 0 x 1. Hence, 1 0 x n x + α dx 1 0 4 x n dx = 1 n + 1.

Recall. e y = Example 2: Evaluate e 5.5 n=0 y n n! = 1 + y + y2 2! + y3 3! +. Using a calculator which carries five significant figures. Method 1. Method 2. x 1 = e 5.5 = 20 n=0 ( 5.5) n n! x 2 = e 5.5 = 1 e = 1 5.5 20 n=0 =.0026363 (5.5) n n! =.0040865 Note. The correct answer, up to five significant digits, is x e = e 5.5 =.0040868 5

Absolute and Relative Error Computed result: x, correct mathematical result: x e. Err abs = x e x, Err rel = x e x x e Definition. The significant digits in a number are the digits starting with the first, i.e., leftmost, nonzero digit (e.g.,.00 40868 }{{} ). x is said to approximate x e to about s significant digits if the relative error satisfies 0.5 10 s x e x x e < 5.0 10 s. 6

Example 3: Relative Error and Significant Digits In Example 2, x e =.0040868, x 1 =.0026363, x 2 =.0040865. Method 1. 0.5 10 1 Err rel = x e x 1 x e 3.5 10 1 < 5.0 10 1. Hence, x 1 has approximately one significant digit correct (in this example, x 1 has zero correct digits). Method 2. 0.5 10 4 Err rel = x e x 2 x e 0.7 10 4 < 5.0 10 4. Hence, x 2 has approximately four significant digits correct (in this example, x 2 is indeed correct to four significant digits). 7

Representation of Numbers in Let β \ {0} be the base for a number system, e.g., β = 10 (decimal), β = 2 (binary), β = 16 (hexadecimal). Each x can be represented by an infinite base β expansion in the normalized form.d 0 d 1 d 2... d t 1 d t... β p where p d 0 0., dk are digits in base β, i.e. dk {0, 1,..., β 1}, and Example. 732.5051 =.7325051 10 3, 0.005612 = 0.5612 10 2. 8

Floating Point Numbers Recall. is infinite in extent and density. Floating point number systems limit the infinite density of by representing only a finite number, t, of digits in the expansion; the infinite extent of by representing only a finite number of integer values for the exponent p, i.e., L p U for specified integers L > 0 and U > 0. Therefore, each number in such a system is precisely of the form.d 0 d 1 d 2... d t 1 β p, L p U, d 0 0 or 0 (a very special floating point number). 9

Two Standardized Systems A floating point number system is denoted by F (β, t, L, U) or simply by F when the parameters are understood. Two standardized systems for digital computers widely used in the design of software and hardware: IEEE single precision: {β = 2; t = 24; L = 127; U = 128}, IEEE double precision: {β = 2; t = 53; L = 1023; U = 1024}. Note. An exception occurs if the exponent is out of range, which leads to a state called overflow if the exponent is too large, or underflow if the exponent is too small. 10

Truncation of a Real Number Let x =.d 0 d 1... d n 1 d n... d t 1 β p. Using n digits: Rounding: x = Chopping:.d 0 d 1... d n 1 β p if 0 d n 4,.d 0 d 1... (d n 1 + 1) β p if 5 d n 9. x =.d 0 d 1... d n 1 β p. 11

Relationship between x and fl(x) F For x, let fl(x) F (β, t, L, U) be its floating point approximation. Then x fl(x) E. (1) x E: machine epsilon, or unit roundoff error. 1 2 E = β1 t for rounding, β 1 t for chopping. By (1), fl(x) x = δ x, for some δ such that δ E. Hence, fl(x) = x(1 + δ), E δ E. Example. Denote the addition operator in F by. For w, z F, w z = fl(w + z) = (w + z)(1 + δ). 12

Roundoff Error Analysis: an Exercise How does (a b) c differ from the true sum a + b + c? (a b) c = (a + b)(1 + δ 1 ) c = ((a + b)(1 + δ 1 ) + c)(1 + δ 2 ) = (a + b + c) + (a + b)δ 1 + (a + b + c)δ 2 + (a + b)δ 1 δ 2. = (a + b + c) ((a b) c) ( a + b + c )( δ 1 + δ 2 + δ 1 δ 2 ). If (a + b + c) 0, then Err rel = (a + b + c) ((a b) c) a + b + c a + b + c a + b + c (2E + E 2 ). If a + b + c a + b + c (e.g., a, b, c +, or a, b, c then Err rel is bounded by 2E + E 2 which is small;, If a + b + c << a + b + c, then Err rel can be quite large. 13

Roundoff Error Analysis: a Generalization Addition of N numbers. If N i=1 x i 0, then Err rel = N i=1 x i fl( N i=1 x i) N i=1 x i N i=1 x i N i=1 x 1.01 N E. i (The appearance of the factor 1.01 is an artificial technicality.) Product of N numbers. If x i 0, 1 i N, then Err rel = N i=1 x i fl( N i=1 x i) N i=1 x i 1.01 N E. 14

Roundoff Error Analysis: an Example In F (10, 5, 10, 10), let a = 10000., b = 3.1416, c = 10000. Then a + b + c = 20003.1416 and a + b + c = 3.1416. Hence, 0.5 10 0 Err rel 6367.2(2E + E 2 ) 0.6 < 5.0 10 0. This relative error implies that there may be no significant digits correct in the result. Indeed, (a b) c = 10003. ( 10000.) = 3.0000. Therefore, the computed sum actually has one significant digit correct. 15

Conditioning Consider a Problem P with input values I and output values O. If a relative change of size I in one or more input values causes a relative change in the mathematically correct output values which is guaranteed to be small (i.e., not too much larger than I), then P is said to be well-conditioned. Otherwise, P is said to be ill-conditioned. Remark. The above definition is independent of any particular choice of algorithm and independent of any particular number system. It is a statement about the mathematical problem. 16

Condition Number P : I = {x}, O = {f(x)}. From Taylor series expansion: f(x + x) = f(x) + f (x) x + 1 2 f (x) x 2 + O( x 3 ) f(x) + f (x) x assuming that x is small. Hence, f(x) f(x + x) f(x) f (x) x f(x) = x f (x) x f(x) x. }{{} κ(p) κ(p): the condition number of P. 17

Condition Number: an Example In Example 2, I = {x = 5.5}, O = {f(x) = e x }, and κ(p) = x = 5.5. Hence, roundoff errors (in relative error) of size E can lead to relative errors in the output bounded by For example, if E 10 5, then Err rel κ(p)e. 0.5 10 4 Err rel < 5.0 10 4, and we should expect to have about four significant digits correct. 18

Stability Consider a Problem P with condition number κ(p), and suppose that we apply Algorithm A to solve P. If we can guarantee that the computed output values from A will have relative errors not too much larger than the errors due to the condition number κ(p), then A is said to be stable. Otherwise, if the computed output values from A can have much larger relative errors, then A is said to be unstable. Example. For the Problem P in Example 2, P is well-conditioned. Method 1 is unstable, while Method 2 is stable. 19

A Stability Analysis In Example 1, computing I n = I n = 1 n α I n 1, 1 0 x n dx is reduced to solving x + α ( ) α + 1 I 0 = ln. α We state without proof that this is a well-conditioned problem. Suppose that the floating point representation of I 0 introduces some error ɛ 0. For simplicity, assume that no other errors are introduced at each stage of the computation after I 0 is computed. Let (I n ) A and (I n ) E be the approximate value and the exact value of I n, respectively. Then 20

(I n ) E = 1 n α (I n 1) E, (I n ) A = 1 n α (I n 1) A. Set ɛ n = (I n ) A (I n ) E. Then ɛ n = ( α) ɛ n 1 = ( α) n ɛ 0. If α > 1, then any initial error ɛ 0 is magnified by an unbounded amount as n. On the other hand, if α < 1, then any initial error is damped out. We conclude that the algorithm is stable if α < 1, and unstable if α > 1. 21

Taylor Series The set of all functions that have n continuous derivatives on a set X is denoted C n (X), and the set of functions that have derivatives of all orders on X is denoted C (X), where X consists of all numbers for which the functions are defined. Taylor s theorem provides the most important tool for this course. 22

Taylor s Theorem Suppose f C n [a, b], that f (n+1) exists on [a, b], and x 0 [a, b]. For every x [a, b], there exists a number ξ(x) between x 0 and x with f(x) = P n (x) }{{} + R n (x) }{{}, nth Taylor polynomial remainder term (or truncation error) P n (x) = n k=0 f (k) (x 0 ) k! (x x 0 ) k = f(x 0 ) + f (x 0 )(x x 0 ) + + f (n) (x 0 ) n! R n (x) = f (n+1) (ξ(x)) (x x 0 ) n+1. (n + 1)! (x x 0 ) n, 23

Taylor Series: an Example Find the third degree Taylor polynomial P 3 (x) for f(x) = sin x at the expansion point x 0 = 0. Note that f C ( ). Hence, Taylor s theorem is applicable. P 3 (x) = f(0) + f (0)x + f (0) 2 x 2 + f (0) 3! x 3. f (x) = cos x = f (0) = cos 0 = 1, f (x) = sin x = f (0) = sin 0 = 0, f (x) = cos x = f 0) = cos 0 = 1. Also, f(0) = sin 0 = 0. Hence, P 3 (x) = x x3 6. 24

What about the truncation error? R 3 (x) = f (ξ) 4! x 4 = sin ξ 24 x4 where ξ (0, x). How big could the error be at x = π/2? R 3 ( π 2 ) = sin ξ 24 Since sin ξ 1 for all ξ ( 0, π 2 ), ( π ) 4 ( where ξ 0, π ). 2 2 ( π ) 1 ( π ) 4 R 3 2.025. 2 24 2 Actual error? ( π ) ( π ) f P 3 = 2 2 sin π ( ) π 2 2 (π/2)3 0.075. 6 In this example, the error bound is much larger than the actual error. 25

Rate of Convergence Throughout this course, we will study numerical methods which solve a problem by constructing a sequence of (hopefully) better and better approximations which converge to the required solution. A technique is required to compare the convergence rates of different methods. Definition. Suppose {β n } n=1 is a sequence known to converge to zero, and {α n } n=1 converges to a number α. If a positive constant K exists with α n α K β n for large n, then we say that {α n } n=1 converges to α with rate of convergence O(β n ). It is indicated by writing α n = α + O(β n ). 26

In nearly every situation, we use β n = 1 n p for some number p > 0. Usually we compare how fast {α n } n=1 α with how fast β n = 1/n p 0. Example. {α n } n=1 α like 1/n or 1/n 2. 1 n 2 0 faster than 1 n, 1 n 3 0 faster than 1 n 2. We are most interested in the largest value of p with α n = α + O(1/n p ). 27

To find the rate of convergence, we can use the definition: find the largest p such that α n α K β n = K 1 n p for n large, or equivalently, find the largest p so that lim n α n α β n = lim n α n α 1/n p = K. Note. K must be a constant, and cannot be. 28

Rate of Convergence: an Example For α n = (n + 1)/n 2, ˆα n = (n + 3)/n 3, lim n α n = lim n ˆα n = 0 = α. lim n lim n α n α 1/n p = lim n ˆα n α 1/n p = lim n n + 1 n 2 n p = n + 3 n 3 n p = 1 if p = 1, if p 2; 0 if p = 1, 1 if p = 2, if p 3. Hence, α n = 0 + O(1/n), and ˆα n = 0 + O(1/n 2 ). 29

Rate of Convergence: another Example For α n = sin(1/n), we have lim n α n = 0. For all n \{0}, sin(1/n) > 0. Hence, to find the rate of convergence of α n, we need to find the largest p so that lim n α n α β n = lim n sin(1/n) 0 1/n p = lim n sin(1/n) 1/n p = K. Use change of variable h = 1/n: lim n Apply Taylor series expansion to sin h at h = 0: lim h 0 sin h h p = lim h 0 h h 3 /6 + h p = sin(1/n) 1/n p lim h 0 sin h h p. 1 if p = 1, if p \ {0, 1}. Hence, the rate of convergence is O(h) or O(1/n). 30

Rate of Convergence for Functions Suppose that lim h 0 G(h) = 0 and lim h 0 F (h) = L. If a positive constant K exists with F (h) L K G(h), for sufficiently small h, then we write F (h) = L + O(G(h)). In general, G(h) = h p, where p > 0, and we are interested in finding the largest value of p for which F (h) = L + O(h p ). 31

Rate of Convergence for Functions: an Example Let F (h) = cos h + 1 2 h2. Then L = lim h 0 F (h) = 1. We have lim h 0 F (h) L cos h + 1/2h 2 1 = lim h p h 0 h p = lim h 0 Hence, F (h) = 1 + O(h 4 ). ( 1 1/2h 2 +1/24h 4 ) +1/2h 2 1 1/24h 4 = lim h 0 h p 1/24 if p = 4, = if p > 4. h p 32