CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5. Ax = b.

Similar documents
FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

ACM 106a: Lecture 1 Agenda

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

Lecture 7. Floating point arithmetic and stability

1 Floating point arithmetic

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 9

ECS 231 Computer Arithmetic 1 / 27

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Notes on floating point number, numerical computations and pitfalls

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Mathematical preliminaries and error analysis

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Floating-point Computation

Notes for Chapter 1 of. Scientific Computing with Case Studies

Arithmetic and Error. How does error arise? How does error arise? Notes for Part 1 of CMSC 460

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Introduction CSE 541

Lecture Notes 7, Math/Comp 128, Math 250

Introduction to Scientific Computing Languages

Chapter 1 Error Analysis

Elements of Floating-point Arithmetic

Chapter 1: Introduction and mathematical preliminaries

Chapter 1 Computer Arithmetic

Numerical Algorithms. IE 496 Lecture 20

MAT 460: Numerical Analysis I. James V. Lambers

Elements of Floating-point Arithmetic

Introduction to Scientific Computing Languages

1.1 COMPUTER REPRESENTATION OF NUM- BERS, REPRESENTATION ERRORS

How do computers represent numbers?

ERROR AND SENSITIVTY ANALYSIS FOR SYSTEMS OF LINEAR EQUATIONS. Perturbation analysis for linear systems (Ax = b)

Round-off Errors and Computer Arithmetic - (1.2)

Homework 2 Foundations of Computational Math 1 Fall 2018

Numerical Methods - Preliminaries

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Errors. Intensive Computation. Annalisa Massini 2017/2018

1 ERROR ANALYSIS IN COMPUTATION

Outline. Math Numerical Analysis. Errors. Lecture Notes Linear Algebra: Part B. Joseph M. Mahaffy,

Chapter 1 Mathematical Preliminaries and Error Analysis

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Binary floating point

Tu: 9/3/13 Math 471, Fall 2013, Section 001 Lecture 1

ALU (3) - Division Algorithms

An Introduction to Numerical Analysis

Applied Mathematics 205. Unit 0: Overview of Scientific Computing. Lecturer: Dr. David Knezevic

Numbering Systems. Contents: Binary & Decimal. Converting From: B D, D B. Arithmetic operation on Binary.

EE731 Lecture Notes: Matrix Computations for Signal Processing

Math 411 Preliminaries

Chapter 1 Mathematical Preliminaries and Error Analysis

Gaussian Elimination for Linear Systems

Chapter 1: Preliminaries and Error Analysis

Accurate polynomial evaluation in floating point arithmetic

Chapter 4 Number Representations

An Introduction to Numerical Analysis. James Brannick. The Pennsylvania State University

MATH ASSIGNMENT 03 SOLUTIONS

QUADRATIC PROGRAMMING?

Numerical Analysis. Yutian LI. 2018/19 Term 1 CUHKSZ. Yutian LI (CUHKSZ) Numerical Analysis 2018/19 1 / 41

Introduction and mathematical preliminaries

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Compute the behavior of reality even if it is impossible to observe the processes (for example a black hole in astrophysics).

INTRODUCTION TO COMPUTATIONAL MATHEMATICS

Numerical Methods - Lecture 2. Numerical Methods. Lecture 2. Analysis of errors in numerical methods

1 Error analysis for linear systems

Numerical Analysis and Computing

Computation of dot products in finite fields with floating-point arithmetic

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

MAT128A: Numerical Analysis Lecture Three: Condition Numbers

Lecture 1. Introduction to Numerical Methods. T. Gambill. Department of Computer Science University of Illinois at Urbana-Champaign.

Let x be an approximate solution for Ax = b, e.g., obtained by Gaussian elimination. Let x denote the exact solution. Call. r := b A x.

Lecture notes to the course. Numerical Methods I. Clemens Kirisits

BASIC COMPUTER ARITHMETIC

Lecture 2 MODES OF NUMERICAL COMPUTATION

Number Representation and Waveform Quantization

MATH Dr. Halimah Alshehri Dr. Halimah Alshehri

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

ECE260: Fundamentals of Computer Engineering

ROUNDOFF ERRORS; BACKWARD STABILITY

Binary Floating-Point Numbers

The numerical stability of barycentric Lagrange interpolation

Introduction to Scientific Computing

Numerical Methods. Dr Dana Mackey. School of Mathematical Sciences Room A305 A Dana Mackey (DIT) Numerical Methods 1 / 12

Math 128A: Homework 2 Solutions

1 Finding and fixing floating point problems

8/13/16. Data analysis and modeling: the tools of the trade. Ø Set of numbers. Ø Binary representation of numbers. Ø Floating points.

Applied Numerical Analysis (AE2220-I) R. Klees and R.P. Dwight

Dense LU factorization and its error analysis

1 Backward and Forward Error

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 28 The Main Sources of Error

The answer in each case is the error in evaluating the taylor series for ln(1 x) for x = which is 6.9.

4.2 Floating-Point Numbers

Lecture Notes to be used in conjunction with. 233 Computational Techniques

MA580: Numerical Analysis: I. Zhilin Li

The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal

Mathematics for Engineers. Numerical mathematics

DSP Design Lecture 2. Fredrik Edman.

14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System?

Roundoff Analysis of Gaussian Elimination

What is it we are looking for in these algorithms? We want algorithms that are

CISE-301: Numerical Methods Topic 1:

Transcription:

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 5 GENE H GOLUB Suppose we want to solve We actually have an approximation ξ such that 1 Perturbation Theory Ax = b x = ξ + e The question is, how can we use norms to bound the relative error in ξ? We define the residual r by r = b Aξ = A(x ξ) = Ae Note that r = 0 if Ax = b From the relations r A e and e A 1 r, we obtain r A e A 1 r It follows that the relative error is bounded as follows: since A x b The quantity r A x e x A 1 r A 1 A r x b κ(a) = A 1 A is called the condition number of A The condition number serves as a measure of how perturbation in the data of the problem Ax = b affects the solution How does a perturbation in A affect the solution x? To answer this question, we define A(ɛ) to be a function of the size of the perturbation ɛ, with A(0) = A Starting with and differentiating with respect to epsilon yields or Now, suppose that x(ɛ) satisfies A(ɛ) da 1 (ɛ) da 1 (ɛ) A(ɛ)A 1 (ɛ) = I + da(ɛ) A 1 (ɛ) = 0 = A 1 (ɛ) da(ɛ) A 1 (ɛ) (A + ɛe)x(ɛ) = b Date: October 26, 2005, version 10 Notes originally due to James Lambers Minor editing by Lek-Heng Lim 1

Using Taylor series, we obtain Taking norms, we obtain x(ɛ) = (A + ɛe) 1 b = x(0) + ɛ dx(ɛ) + O(ɛ 2 ) ( ɛ=0 = x(0) + ɛ A 1 (ɛ) da(ɛ) ) A 1 (ɛ) b + O(ɛ 2 ) = x(0) + ɛ( A 1 EA 1 )b + O(ɛ 2 ) = x(0) + ɛ( A 1 E)x + O(ɛ 2 ) x(ɛ) x(0) ɛ A 1 2 E b + O(ɛ 2 ) from which it follows that the relative perturbation in x is x(ɛ) x x ɛ A 1 A E A + O(ɛ2 ) κ(a)ρ + O(ɛ 2 ) where ρ = ɛe / A is the relative perturbation in A Since the exact solution to Ax = b is given by x = A 1 b, we are also interested in examining (A + E) 1 where E is some perturbation Can we say something about (A + E) 1 A 1? We assume that A 1 E = r < 1 We have which yields From we obtain or A + E = A(I + A 1 E) = A(I F ), (I F ) 1 1 1 r (A + E) 1 A 1 = (I + A 1 E) 1 A 1 A 1 F = A 1 E = (I + A 1 E) 1 (A 1 (I + A 1 E)A 1 ) = (I + A 1 E) 1 ( A 1 EA 1 ) (A + E) 1 A 1 1 1 r A 1 2 E (A + E) 1 A 1 A 1 1 1 r κ(a) E A 2 Inexact Computation 21 Number representation Real numbers can be represented using floating-point notation: a floating-point representation such as y = ±d 1 d s d s+1 10 e A real number may not necessarily have a unique floating-point representation, as the following examples indicate: or even worse, y = ±0899 9 10 0 = ±090 0 10 0 y = +099 9 10 0 = +010 0 10 1 2

How can we represent y? We can use a chopped representation or possibly where ỹ = ±d 1 d s 10 3 ỹ = ±d 1 d s 1 ds 10ē d s d s+1 < 5, d s = (d s + 1) mod 9 d s+1 > 5,? d s+1 = 5 where the value of d s when d s = 5 depends on what convention is used for rounding Note that e can change if d s+1 > 5 We often write ỹ = fl(y) where fl( ) means floating-point representation of y Floating point numbers can be represented in base β as where m e M and the digits d j satisfy y = ±d 1 d s β e 1 d 1 β 1, 0 d j β 1, j = 2,, s This is a normalized floating-point number The sequence of significant digits d 1 d s is called the mantissa, and the number e is called the exponent Suppose s = 1, m = 1, and M = 1 Then the representable numbers are +01 10 1 +02 10 1 +09 10 1 +01 10 0 +02 10 0 +09 10 0 +01 10 1 +02 10 1 +09 10 1 and 27 negative numbers Note that the distribution is not uniform In the previous example, the representable numbers are 9, 8,, 1, 09, 08,, 01, 009,, 001, 0, 01, Note that there are large gaps between integers Now, suppose that s = 3, and that we multiply two numbers y 1 = 0999 and y 2 = 0999 The exact product is y 1 y 2 = 0998001, but the computed product is fl(y 1 y 2 ) = 0998 We will write fl(y 1 op y 2 ) to indicate some numerical calculation On the IBM 360, we have base 16 (hexadecimal) In general, y = (±d 1 d s ) β β e, ( d1 = ± β + d 2 β 2 + + d ) s β s β 3, β = 16, with 1 d 1 β 1 and 0 d j β 1 for 2 j s For the IBM 360, floating-point numbers are represented using s = 6, m = 64 and M = 63, while double-precision floating-point numbers are represented using s = 14, m = 64, and M = 63 If, for the result of any operation, e < 64, then underflow has occurred On the other hand, the scenario e > 63 is called overflow 22 Roundoff Error in Arithmetic Operations We need to consider the error arising from computing the results of arithmetic expressions In general, z = fl(x op y) = (x op y)(1 + δ) where δ β (s 1) u when results are truncated, or δ 1 2 β (s 1) when results are rounded Therefore the relative error in fl(x op y) is fl(x op y) (x op y) x op y = δ u where u is known as the unit roundoff 3

We can use this idea sequentially to bound the error in more complex computations Suppose we want to compute s n = x 1 + x 2 + + x n If we use the numerical algorithm the corresponding computer algorithm is s 0 = 0 s 1 = s 0 + x 1 s 2 = s 1 + x 2 s n = s n 1 + x n σ 0 = 0 σ 1 = fl(σ 0 + x 1 ) σ 2 = fl(σ 1 + x 2 ) Expanding the expressions for the σ i, we obtain σ 1 = fl(σ 0 + x 1 ) = (σ 0 + x 1 )(1 + δ 1 ) σ n = fl(σ n 1 + x n ) σ 2 = fl(σ 1 + x 2 ) = (σ 1 + x 2 )(1 + δ 2 ) = σ 1 (1 + δ 2 ) + x 2 (1 + δ 2 ) = (σ 1 + x 1 )(1 + δ 1 )(1 + δ 2 ) + x 2 (1 + δ 2 ) σ n = x 1 (1 + δ 1 ) (1 + δ n ) + x 2 (1 + δ 2 ) (1 + δ n ) + + x n (1 + δ n ) = where In summary, σ n (1 + η j ) = x i n (1 + δ k ) = x 1 + + x n + k=j x j η j x j n k=j η j x j, η j (n j + 1)u + O(u) 2 (1 + δ k ) = x j (1 + η j ) if 0 < x i1 x i2 x in It would seem to imply that we should add the numbers in that order since the numbers are weighted A better procedure is to add the numbers in pairs, resulting in the bound 1 + η i 1 + pu where n = 2 p In this case, the error is uniformly distributed For the product of n numbers p = x 1 x n, we obtain the computed results Π 1 = fl(x 1 x 2 ) = (x 1 x 2 )(1 + ɛ 1 ) Π 2 = fl(π 2 x 3 ) = (x 1 x 2 x 3 )(1 + ɛ 1 )(1 + ɛ 2 ) where Π n = fl(π n 1 x n ) = (x 1 x n )(1 + ɛ 1 ) (1 + ɛ n 1 ) = p(1 + η n 1 ) η n 1 (n 1)u + O(u 2 ) 4

23 Common Computations It is frequently necessary to compute the inner product s = u i v i Now Therefore, if we wish to compute fl(u i v i ) = (u i v i )(1 + γ i ), γ i u fl(s) = x j y j (1 + η j ) and (1 + η j ) will depend on how the computation is performed If we do serial addition, then n (1 + η j ) = (1 + δ k )(1 + γ j ) If we do pairwise addition and n = 2 p, then ( p ) (1 + η j ) = (1 + δ ikj ) (1 + γ j ) k=j k=1 In computing inner products one must be quite careful; otherwise the underflow or overflow problem can be quite serious Suppose one wishes to compute s = x 2 i It is quite possible that overflow will occur One solution is to scale Suppose and x ± = αβ p Then one should compute [ s = x ± x j, j = 1, 2,, n ( xi This requires two passes; there are better ways of performing the calculation We can build up analysis of many problems This has been done extensively for linear algebra by J H Wilkinson The analysis can be automated; R Moore has used interval arithmetic for determining bounds on the solution It is a general principle that one should add terms of a similar sign Suppose we wish to compute s 2 = (x i x) 2, x = 1 x i (21) n It is well known that s 2 = β p ) 2 ] β 2p x 2 i n x 2 (22) and this formula is frequently used But it is very bad, especially if the x i s are large but s 2 is small Then, you can be subtracting two large numbers Formula (21) is more stable but requires two passes First, you must compute the mean and then the quantity s 2 We could try writing s 2 = (x i ν) 2 + η(ν 2 x 2 ) where ν is some approximation to x 5

3 IEEE floating point numbers These notes are due to Lieven Vandenberghe of UCLA 31 Floating point numbers with base 10 Notation x = ±(d 1 d 2 d n ) 10 10 e d 1 d 2 d n is the mantissa (d i integer, 0 d i 9, d 1 0 if x 0) n is the mantissa length (or precision) e is the exponent (e min e e max ) Interpretation Example (with n = 7) 12625 = +(1262500) 10 10 2 x = ±(d 1 10 1 + d 2 10 2 + + d n 10 n ) 10 e = +(1 10 1 + 2 10 2 + 6 10 3 + 2 10 4 + 5 10 5 + 0 10 6 + 0 10 7 ) 10 2 used in pocket calculators Properties a finite set of numbers unequally spaced distance between floating point numbers varies the smallest number greater than 1 is 1 + 10 n+1 the smallest number greater than 10 is 10 + 10 n+2, largest positive number: smallest positive number: 32 Floating point numbers with base 2 Notation +(999 9) 10 10 emax = (1 10 n )10 emax x min = +(100 0) 10 10 e min = 10 e min 1 x = ±(d 1 d 2 d n ) 2 2 e d 1 d 2 d n is the mantissa (d i {0, 1}, d 1 = 1 if x 0) n is the mantissa length (or precision) e is the exponent (e min e e max ) Interpretation Example (with n = 8): 12625 = +(11001010) 2 2 4 x = ±(d 1 2 1 + d 2 2 2 + + d n 2 n ) 2 e = +(1 2 1 + 1 2 2 + 0 2 3 + 0 2 4 + 1 2 5 + 0 2 6 + 1 2 7 + 0 2 8 ) 2 4 used in almost all computers Properties a finite set of unequally spaced numbers largest positive number: smallest positive number: x max = +(1 1) 2 2 emax = (1 2 n )2 emax x min = +(100 0) 2 2 e min = 2 e min 1 6

in practice, the number system includes subnormal numbers: unnormalized small numbers (d 1 = 0, e = e min ), and the number 0 33 IEEE floating point standard Specifies two binary floating point number formats: IEEE standard single precision: n = 24, e min = 125, e max = 128 requires 32 bits: 1 sign bit, 23 bits for mantissa, 8 bits for exponent IEEE standard double precision: n = 53, e min = 1021, e max = 1024 requires 64 bits: 1 sign bit, 52 bits for mantissa, 11 bits for exponent; used in almost all modern computers 34 Machine precision Definition The machine precision of a binary floating point number system with mantissa length n is defined as ɛ M = 2 n Example IEEE standard double precision (n = 53): ɛ M = 2 53 11102 10 16 Interpretation 1 + 2ɛ M is the smallest floating point number greater than 1: (10 01) 2 2 1 = 1 + 2 1 n = 1 + 2ɛ M 35 Rounding error A floating-point number system is a finite set of numbers; all other numbers must be rounded Notation f l(x) is the floating-point representation of x Rounding rules used in practice: numbers are rounded to the nearest floating-point number in case of a tie: round to the number with least significant bit 0 ( round to nearest even ) Example numbers x (1, 1 + 2ɛ M ) are rounded to 1 or 1 + 2ɛ M : fl(x) = 1 if 1 < x 1 + ɛ M fl(x) = 1 + 2ɛ M if 1 + ɛ M < x < 1 + 2ɛ M gives another interpretation of ɛ M : numbers between 1 and 1 + ɛ M are indistinguishable from 1 36 Rounding error and machine precision General result (no proof): fl(x) x x ɛ M machine precision gives a bound on the relative error due to rounding number of correct (decimal) digits in fl(x) is roughly log 10 ɛ M ie, about 15 or 16 in IEEE precision fundamental limit on accuracy of numerical computation Department of Computer Science, Gates Building 2B, Room 280, Stanford, CA 94305-9025 E-mail address: golub@stanfordedu 7