Matrix Computations (Chapter 1-2)

Size: px

Start display at page:

Download "Matrix Computations (Chapter 1-2)"

Dustin Strickland
6 years ago
Views:

1 Matrix Computations (Chapter 1-2) Mehrdad Sheikholeslami Department of Electrical Engineering University at Buffalo The Best Group Winter 216

2 Chapter Basic Algorithms and Notation 1.2 Exploiting Structure: If a matrix has structure, then it is usually possible to exploit it. 1.3 Block Matrices and Algorithms: A block matrix is a matrix with matrix entries. 1.4 Vectorization and Re-Use Issues

3 1.1 Basic Algorithms and Notation Matrix Notation: Let R denote the set of real numbers. a 11 a 1n A R m n A = a ij = a m1 a mn, a ij R Matrix Operations: Transposition R m n R n m : C = A T c ij = a ij Addition R m n + R m n R m n : C = A + B c ij = a ij + b ij Scalar matrix multiplication R R m n R m n : C = aa c ij = aa ij Matrix matrix multiplication R m p R p n R m n : C = AB c ij = σr k=1 a ik b kj Vector Notation: x R n x = x 1 x n, x i R, Column vector x R 1 n x = x 1 x n, x i R, Row vector If x is a column vector then y = x T is a row vector.

4 Vector Operations: Vector multiplication R R n R n : z = ax z i = ax i Vector addition R n + R n R n : z = x + y z i = x i + y i The dot Product (inner product) R 1 n R n R : c = x T y c = σ i=1 a i b i c= for i = 1:n c = c + x(i)y(i) It involves n multiplications and n additions. It is an O(n) operation, meaning that the amount of work is linear in the dimensions. Vector multiply Hadamard Product R n R n R n : z = x. y z i = x i y i An important update form (Saxpy): y = ax + y y i = ax i + y i ( = is being used to denote assignment) for i = 1:n y(i) = ax(i)+y(i)

5 Generalized saxpy or gaxpy: Suppose A R m n and we wish to compute the update: y = Ax + y. A standard way is to update the components one at a time: y i = σn j=1 a ij x j + y i, i = 1: m. gaxpy (Row Version) for i = 1:m for j = 1:n y(i) = A(i,j)x(j)+y(i) gaxpy (Column Version) for j = 1:n for i = 1:m y(i) = A(i,j)x(j)+y(i) The inner loop in either gaxpy algorithm carries out a saxpy algorithm.

6 Partitioned matrices: From row point of view, a matrix is a stack of row vectors. T r 1 A R m n A =, r k R n r n T Then the row version of gaxpy will become: for i = 1:m y(i) = r i T x(i) + y(i) Alternatively a matrix is a collection of column vectors: A R m n A = c 1 c n, c k R m Then the column version of gaxpy will become: for j = 1:n y(j) = x(j)c(j)+ y(j)

7 The colon notation: To specify a column or row of a matrix if A R m n A k, : = a k1 a kn and A :, k = Then the row version of gaxpy will become: for i = 1:m y(i) = A(i, : )x(i) + y(i) And the column version of gaxpy will become: for j = 1:n y(j) = x(j) A(:, j) + y(j) a 1k a nk With the colon notation we are able to suppress iteration details.

8 Outer Product A = A + xy T, A R m n, x R m, y R n for i = 1:m for j = 1:n A(i,j) = A(i,j)+x(i) y(j) To eliminate the j-loop by colon: for i = 1:m A(i,:) = A(i,:)+x(i) y T Or if we make the i-loop the inner loop: for j = 1:n A(:,j) = A(:,j)+y(j)x

9 Matrix-Matrix Multiplication Methods: Dot product Saxpy Version Although equivalent mathematically, can have different levels of performance. Outer Product Scalar Level Specification The ijk variant (Identify rows of C (and A) with i, columns of C (and B) with j and summation index with k ): C = AB + C, A R m p, B R p n, C R m n for i = 1:m for j = 1:n for k = 1:p C(i,j) = A(i,k)B(i,k)+C(i,j)

10 So we have 3!=6 variations, each of six possibilities features and inner loop of operation (dot product or saxpy) and has its own pattern of data flow, for example: The ijk variant dot product access to a row of A and a column of B. The jki variant saxpy access to a column of C and a column of A. Loop Order Inner Loop Middle Loop Inner Loop Data Access ijk Dot vector matrix A by row, B by column jik Dot matrix vector A by row, B by column ikj Saxpy row gaxpy B by row, C by row jki Saxpy column gaxpy A by column, C by column kij Saxpy row outer product B by row, C by row kji Saxpy column outer product A by column, C by column

11 Matrix Multiplication: Dot product version If A R m p, B R p n, C R m n are given, this algorithm replaces C by AB+C: for i = 1:m for j = 1:n C(i,j) = A(i,:)B(:,j)+C(i,j) Inner two loops of ijk variant define a row-oriented gaxpy operation. Matrix Multiplication: A saxpy formulation Suppose A and C are column partitioned, If A R m p, B R p n, C R m n are given, this algorithm overwrites C by AB+C: for j = 1:n for k = 1:p C(:,j) = A(:,k)B(k,j)+C(:,j) Note that the k-loop oversees a gaxpy operation: for j = 1:n C(:,j) = AB(:,j)+C(:,j)

12 Matrix Multiplication: Outer product version Consider the kij variant: for k = 1:p for j = 1:n for i = 1:m C(i,j) = A(i,k)B(i,k)+C(i,j) The inner two loops oversee the outer product update: C = a k b k T + C. Where A is column partitioned and b is row partitioned. If A R m p, B R p n, C R m n are given, this algorithm overwrites C by AB+C: for k = 1:p C = A(:,k)B(k,:)+C

13 The notion of level : The dot product and saxpy are examples of level-1 operations. The amount of data and amount of arithmetic is linear in dimension in these operations. An m-by-n outer product or gaxpy operation involves a quadratic amount of data (O(mn)) and a quadratic amount of data (O(mn)). They are examples of level-2 operations. The matrix update C=AB+C is a level-3 operation; It involves a quadratic amount of data and a cubic amount of work. Theorem: If A R m p, B R p n, then (AB) T = B T A T Proof is in the text book. Complex Matrices: The scaling, addition and multiplication of complex matrices corresponds to the real case. However, transposition becomes conjugate transposition : C = A H c ij = a ji The dot product of complex n-vector x and y is prescribed as: s = x H y = σn i=1 xҧ i y i Finally, if A = B + ic, then we designate the real and imaginary part of A by Re(A) = B and Im(A) = C respectively.

14 1.2 Exploiting structure The efficiency of a given matrix algorithm deps on many things. In this section we treat the amount of required arithmetic and storage. Band Matrices and x- Notation A R m n has lower bandwidth p if a ij =, whenever i > j + p and upper bandwidth q if j > i + q implies a ij =. Below is an example of 8-by-5 matrix that has lower bandwidth 1 and upper bandwidth 2: x x x x x x x x x x x x x x x x x Diagonal matrix manipulation Matrices with upper and lower bandwidth zero are diagonal. If D R m n is diagonal then: D = diag(d 1,, d q ), q min = min m, n d i = d ii If D is diagonal and A is a matrix, then DA is a row scaling of A and AD is a column scaling of A.

15 Band terminology for m-by-n matrices Type of Matrix Lower Bandwidth Upper Bandwidth diagonal upper triangular n-1 lower triangular m-1 tridiagonal 1 1 upper bidiagonal 1 lower bidiagonal 1 upper Hessenberg 1 n-1 lower Hessenberg m-1 1

16 Triangular Matrix Multiplication If A and B are both n-by-n and upper triangular, the 3-by-3 case of C=AB will look like this: a 11 b 11 a 11 b 12 + a 12 b 22 a 11 b 13 + a 12 b 23 + a 13 b 33 C = a 22 b 22 a 22 b 23 + a 23 b 33 a 33 b 33 It suggests that product is triangular and its upper triangular entries are the result of abbreviated inner products. If A, B R n n are upper triangular, then for C=AB we have: C= for i = 1:n for j = i:n for k = i:j c(i,j) = A(i,k)B(k,j)+c(i,j)

17 To quantify the savings in the last slide algorithm we need some tools for measuring the amount of work: Flops: A flop is a floating point operation. A dot product or saxpy operation of length n 2n flops (n multiplication and n adds) The gaxpy y=ax+y, A R m n and outer product update of A=A+xy T 2mn flops The matrix multiply update C=AB+C, A R m p, B R p n, C R m n 2mnp flops Now to calculate the flops in triangular matrix multiplication: c ij, (i j)requires 2(j-i+1) flops. Total number of flops: σ n n i=1 σ j=1 2(j i + 1) n3 3 We find that this multiplication requires one-sixth the number of flops as full matrix multiplication. Flop counting is a necessarily crude approach to the measuring of program efficiency and we must not infer too much from it because it only captures one of several dimensions of efficiency issue.

18 Band Storage Suppose A R n n has lower bandwidth p and upper bandwidth q and p and q are much smaller than n. Such a matrix can be stored in a p + q + 1 by n array A.band with the convention that: a ij = A. band i q + 1, j A = a 11 a 12 a 13 a 21 a 22 a 32 a 23 a 33 a 43 a 24 a 34 a 35 a 44 a 45 a 46 a 54 a 55 a 56 a 65 a 66 then A. band = a 13 a 12 a 23 a 11 a 21 a 22 a 32 a 33 a 43 a 24 a 35 a 46 a 34 a 45 a 56 a 44 a 54 a 55 a 65 a 66 Band gaxpy: With this data structure, our column oriented gaxpy algorithm transforms to the following: Suppose A R n n has lower bandwidth p and upper bandwidth q and is stored in A.band format. If x, y R n, then the algorithm overwrites y with Ax+y: for j = 1:n y top = max(1,j-q) y bot = min(n,j+p) a top = max(1,q+2-j) a bot = a top +y bot - y top y(y top :y bot )=x(j)a.band(a top : a bot, j)+ y(y top : y bot ) This algorithm involves just 2n(p + q + 1) flops with assumption that p and q are much smaller than n.

19 Symmetry We say that A R n n is symmetric if A T = A Thus A = is symmetric. Storage requirement can be halved if we store the lower triangle of elements, e.g A. vec = In general, with data structure we have: a ij = A. vec( j 1 n j j 1 /(2 + i)), i j Symmetric storage gaxpy: Suppose A R n n is symmetric and is stored in A.vec format. If x, y R n, then the algorithm overwrites y with Ax+y: for j = 1:n for i = 1:j-1 y(i) = A.vec((i-1)n-i(-1)/2+j).x(j)+y(i) for i = j:n y(i) = A.vec((j-1)n-j(j-1)/2+i).x(j)+y(i) This algorithm requires the same 2n 2 flops that an ordinary gaxpy requires. Halving the storage requirement is purchased with some awkward subscripting.

20 Store by diagonal Symmetric matrices can also be stored by diagonal. In this structure we represent A with the vector If A = 2 4 5, then A. diag = In general if i j : a i+k,i = A. diag(i + nk k(k 1)/2) k Store by diagonal gaxpy: Suppose A R n n is symmetric and is stored in A.diag format. If x, y R n, then the algorithm overwrites y with Ax+y: for i = 1:n y(i) = A.diag(i)x(i) + y(i) for k = 1:n-1 t= nk-k(k-1)/2 for i = 1:n-k y(i) = A.diag(i+t)x(i+k) + y(i) for i = 1:n-k y(i+k)=a.diag(i+t)x(i)+y(i+k)

21 A Note on Overwriting and Workspaces: Overwriting input data in another way to control the amount of memory that a matrix computation requires. Consider n-by-n matrix multiplication problem C=AB with the proviso that the input matrix B is to be overwritten by the output matrix C. we can not simply overwrite B, because: C(1:n, 1:n) = for j = 1:n for k = 1:n C(:,j) = C(:,j) + A(:,k)B(k,j) B(:,j) is needed throughout the entire k-loop. A linear workspace is needed to hold the jth column of the product until it is safe to overwrite B(:,j): for j = 1:n w(1:n) = for k = 1:n w(:) = w(:) + A(:,k)B(k,j) B(:,j)=w(:) This linear workspace is usually not important in a matrix computation that has a 2-dimensional array of the same order.

22 1.3 Block Matrices and Algorithms Block algorithms are increasingly more important in high performance computing. By a block algorithm we essentially mean an algorithm that is rich in matrix-matrix multiplication. Algorithms of this type turn out to me more efficient in many computing environments than those that are organized at a lower linear algebraic level. Block Matrix Notation Column and row partitioning are special cases of matrix blocking. In general we can partition both the rows and columns of a m-by-n matrix A to obtain: A = A 11 A 1r A q1 A qr n 1 Where m 1 + +m q = m and n 1 + +n r = n. And A αβ designates the (αβ) block or submatrix This block has dimension m a by n β. n r m 1 m q

23 Block Matrix Manipulation Block matrices combine just like matrices with scalar entries as long as certain dimension requirements are met. B 11 B 1r m 1 B = B q1 B m qr q B is partitioned conformably with the matrix A in the last slide. The sum C = A+B can also be regarded as a q-by-r block matrix: C 11 C 1r A 11 + B 11 A 1r + B 1r C = C q1 C qr = A q1 + B q1 A qr + B qr But multiplication is a little trickier, Lemma: If A R m p, B R p n A 1 m 1 n 1 n r A = and B = B 1 B r then: A q AB = C = m q C 11 C 1r m 1 Where C αβ = A α B β for α = 1: q and β = 1: r C q1 C qr n 1 n r m q n 1 n r

24 Another lemma: If A R m p, B R p n A 1 p 1 p 1 p s A = and B = B 1 B r then: A q AB = C = s γ=1 A γ B γ Theorem 1.3: Based on the two lemmas, for general block matrix multiplication we have: A 11 A 1s m 1 B 11 B 1r p 1 C 11 C 1r m 1 If A =, B = and C = AB = then: A q1 A qs B s1 B sr p s C q1 C qr p 1 s C αβ = p s γ=1 p s A αγ B γβ m q n 1 α = 1: q and β = 1: r n r n 1 n r m q

25 Submatrix Designation Suppose A R m n and that i = i 1,, i r and j = j 1,, j c are integer vectors with property that: i 1,, i r 1,2,, m and j 1,, j c 1,2,, n We let A(i, j) denote the r-by-c submatrix A(i 1, j 1 ) A(i 1, j c ) A(i, j) = A(i r, j 1 ) A(i r, j c ) Then A(i 1 : i 2, j 1 : j 2 ) is the submatrix obtained by extracting rows i 1 through i 2 and columns j 1 through j 2. A(3: 5,1: 2) = a 31 a 32 a 41 a 42 a 51 a 52

26 Block Matrix Multiplication Multiplication of block matrices can be arranged in several possible ways just as ordinary, scalar-level matrix multiplication can. Different blockings for A, B and C can set the stage for block version of the dot product, saxpy and outer product algorithm. We assume that these three matrices are all n-by-n that n = Nl. If A = (A αβ ), B = (B αβ ) and C = (C αβ ) are N-by-n block matrices by l-by-l blocks, then from theorem 1.3: C αβ = N γ=1 A αγ B γβ + C αβ α = 1: N, β = 1: N If we organize a matrix multiplication procedure around this summation we obtain: for α = 1: N i = α 1 l + 1: αl for β = 1: N j = β 1 l + 1: βl for γ = 1: N k = γ 1 l + 1: γl C i, j = A i, k B k, j + C i, j Block saxpy and block outer product is now obtainable.

27 Complex Matrix Multiplication Consider the complex matrix multiplication update: C 1 + ic 2 = A 1 + ia 2 B 1 + ib 2 + C 1 + ic 2 It can be expressed as follows: C 1 = A 1 A 2 B 1 + C 1 C 2 A 2 A 1 B 2 C 2 A Divide and Conquer Matrix Multiplication C 11 C 12 = A 11 A 12 B 11 B 12 C 21 C 22 A 21 A 22 B 21 B 22 In ordinary algorithm C i, j = A i, 1 B 1, j + A i, 2 B 2, j. There are 8 multiplies and 4 adds. Strassen (1969) shown how to complete C with just 7 multiplies and 18 adds. Strassen method involves 7/8ths the arithmetic of the fully conventional algorithm (Page 31)

28 1.4 Vectorization and Re-Use Issues Vector pipeline computers are able to perform vector operations very fast because of special hardware that is able to exploit the fact that a vector operation is a very regular sequence of scalar operations. The performance from such a computer deps on upon the length of the vector operands, vector stride, vector loads and stores and the level of data re-use. Pipelining Arithmetic Operation 3 Cycle Adder Pipelined Addition

29 Vector Operations A vector pipeline computer comes with a collection of vector instructions that take place in vector registers. Vectors travel between the registers and memory. An important attribute of a vector processor is the length of its vector registers (v L ). A length-n vector must operation must be broken down into subvector operation of length v L or less. The Vector Length Issue Suppose the pipeline for the vector operation op takes τ op cycles to set up. Assume that one component of the result is obtained per cycle when the pipeline if filled. The required time for an n-dimensional op is: T op n = τ op + n μ, n v L Where μ is the cycle time and v L is the length of the vector hardware. If the vectors are longer that vector hardware length they nust be broken down, thus: n = n 1 v L + n, n v L Then : We simplify the above to: T op n = (n + τ op ceil T op n = ቐ n 1 τ op + v L μ n = n 1 τ op + v L + τ op + n μ n Where ceil a is the smallest integer such that a ceil(a) n v L )μ

30 If ρ flops per component are involved, then the effective rate of computation for general n is given by: R op n = ρn T op n If μ is in seconds then R op is in flops per second. Now let us see the performance of the matrix multiply update C = AB + C. for i = 1:m for j = 1:n for k = 1:p C(i,j) = A(i,k)B(i,k)+C(i,j) This is the ijk variant and its innermost loop oversees a length-p dot product, thus: p T ijk = mnp + mn. ceil v L τ dot

31 A similar analysis for each of the other variants leads to the following table: Variant Assume that τ dot and τ sax are equal. If m, n, and p are all smaller than v L, the most efficient variants will have the longest inner loops. Otherwise the distinction between six options is small. The Stride Issue The Vector Touch Issue: The time required to read or write a vector to memory is comparable to the time required to engage the vector in a dot product or saxpy. Cycles ijk mnp + mn. ceil pτv L τ dot jik mnp + mn. ceil pτv L τ dot ikj mnp + mp. ceil nτv L τ sax jki mnp + np. ceil mτv L τ sax kij mnp + mp. ceil nτv L τ sax kji mnp + np. ceil mτv L τ sax

32 Chapter Basic Ideas from linear Algebra 2.2 Vector Norms 2.3 Matrix Norms 2.4 Finite Precision Matrix Computation 2.5 Orthogonality and the SVD 2.6 Projection and the CS Decomposition 2.7 The Sensitivity of Square Linear Systems

33 2.1 Basic Ideas from linear Algebra Indepence: A set of vectors a 1,, a n in R m is linearly indepent if σ j=1 α j a j = implies α 1: n =. Otherwise, a nontrivial combination of the a i is zero and a 1,, a n is said to be linearly depent. Subspace: A subspace of R m is a subset that is also a vector space. Given collection of vectors a 1,, a n R m, the set of all linear combinations of these vectors is a subspace referred to as the span of a 1,, a n span a 1,, a n = β j a j : β j R j=1 Basis: The subset a i1,, a ik is a maximal linearly indepent subset of a 1,, a n if it is linearly indepent and is not properly contained in any linearly indepent subset of a 1,, a n. If a i1,, a ik is maximal, the span a 1,, a n = span a i1,, a ik and a i1,, a ik is a basis for span a 1,, a n. Dimension: If S R m is a subspace, then it is possible to find indepent space vectors a 1,, a k S such that S = span a 1,, a n. All bases for a subspace S have the same number of elements. This number is the dimension and is denoted by dim(s). n n

34 Range: There are two important subspaces associated with an m-by-n matrix A. The range of A is defined by: ran A = y R m : y = Ax for some x R n Null Space: null A = x R n : Ax = If A = a 1,, a n is a column partitioning then: ran A = span a 1,, a n Rank: rank A = dim(ran A ) It can be shown that rank(a) = rank(at). We say that A R m n is rank deficient if rank(a)< min{m, n}. If A R m n then: dim null A + rank A = n

35 Matrix Inverse The n-by-n identity (Unit) matrix I n is defined by column partitioning: I n = e 1,, e k Where e k is the kth canonical vector: e k = (,,,1,,, ) T If A and X are in R n n and satisfy AX = I, then X is the inverse of A and is denoted by A 1. If A 1 exists, then A is said to be nonsingular, otherwise it is singular. Matrix Inverse Properties: The inverse of a product is the reverse product of the inverses: (AB) 1 = B 1 A 1 The transpose of inverse is the inverse of the transpose: (A 1 ) T = (A T ) 1 = A T Sherman-Morrison-Woodbury formula: (A + UV T ) 1 = A 1 A 1 U(I + V T A 1 U) 1 V T A 1 Where A R n n and U and V are n-by-k. A rank k correction to a matrix results in a rank k correction of the inverse. Here we assumed that both A and I + V T A 1 U are nonsingular.

36 The Determinant If A = (a) R 1 1.The determinant of A R n n is defined in term of order n-1 determinants: det A = n j=2 ( 1) j+1 a 1j det(a 1j ) Here A 1j is an (n-1)-by-(n-1) matrix obtained by deleting the first row and jth column of A. Determinant Properties: det AB = det A det(b) A, B R n n det A T = det A A R n n det ca = c n det A C R, A R n n det A A is nonsingular A R n n

37 Differentiation Suppose α is a scalar and that A(α) is an m-by-n matrix with entries a ij α. If a ij α is a differentiable function of α for all i and j, then by Aሶ α we mean the matrix: Aሶ α = da(α) dα = da ij α = aሶ dα ij (α) The differentiation of a parameterized matrix turns out to be a handy way to examine the sensitivity of a various matrix problem.

38 2.2 Vector Norms Norms serve the same purpose on vector spaces that absolute value does on the real line: They furnish a measure of distance. More precisely, R n together with a norm on R n defines a metric space. A vector norm on R n is a function of f: R n R that satisfies the following properties: f x x R n f x + y f x + f y x, y R n f ax = a f x a R, x R n This functions is denoted by double bar notation: f x = x. Subscripts on the double bar are used to distinguish between various norms. A useful class of vector norms are the p-norms: x p = ( x p x p n ) Of these, the 1,2 and norms are the most important: x 1 = x x n x 2 = ( x x n 2 ) 1 2= (x T x) 1 2 x = max 1 i n x i A unit vector with respect to the norm is a vector that satisfies x = 1. 1 p

39 Some Vector Properties: A classic result concerning p-norms is the Holder inequality: x T y x p y q 1 p + 1 q = 1 All norms on R n are equivalent, meaning that if. α and. β are norms on R n, then there exist positive constants c 1 and c 2 such that: c 1 x α x β c 2 x α For example for all x R n : x 2 x 1 n x 2 Absolute and Relative Error Suppose x R n is an approximation to x R n. For a given vector norm E abs = x x is the absolute error and x x E rel = x Convergence: We say that a sequence x k of n-vectors converges to x if: lim k x(k) x =

40 2.3 Matrix Norms The definition of a matrix is equivalent to the definition of vector norm. In particular, f: R m n R that satisfies the following properties: f A A R m n f A + B f A + f B x, y R m n f aa = a f A a R, A R m n And double bar notation is with subscript to designate matrix norms: f A = A. Frobenius norm: A F = σ m i=1 σn 2 j=1 a ij p-norms: A p = sup Ax p x p Note that the 2-norm on R 3 2 is different function from 2-norm on R 5 6.

41 Some Matrix Norm Properties The Frobenius and p-norms (especially p= 1, 2 and ) satisfy certain inequalities that are frequently used in the analysis of matrix computations: A 2 A F n A 2 max i,j a ij A 2 mn max i,j a ij m A 1 = max 1 j n i=1 a ij m A = max 1 j m i=1 a ij 1 n A A 2 m A 1 m A 1 A 2 n A 1 If A R m n, 1 i 1 i 2 m and 1 j 1 j 2 n, then: A(i 1 : i 2, j 1 : j 2 ) p A p

42 2.4 Finite Precision Matrix Computations The Floating Point Numbers When calculations are performed on a computer, each arithmetic operation is generally affected by roundoff error. This error arises because the machine hardware can only represent a subset of the real numbers. We denote this subset by F and refer to its elements as floating point numbers. The floating point number system on a particular computer is characterized by four integers: The base β, the precision t and the exponent range [L, U]. In particular, F consists of all numbers f of the form: f = ±. d 1 d 2 d t β e d i β, d 1, L e U together with zero. Notice that for a nonzero f F we have m f M where: m = β L 1 and M = β U (1 β t ) As an example, if β=2, t=3, L= and U=2, then the non-negative elements of F are represented by hash marks on the axis displayed below. Typical Value for (β,t,l,u) might be (2,56,-64,64)

43 A Model of Floating Point Arithmetic To make general pronouncements about the effect of rounding errors on a given algorithm, it is necessary to have a model of computer arithmetic on F. To this define the set G by G = x R m x M And the other operator: fl: G F by nearest c F to x with ties handled by fl x = rounding away from zero The fl operator can be shown to satisfy: fl x = x 1 + ε Where u is the unit roundoff defined by: u = 1 2 β1 t ε u Let a and b be any two floating point numbers and let op denote any of the four arithmetic operations +,,,. If a op b G, then in our model of floating point arithmetic we assume that the computed version of (a op b ) is given by fl a op b. It follow that fl a op b = (a op b)(1 + ε), with ε u, thus: fl a op b a op b u, a op b a op b

44 Cancelation Another important aspect of finite precision arithmetic is the phenomenon of catastrophic cancellation. This term refers to the extreme loss of correct significant digits when small numbers are additively computed from larger numbers. A well known example is the computation of e α via Taylor series with α >. The roundoff error associated with this method is approximately u times the largest sum. The absolute Value Notation Suppose A R m n. We wish to quantify the errors associated with its floating point representation. [fl A ] ij = fl a ij = a ij 1 + ε ij, ε ij u Roundoff in Dot Product The standard dor product algorithm: s= for k = 1:n s=s+x k y k Here, x and y are n-by-1 floating point vectors. Problem: The distinction between computed and exact quantities. We shall use fl(.) operator to signify computed quantities.

45 Thus fl x T y denotes the computed output of last algorithm. Now to bound fl x T y -x T y. If s p = fl( Then s 1 = x 1 y δ 1, with δ 1 u and for p = 2: n p k=1 x k y k ) s p = fl s p 1 + fl x p y p = s p 1 + x p y p 1 + δ p 1 + ε p δ p, ε p u And with a little algebra: fl x T y = s n = Where n k=1 1 + γ k = (1 + δ k ) x k y k (1 + γ k ) n j=k (1 + ε j ) With the convention that ε 1 =, thus: fl x T y -x T y σn k=1 x k y k γ k Now to bound the quantities γ k in terms of u: Lemma: If 1 + α = ς k=1 n (1 + α k ) where α k u and nu.1, then α 1.1nu. Applying above lemma to our problem: fl x T y -x T y 1.1nu x T y

46 Dot Product Accumulation A computed dot product has good relative error, fl x T y = x T y 1 + δ, where δ u. Thus the ability to accumulate dot product is very appealing. Forward and Backward Error Analyses Every roundoff bound given above is the consequence of a forward error analysis. An alternative style of characterizing the roundoff errors in an algorithm is backward error analysis. Here, the rounding errors are related to the data of the problem rather than to its solution.

47 2.5 Orthogonality and the SVD Orthogonality A set of vectors x 1,, x p in R m is orthogonal if x i T x j = whenever i j, and orthonormal if x i T x j = δ j. Orthogonal vectors are maximally indepent for they point in totally different directions. A collection of subspaces S 1,, S p in R m is mutually orthogonal if x T y = whenever x S i and y S j for i j. The orthogonal complement of a subspace S R m is defined by S = {y R m : y T x = for all x S} A matrix Q R m m is said to be orthogonal if QQ T = I. If Q = q 1,, q m orthogonal basis for R m. Theorem: If V 1 R n r has orthogonal columns, then there exists V 2 R n (n r) such that: V = V 1 V 2 is orthogonal. Note that ran(v 1 ) = ran(v 2) is orthogonal, then the q i form an

48 Norms and Orthogonal Transformation The 2-norm is invariant under orthogonal transformation, for if QQ T = I, then Qx 2 2 = x T Q T Qx = xx T = x 2 2. The matrix 2-norm and the Frobenius norm are also invariant with respect to orthogonal transformation. It is easy to show that for all orthogonal Q and Z of appropriate dimensions we have: QAZ F = A F and QAZ 2 = A 2

49 Singular Value Decomposition(SVD) Theorem: If A is a real m-by-n matrix, then there exist orthogonal matrices: U = [u 1,, u m ] R m m and V = [v 1,, v n ] R n n Such that U T AV = diag σ 1,, σ p R m n p = min m, n Where σ 1 σ 2 σ p. The σ i are the singular values of A and the vectors u i and v i are the ith left singular vector and the ith right singular vector respectively. The SVD reveals a great deal about the structure of a matrix. If the SVD of A is given, we define r by: σ 1 σ r > σ r+1 = = σ p =. Then rank(a) = r null A = span v r+1,, v n ran(a) = span u 1,, u r And we have the SVD expansion A = r i=1 σ i u i v i T

50 The Thin SVD If A = UΣV T R m n is the SVD of A and m n, then A = U 1 Σ 1 V T Where U 1 = U :, 1: n = [u 1,, u n ] R m n And Σ 1 = U 1: n, 1: n = diag(σ 1,, Σ n ) R n n This trimmed down version of SVD is referred to as thin SVD. Rank Deficiency and the SVD One of the most valuable aspects of the SVD is that it enables us to deal sensibly with the concept of matrix rank. Rounding errors and fuzzy data make rank determination a nontrivial exercise. Theorem: Let the SVD of A R m n be given. If k < r = rank A and Then A k = k i=1 σ i u i v i T min rank B =k A B 2 = A A k 2 = σ k+1

51 The previous theorem says that the smallest singular value if A is the 2-norm distance of A to the set of all rankdeficient matrices. It also follows that the set of full rank matrices is R m n is both open and dense. Unitary Matrices Over the complex field the unitary matrices correspond to the orthogonal matrices. In particular, Q C n n is unitary if Q H Q = QQ H = I n. Unitary matrices preserve 2-norm. The SVD of a complex matrix involves unitary matrices. If A C m n, then there exist unitary matrices U C m m and V C m n such that: U H AV = diag σ 1,, σ p R m n p = min m, n Where σ 1 σ 2 σ p.

52 2.6 Projections and the CS Decomposition If the object of a computation is to compute a matrix or a vector, then norms are useful for assessing the accuracy of the answer or for measuring progress during an iteration. If the object of a computation is to compute a subspace, then to make similar comments we need to be able to quantify the distance between two subspaces. Orthogonal projections are critical in this regard. Orthogonal Projection: Let S R n be a subspace. P R n n is the orthogonal projection onto S if ran P = S, P 2 = P and P T = P. From this definition it is easy to show that if x R n then Px S and I P xp S SVD-Related Projection There are several important orthogonal projections associated with the singular value decomposition. Suppose A = UΣV T R m n is the SVD of A and that r = rank(a). If we have U and V partitioning U = Ur r ഥU r m r V = Vr r തV r n r Then V r V r T = projection on to null(a) = ran A T തV r തV r T = projection on to null(a) U r U r T = projection on to null(a) ഥU r ഥU r T = projection on to ran(a) = null A T

53 Distance between Subspaces Suppose S 1 and S 2 are subspaces of R n and that dim S 1 = dim S 2. We define the distance between two spaces by: dist S 1, S 2 = P 1 P 2 2 Where P i is the orthogonal projection onto S i. The distance between a pair of subspaces can be characterized in terms of blocks of a certain orthogonal matrix. Theorem: Suppose W = W1 k W 2 n k Z = Z1 k Z 2 n k are n-by-n orthogonal matrices. If S 1 = ran W 1 and S 2 = ran z 1, then dist S 1, S 2 = W T 1 Z 2 2 Note that if S 1 and S 2 are subspaces in R n with the same dimensions, then dist S 1, S 2 1

54 The CS Decomposition The blocks of an orthogonal matrix partitioned into 2-by-2 form have highly related SVDs. This is the gist of CS decomposition. Theorem (The CS Decomposition (Thin Version)): Consider the matrix: Q = Q 1 Q 2 Q 1 R m 1 n, Q 2 R m 2 n Where m 1 n and m 2 n. If the columns of Q are orthonormal then there exist orthogonal matrices U 1 R m 1 m 1, U 2 R m 2 m 2 and V 1 R n n such that U 1 T Q 1 V U 2 Q 1 = C 2 S Where And C = diag cos θ 1,, cos θ n, S = diag sin θ 1,, sin θ n, θ 1 θ 2 θ n π 2

55 Using the same sort of techniques: Theorem (CS Decomposition (General Version)) If Q = Q 11 Q 12 Q 21 Q 22 is a 2-by-2 partitioning of an n-by-n orthogonal matrix, then there exist orthogonal Such that U = U 1 U 2 and V = V 1 V 2 U T QV = I C S I S I I C Where C = diag c 1,, c p and C = diag s 1,, s p are square diagonal matrices with c i, s i 1

56 2.7 The Sensitivity of Square System We now use some of the developed tools in the previous section to analyze the linear problem Ax = b. Where A R n n is nonsingular and b R n. We want to examine how disturbance in A and b affect the solution x. SVD Analysis If is the SVD of A, then A = n i=1 σ i u i v i T = UΣV T x = A 1 b = (UΣV T ) 1 b = n T ui b v σ i i i=1 This expansion shows that small changes in A or b can induce relatively large changes in x if σ n is small. σ n is the distance from A to the set of singular matrices. As the matrix of coefficients approaches this set, it is intuitively clear that the solution should be increasingly sensitive to changes.

57 ሶ ሶ Condition A precise measure of linear system sensitivity can be obtained by considering the parameterized system: A + εf x ε = b + εf x = x Where F R m n and f R n. If A in nonsingular then it is clear that x ε is differentiable in a neighborhood of zero. Moreover x = A 1 f Fx and thus, the Taylor series expansion for x ε has the form: x ε = x + εx + O ε 2 Using any vector norm and consistent matrix norm we obtain: x ε x ε A 1 f x x + F + O ε2 For square matrices A define the condition number k(a): k A = A A 1 With the convention that k A = for singular A. Using the inequality b A x : x ε x x k A ρ A + ρ B + O ε 2 Where ρ A = ε F A and ρ f A = ε b represent the relative errors in A and b respectively. Thus error in x can be k(a) times the relative error in A and b. In this sense condition number k(a) quantifies the sensitivity of Ax = b.

58 Note that k. deps on the underlying norm and subscripts are used accordingly. If k A is large, then A is said to be an ill-conditioned matrix. This is a norm depent property. However, any two condition numbers k α. and k β. on R n n are equivalent in that constants c 1 and c 2 can be found for which c 1 k α A k β A c 2 k α A A R n n And matrices with small condition numbers are said to be well-conditioned. Determinants and Nearness to Singularity Now we want to find out how well determinant size measures ill-conditioning. We find out there is little correlation between det(a) and the condition of Ax = b. For example for the matrix B n : B n = Has determinant of 1, but k B n = n2 n 1

59 A Rigorous Norm Bound The previous method is a little unsatisfying because it is contingent on ε being small enough and because it shed no light on the size of O ε 2 term. Now we establish a perturbation theorem that is completely rigorous: Lemma: Suppose: Ax = b A R n n, b R n A + A y = b + b A R n n, b R n With A ε A and b ε b. If εk A = r < 1, then A + A is nonsingular and y x 1 + r 1 r Theorem: If conditions of the lemma hold, then: y x 2ε k A x 1 r

60 A more refined perturbation Theory Theorem: Suppose Ax = b A R n n, b R n A + A y = b + b A R n n, b R n And that A ε A and b ε b. If δk A = r < 1, then A + A is nonsingular and y x 2δ A 1 A x 1 r We refer the value of A 1 A as the Skeel condition number.

MAT 610: Numerical Linear Algebra. James V. Lambers

MAT 610: Numerical Linear Algebra James V Lambers January 16, 2017 2 Contents 1 Matrix Multiplication Problems 7 11 Introduction 7 111 Systems of Linear Equations 7 112 The Eigenvalue Problem 8 12 Basic