Lecture Notes for Mat-inf 4130, Tom Lyche

Size: px

Start display at page:

Download "Lecture Notes for Mat-inf 4130, Tom Lyche"

Horatio Chandler
5 years ago
Views:

1 Lecture Notes for Mat-inf 4130, 2016 Tom Lyche August 15, 2016

2 2

3 Contents Preface ix 0 A Short Review of Linear Algebra Notation Vector Spaces and Subspaces Linear independence and bases Subspaces The vector spaces R n and C n Linear Systems Basic properties The inverse matrix Determinants Eigenvalues, eigenvectors and eigenpairs Algorithms and Numerical Stability I Linear Systems 27 1 Diagonally dominant tridiagonal matrices; three examples Cubic Spline Interpolation Polynomial interpolation Piecewise linear and cubic spline interpolation Give me a moment LU Factorization of a Tridiagonal System Exercises for section Computer exercises for section A two point boundary value problem Diagonal dominance Exercises for section An Eigenvalue Problem i

4 ii Contents The Buckling of a Beam The eigenpairs of the 1D test matrix Block Multiplication and Triangular Matrices Block multiplication Triangular matrices Exercises for section Review Questions Gaussian elimination and LU Factorizations Gaussian Elimination and LU-factorization by 3 example Gauss and LU Banded Triangular Systems Algorithms for triangular systems Counting operations The LU and LDU Factorizations Existence and uniqueness Block LU factorization The PLU-Factorization Pivoting Permutation matrices Pivot strategies Review Questions LDL* Factorization and Positive definite Matrices The LDL* factorization Positive Definite and Semidefinite Matrices The Cholesky factorization Positive definite and positive semidefinite criteria Semi-Cholesky factorization of a banded matrix The non-symmetric real case Review Questions Orthonormal and Unitary Transformations Inner products, orthogonality and unitary matrices Real and complex inner products Orthogonality Unitary and orthogonal matrices The Householder Transformation Householder Triangulation The number of arithmetic operations Solving linear systems using unitary transformations110

5 Contents iii 4.4 The QR Decomposition and QR Factorization Existence QR and Gram-Schmidt Givens Rotations Review Questions II Some matrix theory and least squares Eigenpairs and Similarity Transformations Defective and nondefective matrices Similarity transformations Geometric multiplicity of eigenvalues and the Jordan Form Algebraic and geometric multiplicity of eigenvalues The Jordan form The Schur decomposition and normal matrices The Schur decomposition Unitary and orthogonal matrices Normal matrices The quasi triangular form Hermitian Matrices The Rayleigh Quotient Minmax Theorems The Hoffman-Wielandt Theorem Left Eigenvectors Biorthogonality Review Questions The Singular Value Decomposition The SVD always exists The matrices A A, AA Further properties of SVD The singular value factorization SVD and the Four Fundamental Subspaces A Geometric Interpretation Determining the Rank of a Matrix Numerically The Frobenius norm Low rank approximation Review Questions Norms and perturbation theory for linear systems Vector Norms

6 iv Contents 7.2 Matrix Norms Consistent and subordinate matrix norms Operator norms The operator p-norms Unitary invariant matrix norms Absolute and monotone norms The Condition Number with Respect to Inversion Proof that the p-norms are Norms Review Questions Least Squares Examples Curve Fitting Geometric Least Squares theory Sum of subspaces and orthogonal projections Numerical Solution Normal equations QR factorization Least squares and singular value decomposition The generalized inverse Perturbation Theory for Least Squares Perturbing the right hand side Perturbing the matrix Perturbation Theory for Singular Values The Minmax Theorem for Singular Values and the Hoffman-Wielandt Theorem Review Questions III Kronecker Products and Fourier Transforms The Kronecker Product The 2D Poisson problem The test matrices The Kronecker Product Properties of the 2D Test Matrices Review Questions Fast Direct Solution of a Large Linear System Algorithms for a Banded Positive Definite System Cholesky factorization Block LU factorization of a block tridiagonal matrix220

7 Contents v Other methods A Fast Poisson Solver based on Diagonalization A Fast Poisson Solver based on the discrete sine and Fourier transforms The discrete sine transform (DST) The discrete Fourier transform (DFT) The fast Fourier transform (FFT) A poisson solver based on the FFT Review Questions IV Iterative Methods for Large Linear Systems The Classical Iterative Methods Classical Iterative Methods; Component Form The discrete Poisson system Classical Iterative Methods; Matrix Form Fixed-point form The splitting matrices for the classical methods Convergence Richardson s method Convergence of SOR Convergence of the classical methods for the discrete Poisson matrix Number of iterations Stopping the iteration Powers of a matrix The spectral radius Neumann series The Optimal SOR Parameter ω Review Questions The Conjugate Gradient Method Quadratic Minimization and Steepest Descent The Conjugate Gradient Method Derivation of the method The conjugate gradient algorithm Numerical example Implementation issues Convergence The Main Theorem The number of iterations for the model problems. 272

8 vi Contents Krylov spaces and the best approximation property Proof of the Convergence Estimates Chebyshev polynomials Convergence proof for steepest descent Monotonicity of the error Preconditioning Preconditioning Example A variable coefficient problem Applying preconditioning Review Questions V Eigenvalues and Eigenvectors Numerical Eigenvalue Problems Eigenpairs Gerschgorin s Theorem Perturbation of Eigenvalues Nondefective matrices Unitary Similarity Transformation of a Matrix into Upper Hessenberg Form Assembling Householder transformations Computing a Selected Eigenvalue of a Symmetric Matrix The inertia theorem Approximating λ m Review Questions The QR Algorithm The Power Method and its variants The power method The inverse power method Rayleigh quotient iteration The basic QR Algorithm Relation to the power method Invariance of the Hessenberg form Deflation The Shifted QR Algorithms Review Questions VI Appendix 325 A Computer Arithmetic 327

9 Contents vii A.1 Absolute and Relative Errors A.2 Floating Point Numbers A.3 Rounding and Arithmetic Operations A.3.1 Rounding A.3.2 Arithmetic operations B Differentiation of Vector Functions 333 Bibliography 337 Index 346

10 viii Contents

11 Preface These lecture notes contains the text for a course in matrix analysis and numerical linear algebra given at the beginning graduate level at the University of Oslo. The chapters correspond approximately to one week of lectures, but some contains more. In my own lectures I have not covered Sections 2.4, 3.3, 3.4, 5.5, 7.4, 8.4.2, 8.5, 11.5, 12.4, 12.5, Earlier versions of this manuscript were converted to LaTeX by Are Magnus Bruaset and Njål Foldnes. A special thanks goes to Christian Schulz, Georg Muntingh and Øyvind Ryan who helped me with the exercise sessions and have provided solutions to all problems in these notes. Oslo, August 2016 Tom Lyche ix

12 x Preface

13 Chapter 0 A Short Review of Linear Algebra In this introductory chapter we give a compact introduction to linear algebra with emphasis on R n and C n. For a more elementary introduction, see for example the book [24]. We start by introducing the notation used. 0.1 Notation The following sets and notations will be used in this book. 1. The sets of natural numbers, integers, rational numbers, real numbers, and complex numbers are denoted by N, Z, Q, R, C, respectively. 2. We use the colon equal symbol v := e to indicate that the symbol v is defined by the expression e. 3. R n is the set of n-tuples of real numbers which we will represent as column vectors. Thus x R n means x 1 x 2 x =., x n where x i R for i = 1,..., n. Row vectors are normally identified using the transpose operation. Thus if x R n then x is a column vector and x T is a row vector. 1

14 2 Chapter 0. A Short Review of Linear Algebra 4. Addition and scalar multiplication are denoted and defined by x 1 + y 1 ax 1 x + y =, ax =, x, y R n, a R.. x n + y n. ax n 5. R m n is the set of matrices A with real elements. The integers m and n are the number of rows and columns in the tableau a 11 a 12 a 1n a 21 a 22 a 2n A = a m1 a m2 a mn The element in the ith row and jth column of A will be denoted by a i,j, a ij, A(i, j) or (A) i,j. We use the notations a 1j a 2j a :j =., at i: = [a i1, a i2,..., a in ], A = [a :1, a :2,... a :n ] = a mj for the columns a :j and rows a T i: of A. We often drop the colon and write a j and a T i with the risk of some confusion. If m = 1 then A is a row vector, if n = 1 then A is a column vector, while if m = n then A is a square matrix. In this text we will denote matrices by boldface capital letters A, B, C, and vectors most often by boldface lower case letters x, y, z,. 6. A complex number is a number written in the form x = a + ib, where a, b are real numbers and i, the imaginary unit, satisfies i 2 = 1. The set of all such numbers is denoted by C. The numbers a = Re x and b = Im x are the real and imaginary part of x. The number x := a ib is called the complex conjugate of x = a + ib, and x := xx = a 2 + b 2 the absolute value or modulus of x. The complex exponential function can be defined by e x = e a+ib := e a (cos b + i sin b). In particular, e iπ/2 = i, e iπ = 1, e 2iπ = 1. We have e x+y = e x e y for all x, y C. The polar form of a complex number is x = a + ib = re iθ, r = x = a 2 + b 2, cos θ = a r, sin θ = b r. a T 1: a T 2:. a T m:

15 0.1. Notation 3 7. For matrices and vectors with complex elements we use the notation A C m n and x C n. We define complex row vectors using either the transpose x T or the conjugate transpose operation x := x T = [x 1,..., x n ]. 8. For x, y C n and a C the operations of vector addition and scalar multiplication is defined by component operations as in the real case (cf. 4.). 9. The arithmetic operations on rectangular matrices are matrix addition C = A+B if A, B, C are matrices of the same size, i. e., with the same number of rows and columns, and c ij = a ij + b ij for all i, j. multiplication by a scalar C = αa, where c ij = αa ij for all i, j. matrix multiplication C = AB, C = A B or C = A B, where A C m p, B C p n, C C m n, and c ij = p k=1 a ikb kj for i = 1,..., m, j = 1,..., n. element-by-element matrix operations C = A B, D = A/B, and E = A r where all matrices are of the same size and c ij = a ij b ij, d ij = a ij /b ij and e ij = a r ij for all i, j and suitable r. The element-byelement product C = A B is known as the Schur product and also the Hadamard product. 10. Let A R m n or A C m n. The transpose A T and conjugate transpose A are n m matrices with elements a T ij = a ji and a ij = a ji, respectively. If B is an n, p matrix then (AB) T = B T A T and (AB) = B A. 11. The unit vectors in R n and C n are denoted by e 1 := 0, e 2 := 0, e 3 := 1,..., e n := 0, while I n = I := [δ ij ] n i,j=1, where δ ij := { 1 if i = j, 0 otherwise, (1) is the identity matrix of order n. Both the collumns and the transpose of the rows of I are the unit vectors e 1, e 2,..., e n. 12. Some matrices with many zeros have names indicating their shape. Suppose A R n n or A C n n. Then A is diagonal if a ij = 0 for i j.

16 4 Chapter 0. A Short Review of Linear Algebra upper triangular or right triangular if a ij = 0 for i > j. lower triangular or left triangular if a ij = 0 for i < j. upper Hessenberg if a ij = 0 for i > j + 1. lower Hessenberg if a ij = 0 for i < j + 1. tridiagonal if a ij = 0 for i j > 1. d-banded if a ij = 0 for i j > d. 13. We use the following notations for diagonal- and tridiagonal n n matrices d d diag(d i ) = diag(d 1,..., d n ) := = d..., d 0 0 d n n d 1 c 1 a 1 d 2 c 2 B = tridiag(a i, d i, c i ) = tridiag(a, d, c) := a n 2 d n 1 c n 1 a n 1 d n Here b ii = d i for i = 1,..., n, b i+1,i = a i, b i,i+1 = c i for i = 1,..., n 1, and b ij = 0 otherwise. 14. Suppose A C m n and 1 i 1 < i 2 < < i r m, 1 j 1 < j 2 < < j c n. The matrix A(i, j) C r c is the submatrix of A consisting of rows i := [i 1,..., i r ] and columns j := [j 1,..., j c ] ( i1 i A(i, j) := A 2 i r j 1 j 2 j c a i1,j ) 1 a i1,j 2 a i1,j c a i2,j 1 a i2,j 2 a i2,j c = a ir,j 1 a ir,j 2 a ir,j c For the special case of consecutive rows and columns we use the notation a r1,c 1 a r1,c 1+1 a r1,c 2 a r1+1,c 1 a r1+1,c 1+1 a r1+1,c 2 A(r 1 : r 2, c 1 : c 2 ) := a r2,c 1 a r2,c 1+1 a r2,c Vector Spaces and Subspaces Many mathematical systems have analogous properties to vectors in R 2 or R 3.

17 0.2. Vector Spaces and Subspaces 5 Definition 0.1 (Real vector space) A real vector space is a nonempty set V, whose objects are called vectors, together with two operations + : V V V and : R V V, called addition and scalar multiplication, satisfying the following axioms for all vectors u, v, w in V and scalars c, d in R. (V1) The sum u + v is in V, (V2) u + v = v + u, (V3) u + (v + w) = (u + v) + w, (V4) There is a zero vector 0 such that u + 0 = u, (V5) For each u in V there is a vector u in V such that u + ( u) = 0, (S1) The scalar multiple c u is in V, (S2) c (u + v) = c u + c v, (S3) (c + d) u = c u + d u, (S4) c (d u) = (cd) u, (S5) 1 u = u. The scalar multiplication symbol is often omitted, writing cv instead of c v. We define u v := u + ( v). We call V a complex vector space if the scalars consist of all complex numbers C. In this book a vector space is either real or complex. From the axioms it follows that 1. The zero vector is unique. 2. For each u V the negative u of u is unique. 3. 0u = 0, c0 = 0, and u = ( 1)u. Here are some examples 1. The space R n, where n N, is a real vector space. 2. Similarly, C n is a complex vector space. 3. Let D be a subset of R and d N. The set V of all functions f, g : D R d is a real vector space with (f + g)(t) := f(t) + g(t), (cf)(t) := cf(t), t D, c R.

18 6 Chapter 0. A Short Review of Linear Algebra Two functions f, g in V are equal if f(t) = g(t) for all t D. The zero element is the zero function given by f(t) = 0 for all t D and the negative of f is given by f = ( 1)f. In the following we will use boldface letters for functions only if d > For n 0 the space Π n of polynomials of degree at most n consists of all polynomials p : R R, p : R C, or p : C C of the form p(t) = a 0 + a 1 t + a 2 t a n t n, (2) where the coefficients a 0,..., a n are real or complex numbers. p is called the zero polynomial if all coefficients are zero. All other polynomials are said to be nontrivial. The degree of a nontrivial polynomial p given by (2) is the smallest integer 0 k n such that p(t) = a a k t k with a k 0. The degree of the zero polynomial is not defined. Π n is a vector space if we define addition and scalar multiplication as for functions. Definition 0.2 (Linear combination) For n 1 let X := {x 1,..., x n } be a set of vectors in a vector space V and let c 1,..., c n be scalars. 1. The sum c 1 x c n x n is called a linear combination of x 1,..., x n. 2. The linear combination is nontrivial if c j x j 0 for at least one j. 3. The set of all linear combinations of elements in X is denoted span(x ). 4. A vector space is finite dimensional if it has a finite spanning set; i. e., there exists n N and {x 1,..., x n } in V such that V = span({x 1,..., x n }). Example 0.3 (Linear combinations) 1. Any x = [x 1,..., x m ] T in C m can be written as a linear combination of the unit vectors as x = x 1 e 1 +x 2 e 2 + +x m e m. Thus, C m = span({e 1,..., e m }) and C m is finite dimensional. Similarly R m is finite dimensional. 2. Let Π = n Π n be the space of all polynomials. Π is a vector space that is not finite dimensional. For suppose Π is finite dimensional. Then Π = span({p 1,..., p m }) for some polynomials p 1,..., p m. Let d be an integer such that the degree of p j is less than d for j = 1,..., m. A polynomial of degree d cannot be written as a linear combination of p 1,..., p m, a contradiction.

19 0.2. Vector Spaces and Subspaces Linear independence and bases Definition 0.4 (Linear independence) A set X = {x 1,..., x n } of nonzero vectors in a vector space is linearly dependent if 0 can be written as a nontrivial linear combination of {x 1,..., x n }. Otherwise X is linearly independent. A set of vectors X = {x 1,..., x n } is linearly independent if and only if c 1 x c n x n = 0 = c 1 = = c n = 0. (3) Suppose {x 1,..., x n } is linearly independent. Then 1. If x span(x ) then the scalars c 1,..., c n in the representation x = c 1 x c n x n are unique. 2. Any nontrivial linear combination of x 1,..., x n is nonzero, Lemma 0.5 (Linear independence and span) Suppose v 1,..., v n span a vector space V and that w 1,..., w k are linearly independent vectors in V. Then k n. Proof. Suppose k > n. Write w 1 as a linear combination of elements from the set X 0 := {v 1,..., v n }, say w 1 = c 1 v c n v n. Since w 1 0 not all the c s are equal to zero. Pick a nonzero c, say c i1. Then v i1 can be expressed as a linear combination of w 1 and the remaining v s. So the set X 1 := {w 1, v 1,..., v i1 1, v i1+1,..., v n } must also be a spanning set for V. We repeat this for w 2 and X 1. In the linear combination w 2 = d i1 w 1 + j i 1 d j v j, we must have d i2 0 for some i 2 with i 2 i 1. For otherwise w 2 = d 1 w 1 contradicting the linear independence of the w s. So the set X 2 consisting of the v s with v i1 replaced by w 1 and v i2 replaced by w 2 is again a spanning set for V. Repeating this process n 2 more times we obtain a spanning set X n where v 1,..., v n have been replaced by w 1,..., w n. Since k > n we can then write w k as a linear combination of w 1,..., w n contradicting the linear independence of the w s. We conclude that k n. Definition 0.6 (basis) A finite set of vectors {v 1,..., v n } in a vector space V is a basis for V if 1. span{v 1,..., v n } = V. 2. {v 1,..., v n } is linearly independent.

20 8 Chapter 0. A Short Review of Linear Algebra Theorem 0.7 (Basis subset of a spanning set) Suppose V is a vector space and that {v 1,..., v n } is a spanning set for V. Then we can find a subset {v i1,..., v ik } that forms a basis for V. Proof. If {v 1,..., v n } is linearly dependent we can express one of the v s as a nontrivial linear combination of the remaining v s and drop that v from the spanning set. Continue this process until the remaining v s are linearly independent. They still span the vector space and therefore form a basis. Corollary 0.8 (Existence of a basis) A vector space is finite dimensional if and only if it has a basis. Proof. Let V = span{v 1,..., v n } be a finite dimensional vector space. By Theorem 0.7, V has a basis. Conversely, if V = span{v 1,..., v n } and {v 1,..., v n } is a basis then it is by definition a finite spanning set. Theorem 0.9 (Dimension of a vector space) Every basis for a vector space V has the same number of elements. This number is called the dimension of the vector space and denoted dim V. Proof. Suppose X = {v 1,..., v n } and Y = {w 1,..., w k } are two bases for V. By Lemma 0.5 we have k n. Using the same Lemma with X and Y switched we obtain n k. We conclude that n = k. The set of unit vectors {e 1,..., e n } form a basis for both R n and C n. Theorem 0.10 (Enlarging vectors to a basis ) Every linearly independent set of vectors {v 1,..., v k } in a finite dimensional vector space V can be enlarged to a basis for V. Proof. If {v 1,..., v k } does not span V we can enlarge the set by one vector v k+1 which cannot be expressed as a linear combination of {v 1,..., v k }. The enlarged set is also linearly independent. Continue this process. Since the space is finite dimensional it must stop after a finite number of steps Subspaces Definition 0.11 (Subspace) A nonempty subset S of a real or complex vector space V is called a subspace of V if

21 0.2. Vector Spaces and Subspaces 9 (V1) The sum u + v is in S for any u, v S. (S1) The scalar multiple cu is in S for any scalar c and any u S. Using the operations in V, any subspace S of V is a vector space, i. e., all 10 axioms V 1 V 5 and S1 S5 are satisfied for S. In particular, S must contain the zero element in V. This follows since the operations of vector addition and scalar multiplication are inherited from V. Example 0.12 (Examples of subspaces) 1. {0}, where 0 is the zero vector is a subspace, the trivial subspace. The dimension of the trivial subspace is defined to be zero. All other subspaces are nontrivial. 2. V is a subspace of itself. 3. span(x ) is a subspace of V for any X = {x 1,..., x n } V. Indeed, it is easy to see that (V1) and (S1) hold. 4. The sum of two subspaces R and S of a vector space V is defined by R + S := {r + s : r R and s S}. (4) Clearly (V1) and (S1) hold and it is a subspace of V. 5. The intersection of two subspaces R and S of a vector space V is defined by R S := {x : x R and x S}. (5) It is a subspace of V. 6. The union of two subspaces R and S of a vector space V is defined by In general it is not a subspace of V. R S := {x : x R or x S}. (6) 7. A sum of two subspaces R and S of a vector space V is called a direct sum and denoted R S if R S = {0}. The subspaces R and S are called complementary in the subspace R S. Theorem 0.13 (Dimension formula for sums of subspaces) Let R and S be two finite dimensional subspaces of a vector space V. Then In particular, for a direct sum dim(r + S) = dim(r) + dim(s) dim(r S). (7) dim(r S) = dim(r) + dim(s). (8)

22 10 Chapter 0. A Short Review of Linear Algebra Proof. Let {u 1,..., u p } be a basis for R S, where {u 1,..., u p } =, the empty set, in the case R S = {0}. We use Theorem 0.10 to extend {u 1,..., u p } to a basis {u 1,..., u p, r 1,..., r q } for R and a basis {u 1,..., u p, s 1,..., s t } for S. Every x R+S can be written as a linear combination of {u 1,..., u p, r 1,..., r q, s 1,..., s t } so these vectors span R + S. We show that they are linearly independent and hence a basis. Suppose u + r + s = 0, where u := p j=1 α ju j, r := q j=1 ρ jr j, and s := t j=1 σ js j. Now r = (u+s) belongs to both R and to S and hence r R S. Therefore r can be written as a linear combination of u 1,..., u p say r := p j=1 β ju j. But then 0 = p j=1 β ju j q j=1 ρ jr j and since {u 1,..., u p, r 1,..., r q } is linearly independent we must have β 1 = = β p = ρ 1 = = ρ q = 0 and hence r = 0. We then have u + s = 0 and by linear independence of {u 1,..., u p, s 1,..., s t } we obtain α 1 = = α p = σ 1 = = σ t = 0. We have shown that the vectors {u 1,..., u p, r 1,..., r q, s 1,..., s t } constitute a basis for R + S. But then dim(r + S) = p + q + t = (p + q) + (p + t) p = dim(r) + dim(s) dim(r S) and (7) follows. (7) implies (8) since dim{0} = 0. It is convenient to introduce a matrix transforming a basis in a subspace into a basis for the space itself. Lemma 0.14 (Change of basis matrix) Suppose S is a subspace of a finite dimensional vector space V and let {s 1,..., s n } be a basis for S and {v 1,..., v m } a basis for V. Then each s j can be expressed as a linear combination of v 1,..., v m, say s j = m a ij v i for j = 1,..., n. (9) i=1 If x S then x = n j=1 c js j = m i=1 b iv i for some coefficients b := [b 1,..., b m ] T, c := [c 1,..., c n ] T. Moreover b = Ac, where A = [a ij ] C m n is given by (9). The matrix A has linearly independent columns. Proof. (9) holds for some a ij since s j V and {v 1,..., v m } spans V. Since {s 1,..., s n } is a basis for S and {v 1,..., v m } a basis for V, every x S can be written x = n j=1 c js j = m i=1 b iv i for some scalars (c j ) and (b i ). But then m b i v i = x = i=1 n j=1 c j s j (9) = n ( m ) m ( n ) c j a ij v i = a ij c j vi. j=1 Since {v 1,..., v m } is linearly independent it follows that b i = n j=1 a ijc j for i = 1,..., m or b = Ac. Finally, to show that A has linearly independent columns i=1 i=1 j=1

23 0.3. Linear Systems 11 suppose b := Ac = 0 for some c = [c 1,..., c n ] T. Define x := n j=1 c js j. Then x = m i=1 b iv i and since b = 0 we have x = 0. But since {s 1,..., s n } is linearly independent it follows that c = 0. The matrix A in Lemma 0.14 is called a change of basis matrix The vector spaces R n and C n When V = R m we can think of n vectors in R m, say x 1,..., x n, as a set X := {x 1,..., x n } or as the columns of a matrix X = [x 1,..., x n ] R m n. A linear combination can then be written as a matrix times vector Xc, where c = [c 1,..., c n ] T is the vector of scalars. Thus span(x ) = span(x) = {Xc : c R n }. Of course the same holds for C m. In R m and C m each of the following statements is equivalent to linear independence of X. (i) Xc = 0 c = 0, (ii) X has linearly independent columns, Definition 0.15 (Column space and null space) Associated with a matrix X = [x 1,..., x n ] R m n are the following subspaces 1. The subspace span(x) is called the column space of X. It is the smallest subspace containing X = {x 1,..., x n }. 2. span(x T ) is called the row space of X. It is generated by the rows of X written as column vectors. 3. The subspace ker(x) := {y R n : Xy = 0} is called the null space or kernel space of X. Note that the subspace ker(x) is nontrivial if and only if X is linearly dependent. 0.3 Linear Systems Consider a linear system a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b a m1 x 1 + a m2 x a mn x n = b m

24 12 Chapter 0. A Short Review of Linear Algebra of m equations in n unknowns. Here for all i, j, the coefficients a ij, the unknowns x j, and the components of the right hand sides b i, are real or complex numbers. The system can be written as a vector equation x 1 a 1 + x 2 a x n a n = b, where a j = [ ] T a 1j,..., a mj C m for j = 1,..., n and b = [ ] T b 1,..., b m C m. It can also be written as a matrix equation a 11 a 12 a 1n x 1 b 1 a 21 a 22 a 2n x 2 Ax = = b 2. = b. a m1 a m2 a mn x n The system is homogeneous if b = 0 and it is said to be underdetermined, square, or overdetermined if m < n, m = n, or m > n, respectively Basic properties A linear system has a unique solution, infinitely many solutions, or no solution. To discuss this we first consider the real case, and a homogeneous underdetermined system. Lemma 0.16 (Underdetermined system) Suppose A R m n with m < n. Then there is a nonzero x R n such that Ax = 0. Proof. Suppose A R m n with m < n. The n columns of A span a subspace of R m. Since R m has dimension m the dimension of this subspace is at most m. By Lemma 0.5 the columns of A must be linearly dependent. It follows that there is a nonzero x R n such that Ax = 0. A square matrix is either nonsingular or singular. Definition 0.17 (Real nonsingular or singular matrix) A square matrix A R n n is said to be nonsingular if the only real solution of the homogeneous system Ax = 0 is x = 0. The matrix is singular if there is a nonzero x R n such that Ax = 0. b m Theorem 0.18 (Linear systems; existence and uniqueness) Suppose A R n n. The linear system Ax = b has a unique solution x R n for any b R n if and only if the matrix A is nonsingular.

25 0.3. Linear Systems 13 Proof. Suppose A is nonsingular. We define B = [ A b ] R n (n+1) by adding a column to A. [ By Lemma ] 0.16 there is a nonzero z R n+1 such that Bz = 0. If z we write z = where z = [ ] T z z 1,..., z n R n and z n+1 R, then n+1 [ ] z Bz = [A b] = A z + z z n+1 b = 0. n+1 We cannot have z n+1 = 0 for then A z = 0 for a nonzero z, contradicting the nonsingularity of A. Define x := z/z n+1. Then ( ) z Ax = A = 1 A z = 1 ( zn+1 b ) = b, z n+1 z n+1 z n+1 so x is a solution. Suppose Ax = b and Ay = b for x, y R n. Then A(x y) = 0 and since A is nonsingular we conclude that x y = 0 or x = y. Thus the solution is unique. Conversely, if Ax = b has a unique solution for any b R n then Ax = 0 has a unique solution which must be x = 0. Thus A is nonsingular. For the complex case we have Lemma 0.19 (Complex underdetermined system) Suppose A C m n with m < n. Then there is a nonzero x C n such that Ax = 0. Definition 0.20 (Complex nonsingular matrix) A square matrix A C n n is said to be nonsingular if the only complex solution of the homogeneous system Ax = 0 is x = 0. The matrix is singular if it is not nonsingular. Theorem 0.21 (Complex linear system; existence and uniqueness) Suppose A C n n. The linear system Ax = b has a unique solution x C n for any b C n if and only if the matrix A is nonsingular.

14 Chapter 0. A Short Review of Linear Algebra James Joseph Sylvester, 1814-1897. The word matrix to denote a rectangular array of numbers, was first used by Sylvester in 1850. 0.3.

26 14 Chapter 0. A Short Review of Linear Algebra James Joseph Sylvester, The word matrix to denote a rectangular array of numbers, was first used by Sylvester in The inverse matrix Suppose A C n n is a square matrix. A matrix B C n n is called a right inverse of A if AB = I. A matrix C C n n is said to be a left inverse of A if CA = I. We say that A is invertible if it has both a left- and a right inverse. If A has a right inverse B and a left inverse C then C = CI = C(AB) = (CA)B = IB = B and this common inverse is called the inverse of A and denoted by A 1. Thus the inverse satisfies A 1 A = AA 1 = I. We want to characterize the class of invertible matrices and start with a lemma. Theorem 0.22 (Product of nonsingular matrices) If A, B, C C n n with AB = C then C is nonsingular if and only if both A and B are nonsingular. In particular, if AB = I or BA = I then A is nonsingular and A 1 = B. Proof. Suppose both A and B are nonsingular and let Cx = 0. Then ABx = 0 and since A is nonsingular we see that Bx = 0. Since B is nonsingular we have x = 0. We conclude that C is nonsingular. For the converse suppose first that B is singular and let x C n be a nonzero vector so that Bx = 0. But then Cx = (AB)x = A(Bx) = A0 = 0 so C is singular. Finally suppose B is nonsingular, but A is singular. Let x be a nonzero vector such that A x = 0. By Theorem 0.21 there is a vector x such that Bx = x and x is nonzero since x is nonzero. But then Cx = (AB)x = A(Bx) = A x = 0 for a nonzero vector x and C is singular.

27 0.3. Linear Systems 15 Theorem 0.23 (When is a square matrix invertible?) A square matrix is invertible if and only if it is nonsingular. Proof. Suppose first A is a nonsingular matrix. By Theorem 0.21 each of the linear [ systems ] Ab i = e i has [ a unique solution ] [ b i for i ] = 1,..., n. Let B = b1,..., b n. Then AB = Ab1,..., Ab n = e1,..., e n = I so that A has a right inverse B. By Theorem 0.22 B is nonsingular since I is nonsingular and AB = I. Since B is nonsingular we can use what we have shown for A to conclude that B has a right inverse C, i.e. BC = I. But then AB = BC = I so B has both a right inverse and a left inverse which must be equal so A = C. Since BC = I we have BA = I, so B is also a left inverse of A and A is invertible. Conversely, if A is invertible then it has a right inverse B. Since AB = I and I is nonsingular, we again use Theorem 0.22 to conclude that A is nonsingular. To verify that some matrix B is an inverse of another matrix A it is enough to show that B is either a left inverse or a right inverse of A. This calculation also proves that A is nonsingular. We use this observation to give simple proofs of the following results. Corollary 0.24 (Basic properties of the inverse matrix) Suppose A, B C n n are nonsingular and c is a nonzero constant. 1. A 1 is nonsingular and (A 1 ) 1 = A. 2. C = AB is nonsingular and C 1 = B 1 A A T is nonsingular and (A T ) 1 = (A 1 ) T =: A T. 4. A is nonsingular and (A ) 1 = (A 1 ) =: A. 5. ca is nonsingular and (ca) 1 = 1 c A 1. Proof. 1. Since A 1 A = I the matrix A is a right inverse of A 1. Thus A 1 is nonsingular and (A 1 ) 1 = A. 2. We note that (B 1 A 1 )(AB) = B 1 (A 1 A)B = B 1 B = I. Thus AB is invertible with the indicated inverse since it has a left inverse. 3. Now I = I T = (A 1 A) T = A T (A 1 ) T showing that (A 1 ) T is a right inverse of A T. The proof of part 4 is similar. 4. The matrix 1 c A 1 is a one sided inverse of ca.

28 16 Chapter 0. A Short Review of Linear Algebra Exercise 0.25 (The inverse of a general 2 2 matrix) Show that [ ] 1 [ ] a b d b 1 = α, α = c d c a ad bc, for any a, b, c, d such that ad bc 0. Exercise 0.26 (The inverse of a special 2 2 matrix) Find the inverse of [ ] cos θ sin θ A =. sin θ cos θ Exercise 0.27 (Sherman-Morrison formula) Suppose A C n n, and B, C R n m for some n, m N. If (I + C T A 1 B) 1 exists then (A + BC T ) 1 = A 1 A 1 B(I + C T A 1 B) 1 C T A Determinants The first systematic treatment of determinants was given by Cauchy in He adopted the word determinant. The first use of determinants was made by Leibniz in 1693 in a letter to De L Hôspital. By the beginning of the 20th century the theory of determinants filled four volumes of almost 2000 pages (Muir, Historic references can be found in this work). The main use of determinants in this text will be to study the characteristic polynomial of a matrix and to show that a matrix is nonsingular. For any A C n n the determinant of A is defined the number det(a) = sign(σ)a σ(1),1 a σ(2),2 a σ(n),n. (10) σ S n This sum ranges of all n! permutations of {1, 2,..., n}. Moreover, sign(σ) equals the number of times a bigger integer precedes a smaller one in σ. We also denote the determinant by (Cayley, 1841) a 11 a 12 a 1n a 21 a 22 a 2n... a n1 a n2 a nn From the definition we have a 11 a 12 a 21 a 22 = a 11a 22 a 21 a 12..

29 0.4. Determinants 17 The first term on the right corresponds to the identity permutation ɛ given by ɛ(i) = i, i = 1, 2. The second term comes from the permutation σ = {2, 1}. For n = 3 there are six permutations of {1, 2, 3}. Then a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = a 11 a 22 a 33 a 11 a 32 a 23 a 21 a 12 a 33 + a 21 a 32 a 13 + a 31 a 12 a 23 a 31 a 22 a 13. This follows since sign({1, 2, 3}) = sign({2, 3, 1}) = sign({3, 1, 2}) = 1, and noting that interchanging two numbers in a permutation reverses it sign we find sign({2, 1, 3}) = sign({3, 2, 1}) = sign({1, 3, 2}) = 1. To compute the value of a determinant from the definition can be a trying experience. It is often better to use elementary operations on rows or columns to reduce it to a simpler form. For example, if A is triangular then det(a) = a 11 a 22 a nn, the product of the diagonal elements. In particular, for the identity matrix det(i) = 1. The elementary operations using either rows or columns are 1. Interchanging two rows(columns): det(b) = det(a), 2. Multiply a row(column) by a scalar: α, det(b) = α det(a), 3. Add a constant multiple of one row(column) to another row(column): det(b) = det(a). where B is the result of performing the indicated operation on A. If only a few elements in a row or column are nonzero then a cofactor expansion can be used. These expansions take the form det(a) = det(a) = n ( 1) i+j a ij det(a ij ) for i = 1,..., n, row (11) j=1 n ( 1) i+j a ij det(a ij ) for j = 1,..., n, column. (12) i=1 Here A i,j denotes the submatrix of A obtained by deleting the ith row and jth column of A. For A C n n and 1 i, j n the determinant det(a ij ) is called the cofactor of a ij. Example 0.28 (Determinant equation for a straight line) The equation for a straight line through two points (x 1, y 1 ) and (x 2, y 2 ) in the plane can be written as the equation 1 x y det(a) := 1 x 1 y 1 1 x 2 y 2 = 0

30 18 Chapter 0. A Short Review of Linear Algebra involving a determinant of order 3. We can compute this determinant using row operations of type 3. Subtracting row 2 from row 3 and then row 1 from row 2, and then using a cofactor expansion on the first column we obtain 1 x y 1 x 1 y 1 1 x 2 y 2 = 1 x y 0 x 1 x y 1 y 0 x 2 x 1 y 2 y 1 = x 1 x y 1 y x 2 x 1 y 2 y 1 = (x 1 x)(y 2 y 1 ) (y 1 y)(x 2 x 1 ). Rearranging the equation det(a) = 0 we obtain y y 1 = y 2 y 1 x 2 x 1 (x x 1 ) which is the slope form of the equation of a straight line. We will freely use, without proofs, the following properties of determinants. If A, B are square matrices of order n with real or complex elements, then 1. det(ab) = det(a) det(b). 2. det(a T ) = det(a), and det(a ) = det(a), (complex conjugate). 3. det(aa) = a n det(a), for a C. 4. A is singular if and only if det(a) = 0. [ ] C D 5. If A = for some square matrices C, E then det(a) = det(c) det(e). 0 E 6. Cramer s rule Suppose A C n n is nonsingular and b C n. Let x = [x 1, x 2,..., x n ] T be the unique solution of Ax = b. Then x j = det(a j(b)), j = 1, 2,..., n, det(a) where A j (b) denote the matrix obtained from A by replacing the jth column of A by b. 7. Adjoint formula for the inverse. If A C n n is nonsingular then A 1 = 1 det(a) adj(a), where the matrix adj(a) C n n with elements adj(a) i,j = ( 1) i+j det(a j,i ) is called the adjoint of A. Moreover, A j,i denotes the submatrix of A obtained by deleting the jth row and ith column of A.

0.4. Determinants 19 8. Cauchy-Binet formula: Let A C m p, B C p n and C = AB. Suppose 1 r min{m, n, p} and let i = {i 1,..., i r } and j = {j 1,.

31 0.4. Determinants Cauchy-Binet formula: Let A C m p, B C p n and C = AB. Suppose 1 r min{m, n, p} and let i = {i 1,..., i r } and j = {j 1,..., j r } be integers with 1 i 1 < i 2 < < i r m and 1 j 1 < j 2 < < j r n. Then c i1,j 1 c i1,j r.. = k a i1,k 1 a i1,k r b k1,j 1 b k1,j r...., c ir,j 1 c ir,j r a irk 1 a ir,k r b kr,j 1 b kr,j r where we sum over all k = {k 1,..., k r } with 1 k 1 < k 2 < < k r p. More compactly, det ( C(i, j) ) = k det ( A(i, k) ) det ( B(k, j) ), (13) Note the resemblance to the formula for matrix multiplication. Arthur Cayley, (left), Gabriel Cramer (center), Alexandre- Thophile Vandermonde, (right). The notation for determinants is due to Cayley Exercise 0.29 (Cramer s rule; special case) Solve the following system by Cramers rule: [ ] [ ] [ 1 2 x1 3 = 2 1 x 2 6 ] Exercise 0.30 (Adjoint matrix; special case) Show that if A = 3 2 6, 6 3 2

32 20 Chapter 0. A Short Review of Linear Algebra then Moreover, adj(a) = adj(a)a = = det(a)i. Exercise 0.31 (Determinant equation for a plane) Show that x y z 1 x 1 y 1 z 1 1 x 2 y 2 z 2 1 = 0. x 3 y 3 z 3 1 is the equation for a plane through three points (x 1, y 1, z 1 ), (x 2, y 2, z 2 ) and (x 3, y 3, z 3 ) in space. Exercise 0.32 (Signed area of a triangle) Let P i = (x i, y i ), i = 1, 2, 3, be three points in the plane defining a triangle T. Show that the area of T is 1 A(T ) = x 1 x 2 x 3 y 1 y 2 y 3 The area is positive if we traverse the vertices in counterclockwise order.. Exercise 0.33 (Vandermonde matrix) Show that 1 x 1 x 2 1 x n x 2 x 2 2 x n 1 2 =.... i x j ), i>j(x 1 x n x 2 n x n 1 n where i>j (x i x j ) = n i=2 (x i x 1 )(x i x 2 ) (x i x i 1 ). This determinant is called the Vandermonde determinant. 2 1 Hint: A(T ) = A(ABP 3 P 1 ) + A(P 3 BCP 2 ) A(P 1 ACP 2 ), c.f. Figure 1 2 Hint: subtract x k n times column k from column k+1 for k = n 1, n 2,..., 1.

33 0.4. Determinants 21 P 3 P 2 P 1 A B C Figure 1: The triangle T defined by the three points P 1, P 2 and P 3. Exercise 0.34 (Cauchy determinant (1842)) Let α = [α 1,..., α n ] T, β = [β 1,..., β n ] T be in R n. a) Consider the matrix A R n n with elements a i,j = 1/(α i + β j ), i, j = 1, 2,..., n. Show that det(a) = P g(α)g(β) where P = n i=1 n j=1 a ij, and for γ = [γ 1,..., γ n ] T g(γ) = n (γ i γ 1 )(γ i γ 2 ) (γ i γ i 1 ) i=2 Hint: Multiply the ith row of A by n j=1 (α i +β j ) for i = 1, 2,..., n. Call the resulting matrix C. Each element of C is a product of n 1 factors α r + β s. Hence det(c) is a sum of terms where each term contain precisely n(n 1) factors α r + β s. Thus det(c) = q(α, β) where q is a polynomial of degree at most n(n 1) in α i and β j. Since det(a) and therefore det(c) vanishes if α i = α j for some i j or β r = β s for some r s, we have that q(α, β) must be divisible by each factor in g(α) and g(β). Since g(α) and g(β) is a polynomial of degree n(n 1), we have q(α, β) = kg(α)g(β) for some constant k independent of α and β. Show that k = 1 by choosing β i + α i = 0, i = 1, 2,..., n. b) Notice that the cofactor of any element in the above matrix A is the determinant of a matrix of similar form. Use the cofactor and determinant of A to represent the elements of A 1 = (b j,k ). Answer: b j,k = (α k + β j )A k ( β j )B j ( α k ),

34 22 Chapter 0. A Short Review of Linear Algebra where A k (x) = s k ( αs x α s α k ), B k (x) = ( ) βs x. β s β k s k Exercise 0.35 (Inverse of the Hilbert matrix) Let H n = (h i,j ) be the n n matrix with elements h i,j = 1/(i + j 1). Use Exercise 0.34 to show that the elements t n i,j in T n = H 1 n are given by t n i,j = f(i)f(j) i + j 1, where ( i 2 n 2 ) f(i+1) = f(i), i = 1, 2,..., f(1) = n. i Eigenvalues, eigenvectors and eigenpairs Suppose A C n n is a square matrix, λ C and x C n. We say that (λ, x) is an eigenpair for A if Ax = λx and x is nonzero. The scalar λ is called an eigenvalue and x is said to be an eigenvector. 3 The set of eigenvalues is called the spectrum of A and is denoted by σ(a). For example, σ(i) = {1,..., 1} = {1}. Eigenvalues are the roots of the characteristic polynomial. Lemma 0.36 (Characteristic equation) For any A C n n we have λ σ(a) det(a λi) = 0. Proof. Suppose (λ, x) is an eigenpair for A. The equation Ax = λx can be written (A λi)x = 0. Since x is nonzero the matrix A λi must be singular with a zero determinant. Conversely, if det(a λi) = 0 then A λi is singular and (A λi)x = 0 for some nonzero x C n. Thus Ax = λx and (λ, x) is an eigenpair for A. The expression det(a λi) is a polynomial of exact degree n in λ. n = 3 we have a 11 λ a 12 a 13 det(a λi) = a 21 a 22 λ a 23 a 31 a 32 a 33 λ. For 3 The word eigen is derived from German and means own

35 0.5. Eigenvalues, eigenvectors and eigenpairs 23 Expanding this determinant by the first column we find det(a λi) = (a 11 λ) a 22 λ a 23 a 32 a 33 λ a 21 a 12 a 13 a 32 a 33 λ a + a 12 a a 22 λ a 23 = (a 11 λ)(a 22 λ)(a 33 λ) + r(λ) for some polynomial r of degree at most one. In general det(a λi) = (a 11 λ)(a 22 λ) (a nn λ) + r(λ), (14) where each term in r(λ) has at most n 2 factors containing λ. It follows that r is a polynomial of degree at most n 2, det(a λi) is a polynomial of exact degree n in λ and the eigenvalues are the roots of this polynomial. We observe that det(a λi) = ( 1) n det(λi A) so det(a λi) = 0 if and only if det(λi A) = 0. Definition 0.37 (Characteristic polynomial of a matrix) The function π A : C C given by π A (λ) = det(a λi) is called the characteristic polynomial of A. The equation det(a λi) = 0 is called the characteristic equation of A. By the fundamental theorem of algebra an n n matrix has, counting multiplicities, precisely n eigenvalues λ 1,..., λ n some of which might be complex even if A is real. The complex eigenpairs of a real matrix occur in complex conjugate pairs. Indeed, taking the complex conjugate on both sides of the equation Ax = λx with A real gives Ax = λx. Theorem 0.38 (Sums and products of eigenvalues; trace) For any A C n n trace(a) = λ 1 + λ λ n, det(a) = λ 1 λ 2 λ n, (15) where the trace of A C n n is the sum of its diagonal elements trace(a) := a 11 + a a nn. (16) Proof. We compare two different expansions of π A. On the one hand from (14) we find π A (λ) = ( 1) n λ n + c n 1 λ n c 0, where c n 1 = ( 1) n 1 trace(a) and c 0 = π A (0) = det(a). On the other hand π A (λ) = (λ 1 λ) (λ n λ) = ( 1) n λ n + d n 1 λ n d 0,

36 24 Chapter 0. A Short Review of Linear Algebra where d n 1 = ( 1) n 1 (λ λ n ) and d 0 = λ 1 λ n. Since c j = d j for all j we obtain (15). For a 2 2 matrix the characteristic equation takes the convenient form λ 2 trace(a)λ + det(a) = 0. (17) Thus, if A = [ ] then trace(a) = 4, det(a) = 3 so that π A(λ) = λ 2 4λ + 3. Since A is singular Ax = 0, some x 0 Ax = 0x, some x 0 zero is an eigenvalue of A, we obtain Theorem 0.39 (Zero eigenvalue) The matrix A C n n is singular if and only if zero is an eigenvalue. Since the determinant of a triangular matrix is equal to the product of the diagonal elements the eigenvalues of a triangular matrix are found on the diagonal. In general it is not easy to find all eigenvalues of a matrix. However, sometimes the dimension of the problem can be reduced. Since the determinant of a block triangular matrix is equal to the product of the determinants of the diagonal blocks we obtain Theorem 0.40 (Eigenvalues of a block triangular matrix) If A = [ B D 0 C ] is block triangular then π A = π B π C. 0.6 Algorithms and Numerical Stability In this text we consider mathematical problems (i. e., linear algebra problems) and many detailed numerical algorithms to solve them. Complexity is discussed briefly in Section As for programming issues we often vectorize the algorithms leading to shorter and more efficient programs. Stability is important both for the mathematical problems and for the numerical algorithms. Stability can be studied in terms of perturbation theory leading to condition numbers, see Chapters 7, 8, 13. We will often use phrases like the algorithm is numerically stable or the algorithm is not numerically stable without saying precisely what we mean by this. Loosely speaking, an algorithm is numerically stable if the solution, computed in floating point arithmetic, is the exact solution of a slightly perturbed problem. To determine upper bounds for these perturbations is the topic of backward error analysis. We give a rather limited introduction to floating point arithmetic and backward error analysis in Appendix A, but in the text we will not discuss this. This does not mean that numerical stability is not an important issue. In fact, numerical stability is crucial for a good algorithm. For thorough treatments of numerical stability issues we refer to the books [13] and [28, 29].

37 0.6. Algorithms and Numerical Stability 25 A list of freely available software for solving linear algebra problems can be found at

38 26 Chapter 0. A Short Review of Linear Algebra

39 Part I Linear Systems 27

41 Chapter 1 Diagonally dominant tridiagonal matrices; three examples In this introductory chapter we consider three problems originating from: cubic spline interpolation, a two point boundary value problem, an eigenvalue problem for a two point boundary value problem. Each of these problems leads to a linear algebra problem with a matrix which is diagonally dominant and tridiagonal. Taking advantage of structure we can show existence, uniqueness and characterization of a solution, and derive efficient algorithms to compute a numerical solution. For a particular tridiagonal test matrix we determine all its eigenvectors and eigenvalues. We will need these later when studying more complex problems. We end the chapter with an introduction to block multiplication, a powerful tool in matrix analysis and numerical linear algebra. Block multiplication is applied to derive some basic facts about triangular matrices. 1.1 Cubic Spline Interpolation We consider the following interpolation problem. Given an interval [a, b], n equidistant sites x i = a + i 1 (b a), i = 1, 2,..., n + 1 (1.1) n and y values y := [y 1,..., y n+1 ] T R n+1. We seek a function g : [a, b] R such that g(x i ) = y i, for i = 1,..., n + 1. (1.2) 29

42 30 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Figure 1.1: The polynomial of degree 13 interpolating f(x) = arctan(10x) + π/2 on [ 1, 1]. See text. For simplicity we only consider equidistant sites. More generally they could be any a x 1 < x 2 < < x n+1 b Polynomial interpolation Since there are n+1 interpolation conditions in (1.2) a natural choice for a function g is a polynomial of degree n. As shown in most books on numerical methods such a g is uniquely defined and there are good algorithms for computing it. Evidently, when n = 1, g is the straight line g(x) = y 1 + y 2 y 1 x 2 x 1 (x x 1 ), (1.3) known as the linear interpolation polynomial. Polynomial interpolation is an important technique which often gives good results, but the interpolant g can have undesirable oscillations when n is large. As an example, consider the function given by f(x) = arctan(10x) + π/2, x [ 1, 1]. The function f and the polynomial g of degree at most 13 satisfying (1.2) with [a, b] = [ 1, 1] and y i = f(x i ), i = 1,..., 14 is shown in Figure 1.1. The interpolant has large oscillations near the end of the range. This is an example of the Runge phenomenon and is due to the fact that the sites are uniformly spaced. Using larger n will only make the oscillations bigger Piecewise linear and cubic spline interpolation To avoid oscillations like the one in Figure 1.1 piecewise linear interpolation can be used. An example is shown in Figure 1.2. The interpolant g approximates the

43 1.1. Cubic Spline Interpolation Figure 1.2: The piecewise linear polynomial interpolating f(x) = arctan(10x) + π/2 at n = 14 uniform points on [ 1, 1]. original function quite well, and for some applications, like plotting, the linear interpolant using many points is what is used. Note that g is a piecewise polynomial of the form p 1 (x), if x 1 x < x 2, p 2 (x), if x 2 x < x 3, g(x) :=. p n 1 (x), if x n 1 x < x n, p n (x), if x n x x n+1, (1.4) where each p i is a polynomial of degree 1. In particular, p 1 is given in (1.3) and the other polynomials p i are given by similar expressions. The piecewise linear interpolant is continuous, but the first derivative will usually have jumps at the interior sites. We can obtain a smoother approximation by letting g be a piecewise polynomial of higher degree. With degree 3 (cubic) we obtain continuous derivatives of order 2 (C 2 ). We consider here the following functions giving examples of C 2 cubic spline interpolants. Definition 1.1 (The D 2 -spline problem) Given n N, an interval [a, b], y R n+1, knots (sites) x 1,..., x n+1 given by (1.1) and numbers µ 1, µ n+1. The problem is to find a function g : [a, b] R such that g is of the form (1.4) with each p i a cubic polynomial, (a piecewise cubic polynomial), g C 2 [a, b], i. e., derivatives of order 2 are continuous on R, (smooth), g(x i ) = y i, g (a) = µ 1, i = 1, 2,..., n + 1, (interpolation conditions), g (b) = µ n+1, (D 2 boundary conditions).

44 32 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Figure 1.3: A cubic spline with one knot interpolating f(x) = x 4 on [0, 2]. We call g a D 2 -spline. It is called an N-spline or natural spline if µ 1 = µ n+1 = 0. Example 1.2 (A D 2 -spline) Suppose we choose n = 2 and sample data from the function f : [0, 2] R given by f(x) = x 4. Thus we consider the D 2 -spline problem with [a, b] = [0, 2], y := [0, 1, 16] T and µ 1 = g (0) = 0, µ 3 = g (2) = 48. The knots are x 1 = 0, x 2 = 1 and x 3 = 2. The function g given by { p 1 (x) = 1 g(x) := 2 x x3, if 0 x < 1, p 2 (x) = 1 + 4(x 1) (x 1) (x (1.5) 1)3, if 1 x 2, is a D 2 -spline solving this problem. Indeed, p 1 and p 2 are cubic polynomials. For smoothness we find p 1 (1) = p 2 (1) = 1, p 1(1) = p 2(1) = 4, p 1(1) = p 2(1) = 9 which implies that g C 2 [0, 2]. Finally we check that the interpolation and boundary conditions hold. Indeed, g(0) = p 1 (0) = 0, g(1) = p 2 (1) = 1, g(2) = p 2 (2) = 16, g (0) = p 1(0) = 0 and g (2) = p 2(2) = 48. Note that p 1 (x) = 9 39 = p 2 (x) showing that the third derivative of g is piecewise constant with a jump discontinuity at the interior knot. A plot of f and g is shown in Figure 1.3. It is hard to distinguish one from the other. We note that The C 2 condition is equivalent to p (j) i 1 (x i) = p (j) i (x i ), j = 0, 1, 2, i = 2,..., n. The extra boundary conditions D 2 or N are introduced to obtain a unique interpolant. Indeed counting requirements we have 3(n 1) C 2 conditions,

45 1.1. Cubic Spline Interpolation 33 n + 1 conditions (1.2), and two boundary conditions, adding to 4n. Since a cubic polynomial has four coefficients, this number is equal to the number of coefficients of the n polynomials p 1,..., p n and give hope for uniqueness of the interpolant Give me a moment Existence and uniqueness of a solution of the D 2 -spline problem hinges on the nonsingularity of a linear system of equations that we now derive. The unknowns are derivatives at the knots. Here we use second derivatives which are sometimes called moments. We start with the following lemma. Lemma 1.3 (Representing each p i using (0, 2) interpolation) Given a < b, h = (b a)/n with n 2, x i = a + (i 1)h, and numbers y i, µ i for i = 1,..., n + 1. For i = 1,..., n there are unique cubic polynomials p i such that Moreover, where p i (x i ) = y i, p i (x i+1 ) = y i+1, p i (x i ) = µ i, p i (x i+1 ) = µ i+1. (1.6) p i (x) = c i,1 + c i,2 (x x i ) + c i,3 (x x i ) 2 + c i,4 (x x i ) 3 i = 1,..., n, (1.7) c i1 = y i, c i2 = y i+1 y i h h 3 µ i h 6 µ i+1, c i,3 = µ i 2, c i,4 = µ i+1 µ i. (1.8) 6h Proof. Consider p i in the form (1.7) for some 1 i n. Evoking (1.6) we find p i (x i ) = c i,1 = y i. Since p i (x) = 2c i,3 + 6c i,4 (x x i ) we obtain c i,3 from p i (x i) = 2c i,3 = µ i (a moment), and then c i,4 from p i (x i+1) = µ i + 6hc i,4 = µ i+1. Finally we find c i,2 by solving p i (x i+1 ) = y i +c i,2 h+ µi 2 h2 + µi+1 µi 6h h 3 = y i+1. For j = 0, 1, 2, 3 the shifted powers (x x i ) j constitute a basis for cubic polynomials and the formulas (1.8) are unique by construction. It follows that p i is unique. Theorem 1.4 (Constructing a D 2 -spline) Suppose for some moments µ 1,..., µ n+1 that each p i is given as in Lemma 1.3 for i = 1,..., n. If in addition µ i 1 + 4µ i + µ i+1 = 6 h 2 (y i+1 2y i + y i 1 ), i = 2,..., n, (1.9) then the function g given by (1.4) solves a D 2 -spline problem.

46 34 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Proof. Suppose for 1 i n that p i is given as in Lemma 1.3 for some µ 1,..., µ n+1. Consider the C 2 requirement. Since p i 1 (x i ) = p i (x i ) = y i and p i 1 (x i) = p i (x i) = µ i for i = 2,..., n it follows that g C 2 if and only if p i 1 (x i) = p i (x i) for i = 2,..., n. By (1.7) p i 1(x i ) = c i 1,2 + 2hc i 1,3 + 3h 2 c i 1,4 = y i y i 1 h = y i y i 1 h p i(x i ) = c i2 = y i+1 y i h h 3 µ i 1 h 6 µ i + 2h µ i h 6 µ i 1 + h 3 µ i h 3 µ i h 6 µ i h 2 µ i µ i 1 6h (1.10) A simple calculation shows that p i 1 (x i) = p i (x i) if and only if (1.9) holds. Finally consider the function g given by (1.4). If (1.9) holds then g C 2 [a, b]. By construction g(x i ) = y i, i = 1,..., n + 1, g (a) = p 1(x 1 ) = µ 1 and g (b) = p n(x n+1 ) = µ n+1. It follows that g solves the D 2 -spline problem. In order for the D 2 -spline to exist we need to show that µ 2,..., µ n always can be determined from (1.9). For n 3 and with µ 1 and µ n+1 given (1.9) can be written in the form µ 2 µ 3. µ n 1 µ n = 6 h 2 δ 2 y 2 µ 1 δ 2 y 3. δ 2 y n 1 δ 2 y n µ n+1, δ 2 y i := y i+1 2y i + y i 1. (1.11) This is a square linear system of equations. We recall (see Theorem 0.18) that a square system Ax = b has a solution for all right hand sides b if and only if the coefficient matrix A is nonsingular, i. e., the homogeneous system Ax = 0 only has the solution x = 0. Moreover, the solution is unique. We need to show that the coefficient matrix in (1.11) is nonsingular. We observe that the matrix in (1.11) is strictly diagonally dominant in accordance with the following definition. Definition 1.5 (Strict diagonal dominance) The matrix A = [a ij ] C n n is strictly diagonally dominant if a ii > j i a ij, i = 1,..., n. (1.12)

47 1.1. Cubic Spline Interpolation 35 Theorem 1.6 (Strict diagonal dominance) A strictly diagonally dominant matrix is nonsingular. Moreover, if A C n n is strictly diagonally dominant then the solution x of Ax = b is bounded as follows: ( max x bi ) i max, where σ i := a ii ij. (1.13) 1 i n 1 i n σ i j i a Proof. We first show that the bound (1.13) holds for any solution x. Choose k so that x k = max i x i. Then b k = a kk x k + j k a kj x j a kk x k j k a kj x j x k ( a kk j k a kj ), and this implies max 1 i n x i = x k b ( k σ k max bi ) 1 i n σ i. For the nonsingularity, if Ax = 0, then max 1 i n x i 0 by (1.13), and so x = 0. Theorem 1.6 implies that the system (1.11) has a unique solution giving rise to a function g detailed in Lemma 1.3 and solving the D 2 -spline problem. For uniqueness suppose g 1 and g 2 are two D 2 -splines interpolating the same data. Then g := g 1 g 2 is an N-spline satisfying (1.2) with y = 0. The solution [µ 2,..., µ n ] T of (1.11) and also µ 1 = µ n+1 are zero. It follows from (1.8) that all coefficients c i,j are zero. We conclude that g = 0 and g 1 = g 2. Example 1.7 (Cubic B-spline) For the N-spline with [a, b] = [0, 4] and y = [0, 1 6, 2 3, 1 6, 0] the linear system (1.9) takes the form µ 2 µ 3 = µ 4 2 The solution is µ 2 = µ 4 = 1, µ 3 = 2 The knotset is {0, 1, 2, 3, 4}.Using (1.8) (cf. Exercise 1.18) we find p 1 (x) = 1 6 x3, if 0 x < 1, p 2 (x) = 1 g(x) := (x 1) (x 1)2 1 2 (x 1)3, if 1 x < 2, p 3 (x) = 2 3 (x 2) (x 2)3, if 2 x < 3, p 4 (x) = (x 3) (x 3)2 1 6 (x 3)3 = 1 6 (4 x)3, if 3 x 4, (1.14) A plot of this spline is shown in Figure 1.4. On (0, 4) the function g equals the nonzero part of a C 2 cubic B-spline.

36 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Figure 1.4: A cubic B-spline. 1.1.4 LU Factorization of a Tridiagonal System To find the D 2 -spline g we have to solve the triangular system (1.

48 36 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Figure 1.4: A cubic B-spline LU Factorization of a Tridiagonal System To find the D 2 -spline g we have to solve the triangular system (1.11). Consider solving a general tridiagonal linear system Ax = b where A = tridiag(a i, d i, c i ) C n n. Instead of using Gaussian elimination we can try to construct triangular matrices L and U such that the product A = LU has the form d 1 c 1 a 1 d 2 c 2 1 u 1 c l = a n 2 d n 1 c n 1 u n 1 c n 1. l a n 1 d n 1 1 u n n (1.15) If L and U can be determined and u 1,... u n are nonzero we can find x by solving two simpler systems Lz = b and Ux = z. For n = 3 equation (1.15) takes the form d 1 c u 1 c 1 0 u 1 c 1 0 a 1 d 2 c 2 = l u 2 c 2 = l 1 u 1 l 1 c 1 + u 2 c 2, 0 a 2 d 3 0 l u 3 0 l 2 u 2 l 2 c 2 + u 3 and the systems Lz = b and Ux = z can be written z 1 b 1 u 1 c 1 0 l z 2 = b 2, 0 u 2 c 2 0 l 2 1 z 3 b u 3 Comparing elements we find x 1 x 2 x 3 = z 1 z 2 z 3.

49 1.1. Cubic Spline Interpolation 37 u 1 = d 1, l 1 = a 1 /u 1, u 2 = d 2 l 1 c 1, l 2 = a 2 /u 2, u 3 = d 3 l 2 c 2, z 1 = b 1, z 2 = b 2 l 1 z 1, z 3 = b 3 l 2 z 2, x 3 = z 3 /u 3, x 2 = (z 2 c 2 x 3 )/u 2, x 1 = (z 1 c 1 x 2 )/u 1. In general, if u 1 = d 1, l k = a k /u k, u k+1 = d k+1 l k c k, k = 1, 2,..., n 1, (1.16) then A = LU. If u 1, u 2,..., u n 1 are nonzero then (1.16) is well defined. If in addition u n 0 then we can solve Lz = b and Ux = z for z and x. We formulate this as two algorithms. Algorithm 1.8 (trifactor) Vectors l C n 1, u C n are computed from a, c C n 1, d C n. This implements the LU factorization of a tridiagonal matrix. 1 function [ l, u]= t r i f a c t o r ( a, d, c ) 2 % [ l, u]= t r i f a c t o r ( a, d, c ) 3 u=d ; l=a ; 4 for k =1: length ( a ) 5 l ( k )=a ( k ) /u ( k ) ; 6 u ( k+1)=d ( k+1) l ( k ) c ( k ) ; 7 end Algorithm 1.9 (trisolve) The solution x of a tridiagonal system with r right hand sides is computed from a previous call to trifactor. Here l, C n 1 and u C n are output from trifactor and b C n,r for some r N. 1 function x = t r i s o l v e ( l, u, c, b ) 2 % x = t r i s o l v e ( l, u, c, b ) 3 x=b ; 4 n= size ( b, 1 ) ; 5 for k =2:n 6 x ( k, : ) =b ( k, : ) l ( k 1) x ( k 1,:) ; 7 end 8 x ( n, : ) =x ( n, : ) /u ( n ) ; 9 for k=(n 1) : 1:1 10 x ( k, : ) =(x ( k, : ) c ( k ) x ( k +1,:) ) /u ( k ) ; 11 end Since division by zero can occur, the algorithms will not work in general, but for tridiagonal strictly diagonally dominant systems we have 4 4 It can be shown that any strictly diagonally dominant linear system has a unique LU factorization.

50 38 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Theorem 1.10 (LU of a tridiagonal strictly dominant system) A strictly diagonally dominant tridiagonal matrix has a unique LU-factorization of the form (1.15). Proof. We show that the u k s in (1.16) are nonzero for k = 1,..., n. For this it is sufficient to show by induction that u k σ k + c k, where, σ k := d k a k 1 c k > 0, k = 1,..., n, (1.17) and where a 0 := c n := 0. By assumption u 1 = d 1 = σ 1 + c 1. Suppose u k σ k + c k for some 1 k n 1. Then c k / u k < 1 and by (1.16) and strict diagonal dominance u k+1 = d k+1 l k c k = d k+1 a kc k d k+1 a k c k u k u k d k+1 a k = σ k+1 + c k+1. (1.18) Corollary 1.11 (Stability of the LU factorization) Suppose A C n n is tridiagonal and strictly diagonally dominant with computed elements in the LU factorization given by (1.16). Then (1.17) holds, u 1 = d 1 and l k = a k u k a k d k a k 1, u k+1 d k+1 + a k c k d k a k 1, k = 1,..., n 1. Proof. Using (1.16) and (1.17) for 1 k n 1 we find (1.19) l k = a k u k a k d k a k 1, u k+1 d k+1 + l k c k d k+1 + a k c k d k a k 1. For a strictly diagonally dominant tridiagonal matrix it follows from Corollary 1.11 that the LU factorization algorithm trifactor is stable. We cannot have severe growth in the computed elements u k and l k. The number of arithmetic operations to compute the LU factorization of a tridiagonal matrix of order n using (1.16) is 3n 3, while the number of arithmetic operations for Algorithm trisolve is r(5n 4), where r is the number of right-hand sides. This means that the complexity to solve one

51 1.1. Cubic Spline Interpolation 39 tridiagonal system is O(n), or more precisely 8n 7, and this number only grows linearly 5 with n Exercises for section 1.1 Exercise 1.12 (The shifted power basis is a basis) Show that the polynomials {(x x i ) j } 0 j n is a basis for polynomials of degree n. 6 Exercise 1.13 (The natural spline, n = 1) How can one define an N-spline when n = 1? Exercise 1.14 (Bounding the moments) Show that for the N-spline the solution of the linear system (1.11) is bounded as follows: 7 max µ j 3 2 j n h 2 max y i+1 2y i + y i 1. 2 i n Exercise 1.15 (Moment equations for 1. derivative boundary conditions) Suppose instead of the D 2 boundary conditions we use D 1 conditions given by g (a) = s 1 and g (b) = s n+1 for some given numbers s 1 and s n+1. Show that the linear system for the moments of a D 1 -spline can be written µ 1 µ 2. µ n µ n+1 = 6 h 2 y 2 y 1 hs 1 δ 2 y 2 δ 2 y 3. δ 2 y n 1 δ 2 y n hs n+1 y n+1 + y n, (1.20) where δ 2 y i := y i+1 2y i + y i 1, i = 2,..., n. Hint: Use (1.10) to compute g (x 1 ) and g (x n+1 ), Is g unique? 5 We show in Section that Gaussian elimination on a full n n system is an O(n 3 ) process 6 Hint: consider an arbitrary polynomial of degree n and expanded it in Taylor series around x i 7 Hint, use Theorem 1.6

52 40 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Exercise 1.16 (Minimal norm property of the natural spline) Study proof of the following theorem 8. Theorem 1.17 (Minimal norm property of a cubic spline) Suppose g is an N-spline. Then b ( g (x) ) b 2 ( dx h (x) ) 2 dx a a for all h C 2 [a, b] such that h(x i ) = g(x i ), i = 1,..., n + 1. Proof. Let h be any interpolant as in the theorem. We first show the orthogonality condition b a g e = 0, e := h g. (1.21) Integration by parts gives b a g e = [ g e ] b b a a g e. The first term is zero since g is continuous and g (b) = g (a) = 0. For the second term, since g is equal to a constant v i on each subinterval (x i, x i+1 ) and e(x i ) = 0, for i = 1,..., n + 1 b a g e = n i=1 xi+1 x i g e = n xi+1 v i i=1 Writing h = g + e and using (1.21) b ( b h ) 2 ( = g + e ) 2 x i e = n ( v i e(xi+1 ) e(x i ) ) = 0. i=1 a = = a b a b ( b g ) 2 ( b + e ) g e a a ( b g ) 2 ( b + e ) 2 ( g ) 2 a a a and the proof is complete. 8 The name spline is inherited from a physical analogue, an elastic ruler that is used to draw smooth curves. Heavy weights, called ducks, are used to force the ruler to pass through, or near given locations. (Cf. Figure 1.5). The ruler will take a shape that minimizes its potential energy. Since the potential energy is proportional to the integral of the square of the curvature, and the curvature can be approximated by the second derivative it follows from Theorem 1.17 that the mathematical N-spline approximately models the physical spline.

1.1. Cubic Spline Interpolation 41 Figure 1.5: A physical spline with ducks. 1.1.6 Computer exercises for section 1.1 Exercise 1.

53 1.1. Cubic Spline Interpolation 41 Figure 1.5: A physical spline with ducks Computer exercises for section 1.1 Exercise 1.18 (Computing the D 2 -spline) Let g be the D 2 -spline corresponding to an interval [a, b], a vector y R n+1 and µ 1, µ n+1. The vector x = [x 1,..., x n ] and the coefficient matrix C R n 4 in (1.7) are returned in the following algorithm. It uses algorithms 1.8 and 1.9 to solve the tridiagonal linear system. Algorithm 1.19 (splineint) 1 function [ x,c]= s p l i n e i n t ( a, b, y, mu1, munp1) 2 y=y ( : ) ; n=length ( y ) 1; 3 h=(b a ) /n ; x=a : h : b h ; c=ones ( n 2,1) ; 4 [ l, u]= t r i f a c t o r ( c, 4 ones ( n 1,1), c ) ; 5 b1=6/h ˆ2 ( y ( 3 : n+1) 2 y ( 2 : n )+y ( 1 : n 1) ) ; 6 b1 ( 1 )=b1 ( 1 ) mu1 ; b1 ( n 1)=b1 ( n 1) munp1 ; 7 mu= [ mu1 ; t r i s o l v e ( l, u, c, b1 ) ; munp1 ] ; 8 C=zeros (4 n, 1 ) ; 9 C( 1 : 4 : 4 n 3)=y ( 1 : n ) ; 10 C( 2 : 4 : 4 n 2)=(y ( 2 : n+1) y ( 1 : n ) ) /h h mu( 1 : n )/3 h mu( 2 : n+1) / 6 ; 11 C( 3 : 4 : 4 n 1)=mu( 1 : n ) / 2 ; 12 C( 4 : 4 : 4 n )=(mu( 2 : n+1) mu( 1 : n ) ) /(6 h ) ; 13 C=reshape (C, 4, n ) ; Use the algorithm to compute the c i,j in Example 1.7.

54 42 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Figure 1.6: The cubic spline interpolating f(x) = arctan(10x)+π/2 at 14 equidistant sites on [ 1, 1]. The exact function is also shown. Exercise 1.20 (Spline evaluation) To plot a piecewise polynomial g in the form (1.4) we need to compute values g(r j ) at a number of sites r = [r 1,..., r m ] R m for some reasonably large integer m. To determine g(r j ) for some j we need to find an integer i j so that g(r j ) = p ij (r j ). Given k N, t = [t 1,..., t k ] and a real number x. We consider the problem of computing an integer i so that i = 1 if x < t 2, i = k if x t k, and t i x < t i+1 otherwise. If x R m is a vector then an m-vector i should be computed, such that the jth component of i gives the location of the jth component of x. The following Matlab function determines i = [i 1,..., i m ]. It uses the built in Matlab functions length, min, sort, find. Algorithm 1.21 (findsubintervals) 1 function i = f i n d s u b i n t e r v a l s ( t, x ) 2 %i = f i n d s u b i n t e r v a l s ( t, x ) 3 k=length ( t ) ; m=length ( x ) ; 4 i f k<2 5 i=ones (m, 1 ) ; 6 else 7 t ( 1 )=min( x ( 1 ), t ( 1 ) ) 1; 8 [, j ]= sort ( [ t ( : ), x ( : ) ] ) ; 9 i =(find ( j>k ) (1:m) ) ; 10 end Use findsubintervals and the following algorithm to make the plots in

55 1.2. A two point boundary value problem 43 Figure 1.6. Algorithm 1.22 (splineval) Given the output x, C of Algorithm 1.19 defining a cubic spline g, and a vector X. The vector G = g(x) is computed. 1 function [X,G]= s p l i n e v a l ( x, C,X) 2 m=length (X) ; 3 i=f i n d s u b i n t e r v a l s ( x,x) ; 4 G=zeros (m, 1 ) ; 5 for j =1:m 6 k=i ( j ) ; 7 t=x( j ) x ( k ) ; 8 G( j ) =[1, t, t ˆ2, t ˆ 3 ] C( k, : ) ; 9 end 1.2 A two point boundary value problem Consider the simple two point boundary value problem u (x) = f(x), x [0, 1], u(0) = 0, u(1) = 0, (1.22) where f is a given continuous function on [0, 1] and u is an unknown function. This problem is also known as the one-dimensional (1D) Poisson problem. In principle it is easy to solve (1.22) exactly. We just integrate f twice and determine the two integration constants so that the homogeneous boundary conditions u(0) = u(1) = 0 are satisfied. For example, if f(x) = 1 then u(x) = x(x 1)/2 is the solution. Suppose f cannot be integrated exactly. Problem (1.22) can then be solved approximately using the finite difference method. We need a difference approximation to the second derivative. If g is a function differentiable at x then g (x) = lim h 0 g(x + h 2 ) g(x h 2 ) h and applying this to a function u that is twice differentiable at x u u (x + h 2 (x) = lim ) u (x h 2 ) u(x+h) u(x) h = lim h 0 h h 0 u(x + h) 2u(x) + u(x h) = lim h 0 h 2. u(x) u(x h) h h To define the points where this difference approximation is used we choose a positive integer m, let h := 1/(m+1) be the discretization parameter, and replace the interval [0, 1] by grid points x j := jh for j = 0, 1,..., m + 1. We then obtain

56 44 Chapter 1. Diagonally dominant tridiagonal matrices; three examples approximations v j to the exact solution u(x j ) for j = 1,..., m by replacing the differential equation by the difference equation v j 1 + 2v j v j+1 h 2 = f(jh), j = 1,..., m, v 0 = v m+1 = 0. Moving the h 2 factor to the right hand side this can be written as an m m linear system v 1 f(h). T v = v 2 f(2h) 0. = h v m 1 f ( (m 1)h ) =: b. v m f(mh) (1.23) The matrix T is called the second derivative matrix and will occur frequently in this book. It is our second example of a tridiagonal matrix, T = tridiag(a i, d i, c i ) R m m, where in this case a i = c i = 1 and d i = 2 for all i Diagonal dominance We want to show that (1.23) has a unique solution. Note that T is not strictly diagonally dominant. However, T is weakly diagonally dominant in accordance with the following definition. Definition 1.23 (Diagonal dominance) The matrix A = [a ij ] C n n is weakly diagonally dominant if a ii j i a ij, i = 1,..., n. (1.24) We showed in Theorem 1.6 that a strictly diagonally dominant matrix is nonsingular. This is in general not true in the weakly diagonally dominant case. Consider the 3 matrices A 1 = 1 2 1, A 2 = 0 0 0, A 3 = They are all weakly diagonally dominant, but A 1 and A 2 are singular, while A 3 is nonsingular. Indeed, for A 1 column two is the sum of columns one and three, A 2 has a zero row, and det(a 3 ) = 4 0. It follows that for the nonsingularity and existence of an LU-factorization we need some conditions in addition to weak diagonal dominance. Here are some sufficient conditions..

57 1.2. A two point boundary value problem 45 Theorem 1.24 (Weak diagonal dominance) Suppose A = tridiag(a i, d i, c i ) C n n is tridiagonal and weakly diagonally dominant. If in addition d 1 > c 1 and a i 0 for i = 1,..., n 2, then A has a unique LU factorization (1.15). If in addition d n 0, then A is nonsingular. Proof. The proof is similar to the proof of Theorem 1.6. The matrix A has an LU factorization if the u k s in (1.16) are nonzero for k = 1,..., n 1. For this it is sufficient to show by induction that u k > c k for k = 1,..., n 1. By assumption u 1 = d 1 > c 1. Suppose u k > c k for some 1 k n 2. Then c k / u k < 1 and by (1.16) and since a k 0 u k+1 = d k+1 l k c k = d k+1 a kc k d k+1 a k c k > d k+1 a k. (1.25) u k u k This also holds for k = n 1 if a n 1 0. By (1.25) and weak diagonal dominance u k+1 > d k+1 a k c k+1 and it follows by induction that an LU factorization exists. It is unique since any LU factorization must satisfy (1.16). For the nonsingularity we need to show that u n 0. For then by Lemma 1.35, both L and U are nonsingular, and this is equivalent to A = LU being nonsingular. If a n 1 0 then by (1.16) u n > d n a n 1 0 by weak diagonal dominance, while if a n 1 = 0 then again by (1.25) u n d n > 0. Consider now the special system T v = b given by (1.23). The matrix T is weakly diagonally dominant and satisfies the additional conditions in Theorem Thus it is nonsingular and we can solve the system in O(n) arithmetic operations using the algorithms trifactor and trisolve. We could use the explicit inverse of T, given in Exercise 1.26, to compute the solution of T v = b as v = T 1 b. However this is not a good idea. In fact, all elements in T 1 are nonzero and the calculation of T 1 b requires O(n 2 ) operations Exercises for section 1.2 Exercise 1.25 (LU factorization of 2. derivative matrix) Show that T = LU, where L = , U = m 0 0 m 1 m 1 1 m+1 m m (1.26)

58 46 Chapter 1. Diagonally dominant tridiagonal matrices; three examples is the LU factorization of T. Exercise 1.26 (Inverse of the 2. derivative matrix) Let S R m m have elements s ij given by s i,j = s j,i = 1 m + 1 j( m + 1 i ), 1 j i m. (1.27) Show that ST = I and conclude that T 1 = S. Exercise 1.27 (Central difference approximation of 2. derivative) Consider δ 2 f(x) := f(x + h) 2f(x) + f(x h) h 2, h > 0, f : [x h, x + h] R. 1. Show using Taylor expansion that if f C 2 [x h, x + h] then for some η 2 δ 2 f(x) = f (η 2 ), x h < η 2 < x + h. 2. If f C 4 [x h, x + h] then for some η 4 δ 2 f(x) = f (x) + h2 12 f (4) (η 4 ), x h < η 4 < x + h. δ 2 f(x) is known as the central difference approximation to the second derivative at x. Exercise 1.28 (Two point boundary value problem) We consider a finite difference method for the two point boundary value problem u (x) + r(x)u (x) + q(x)u(x) = f(x), for x [a, b], u(a) = g 0, u(b) = g 1. (1.28) We assume that the given functions f, q and r are continuous on [a, b] and that q(x) 0 for x [a, b]. It can then be shown that (1.28) has a unique solution u. To solve (1.28) numerically we choose m N, h = (b a)/(m+1), x j = a+jh for j = 0, 1,..., m + 1 and solve the difference equation v j 1 + 2v j v j+1 h 2 with v 0 = g 0 and v m+1 = g 1. +r(x j ) v j+1 v j 1 +q(x j )v j = f(x j ), j = 1,..., m, (1.29) 2h

59 1.3. An Eigenvalue Problem 47 (a) Show that (1.29) leads to a tridiagonal linear system Av = b, where A = tridiag(a j, d j.c j ) R m m has elements a j = 1 h 2 r(x j), c j = 1 + h 2 r(x j), d j = 2 + h 2 q(x j ), and h 2 f(x 1 ) a 1 g 0, if j = 1, b j = h 2 f(x j ), if 2 j m 1, h 2 f(x m ) c m g 1, if j = m. (b) Show that the linear system satisfies the conditions in Theorem 1.24 if the spacing h is so small that h 2 r(x) < 1 for all x [a, b]. (c) Propose a method to find v 1,..., v m. Exercise 1.29 (Two point boundary value problem; computation) (a) Consider the problem (1.28) with r = 0, f = q = 1 and boundary conditions u(0) = 1, u(1) = 0. The exact solution is u(x) = 1 sinh x/ sinh 1. Write a computer program to solve (1.29) for h = 0.1, 0.05, 0.025, , and compute the error max 1 j m u(x j ) v j for each h. (b) Make a combined plot of the solution u and the computed points v j, j = 0,..., m + 1 for h = 0.1. (c) One can show that the error is proportional to h p for some integer p. Estimate p based on the error for h = 0.1, 0.05, 0.025, An Eigenvalue Problem Recall that if A C n n is a square matrix and Ax = λx for some nonzero x C n, then λ C is called an eigenvalue and x an eigenvector. We call (λ, x) an eigenpair of A The Buckling of a Beam Consider a horizontal beam of length L located between 0 and L on the x-axis of the plane. We assume that the beam is fixed at x = 0 and x = L and that a force F is applied at (L, 0) in the direction towards the origin. This situation can be modeled by the boundary value problem Ry (x) = F y(x), y(0) = y(l) = 0, (1.30)

60 48 Chapter 1. Diagonally dominant tridiagonal matrices; three examples where y(x) is the vertical displacement of the beam at x, and R is a constant defined by the rigidity of the beam. We can transform the problem to the unit interval [0, 1] by considering the function u : [0, 1] R given by u(t) := y(tl). Since u (t) = L 2 y (tl), the problem (1.30) then becomes u (t) = Ku(t), u(0) = u(1) = 0, K := F L2 R. (1.31) Clearly u = 0 is a solution, but we can have nonzero solutions corresponding to certain values of the K known as eigenvalues. The corresponding function u is called an eigenfunction. If F = 0 then K = 0 and u = 0 is the only solution, but if the force is increased it will reach a critical value where the beam will buckle and maybe break. This critical value corresponds to the smallest eigenvalue of (1.31). With u(t) = sin (πt) we find u (t) = π 2 u(t) and this u is a solution if K = π 2. It can be shown that this is the smallest eigenvalue of (1.31) and solving for F we find F = π2 R L. 2 We can approximate this eigenvalue numerically. Using the same finite difference approximation as in Section 1.2 we obtain v j 1 + 2v j v j+1 h 2 = Kv j, j = 1,..., m, h = 1 m + 1, v 0 = v m+1 = 0, where v j u(jh) for j = 0,..., m + 1. If we define λ := h 2 K then we obtain the equation T v = λv, with v = [v 1,..., v m ] T, (1.32) and T := tridiag m ( 1, 2, 1) is the matrix given by (1.23). The problem now is to determine the eigenvalues of T. Normally we would need a numerical method to determine the eigenvalues of a matrix, but for this simple problem the eigenvalues can be determined exactly. We show in the next subsection that the smallest eigenvalue of (1.32) is given by λ = 4 sin 2 (πh/2). Since λ = h 2 K = h2 F L 2 we can solve for F to obtain F = 4 sin2 (πh/2)r h 2 L 2. For small h this is a good approximation to the value π2 R L 2 Exercise 1.30 (Approximate force) Show that F = 4 sin2 (πh/2)r h 2 L 2 = π2 R L 2 + O(h2 ) The eigenpairs of the 1D test matrix R we computed above. The second derivative matrix T = tridiag( 1, 2, 1) is a special case of the tridiagonal matrix T 1 := tridiag(a, d, a) (1.33)

61 1.3. An Eigenvalue Problem 49 where a, d R. We call this the 1D test matrix. It is symmetric and strictly diagonally dominant if d > 2 a. We show that the eigenvectors are the columns of the sine matrix defined by S = [ sin jkπ ] m R m m. (1.34) m + 1 j,k=1 For m = 3, sin π 4 sin 2π 4 sin 3π 4 t 1 t S = [s 1, s 2, s 3 ] = sin 2π 4 sin 4π 4 sin 6π 4 = 1 0 1, t := 1. sin 3π 4 sin 6π 4 sin 9π t 1 t 2 4 Lemma 1.31 (Eigenpairs of 1D test matrix) Suppose T 1 = (t kj ) k,j = tridiag(a, d, a) R m m with m 2, a, d R, and let h = 1/(m + 1). 1. We have T 1 s j = λ j s j for j = 1,..., m, where s j = [sin(jπh), sin(2jπh),..., sin(mjπh)] T, (1.35) λ j = d + 2a cos(jπh). (1.36) 2. The eigenvalues are distinct and the eigenvectors are orthogonal Proof. We find for 1 < k < m (T 1 s j ) k = s T j s k = m + 1 δ j,k = 1 2 2h δ j,k, j, k = 1,..., m. (1.37) m t k,l sin(ljπh) = a [ sin ( (k 1)jπh ) + sin ( (k + 1)jπh )] + d sin(kjπh) l=1 = 2a cos(jπh) sin(kjπh) + d sin(kjπh) = λ j s k,j. This also holds for k = 1, m, and part 1 follows. Since jπh = jπ/(m + 1) (0, π) for j = 1,..., m and the cosine function is strictly monotone decreasing on (0, π) the eigenvalues are distinct, and since T 1 is symmetric it follows from Lemma 1.32 below that the eigenvectors s j are orthogonal. To finish the proof of (1.37) we compute m m s T j s j = sin 2 (kjπh) = sin 2 (kjπh) = 1 2 k=1 = m m k=0 k=0 cos(2kjπh) = m + 1, 2 m ( ) 1 cos(2kjπh) k=0

62 50 Chapter 1. Diagonally dominant tridiagonal matrices; three examples since the last cosine sum is zero. We show this by summing a geometric series of complex exponentials. With i = 1 we find m m m cos(2kjπh) + i sin(2kjπh) = e 2ikjπh = e2i(m+1)jπh 1 e 2ijπh 1 k=0 and (1.37) follows. k=0 k=0 = 0, Recall that the conjugate transpose of a matrix is defined by A := A T, where A is obtained from A by taking the complex conjugate of all elements. A matrix A C n n is Hermitian if A = A. A real symmetric matrix is Hermitian. Lemma 1.32 (Eigenpairs of a Hermitian matrix) The eigenvalues of a Hermitian matrix are real. Moreover, eigenvectors corresponding to distinct eigenvalues are orthogonal. Proof. Suppose A = A and Ax = λx with x 0. We multiply both sides of Ax = λx from the left by x and divide by x x to obtain λ = x Ax x x. Taking complex conjugates we find λ = λ = (x Ax) (x x) = x A x x x = x Ax x x = λ, and λ is real. Suppose that (λ, x) and (µ, y) are two eigenpairs for A with µ λ. Multiplying Ax = λx by y gives λy x = y Ax = (x A y) = (x Ay) = (µx y) = µy x, using that µ is real. Since λ µ it follows that y x = 0, which means that x and y are orthogonal. Exercise 1.33 (Eigenpairs T of order 2) Compute directly the eigenvalues and eigenvectors for T when n = 2 and thus verify Lemma 1.31 in this case. 1.4 Block Multiplication and Triangular Matrices Block multiplication is a powerful and essentiasl tool for dealing with matrices. It will be used extensively in this book. We will also need some basic facts about triangular matrices Block multiplication A rectangular matrix A can be partitioned into submatrices by drawing horizontal lines between selected rows and vertical lines between selected columns. For

63 1.4. Block Multiplication and Triangular Matrices 51 example, the matrix can be partitioned as (i) [ ] A11 A 12 = A 21 A 22 (iii) a T 1: a T 2: = a T 3: A = ,, (ii) [ ] a :1, a :2, a :3 = (iv) [ ] A 11, A 12 = In (i) the matrix A is divided into four submatrices A 11 = [ 1 ], A 12 = [ 2, 3 ], A 21 = [ 4 7], and A 22 = [ ] 5 6, 8 9 while in (ii) and (iii) A has been partitioned into columns and rows, respectively. The submatrices in a partition are often referred to as blocks and a partitioned matrix is sometimes called a block matrix. In the following we assume that A C m p and B C p n. Here are some rules and observations for block multiplication. 1. If B = [ b :1,..., b :n ] is partitioned into columns then the partition of the product AB into columns is AB = [ Ab :1, Ab :2,..., Ab :n ]. In particular, if I is the identity matrix of order p then A = AI = A [ e 1, e 2,..., e p ] = [ Ae1, Ae 2,..., Ae p ] and we see that column j of A can be written Ae j for j = 1,..., p. 2. Similarly, if A is partitioned into rows then a T 1: a T 2: AB =. B = a T m: a T 1:B a T 2:B. a T m:b and taking A = I it follows that row i of B can be written e T i B for i = 1,..., m.,.,

64 52 Chapter 1. Diagonally dominant tridiagonal matrices; three examples 3. It is often useful to write the matrix-vector product Ax as a linear combination of the columns of A Ax = x 1 a :1 + x 2 a :2 + + x p a :p. 4. If B = [ B 1, B 2 ], where B1 C p r and B 2 C p (n r) then A [ B 1, B 2 ] = [ AB1, AB 2 ]. This follows from Rule 1. by an appropriate grouping of columns. [ ] A1 5. If A =, where A A 1 C k p and A 2 C (m k) p then 2 [ ] A1 B = A 2 [ ] A1 B. A 2 B This follows from Rule 2. by a grouping of rows. 6. If A = [ [ ] ] B1 A 1, A 2 and B =, where A B 1 C m s, A 2 C m (p s), 2 B 1 C s n and B 2 C (p s) n then [ ] [ ] B A1, A 1 2 = [ ] A B 1 B 1 + A 2 B 2. 2 Indeed, (AB) ij = p k=1 a ikb kj = s k=1 a ikb kj + p k=s+1 a ikb kj = (A 1 B 1 ) ij + (A 2 B 2 ) ij = (A 1 B 1 + A 2 B 2 ) ij. [ ] [ ] A11 A 7. If A = 12 B11 B and B = 12 then A 21 A 22 B 21 B 22 [ ] [ ] [ ] A11 A 12 B11 B 12 A11 B = 11 + A 12 B 21 A 11 B 12 + A 12 B 22, A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 provided the vertical partition in A matches the horizontal one in B, i.e. the number of columns in A 11 and A 21 equals the number of rows in B 11 and B 12 and the number of columns in A equals the number of rows in B. To show this we use Rule 4. to obtain [[ ] [ ] [ ] [ ]] A11 A AB = 12 B11 A11 A, 12 B12. A 22 B 21 A 22 B 22 A 21 A 21 We complete the proof using Rules 5. and Consider finally the general case. If all the matrix products A ik B kj in s C ij = A ik B kj, k=1 i = 1,..., p, j = 1,..., q

65 1.4. Block Multiplication and Triangular Matrices 53 are well defined then A 11 A 1s B 11 B 1q.... A p1 A ps B s1 B sq The requirements are that = C 11 C 1q... C p1 C pq the number of columns in A is equal to the number of rows in B. the position of the vertical partition lines in A has to mach the position of the horizontal partition lines in B. The horizontal lines in A and the vertical lines in B can be anywhere Triangular matrices We need some basic facts about triangular matrices and we start with Lemma 1.34 (Inverse of a block triangular matrix) Suppose [ ] A11 A A = 12 0 A 22 where A, A 11 and A 22 are square matrices. Then A is nonsingular if and only if both A 11 and A 22 are nonsingular. In that case [ ] A 1 A 1 11 C = 0 A 1, (1.38) 22 for some matrix C. Proof. Suppose A is nonsingular. We partition B := A 1 conformally with A and have [ ] [ ] [ ] B11 B BA = 12 A11 A 12 I 0 = = I B 21 B 22 0 A 22 0 I Using block-multiplication we find B 11 A 11 = I, B 21 A 11 = 0, B 21 A 12 + B 22 A 22 = I, B 11 A 12 + B 12 A 22 = 0. The first equation implies that A 11 is nonsingular, this in turn implies that B 21 = 0A 1 11 = 0 in the second equation, and then the third equation simplifies to B 22 A 22 = I. We conclude that also A 22 is nonsingular. From the fourth equation we find B 12 = C = A 1 11 A 12A 1 22.

66 54 Chapter 1. Diagonally dominant tridiagonal matrices; three examples Conversely, if A 11 and A 22 are nonsingular then [ A 1 11 A 1 11 A 12A 1 ] [ ] 22 A11 A 12 0 A 1 = 22 0 A 22 and A is nonsingular with the indicated inverse. Consider now a triangular matrix. [ ] I 0 = I 0 I Lemma 1.35 (Inverse of a triangular matrix) An upper (lower) triangular matrix A = [a ij ] C n n is nonsingular if and only if the diagonal elements a ii, i = 1,..., n are nonzero. In that case the inverse is upper (lower) triangular with diagonal elements a 1 ii, i = 1,..., n. Proof. We use induction on n. The result holds for n = 1. The 1-by-1 matrix A = [a 11 ] is nonsingular if and only if a 11 0 and in that case A 1 = [a 1 11 ]. Suppose the result holds for n = k and let A C (k+1) (k+1) be upper triangular. We partition A in the form [ ] Ak a A = k 0 a k+1,k+1 and note that A k C k k is upper triangular. By Lemma 1.34 A is nonsingular if and only if A k and (a k+1,k+1 ) are nonsingular and in that case [ A A 1 1 ] k c = 0 a 1, k+1,k+1 for some c C n. By the induction hypothesis A k is nonsingular if and only if the diagonal elements a 11,..., a kk of A k are nonzero and in that case A 1 k is upper triangular with diagonal elements a 1 ii, i = 1,..., k. The result for A follows. Lemma 1.36 (Product of triangular matrices) The product C = AB = (c ij ) of two upper (lower) triangular matrices A = (a ij ) and B = (b ij ) is upper (lower) triangular with diagonal elements c ii = a ii b ii for all i. Proof. Exercise. A matrix is called unit triangular if it is triangular with 1 s on the diagonal. Lemma 1.37 (Unit triangular matrices) For a unit upper (lower) triangular matrix A C n n :

67 1.4. Block Multiplication and Triangular Matrices A is nonsingular and the inverse is unit upper(lower) triangular. 2. The product of two unit upper (lower) triangular matrices is unit upper (lower) triangular. Proof. 1. follows from Lemma 1.35, while Lemma 1.36 implies Exercises for section 1.4 Exercise 1.38 (Matrix element as a quadratic form) For any matrix A show that a ij = e T i Ae j for all i, j. Exercise 1.39 (Outer product expansion of a matrix) For any matrix A C m n show that A = m i=1 n j=1 a ije i e T j. Exercise 1.40 (The product A T A) Let B = A T A. Explain why this product is defined for any matrix A. Show that b ij = a T :i a :j for all i, j. Exercise 1.41 (Outer product expansion) For A R m n and B R p n show that AB T = a :1 b T :1 + a :2 b T :2 + + a :n b T :n. This is called the outer product expansion of the columns of A and B. Exercise 1.42 (System with many right hand sides; compact form) Suppose A R m n, B R m p, and X R n p. Show that AX = B Ax :j = b :j, j = 1,..., p. Exercise 1.43 (Block multiplication example) Suppose A = [ [ ] ] B1 A 1, A 2 and B =. When is AB = A 0 1 B 1? Exercise 1.44 (Another block multiplication example) Suppose A, B, C R n n are given in block form by [ ] [ ] [ ] λ a T 1 0 T 1 0 T A :=, B :=, C :=, 0 A 1 0 B 1 0 C 1 where A 1, B 1, C 1 R (n 1) (n 1). Show that [ ] λ a CAB = T B 1. 0 C 1 A 1 B 1

68 56 Chapter 1. Diagonally dominant tridiagonal matrices; three examples 1.5 Review Questions How do we define nonsingularity of a matrix? Define the second derivative matrix T. How did we show that it is nonsingular? Why do we not use the explicit inverse of T to solve the linear system T x = b? What are the eigenpairs of the matrix T? Why are the diagonal elements of a Hermitian matrix real? Is the matrix [ ] 1 1+i 1+i 2 Hermitian? Symmetric? Is a weakly diagonally dominant matrix nonsingular? Is a strictly diagonally dominant matrix always nonsingular? Does a tridiagonal matrix always have an LU factorization?

Chapter 2 Gaussian elimination and LU Factorizations Carl Friedrich Gauss, 1777-1855 (left), Myrick Hascall Doolittle, 1830-1911 (right).

Such a factorization is useful if the corresponding matrix problem for each of the factors is simple to solve, and extra numerical stability issues are not introduced.

69 Chapter 2 Gaussian elimination and LU Factorizations Carl Friedrich Gauss, (left), Myrick Hascall Doolittle, (right). Numerical methods for solving systems of linear equations are often based on writing a matrix as a product of simpler matrices. Such a factorization is useful if the corresponding matrix problem for each of the factors is simple to solve, and extra numerical stability issues are not introduced. Examples of a factorization was encountered in Chapter 1 and we saw how an LU factorization can be used to solve certain tridiagonal systems in O(n) operations. Other factorizations based on unitary matrices will be considered later in this book. In this chapter we first consider Gaussian elimination. Gaussian elimination leads to an LU factorization of the coefficient matrix or more generally to a PLU factorization, if row interchanges are introduced. Here P is a permutation matrix. 57

70 58 Chapter 2. Gaussian elimination and LU Factorizations We also consider the general theory of LU factorizations. 2.1 Gaussian Elimination and LU-factorization by 3 example Gaussian elimination with row interchanges is the classical method for solving n linear equations in n unknowns 9. We first recall how it works on a 3 3 system. Example 2.1 (Gaussian elimination on a 3 3 system) Consider a nonsingular system of three equations in three unknowns: a (1) 11 x 1 + a (1) 12 x 2 + a (1) 13 x 3 = b (1) 1, I a (1) 21 x 1 + a (1) 22 x 2 + a (1) 23 x 3 = b (1) 2, II a (1) 31 x 1 + a (1) 32 x 2 + a (1) 33 x 3 = b (1) 3. III. To solve this system by Gaussian elimination suppose a (1) We subtract l (1) 21 := a (1) 21 /a(1) 11 times equation I from equation II and l(1) 31 := a (1) 31 /a(1) 11 times equation I from equation III. The result is a (1) 11 x 1 + a (1) 12 x 2 + a (1) 13 x 3 = b (1) 1, I a (2) 22 x 2 + a (2) 23 x 3 = b (2) 2, II a (2) 32 x 2 + a (2) 33 x 3 = b (2) 3, III, where b (2) i = b (1) i l (1) i1 b(1) i for i = 2, 3 and a (2) ij = a (1) ij l (1) i,1 a1 1j a (1) 11 a (1) 21 for i, j = 2, 3. If = 0 and a(1) 21 0 we first interchange equation I and equation II. If a(1) 11 = = 0 we interchange equation I and III. Since the system is nonsingular the first column cannot be zero and an interchange is always possible. If a (2) 22 0 we subtract l(2) 32 := a(2) 32 /a(2) 22 times equation II from equation III to obtain a (1) 11 x 1 + a (1) 12 x 2 + a (1) 13 x 3 = b (1) 1, I a (2) 22 x 2 + a (2) 23 x 3 = b (2) 2, II a (3) 33 x 3 = b (3) 3, III, where a (3) 33 = a(2) 33 l(2) 32 a2 23 and b (3) 3 = b (2) 3 l (2) 32 b(2) 2. If a (2) 22 = 0 then a(2) 32 0 (cf. Section 2.4) and we first interchange equation II and equation III. The reduced 9 The method was known long before Gauss used it in It was further developed by Doolittle in 1881, see [7].

71 2.1. Gaussian Elimination and LU-factorization 59 1 k-1 k n 1 k-1 k n A 1 A 2 A k A n Figure 2.1: Gaussian elimination system is easy to solve since it is upper triangular. Starting from the bottom and moving upwards we find x 3 = b (3) 3 /a(3) 33 x 2 = (b (2) 2 a (2) 23 x 3)/a (2) 22 x 1 = (b (1) 1 a (1) 12 x 2 a (1) 13 x 3)/a (1) 11. This is known as back substitution. If a (k) kk 0, k = 1, 2 then LU := l (1) a (1) 11 a (1) 12 a (1) a (2) l (1) 31 l (2) 22 a (2) a 33 a (1) 11 a (1) 12 a (1) 13 = l (1) 21 a(1) 11 l (1) 21 a(1) 12 + a(2) 22 l (1) 21 a(1) 13 + a(2) 23 = l (1) 31 a(1) 31 l (1) 31 a(1) 12 + l(2) 32 a(2) 22 l (1) 31 a(1) 13 + l(2) 32 a(2) 23 + a(3) 33 a (1) 11 a (1) 12 a (1) 13 a (1) 21 a (1) 22 a (1) 23 a (1) 31 a (1) 32 a (1) 33 Thus Gaussian elimination leads to an LU factorization of the coefficient matrix A (1) (cf. the proof of Theorem 2.5) Gauss and LU In Gaussian elimination without row interchanges we start with a linear system Ax = b and generate a sequence of equivalent systems A (k) x = b (k) for k = 1,..., n, where A (1) = A, b (1) = b, and A (k) has zeros under the diagonal in its first k 1 columns. Thus A (n) is upper triangular and the system A (n) x = b (n) is easy to solve. The process is illustrated in Figure 2.1.

72 60 Chapter 2. Gaussian elimination and LU Factorizations The matrix A (k) takes the form a (1) 1,1 a (1) 1,k 1 a (1) 1,k a (1) 1,j a (1) 1,n a (k 1) k 1,k 1 a (k 1) k 1,k a (k 1) k 1,j a (k 1) k 1,n A (k) a (k) k,k a (k) k,j a (k) k,n =... a (k) i,k a (k) i,j a (k) i,n... a (k) n,k a (k) n,j a (k) n,n. (2.1) The process transforming A (k) into A (k+1) for k = 1,..., n 1 can be described as follows. for i = k + 1 : n l (k) ik = a(k) ik /a(k) kk for j = k : n a (k+1) ij = a (k) ij l (k) ik a(k) kj (2.2) For j = k it follows from (2.2) that a (k+1) ik = a (k) ik a(k) ik a (k) kk a (k) kk = 0 for i = k + 1,..., n. Thus A (k+1) will have zeros under the diagonal in its first k columns and the elimination is carried one step further. The numbers l (k) ik in (2.2) are called multipliers. To characterize matrices for which Gaussian elimination with no row interchanges is possible we start with a definition. Definition 2.2 (Principal submatrix) For k = 1,..., n the matrices A [k] C k k given by A [k] := A(1 : k, 1 : k) = a 11 a k1.. a k1 a kk are called the leading principal submatrices of A C n n. More generally, a matrix B C k k is called a principal submatrix of A if B = A(r, r), where r = [r 1,..., r k ] for some 1 r 1 < < r k n. Thus, b i,j = a ri,r j, i, j = 1,..., k.

73 2.1. Gaussian Elimination and LU-factorization 61 The determinant of a (leading) principal submatrix is called a (leading) principal minor. A principal submatrix is leading if r j = j for j = 1,..., k. Also a principal submatrix is special in that it uses the same rows and columns of A. For k = 1 The only principal submatrices of order k = 1 are the diagonal elements of A. Example 2.3 (Principal submatrices) ] The principal submatrices of A = are [ The leading principal submatrices are [1], [5], [9], [ ], [ ], [ ], A. [1], [ ], A. Gaussian elimination with no row interchanges is valid if and only if the pivots a (k) kk are nonzero for k = 1,..., n 1. Theorem 2.4 We have a (k) k,k 0 for k = 1,..., n 1 if and only if the leading principal submatrices A [k] of A are nonsingular for k = 1,..., n 1. Moreover det(a [k] ) = a (1) 11 a(2) 22 a(k), k = 1,..., n. (2.3) Proof. Let B k = A (k) k 1 be the upper left k 1 corner of A(k) given by (2.1). Observe that the elements of B k are computed from A by using only elements from A [k 1]. Since the determinant of a matrix does not change under the operation of subtracting a multiple of one row from another row the determinant of A [k] equals the product of diagonal elements of B k+1 and (2.3) follows. But then a (1) 11 a(k) kk 0 for k = 1,..., n 1 if and only if det(a [k]) 0 for k = 1,..., n 1, or equivalently A [k] is nonsingular for k = 1,..., n 1. Gaussian elimination is a way to compute the LU-factorization of the coefficient matrix. Theorem 2.5 Suppose A C n n and that A [k] is nonsingular for k = 1,..., n 1. Then Gaussian elimination with no row interchanges results in an LU-factorization of A. In particular A = LU, where 1 where the l (j) ij L =. l (1) l (1) n1 l (2) n2 1 and a (i) ij are given by (2.2). kk, U = a (1) 11 a (1) 1n...., (2.4) a n nn

74 62 Chapter 2. Gaussian elimination and LU Factorizations Proof. From (2.2) we have for all i, j l (k) ik a(k) kj = a(k) ij a (k+1) ij for k < min(i, j), and l (k) ij a(j) jj = a(j) ij for i > j. Thus for i j we find (LU) ij = n k=1 while for i > j (LU) ij = l (k) ik u i 1 kj = n k=1 l (k) ik a(k) kj k=1 + a(i) ij = i 1 k=1 l (k) ik u j 1 kj = l (k) ik a(k) kj + l ija (j) jj k=1 ( a (k) ij = j 1 k=1 a (k+1) ) (i) ij + a ij = a(1) ij = a ij, ( a (k) ij (2.5) a (k+1) ) (j) ij + a ij = a ij. (2.6) Note that this Theorem holds even if A is singular. Since L is nonsingular the matrix U is then singular, and we must have a n nn = 0 when A is singular. 2.2 Banded Triangular Systems Once we have an LU-factorization of A the system Ax = b is solved in two steps. Since LUx = b we have Ly = b, where y := Ux. We first solve Ly = b, for y and then Ux = y for x Algorithms for triangular systems A nonsingular triangular linear system Ax = b is easy to solve. By Lemma 1.35 A has nonzero diagonal elements. Consider first the lower triangular case. For n = 3 the system is a a 21 a 22 0 a 31 a 32 a 33 x 1 x 2 x 3 = From the first equation we find x 1 = b 1 /a 11. Solving the second equation for x 2 we obtain x 2 = (b 2 a 21 x 1 )/a 22. Finally the third equation gives x 3 = (b 3 a 31 x 1 a 32 x 2 )/a 33. This process is known as forward substitution. In general x k = ( k 1 ) b k a k,j x j /akk, k = 1, 2,..., n. (2.7) j=1 When A is a lower triangular band matrix the number of arithmetic operations necessary to find x can be reduced. Suppose A is a lower triangular d-banded, b 1 b 2 b 3.

75 2.2. Banded Triangular Systems 63 a a 21 a a 32 a a 43 a 44 0, a 54 a 55 a a 21 a a 31 a 32 a a 42 a 43 a a 53 a 54 a 55 Figure 2.2: Lower triangular 5 5 band matrices: d = 1 (left) and d = 2 right. so that a k,j = 0 for j / {l k, l k + 1,..., k for k = 1, 2,..., n, and where l k := max(1, k d), see Figure 2.2. For a lower triangular d-band matrix the calculation in (2.7) can be simplified as follows k 1 x k = ( ) b k a k,j x j /akk, k = 1, 2,..., n. (2.8) j=l k Note that (2.8) reduces to (2.7) if d = n. Letting A(k, l k : (k 1)) x(l k : (k 1)) denote the sum k 1 j=l k a kj x j we arrive at the following algorithm, where the intial r in the name signals that this algorithm is row oriented. For each k we take the inner product of a part of a row with the already computed unknowns. Algorithm 2.6 (forwardsolve) Given a nonsingular lower triangular d-banded matrix A C n n and b C n. An x C n is computed so that Ax = b. 1 function x=r f o r w a r d s o l v e (A, b, d ) 2 n=length ( b ) ; x=b ; 3 x ( 1 )=b ( 1 ) /A( 1, 1 ) ; 4 for k=2:n 5 l k=max( 1, k d ) ; 6 x ( k )=(b ( k ) A( k, ( l k : ( k 1) ) x ( l k : ( k 1) ) ) /A( k, k ) ; 7 end A system Ax = b, where A is upper triangular must be solved by back substitution or bottom-up. We first find x n from the last equation and then move upwards for the remaining unknowns. For an upper triangular d-banded matrix this leads to the following algorithm.

76 64 Chapter 2. Gaussian elimination and LU Factorizations Algorithm 2.7 (backsolve) Given a nonsingular upper triangular d-banded matrix A C n n and b C n. An x C n is computed so that Ax = b. 1 function x=r b a c k s o l v e (A, b, d ) 2 n=length ( b ) ; x=b ; 3 x ( n )=b ( n ) /A( n, n ) ; 4 for k=n 1: 1:1 5 uk=min( n, k+d ) ; 6 x ( k )=(b ( k ) A( k, ( k+1) : uk ) x ( ( k+1) : uk ) ) /A( k, k ) ; 7 end Exercise 2.8 ( Column oriented backsolve) In this exercise we develop column oriented vectorized versions of forward and backward substitution. Consider the system Ax = b, where A C n n is lower triangular. Suppose after k 1 steps of the algorithm we have a reduced system in the form a k,k 0 0 a k+1,k a k+1,k a n,k a n n x k x k+1. x n = b k b k+1. b n. This system is of order n k + 1. The unknowns are x k,..., x n. a) We see that x k = b k /a k,k and eliminating x k from the remaining equations we obtain a system of order n k with unknowns x k+1,..., x n a k+1,k a k+2,k+1 a k+2,k a n,k+1 a n,n x k+1. x n = b k+1. b n x k a k+1,k. a n,k. Thus at the kth step, k = 1, 2,... n we set x k = b k /A(k, k) and update b as follows: b((k + 1) : n) = b((k + 1) : n) x(k) A((k + 1) : n, k). This leads to the following algorithm.

77 2.2. Banded Triangular Systems 65 Algorithm 2.9 (Forward Solve (column oriented)) Given a nonsingular lower triangular d-banded matrix A C n n and b C n. An x C n is computed so that Ax = b. 1 function x=c f o r w a r d s o l v e (A, b, d ) 2 x=b ; n=length ( b ) ; 3 for k=1:n 1 4 x ( k )=b ( k ) /A( k, k ) ; uk=min( n, k+d ) ; 5 b ( ( k+1) : uk )=b ( ( k+1) : uk ) A( ( k+1) : uk, k ) x ( k ) ; 6 end 7 x ( n )=b ( n ) /A( n, n ) ; 8 end b) Suppose now A C n n is nonsingular, upper triangular, d-banded, and b C n. Justify the following column oriented vectorized algorithms for solving Ax = b. Algorithm 2.10 (Backsolve (column oriented)) Given a nonsingular upper triangular d-banded matrix A C n n and b C n. An x C n is computed so that Ax = b. 1 function x=c b a c k s o l v e (A, b, d ) 2 x=b ; n=length ( b ) ; 3 for k=n : 1:2 4 x ( k )=b ( k ) /A( k, k ) ; l k=max( 1, k d ) ; 5 b ( l k : ( k 1) )=b ( l k : ( k 1) ) A( l k : ( k 1), k ) x ( k ) ; 6 end 7 x ( 1 )=b ( 1 ) /A( 1, 1 ) ; 8 end Exercise 2.11 (Computing the inverse of a triangular matrix) Suppose A C n n is a nonsingular upper triangular matrix with upper triangular inverse B = [b 1,..., b n ]. The kth column b k of B is the solution of the linear systems Ab k = e k. Show that b k (k) = 1/a(k, k) for k = 1,..., n, and explain why we can find b k by solving the linear systems A((k + 1):n, (k + 1):n)b k ((k + 1):n) = A((k + 1):n, k)b k (k), k = 1,..., n 1. (2.9) Is it possible to store the interesting part of b k in A as soon as it is computed? What if A is upper triangular? A(1:k, 1:k)b k (1:k) = I(1:k, k), k = n, n 1,..., 1, upper triangular (2.10)

78 66 Chapter 2. Gaussian elimination and LU Factorizations Counting operations It is useful to have a number which indicates the amount of work an algorithm requires. In this book we measure this by estimating the total number of arithmetic operations. We count both additions, subtractions, multiplications and divisions, but not work on indices. As an example we show that the calculation to find the LU factorization of a full matrix of order n using Gaussian elimination is exactly N LU := 2 3 n3 1 2 n2 1 n. (2.11) 6 Let M, D, A, S be the number of multiplications, divisions, additions, and subtractions. In (2.2) the multiplications and subtractions occur in the calculation of a k+1 ij = a (k) ij l (k) ik a(k) kj which is carried out (n k) 2 times. Moreover, each calculation involves one subtraction and one multiplication. Thus we find M + S = 2 n 1 k=1 (n k)2 = 2 n 1 m=1 m2 = 2 3 n(n 1)(n 1 2 ). For each k there are n k divisions giving a sum of n 1 m=1 (n k) = 1 2n(n 1). Since there are no additions we obtain the total M + D + A + S = 2 3 n(n 1)(n 1 2 ) n(n 1) = N LU given by (2.11). We are only interested in N LU when n is large and for such n the term 2 3 n3 dominates. We therefore regularly ignore lower order terms and use number of operations both for the exact count and for the highest order term. We also say more loosely that the the number of operations is O(n 3 ). We will use the number of operations counted in one of these ways as a measure of the complexity of an algorithm and say that the complexity of LU factorization of a full matrix is O(n 3 ) or more precisely 2 3 n3. We will compare the number of arithmetic operations of many algorithms with the number of arithmetic operations of Gaussian elimination and define for n N the number G n as follows: Definition 2.12 (G n := 2 3 n3 ) We define G n := 2 3 n3. There is a quick way to arrive at the leading term 2n 3 /3. We only consider the operations contributing to this term. In (2.2) the leading term comes from the inner loop contributing to M + S. Then we replace sums by integrals letting the summation indices be continuous variables and adjust limits of integration in an insightful way to simplify the calculation. Thus, n 1 M + S = 2 (n k) 2 2 k=1 n 1 1 (n k) 2 dk 2 n 0 (n k) 2 dk = 2 3 n3

79 2.3. The LU and LDU Factorizations 67 and this is the correct leading term. Consider next N S, the number of forward plus backward substitutions. By (2.7) we obtain N S = 2 n (2k 1) 2 k=1 n 1 (2k 1)dk 4 n 0 kdk = 2n 2. The last integral actually give the exact value for the sum in this case (cf. (2.14)). We see that LU factorization is an O(n 3 ) process while solving a triangular system requires O(n 2 ) arithmetic operations. Thus, if n = 10 6 and one arithmetic operation requires c = seconds of computing time then cn 3 = 10 4 seconds 3 hours and cn 2 = 0.01 second, giving dramatic differences in computing time. Exercise 2.13 (Finite sums of integers) Use induction on m, or some other method, to show that m = 1 m(m + 1), (2.12) m 2 = 1 3 m(m + 1 )(m + 1), 2 (2.13) m 1 = m 2, (2.14) (m 1)m = 1 (m 1)m(m + 1). (2.15) 3 Exercise 2.14 (Multiplying triangular matrices) Show that the matrix multiplication AB can be done in 1 3 n(2n2 + 1) G n arithmetic operations when A R n n is lower triangular and B R n n is upper triangular. What about BA? 2.3 The LU and LDU Factorizations Gaussian elimination without row interchanges is one way of computing an LU factorization of a matrix. There are other ways that can be advantageous for certain kind of problems. Here we consider the general theory of LU factorizations. Recall that A = LU is an LU factorization of A C n n if L C n n is lower triangular and U C n n is upper triangular, i. e., l 1,1 0 u 1,1 u 1,n L =....., U = l n,1 l n,n 0 u n,n

68 Chapter 2. Gaussian elimination and LU Factorizations To find an LU factorization there is one equation for each of the n 2 elements in A, and L and U contain a total of n 2 + n unknown elements.

1 Existence and uniqueness Henry Jensen, 1915-1974 (left), Prescott Durand Crout, 1907-1984. Jensen worked on LU factorizations. His name is also associated with a very useful inequality (cf.

80 68 Chapter 2. Gaussian elimination and LU Factorizations To find an LU factorization there is one equation for each of the n 2 elements in A, and L and U contain a total of n 2 + n unknown elements. There are several ways to restrict the number of unknowns to n 2. L1U: l ii = 1 all i, LU1: u ii = 1 all i, LDU: A = LDU, l ii = u ii = 1 all i, D = diag(d 11,..., d nn ) Existence and uniqueness Henry Jensen, (left), Prescott Durand Crout, Jensen worked on LU factorizations. His name is also associated with a very useful inequality (cf. Theorem 7.40). Consider the L1U factorization. Three things can happen. An L1U factorization exists and is unique, it exists, but it is not unique, or it does not exist. The 2 2 case illustrates this. Example 2.15 (L1U of 2 2 matrix) Let a, b, c, d C. An L1U factorization of A = [ a b [ ] a b = c d [ 1 0 l 1 1 ] c d must satisfy the equations ] [ ] [ ] u1 u 2 u1 u = 2 0 u 3 u 1 l 1 u 2 l 1 + u 3 for the unknowns l 1 in L and u 1, u 2, u 3 in U. The equations are u 1 = a, u 2 = b, al 1 = c, bl 1 + u 3 = d. (2.16) These equations do not always have a solution. Indeed, the main problem is the equation al 1 = c. There are essentially three cases 1. a 0: The matrix has a unique L1U factorization.

81 2.3. The LU and LDU Factorizations a = c = 0: The L1U factorization exists, but it is not unique. Any value for l 1 can be used. 3. a = 0, c 0: No L1U factorization exists. Consider the four matrices [ 2 1 A 1 := 1 2 ], A 2 := [ ] 0 1, A := [ ] 0 1, A := [ ] From the previous discussion it follows that A 1 has a unique L1U factorization, A 2 has no L1U factorization, A 3 has an L1U factorization but it is not unique, and A 4 has a unique L1U factorization even if it is singular. In preparation for the main theorem about LU factorization we prove a simple lemma. Lemma 2.16 (L1U of leading principal sub matrices) Suppose A = LU is an L1U factorization of A C n n. For k = 1,..., n let A [k], L [k], U [k] be the leading principal submatrices of A, L, U, respectively. Then A [k] = L [k] U [k] is an LU factorization of A [k] for k = 1,..., n. Proof. For k = 1,..., n 1 we partition A = LU as follows: [ ] [ ] [ ] [ ] A[k] B k L[k] 0 U = [k] S k L[k] U = [k] L [k] S k, (2.17) C k F k M k N k 0 T k M k U [k] M k S k + N k T k where F k, N k, T k C n k,n k. Comparing blocks we find A [k] = L [k] U [k]. Since L [k] is unit lower triangular and U [k] is upper triangular this is an L1U factorization of A [k]. The following theorem gives a necessary and sufficient condition for existence of a unique LU factorization. The conditions are the same for the three factorizations L1U, LU1 and LDU. Theorem 2.17 (LU Theorem) A square matrix A C n n has a unique L1U (LU1, LDU) factorization if and only if the leading principal submatrices A [k] of A are nonsingular for k = 1,..., n 1. Proof. Suppose A [k] is nonsingular for k = 1,..., n 1. Under these conditions Gaussian elimination gives an L1U factorization (cf. Theorem 2.5). We give another proof here that in addition to showing uniqueness also gives alternative ways to compute the L1U factorization. The proofs for the LU1 and LDU factorizations are similar and left as exercises.

70 Chapter 2. Gaussian elimination and LU Factorizations Figure 2.3: André-Louis Cholesky, 1875-1918 (left), John von Neumann, 1903-1957.

Suppose that A [n 1] has a unique L1U factorization A [n 1] = L n 1 U n 1, and that A [1],..., A [n 1] are nonsingular.

82 70 Chapter 2. Gaussian elimination and LU Factorizations Figure 2.3: André-Louis Cholesky, (left), John von Neumann, We use induction on n to show that A has a unique L1U factorization. The result is clearly true for n = 1, since the unique L1U factorization of a 1 1 matrix is [a 11 ] = [1][a 11 ]. Suppose that A [n 1] has a unique L1U factorization A [n 1] = L n 1 U n 1, and that A [1],..., A [n 1] are nonsingular. By block multiplication [ ] A[n 1] c A = n = r T n a nn [ ] [ ] [ Ln 1 0 U n 1 u n Ln 1 U n 1 L n 1 u n l T = n 1 0 u nn l T n U n 1 l T n u n + u nn (2.18) if and only if A [n 1] = L n 1 U n 1 and l n, u n C n 1 and u nn C are determined from U T n 1l n = r n, L n 1 u n = c n, u nn = a nn l T n u n. (2.19) Since A [n 1] is nonsingular it follows that L n 1 and U n 1 are nonsingular and therefore l n, u n, and u nn are uniquely given. Thus (2.18) gives a unique L1U factorization of A. Conversely, suppose A has a unique L1U factorization A = LU. By Lemma 2.16 A [k] = L [k] U [k] is an L1U factorization of A [k] for k = 1,..., n 1. Suppose A [k] is singular for some k n 1. We will show that this leads to a contradiction. Let k be the smallest integer so that A [k] is singular. Since A [j] is nonsingular for j k 1 it follows from what we have already shown that A [k] = L [k] U [k] is the unique L1U factorization of A [k]. The matrix U [k] is singular since A [k] is singular and L [k] is nonsingular. By (2.17) we have U T [k]m T k = C T k. This can be written as n k linear systems for the columns of M T k. By assumption M T k exists, but since U T [k] is singular M k is not unique, a contradiction. By combining the last two equations in (2.19) we obtain with k = n ],

83 2.3. The LU and LDU Factorizations 71 U T k 1l k = r k, [ Lk 1 ] [ 0 uk l T k 1 u kk ] = This can be used in an algorithm to compute the L1U factorization. Moreover, if A is d-banded then the first k d components in r k and c k are zero so both L and U will be d-banded. Thus we can use the banded rforwardsolve Algorithm 2.6 to solve the lower triangular system U T k 1l k = r k for the kth row l T k in L and the kth column [ u k u kk ] in U for k = 2,..., n. This leads to the following algorithm to compute the L1U factorization of a d-banded matrix. The algorithm will fail if the conditions in the LU theorem are not satisfied. Algorithm 2.18 (L1U factorization) [ ck a kk ]. 1 function [ L,U]=L1U(A, d ) 2 n=length (A) ; 3 L=eye ( n, n ) ; U=zeros ( n, n ) ;U( 1, 1 )=A( 1, 1 ) ; 4 for k=2:n 5 km=max( 1, ( k d ) ) ; 6 L( k,km : ( k 1) )=r f o r w a r d s o l v e (U(km : ( k 1),km : ( k 1) ),A( k,km : ( k 1) ), d ) ; 7 U(km: k, k )=r f o r w a r d s o l v e (L(km: k,km: k ),A(km: k, k ), d ) ; 8 end For each k we essentially solve a lower triangular linear system of order d. Thus the number of arithmetic operation for this algorithm is O(d 2 n). Exercise 2.19 (# operations for banded triangular systems) Show that for 1 d n algorithms 2.18 requires exactly N LU (n, d) := (2d 2 + d)n (d 2 + d)(8d + 1)/6 = O(d 2 n) operations. 10. In particular, for a full matrix d = n 1 and we find N LU (n, n) = 2 3 n3 1 2 n2 1 6 n G n in agreement with the exact count (2.11) for Gaussian elimination, while for a tridiagonal matrix N LU (n, 1) = 3n 3 = O(n). Remark 2.20 (LU of upper triangular matrix) A matrix A C n n can have an LU factorization even if A [k] is singular for some k < n. By Theorem 3.3 such an LU factorization cannot be unique. An L1U factorization of an upper triangular matrix A is A = IA so it always exists even if A has zeros somewhere on the diagonal. By Lemma 1.35, if some a kk is zero then A [k] is singular and the L1U factorization is not unique. In particular, for the zero matrix any unit lower triangular matrix can be used as L in an L1U factorization. 10 Hint: Consider the cases 2 k d and d + 1 k n separately.

84 72 Chapter 2. Gaussian elimination and LU Factorizations Exercise 2.21 (L1U and LU1) Show that the matrix A 3 in Example 2.15 has no LU1 or LDU factorization. Give an example of a matrix that has an LU1 factorization, but no LDU or L1U factorization. Exercise 2.22 (LU of nonsingular matrix) Show that the following is equivalent for a nonsingular matrix A C n n. 1. A has an LDU factorization. 2. A has an L1U factorization. 3. A has an LU1 factorization. Exercise 2.23 (Row interchange) Show that A = [ ] has a unique L1U factorization. Note that we have only interchanged rows in Example Exercise 2.24 (LU and determinant) Suppose A has an L1U factorization A = LU. Show that det(a [k] ) = u 11 u 22 u kk for k = 1,..., n. Exercise 2.25 (Diagonal elements in U) Suppose A C n n and A [k] is nonsingular for k = 1,..., n 1. Use Exercise 2.24 to show that the diagonal elements u kk in the L1U factorization are u 11 = a 11, u kk = det(a [k]), for k = 2,..., n. (2.20) det(a [k 1] ) Exercise 2.26 (Proof of LDU theorem) Give a proof of the LU theorem for the LDU case. Exercise 2.27 (Proof of LU1 theorem) Give a proof of the LU theorem for the LU1 case.

85 2.3. The LU and LDU Factorizations Block LU factorization Suppose A C n n is a block matrix of the form A 11 A 1m A :=.., (2.21) A m1 A mm where each diagonal block A ii is square. We call the factorization I U 11 U 1m L 21 I U 21 U 2m A = LU = L m1 L m,m 1 I U mm (2.22) a block L1U factorization of A. Here the ith diagonal blocks I and U ii in L and U have the same size as A ii, the ith diagonal block in A. Block LU1 and block LDU factorizations are defined similarly. The results for element-wise LU factorization carry over to block LU factorization as follows. Theorem 2.28 (Block LU theorem) Suppose A C n n is a block matrix of the form (2.21). Then A has a unique block LU factorization (2.22) if and only if the leading principal block submatrices A {k} := are nonsingular for k = 1,..., m 1. A 11 A 1k.. A k1 A kk Proof. Suppose A {k} is nonsingular for k = 1,..., m 1. Following the proof in Theorem 2.17 suppose A {m 1} has a unique block LU factorization A {m 1} = L {m 1} U {m 1}, and that A {1},..., A {m 1} are nonsingular. Then L {m 1} and U {m 1} are nonsingular and [ ] A{m 1} B A = C T A mm [ ] L{m 1} 0 [ U {m 1} L 1 = C T U 1 {m 1} B ] (2.23) {m 1} I 0 A mm C T U 1 {m 1} L 1 {m 1} B, is a block LU factorization of A. It is unique by derivation. Conversely, suppose A has a unique block LU factorization A = LU. Then as in Lemma 2.16 it is

86 74 Chapter 2. Gaussian elimination and LU Factorizations easily seen that A {k} = L {k} U {k} is the unique block LU factorization of A [k] for k = 1,..., m. The rest of the proof is similar to the proof of Theorem Remark 2.29 (Comparing LU and block LU) The number of arithmetic operations for the block LU factorization is the same as for the ordinary LU factorization. An advantage of the block method is that it combines many of the operations into matrix operations. Remark 2.30 (A block LU is not an LU) Note that (2.22) is not an LU factorization of A since the U ii s are not upper triangular in general. To relate the block LU factorization to the usual LU factorization we assume that each U ii has an LU factorization U ii = L ii Ũ ii. Then A = ˆLÛ, where ˆL := L diag( L 1 ii ) and Û := diag( L ii )U, and this is an ordinary LU factorization of A. Exercise 2.31 (Making block LU into LU) Show that ˆL is unit lower triangular and Û is upper triangular. 2.4 The PLU-Factorization Theorem 2.4 shows that Gaussian elimination can fail on a nonsingular system. A simple example is [ ] [ x1 x 2 ] = [ 1 1 ]. We show here that any nonsingular linear system can be solved by Gaussian elimination if we incorporate row interchanges Pivoting Interchanging two rows (and/or two columns) during Gaussian elimination is known as pivoting. The element which is moved to the diagonal position (k, k) is called the pivot element or pivot for short, and the row containing the pivot is called the pivot row. Gaussian elimination with row pivoting can be described as follows. 1. Choose r k k so that a (k) r k,k Interchange rows r k and k of A (k). 3. Eliminate by computing l (k) ik and a(k+1) ij using (2.2). To show that Gaussian elimination can always be carried to completion by using suitable row interchanges suppose by induction on k that A (k) is nonsingular. Since A (1) = A this holds for k = 1. By Lemma 1.34 the lower right diagonal block

87 2.4. The PLU-Factorization 75 in A (k) is nonsingular. But then at least one entry in the first column of that block must be nonzero and it follows that r k exists so that a (k) r k,k 0. But then A(k+1) is nonsingular since it is computed from A (k) using row operations preserving the nonsingularity. We conclude that A (k) is nonsingular for k = 1,..., n Permutation matrices Row interchanges can be described in terms of permutation matrices. Definition 2.32 A permutation matrix is a matrix of the form P = I(:, p) = [e i1, e i2,..., e in ] R n,n, where e i1,..., e in is a permutation of the unit vectors e 1,..., e n R n. Every permutation p = [i 1,..., i n ] T of the integers 1, 2,..., n gives rise to a permutation matrix and vice versa. Post-multiplying a matrix A by a permutation matrix results in a permutation of the columns, while pre-multiplying by a permutation matrix gives a permutation of the rows. In symbols AP = A(:, p), P T A = A(p, :). (2.24) Indeed, AP = (Ae i1,..., Ae in ) = A(:, p) and P T A = (A T P ) T = (A T (:, p)) T = A(p, :). Since P T P = I the inverse of P is equal to its transpose, P 1 = P T and P P T = I as well. We will use a particularly simple permutation matrix. Definition 2.33 We define a (j,k)-interchange matrix I jk by interchanging column j and k of the identity matrix. Since I jk = I kj, and we obtain the identity by applying I jk twice, we see that I 2 jk = I and an interchange matrix is symmetric and equal to its own inverse. Pre-multiplying a matrix by an interchange matrix interchanges two rows of the matrix, while post-multiplication interchanges two columns. We can keep track of the row interchanges using pivot vectors p k. We define p := p n, where p 1 := [1, 2,..., n] T, and p k+1 := I rk,kp k for k = 1,..., n 1. (2.25) We obtain p k+1 from p k by interchanging the entries r k and k in p k. In particular, since r k k, the first k 1 components in p k and p k+1 are the same. There is a close relation between the pivot vectors p k and the corresponding interchange matrices P k := I rk,k. Since P k I(p k, :) = I(P k p k, :) = I(p k+1, :) we obtain P T = P n 1 P 1 = I(p, :), P = P 1 P 2 P n 1 = I(:, p). (2.26)

88 76 Chapter 2. Gaussian elimination and LU Factorizations Instead of interchanging the rows of A during elimination we can keep track of the ordering of the rows using the pivot vectors p k. The Gaussian elimination in Section 2.1 with row pivoting starting with a (1) ij = a ij can be described as follows: p = [1,..., n] T ; for k = 1 : n 1 choose r k k so that a (k) p rk,k 0. p = I rk,kp for i = k + 1 : n a (k) p i,k = a(k) p i,k /a(k) p k,k for j = k : n a (k+1) p i,j = a (k) p i,j a(k) p i,k a(k) p k,j (2.27) This leads to the following factorization: Theorem 2.34 Gaussian elimination with row pivoting on a nonsingular matrix A C n,n leads to the factorization A = P LU, where P is a permutation matrix, L is lower triangular with ones on the diagonal, and U is upper triangular. More explicitly, P = I(:, p), where p = I rn 1,n 1 I r1,1[1,..., n] T, and L = 1 a (1) p 2, a (1) p n,1 a (2) p n,2 1 a (1) p, U = 1,1 a (1) p 1,n.... a p (n) n,n. (2.28) Proof. The proof is analogous to the proof for LU-factorization without pivoting. From (2.27) we have for all i, j a (k) p i,k a(k) p k,j = a(k) p i,j a(k+1) p i,j for k < min(i, j), and a (k) p i,j a(j) p = j,j a(j) p i,j for i > j. Thus for i j we find (LU) ij = n i 1 l i,k u kj = a (k) p i,k a(k) p k,j + a(i) p i,j k=1 k=1 k=1 i 1 ( (k) = a p ) i,j a(k+1) (i) p i,j + a p = i,j a(1) p = a i,j p i,j = ( P T A ) ij,

89 2.4. The PLU-Factorization 77 while for i > j (LU) ij = n k=1 k=1 l (k) ik u j 1 kj = a (k) p i,k a(k) p k,j + a(k) p i,j a(j) p j,j k=1 j 1 ( (k) = a p ) i,j a(k+1) (j) p i,j + a p = i,j a(1) p = a i,j p i,j = ( P T A ) ij. The PLU factorization can also be written P T A = LU. This shows that for a nonsingular matrix there is a permutation of the rows of A so that the permuted matrix has an LU factorization Pivot strategies In (2.27) we did not say which of the nonzero elements we should chose. A common strategy is to choose the largest element. This is known as partial pivoting a (k) r k,k := max{ a(k) i,k : k i n} with r k the smallest such index in case of a tie. The following example illustrating that small pivots should be avoided. Example 2.35 Applying Gaussian elimination without row interchanges to the linear system we obtain the upper triangular system 10 4 x 1 + 2x 2 = 4 x 1 + x 2 = x 1 + 2x 2 = 4 ( )x 2 = The exact solution is x 2 = , x 1 = 4 2x = Suppose we round the result of each arithmetic operation to three digits. solutions fl(x 1 ) and fl(x 2 ) computed in this way is The fl(x 2 ) = 2, fl(x 1 ) = 0.

90 78 Chapter 2. Gaussian elimination and LU Factorizations The computed value 0 of x 1 is completely wrong. Suppose instead we apply Gaussian elimination to the same system, but where we have interchanged the equations. The system is x 1 + x 2 = x 1 + 2x 2 = 4 and we obtain the upper triangular system x 1 + x 2 = 3 Now the solution is computed as follows ( )x 2 = x 2 = , x 1 = 3 x 2 1. In this case rounding each calculation to three digits produces fl(x 1 ) = 1 and fl(x 2 ) = 2 which is quite satisfactory since it is the exact solution rounded to three digits. Related to partial pivoting is scaled partial pivoting. Here r k is the smallest index such that a (k) r k,k { (k) a i,k := max } : k i n, s k := max s k s a kj. k 1 j n This can sometimes give more accurate results if the coefficient matrix have coefficients of wildly different sizes. Note that the scaling factors s k are computed using the initial matrix. It also is possible to interchange both rows and columns. The choice a (k) r k,s k := max{ a (k) i,j : k i, j n} with r k, s k the smallest such indices in case of a tie, is known as complete pivoting. Complete pivoting is known to be more numerically stable than partial pivoting, but requires a lot of search and is seldom used in practice. Exercise 2.36 (Using PLU for A T ) Suppose we know the PLU factors P, L, U in a PLU factorization A = P LU of A R n,n. Explain how we can solve the system A T x = b economically. Exercise 2.37 (Using PLU for determinant) Suppose we know the PLU factors P, L, U in a PLU factorization A = P LU of A R n,n. Explain how we can use this to compute the determinant of A.

91 2.5. Review Questions 79 Exercise 2.38 (Using PLU for A 1 ) Suppose the factors P, L, U in a PLU factorization of A R n,n are known. Use Exercise 2.14 to show that it takes approximately 2G n arithmetic operations to compute A 1 = U 1 L 1 P T. Here we have not counted the final multiplication with P T which amounts to n row interchanges. 2.5 Review Questions When is a triangular matrix nonsingular? What is the general condition for Gaussian elimination without row interchanges to be well defined? What is the content of the LU theorem? Approximately how many arithmetic operations are needed for the multiplication of two square matrices? The LU factorization of a matrix? the solution of Ax = b, when A is triangular? What is a PLU factorization? When does it exist? What is complete pivoting?

92 80 Chapter 2. Gaussian elimination and LU Factorizations

93 Chapter 3 LDL* Factorization and Positive definite Matrices Recall that a matrix A C n n is Hermitian if A = A, i. e., a ji = a ij for all i, j. A real Hermitian matrix is symmetric. Note that the diagonal elements of a Hermitian matrix must be real. 3.1 The LDL* factorization There are special versions of the LU factorization for Hermitian and positive definite matrices. The most important ones are LDL*: LDU with U = L and D R n n is a diagonal matrix, LL*: LU with U = L and l ii > 0 all i. A matrix A having an LDL* factorization must be Hermitian since D is real so that A = (LDL ) = LD L = A. The LL* factorization is called a Cholesky factorization. Example 3.1 (LDL* of 2 2 Hermitian matrix) Let a, d R and b C. An LDL* factorization of a 2 2 Hermitian matrix must satisfy the equations [ ] [ ] [ ] [ ] [ ] a b 1 0 d1 0 1 l1 d1 d = = 1 l 1 b d l d d 1 l 1 d 1 l d 2 for the unknowns l 1 in L and d 1, d 2 in D. They are determined from There are essentially three cases d 1 = a. al 1 = b, d 2 = d a l 1 2. (3.1) 81

94 82 Chapter 3. LDL* Factorization and Positive definite Matrices 1. a 0: The matrix has a unique LDL* factorization. Note that d 1 and d 2 are real. 2. a = b = 0: The LDL* factorization exists, but it is not unique. Any value for l 1 can be used. 3. a = 0, b 0: No LDL* factorization exists. Lemma 2.16 carries over to the Hermitian case. Lemma 3.2 ( LDL* of leading principal sub matrices) Suppose A = LDL is an LDL* factorization of A C n n. For k = 1,..., n let A [k],l [k] and D [k] be the leading principal submatrices of A,L and D, respectively. Then A [k] = L [k] D [k] L [k] is an LDL* factorization of A [k] for k = 1,..., n. Proof. For k = 1,..., n 1 we partition A = LDL as follows: [ ] [ ] [ ] [ A[k] B A = k L[k] 0 D[k] 0 L = [k] M ] k B k F k M k N k 0 E k 0 N = LDU, (3.2) k where F k, N k, E k C n k,n k. Block multiplication gives A [k] = L [k] D [k] L [k]. Since L [k] is unit lower triangular and D [k] is real and diagonal this is an LDL* factorization of A [k]. Theorem 3.3 (LDL* theorem) The matrix A C n n has a unique LDL* factorization if and only if A = A and A [k] is nonsingular for k = 1,..., n 1. Proof. We essentially repeat the proof of Theorem 2.17 incorporating the necessary changes. Suppose A = A and that A [k] is nonsingular for k = 1,..., n 1. Note that A [k] = A [k] for k = 1,..., n. We use induction on n to show that A has a unique LDL* factorization. The result is clearly true for n = 1, since the unique LDL* factorization of a 1-by-1 matrix is [a 11 ] = [1][a 11 ][1] and a 11 is real since A = A. Suppose that A [n 1] has a unique LDL* factorization A [n 1] = L n 1 D n 1 L n 1, and that A [1],..., A [n 1] are nonsingular. By definition D n 1 is real. Using block multiplication [ ] [ ] [ ] [ ] A[n 1] a A = n Ln 1 0 Dn 1 0 L a = n 1 l n n a nn l n 1 0 d nn 0 1 [ Ln 1 D = n 1 L ] (3.3) n 1 L n 1 D n 1 l n l nd n 1 L n 1 l nd n 1 l n + d nn if and only if A [n 1] = L n 1 D n 1 L n 1, and a n = L n 1 D n 1 l n, a nn = l nd n 1 l n + d nn. (3.4)

95 3.2. Positive Definite and Semidefinite Matrices 83 Thus we obtain an LDL* factorization of A that is unique since L n 1 and D n 1 are nonsingular. Also d nn is real since a nn and D n 1 are real. For the converse we use Lemma 3.2 in the same way as Lemma 2.16 was used to prove Theorem Here is an analog of Algorithm 2.18 that tries to compute the LDL* factorization of a matrix. It uses the upper part of the matrix. Algorithm 3.4 (LDL* factorization) 1 function [ L, dg]=ldl(a, d ) 2 n=length (A) ; 3 L=eye ( n, n ) ; dg=zeros ( n, 1 ) ; dg ( 1 )=A( 1, 1 ) ; 4 for k=2:n 5 m=r f o r w a r d s o l v e (L ( 1 : k 1,1: k 1),A( 1 : k 1,k ), d ) ; 6 L( k, 1 : k 1)=m. / dg ( 1 : k 1) ; 7 dg ( k )=A( k, k ) L( k, 1 : k 1) m; 8 end The number of arithmetic operations for the LDL* factorization is approximately 1 2 G n, half the number of operations needed for the LU factorization. Indeed, in the L1U factorization we needed to solve two triangular systems to find the vectors s and m, while only one such system is needed to find x in the symmetric case (3.3). The work to find d nn is O(n) and does not contribute to the highest order term. Example 3.5 (A factorization) Is the factorization [ ] [ ] [ ] [ ] /3 = 1 3 1/ /3 0 1 an LDL* factorization? 3.2 Positive Definite and Semidefinite Matrices Real symmetric positive definite matrices occur often in scientific computing. In this section we study some properties of such matrices. We consider the complex case as well. The real non-symmetric case is considered in a separate section. Definition 3.6 (Positiv definite matrix) Suppose A C n n is a square Hermitian matrix. The function f : C n R given by n n f(x) = x Ax = a ij x i x j i=1 j=1

96 84 Chapter 3. LDL* Factorization and Positive definite Matrices is called a quadratic form. Note that f is real valued. Indeed, f(x) = x Ax = (x Ax) = x A x = f(x) since A is Hermitian. We say that A is (i) positive definite if A = A and x Ax > 0 for all nonzero x C n. (ii) positive semidefinite if A = A and x Ax 0 for all x C n. (iii) negative (semi)definite if A = A and A is positive (semi)definite. We observe that 1. The zero-matrix is positive semidefinite, while the unit matrix is positive definite. 2. The matrix A is positive definite if and only if it is positive semidefinite and x Ax = 0 = x = A positive definite matrix A is nonsingular. For if Ax = 0 then x Ax = 0 and this implies that x = If A is real then it is enough to show definiteness for real vectors only. Indeed, if A R n n, A T = A and x T Ax > 0 for all nonzero x R n then z Az > 0 for all nonzero z C n. For if z = x + iy 0 with x, y R n then z Az = (x iy) T A(x + iy) = x T Ax iy T Ax + ix T Ay i 2 y T Ay = x T Ax + y T Ay and this is positive since at least one of the real vectors x, y is nonzero. Example 3.7 (Gradient and hessian) Symmetric positive definite matrices is important in nonlinear optimization. Consider (cf. (B.1)) the gradient f and hessian Hf of a function f : Ω R n R f(x) = f(x) x 1. f(x) x n Rn, Hf(x) = 2 f(x) x 1 x f(x) x n x f(x) x 1 x n. 2 f(x) x n x n Rn n. We assume that f has continuous first and second partial derivatives on Ω. Under suitable conditions on the domain Ω it is shown in advanced calculus texts that if f(x) = 0 and Hf(x) is symmetric positive definite then x is a local minimum for f. This can be shown using the second-order Taylor expansion (B.2). Moreover, x is a local maximum if f(x) = 0 and Hf(x) is negative definite.

97 3.2. Positive Definite and Semidefinite Matrices 85 A matrix A C m n has rank n or equivalently linearly independent columns if and only if Ax = 0 = x = 0. The Euclidian norm on C n is given by y 2 := n y y = y j 2, y = [y 1,..., y n ] T C n. (3.5) j=1 Lemma 3.8 (The matrix A A) The matrix A A is positive semidefinite for any m, n N and A C m n. It is positive definite if and only if A has linearly independent columns. Proof. Clearly A A is Hermitian. The rest follows from the relation xa Ax = Ax 2 2 (cf. (3.5)). Clearly A A is positive semidefinite. We have xa Ax = 0 Ax 2 = 0 Ax = 0. This implies that x = 0 if A A either has linearly independent columns or is positive definite. Lemma 3.9 (T is symmetric positive definite) The second derivative matrix T = tridiag( 1, 2, 1) R n n is symmetric positive definite. Proof. Clearly T is symmetric. For any x R n x T T x = 2 n i=1 n 1 = i=1 n 1 x 2 i x i x i+1 i=1 n 1 x 2 i 2 i=1 n x i 1 x i i=2 n 1 x i x i+1 + i=1 n 1 = x x 2 n + (x i+1 x i ) 2. i=1 x 2 i+1 + x x 2 n Thus x T T x 0 and if x T T x = 0 then x 1 = x n = 0 and x i = x i+1 for i = 1,..., n 1 which implies that x = 0. Hence T is positive definite The Cholesky factorization. Recall that a principal submatrix B = A(r, r) C k k of a matrix A C n n has elements b i,j = a ri,r j for i, j = 1,..., k, where 1 r 1 < < r k n. It is a leading principal submatrix, denoted A [k] if r = [1, 2,..., k] T. Lemma 3.10 (Submatrices) Any principal sub matrix of a positive (semi)definite matrix is positive (semi) definite.

98 86 Chapter 3. LDL* Factorization and Positive definite Matrices Proof. Suppose the submatrix B = A(r, r) is defined by the rows and columns r = [r 1,..., r k ] T of A. Let X = [e r1,..., e rk ] C n k. Then B := X AX. Let y C k and set x := Xy C n. Then y By = y X AXy = x Ax. This is nonnegative if A is positive semidefinite and positive if A is positive definite and X has linearly independent columns. For then x is nonzero if y is nonzero. Theorem 3.11 (LDL* and LL*) The following is equivalent for a Hermitian matrix A C n n. 1. A is positive definite, 2. A has an LDL* factorization with positive diagonal elements in D, 3. A has a Cholesky factorization. If the Cholesky factorization exists it is unique. Proof. 1 = 2: Suppose A is positive definite. By Lemma 3.10 the leading principal submatrices A [k] C k k are positive definite and by Lemma 3.17 they are nonsingular for k = 1,..., n 1. By Theorem 2.17 A has a unique LDL* factorization A = LDL. The ith diagonal element in D is positive, d ii = e i De i = e i L 1 AL 1 e i = x i Ax i > 0. Indeed, x i := L 1 e i is nonzero since L 1 is nonsingular. 2 = 3: Suppose if A has an LDL* factorization A = LDL with positive diagonal elements d ii in D. Then A = SS, where S := LD 1/2 and D 1/2 := diag( d 11,..., d nn ), and this is a Cholesky factorization of A. 3 = 1: Suppose A has a Cholesky factorization A = LL. Since L has positive diagonal elements it is nonsingular and A is positive definite by Lemma 3.8. For uniqueness suppose LL = SS are two Cholesky factorizations of the positive definite matrix A. Since A is nonsingular both L and S are nonsingular. Then S 1 L = S L 1, where by Lemma 1.35 S 1 L is lower triangular and S L 1 is upper triangular, with diagonal elements l ii /s ii and s ii /l ii, respectively. But then both matrices must be equal to the same diagonal matrix and l 2 ii = s2 ii. By positivity l ii = s ii and we conclude that S 1 L = I = S L T which means that L = S. A Cholesky factorization can also be written in the equivalent form A = R R, where R = L is upper triangular with positive diagonal elements. Example 3.12 (2 2)

99 3.2. Positive Definite and Semidefinite Matrices 87 The matrix A = [ ] 2 1 = 1 2 [ ] 2 1 has an LDL*- and a Cholesky-factorization given by 1 2 [ ] [ ] [ ] = 0 1 [ ] [ ] 2 0 1/ 2 1/ /2 0 3/2 There are many good algorithms for finding the Cholesky factorization of a matrix, see [6]. The following version for finding the factorization of a matrix with bandwidth d uses the LDL* factorization algorithm 3.4. Only the upper part of A is used. The algorithm uses the Matlab command diag. Algorithm 3.13 (bandcholesky) 1 function L=bandcholesky (A, d ) 2 %L=b a n d c h o l e s k y (A, d ) 3 [ L, dg]=ldl(a, d ) ; 4 L=L diag ( sqrt ( dg ) ) ; As for the LDL* factorization the leading term in an operation count for a band matrix is O(d 2 n). When d is small this is a considerable saving compared to the count 1 2 G n = n 3 /3 for a full matrix Positive definite and positive semidefinite criteria Not all Hermitian matrices are positive definite, and sometimes we can tell just by glancing at the matrix that it cannot be positive definite. Here are some necessary conditions. We consider only the real case. Theorem 3.14 (Necessary conditions for positive definite) If A R n n is positive definite then for all i, j with i j 1. a ii > 0, 2. a ij < (a ii + a jj )/2, 3. a ij < a ii a jj. Proof. Clearly a ii = e T i Ae i > 0 and Part 1 follows. If αe i + βe j 0 then 0 < (αe i + βe j ) T A(αe i + βe j ) = α 2 a ii + β 2 a jj + 2αβa ij. (3.6) Taking α = 1, β = ±1 we obtain a ii + a jj ± 2a ij > 0 and this implies Part 2. Taking α = a ij, β = a ii in (3.6) we find 0 < a 2 ija ii + a 2 iia jj 2a 2 ija ii = a ii (a ii a jj a 2 ij).

100 88 Chapter 3. LDL* Factorization and Positive definite Matrices Since a ii > 0 Part 3 follows. Example 3.15 (Not positive definite) Consider the matrices [ ] 0 1 A 1 =, A = [ ] 1 2, A = [ ] Here A 1 and A 3 are not positive definite, since a diagonal element is not positive. A 2 is not positive definite since neither Part 2 nor Part 3 in Theroem 3.14 are satisfied. The matrix [ ] enjoys all the necessary conditions in in Theorem But to decide if it is positive definite it is nice to have sufficient conditions as well. We start by considering eigenvalues of a positive (semi)definite matrix. Lemma 3.16 (positive eigenvalues) A matrix is positive (semi)definite if and only if it is Hermitian and all its eigenvalues are positive (nonnegative). Proof. Suppose A is positive (semi)definite. Then A is Hermitian by definition, and if Ax = λx and x is nonzero, then x Ax = λx x. This implies that λ > 0( 0) since A is positive (semi)definite and x x = x 2 2 > 0. Conversely, suppose A C n n is Hermitian with positive (nonnegative) eigenvalues λ 1,..., λ n. By Theorem 5.38 (the spectral theorem) there is a matrix U C n n with U U = UU = I such that U AU = diag(λ 1,..., λ n ). Let x C n and define z := U x = [z 1,..., z n ] T C n. Then x = UU x = Uz and by the spectral theorem x Ax = z U AUz = z diag(λ 1,..., λ n )z = n λ j z j 2 0. It follows that A is positive semidefinite. Since U is nonsingular we see that z = U x is nonzero if x is nonzero, and therefore A is positive definite. j=1 Lemma 3.17 (positive semidefinite and nonsingular) A matrix is positive definite if and only if it is positive semidefinite and nonsingular. Proof. If A is positive definite then it is positive semidefinite and if Ax = 0 then x Ax = 0 which implies that x = 0. Conversely, if A is positive semidefinite then it is Hermitian with nonnegative eigenvalues. Since it is nonsingular all eigenvalues are positive showing that A is positive definite.

101 3.2. Positive Definite and Semidefinite Matrices 89 The following necessary and sufficient conditions can be used to decide if a matrix is positive definite. Theorem 3.18 (Positive definite characterization) The following statements are equivalent for a Hermitian matrix A C n n. 1. A is positive definite. 2. A has only positive eigenvalues. 3. All leading principal submatrices have a positive determinant. 4. A = BB for a nonsingular B C n n. Proof. 1 2: This follows from Lemma = 3: A positive definite matrix has positive eigenvalues, and since the determinant of a matrix equals the product of its eigenvalues (cf. Theorem 0.38) the determinant is positive. Every leading principal submatrix of a positive definite matrix is positive definite (cf. Lemma 3.10) and therefore has a positive determinant. 3 = 4: Since a leading principal submatrix has a positive determinant it is nonsingular and Theorem 2.17 implies that A has a unique LDL* factorization A = LDL. By Lemma 3.2 A [k] = L [k] D [k] L [k] is an LDL* factorization of A [k], k = 1,..., n. But then for all k det(a [k] ) = det(l [k] ) det(d [k] ) det(l [k]) = det(d [k] ) = d 11 d kk > 0. But then D has positive diagonal elements and we have A = BB, where B := LD 1/2. 4 = 1: This follows from Lemma 3.8. Example 3.19 (Positive definite characterization) [ ] 3 1 Consider the symmetric matrix A := We have x T Ax = 2x x (x 1 + x 2 ) 2 > 0 for all nonzero x showing that A is positive definite. 2. The eigenvalues of A are λ 1 = 2 and λ 2 = 4. They are positive showing that A is positive definite since it is symmetric. 3. We find det(a [1] ) = 3 and det(a [2] ) = 8 showing again that A is positive definite.

102 90 Chapter 3. LDL* Factorization and Positive definite Matrices 4. Finally A is positive definite since by Example 3.5 we have A = BB, B = [ 1 0 1/3 1 ] [ /3 ]. Exercise 3.20 (Positive definite characterizations) Show [ ] directly that all 4 characterizations in Theorem 3.18 hold for the matrix Semi-Cholesky factorization of a banded matrix A Hermitian positive semidefinite matrix has a factorization that is similar to the Cholesky factorization. Definition 3.21 (Semi-Cholesky factorization) A factorization A = LL where L is lower triangular with nonnegative diagonal elements is called a semi-cholesky factorization. Note that a semi-cholesky factorization of a positive definite matrix is necessarily a Cholesky factorization. For if A is positive definite then it is nonsingular and then L must be nonsingular. Thus the diagonal elements of L cannot be zero. Theorem 3.22 (Characterization, semi-cholesky factorization) A matrix A C n n has a semi-cholesky factorization A = LL if and only if it is symmetric positive semidefinite. Proof. If A = LL is a semi-cholesky factorization then x Ax = L x and A is positive semidefinite. For the converse we use induction on n. A positive semidefinite matrix of order one has a semi-cholesky factorization since the one and only element in A is nonnegative. Suppose any symmetric positive semidefinite matrix of order n 1 has a semi-cholesky factorization and suppose A C n n is symmetric positive semidefinite. We partition A as follows [ ] α v A =, α C, v C n 1, B C (n 1) (n 1). (3.7) v B There are two cases. Suppose first α = e 1Ae 1 > 0. We claim that C := B vv /α is symmetric positive semidefinite. C is symmetric. To show that C is positive

103 3.3. Semi-Cholesky factorization of a banded matrix 91 semidefinite we consider any y C n 1 and define x := [ v y/α, y ] C n. Then [ ] [ ] 0 x Ax = [ v y/α, y α v v ] y/α v B y [ ] = [0, (v y)v /α + y v B] y/α (3.8) y = (v y)(v y)/α + y By = y Cy, since (v y)v y = (v y) v y = y vv y. So C C (n 1) (n 1) is symmetric positive semidefinite and by the induction hypothesis it has a semi-cholesky factorization C = L 1 L 1. The matrix [ ] L β v := /β 0 L, β := α, (3.9) 1 is upper triangular with nonnegtive diagonal elements and [ ] [ ] [ ] LL β 0 β v = /β α v v/β L 1 0 L = = A 1 v B is a semi-cholesky factorization of A. If α = 0 then 11 v = 0. Moreover, B C (n 1) (n 1) in (3.7) is positive semidefinite and therefore [ has] a semi-cholesky factorization B = L 1 L 1. But then LL 0 0, where L = is a semi-cholesky factorization of A. Indeed, L 0 L 1 is lower triangular and [ ] [ ] [ ] LL = 0 L 1 0 L = = A. 1 0 B Recall that a matrix A is d-banded if a ij = 0 for i j > d. A (semi-) Cholesky factorization preserves bandwidth. Theorem 3.23 (Bandwidth semi-cholesky factor) The semi-cholesky factor L given by (3.9) has the same bandwidth as A. Proof. Suppose A C n n is d-banded. Then v = [u, 0 ] in (3.7), where u C d, and therefore C := B vv /α differs from B only in the upper left d d corner. It follows that C has the same bandwidth as B and A. By induction on 11 A semidefinite version of Part 3 in Lemma 3.14 shows that a ij a ii a jj. Thus, if a ii = 0 then a ij = 0 for all j.

104 92 Chapter 3. LDL* Factorization and Positive definite Matrices n, C = L 1 L 1, where L 1 has the same bandwidth as C. But then L in (3.9) has the same bandwidth as A. Consider now implementing an algorithm based on the previous discussion. Since A is symmetric we only need to use the lower part of A. The first column of L is [β, v /β] if α > 0. If α = 0 then by 4 in Lemma 3.14 the first column of A is zero and this is also the first column of L. We obtain if A(1, 1) > 0 A(1, 1) = A(1, 1) A(2 : n, 1) = A(2 : n, 1)/A(1, 1) for j = 2 : n A(j : n, j) = A(j : n, j) A(j, 1) A(j : n, 1) (3.10) Here we store the first column of L in the first column of A and the lower part of C = B vv /α in the lower part of A(2 : n, 2 : n). The code can be made more efficient when A is a d-banded matrix. We simply replace all occurrences of n by min(i + d, n). Continuing the reduction we arrive at the following algorithm. Algorithm 3.24 (bandsemi-cholesky) Suppose A is positive semidefinite. A lower triangular matrix L is computed so that A = LL. This is the Cholesky factorization of A if A is positive definite and a semi-cholesky factorization of A otherwise. The algorithm uses the Matlab command tril. 1 function L=bandsemicholeskyL (A, d ) 2 %L=bandsemicholeskyL (A, d ) 3 n=length (A) ; 4 for k=1:n 5 i f A( k, k )>0 6 kp=min( n, k+d ) ; 7 A( k, k )=sqrt (A( k, k ) ) ; 8 A( ( k+1) : kp, k )=A( ( k+1) : kp, k ) /A( k, k ) ; 9 for j=k+1: kp 10 A( j : kp, j )=A( j : kp, j ) A( j, k ) A( j : kp, k ) ; 11 end 12 else 13 A( k : kp, k )=zeros ( kp k+1,1) ; 14 end 15 end 16 L=t r i l (A) ; In the algorithm we overwrite the lower triangle of A with the elements of

105 3.3. Semi-Cholesky factorization of a banded matrix 93 L. Column k of L is zero for those k where l kk = 0. We reduce round-off noise by forcing those rows to be zero. In the semidefinite case no update is necessary and we do nothing. Deciding when a diagonal element is zero can be a problem in floating point arithmetic. We end the section with some necessary and sufficient conditions for a matrix to be positive semidefinite. Theorem 3.25 (Positive semidefinite characterization) The following is equivalent for a Hermitian matrix A C n n. 1. A is positive semidefinite. 2. A has only nonnegative eigenvalues. 3. All principal submatrices have a nonnegative determinant. 4. A = BB for some B C n n. Proof. 1 2: This follows from Lemma : follows from Theorem 3.22 below, while is a consequence of Theorem To prove one first shows that ɛi + A is symmetric positive definite for all ɛ > 0 (Cf. page 567 of [26]). But then x Ax = lim ɛ 0 x (ɛi + A)x 0 for all x C n. Example 3.26 (Positive semidefinite [ characterization) ] 1 1 Consider the symmetric matrix A := We have x Ax = x x x 1 x 2 + x 2 x 1 = (x 1 + x 2 ) 2 0 for all x R 2 showing that A is positive semidefinite. 2. The eigenvalues of A are λ 1 = 2 and λ 2 = 0 and they are nonnegative showing that A is positive semidefinite since it is symmetric. 3. There are three principal sub matrices, and they have determinats det([a 11 ]) = 1, det([a 22 ]) = 1 and det(a) = 0 and showing again that A is positive semidefinite. [ ] 4. Finally A is positive semidefinite since A = BB 1 0, where B =. 1 0 In part 4 of Theorem 3.25 we require nonnegativity of all principal minors, while only positivity of leading principal minors was required for positive definite

106 94 Chapter 3. LDL* Factorization and Positive definite Matrices matrices (cf. Theorem 3.18). To see that nonnegativity of the leading principal minors is not enough consider the matrix A := [ ] The leading principal minors are nonnegative, but A is not positive semidefinite. 3.4 The non-symmetric real case In this section we say that a matrix A R n n is positive semidefinite if x Ax 0 for all x R n and positive definite if x Ax > 0 for all nonzero x R n. Thus we do not require A to be symmetric. This means that some of the eigenvalues can be complex (cf. Example 3.28). Note that a non-symmetric positive definite matrix is nonsingular, but in Exercise 3.29 you show that the converse is not true. We have the following theorem. Theorem 3.27 (The non-symmetric case) Suppose A R n,n is positive definite. 1. Every principal submatrix of A is positive definite, 2. A has a unique LU factorization, 3. the real eigenvalues of A are positive, 4. det(a) > 0, 5. a ii a jj > a ij a ji, for i j. Proof. 1. The proof is the same as for Lemma Since all leading submatrices are positive definite they are nonsingular and the result follows from the LU Theorem Suppose (λ, x) is an eigenpair of A and that λ is real. Since A is real we can choose x to be real. Multiplying Ax = λx by x T and solving for λ we find λ = xt Ax x T x > The determinant of A equals the product of its eigenvalues. The eigenvalues are either real and positive or occur in complex conjugate pairs. The product of two nonzero complex conjugate numbers is positive. 5. The principal submatrix [ a ii a ij a ji a jj ] has a positive determinant.

107 3.5. Review Questions 95 Example 3.28 (2 2 positive definite) A non-symmetric positive definite matrix can have complex eigenvalues. The family of matrices [ ] 2 2 a A[a] :=, a R a 1 is positive definite for any a R. Indeed x T Ax = 2x (2 a)x 1 x 2 + ax 2 x 1 + x 2 2 = x (x 1 + x 2 ) 2 > 0. The eigenvalues of A[a] are positive for a [1 5 2, ] and complex for other values of a. Exercise 3.29 (A counterexample) In the non-symmetric case a nonsingular positive semidefinite matrix [ is not] necessarily positive definite. Show this by considering the matrix A := Review Questions What is the content of the LDL* theorem? Is A A always positive definite? What class of matrices has a Cholesky factorization? What is the bandwidth of the Cholesky factor of a band matrix? For a symmetric matrix give 3 conditions that are equivalent to positive definiteness What class of matrices has a semi-cholesky factorization?

108 96 Chapter 3. LDL* Factorization and Positive definite Matrices

Chapter 4 Orthonormal and Unitary Transformations Alston Scott Householder, 1904-1993 (left), James Hardy Wilkinson, 1919-1986 (right).

109 Chapter 4 Orthonormal and Unitary Transformations Alston Scott Householder, (left), James Hardy Wilkinson, (right). Householder and Wilkinson are two of the founders of modern numerical analysis and scientific computing. In Gaussian elimination and LU factorization we solve a linear system by transforming it to triangular form. These are not the only kind of transformations that can be used for such a task. Matrices with orthonormal columns, called unitary matrices can be used to reduce a square matrix to upper triangular form and more generally a rectangular matrix to upper triangular (also called upper trapezoidal) form. This lead to a decomposition of a rectangular matrix known as a QR decomposition and a reduced form which we refer to as a QR factorization. The QR decomposition and factorization will be used in later chapters to solve least squares- and eigenvalue problems. Unitary transformations have the advantage that they preserve the Euclidian norm of a vector. This means that when a unitary transformation is applied to an inaccurate vector then the error will not grow. Thus a unitary transformation is numerically stable. We consider two classes of unitary transformations known as Householder- and Givens transformations, respectively. 97

110 98 Chapter 4. Orthonormal and Unitary Transformations 4.1 Inner products, orthogonality and unitary matrices An inner product or scalar product in a vector space is a function mapping pairs of vectors into a scalar Real and complex inner products Definition 4.1 (Inner product) An inner product in a complex vector space V is a function V V C satisfying for all x, y, z V and all a, b C the following conditions: 1. x, x 0 with equality if and only if x = 0. (positivity) 2. x, y = y, x (skew symmetry) 3. ax + by, z = a x, z + b y, z. (linearity) The pair (V,, ) is called an inner product space. Note the complex conjugate in 2. We find x, ay + bz = a x, y + b x, z, ax, ay = a 2 x, y. (4.1) An inner product in a real vector space V is real valued function satisfying Properties 1,2,3 in Definition 4.1, where we can replace skew symmetry by symmetry x, y = y, x (symmetry). In the real case we have linearity in the second variable since we can remove the complex conjugates in (4.1). The standard inner product in C n is given by x, y := y x = x T y = It is clearly an inner product in C n. The function is called the inner product norm. n x j y j. j=1 : V R, x x := x, x (4.2) The inner product norm for the standard inner product is the Euclidian norm x = x 2 = x x.

4.1. Inner products, orthogonality and unitary matrices 99 Viktor Yakovlevich Bunyakovsky, 1804-1889 (left), Augustin-Louis Cauchy, 1789-1857 (center), Karl Hermann Amandus

2 (Cauchy-Schwarz inequality) For any x, y in a real or complex inner product space x, y x y, (4.3) with equality if and only if x and y are linearly dependent. Proof.

Then z, y = x, y a y, y = 0 so that by 2. and (4.1) But then ay, z + z, ay = a z, y + a z, y = 0. (4.4) x 2 = x, x = z + ay, z + ay (4.4) = z, z + ay, ay (4.

111 4.1. Inner products, orthogonality and unitary matrices 99 Viktor Yakovlevich Bunyakovsky, (left), Augustin-Louis Cauchy, (center), Karl Hermann Amandus Schwarz, (right). The name Bunyakovsky is also associated with the Cauchy-Schwarz inequality. The following inequality holds for any inner product. Theorem 4.2 (Cauchy-Schwarz inequality) For any x, y in a real or complex inner product space x, y x y, (4.3) with equality if and only if x and y are linearly dependent. Proof. If y = 0 then x, y = x, 0y = 0 x, y = 0 and y = 0. Thus the inequality holds with equality, and x and y are linearly dependent. So assume y 0. Define x, y z := x ay, a := y, y. Then z, y = x, y a y, y = 0 so that by 2. and (4.1) But then ay, z + z, ay = a z, y + a z, y = 0. (4.4) x 2 = x, x = z + ay, z + ay (4.4) = z, z + ay, ay (4.1) = z 2 + a 2 y 2 a 2 y 2 = x, y 2 y 2. Multiplying by y 2 gives (4.3). We have equality if and only if z = 0, which means that x and y are linearly dependent. Theorem 4.3 (Inner product norm) For all x, y in an inner product space and all a in C we have

112 100 Chapter 4. Orthonormal and Unitary Transformations 1. x 0 with equality if and only if x = 0. (positivity) 2. ax = a x. (homogeneity) 3. x + y x + y. (subadditivity) In general a function : C n R that satisfies these there properties is called a vector norm. A class of vector norms called p-norms will be studied in Chapter 7. Proof. The first statement is an immediate consequence of positivity, while the second one follows from (4.1). Expanding x + ay 2 = x + ay, x + ay using (4.1) we obtain x + ay 2 = x 2 + a y, x + a x, y + a 2 y 2, a C, x, y V. (4.5) Now (4.5) with a = 1 and the Cauchy-Schwarz inequality implies x + y 2 x x y + y 2 = ( x + y ) 2. Taking square roots completes the proof. In the real case the Cauchy-Schwarz inequality implies that 1 x,y x y 1 for nonzero x and y, so there is a unique angle θ in [0, π] such that cos θ = x, y x y. (4.6) This defines the angle between vectors in a real inner product space. Exercise 4.4 (The A A inner product) Suppose A C m n has linearly independent columns. Show that x, y := x A Ay defines an inner product on C n. Exercise 4.5 (Angle between vectors in complex case) Show that in the complex case there is a unique angle θ in [0, π/2] such that Orthogonality cos θ = x, y x y. (4.7) Definition 4.6 (Orthogonality) Two vectors x, y in a real or complex inner product space are orthogonal or perpendicular, denoted as x y, if x, y = 0. The vectors are orthonormal if in addition x = y = 1.

4.1. Inner products, orthogonality and unitary matrices 101 From the definitions (4.6), (4.7) of angle θ between two vectors in R n or C n it follows that x y if and only if θ = π/2. Theorem 4.

113 4.1. Inner products, orthogonality and unitary matrices 101 From the definitions (4.6), (4.7) of angle θ between two vectors in R n or C n it follows that x y if and only if θ = π/2. Theorem 4.7 (Pythagoras) For a real or complex inner product space x + y 2 = x 2 + y 2, if x y. (4.8) Proof. We set a = 1 in (4.5) and use the orthogonality. Pythagoras of Samos, BC 570-BC 495 (left), Jørgen Pedersen Gram, (center), Erhard Schmidt, (right). Definition 4.8 (Orthogonal- and orthonormal bases) A set of nonzero vectors {v 1,..., v k } in a subspace S of a real or complex inner product space is an orthogonal basis for S if it is a basis for S and v i, v j = 0 for i j. It is an orthonormal basis for S if it is a basis for S and v i, v j = δ ij for all i, j. A basis for a subspace of an inner product space can be turned into an orthogonal- or orthonormal basis for the subspace by the following construction. Theorem 4.9 (Gram-Schmidt) Let {s 1,..., s k } be a basis for a real or complex inner product space (S,, ). Define v 1 := s 1, j 1 v j := s j i=1 s j, v i v i, v i v i, j = 2,..., k. (4.9) Then {v 1,..., v k } is an orthogonal basis for S and the normalized vectors { } v1 {u 1,..., u k } := v 1,..., v k v k form an orthonormal basis for S.

114 102 Chapter 4. Orthonormal and Unitary Transformations v 2 v 1 := s 1 s 2 cv 1 Figure 4.1: The construction of v 1 and v 2 in Gram-Schmidt. The constant c is given by c := s 2, v 1 / v 1, v 1. Proof. To show that {v 1,..., v k } is an orthogonal basis for S we use induction on k. Define subspaces S j := span{s 1,..., s j } for j = 1,..., k. Clearly v 1 = s 1 is an orthogonal basis for S 1. Suppose for some j 2 that v 1,..., v j 1 is an orthogonal basis for S j 1 and let v j be given by (4.9) as a linear combination of s j and v 1,..., v j 1. Now each of these v i is a linear combination of s 1,..., s i, and we obtain v j = j i=1 a is i for some a 0,..., a j with a j = 1. Since s 1,..., s j are linearly independent and a j 0 we deduce that v j 0. By the induction hypothesis j 1 s j, v i v j, v l = s j, v l v i, v i v i, v l = s j, v l s j, v l v l, v l v l, v l = 0 i=1 for l = 1,..., j 1. Thus v 1,..., v j is an orthogonal basis for S j. If {v 1,..., v k } is an orthogonal basis for S then clearly {u 1,..., u k } is an orthonormal basis for S. Sometimes we want to extend an orthogonal basis for a subspace to an orthogonal basis for a larger space. Theorem 4.10 (Orthogonal Extension of basis) Suppose S T are finite dimensional subspaces of a vector space V. An orthogonal basis for S can always be extended to an orthogonal basis for T. Proof. Suppose dim S := k < dim T = n. Using Theorem 0.10 we first extend an orthogonal basis s 1,..., s k for S to a basis s 1,..., s k, s k+1,..., s n for T, and then apply the Gram-Schmidt process to this basis obtaining an orthogonal basis v 1,..., v n for T. This is an extension of the basis for S since v i = s i for i = 1,..., k. We show this by induction. Clearly v 1 = s 1. Suppose for some 2

115 4.2. The Householder Transformation 103 r < k that v j = s j for j = 1,..., r 1. Consider (4.9) for j = r. Since s r, v i = s r, s i = 0 for i < r we obtain v r = s r. Letting S = span(s 1,..., s k ) and T be R n or C n we obtain Corollary 4.11 (Extending orthogonal vectors to a basis) For 1 k < n a set {s 1,..., s k } of nonzero orthogonal vectors in R n or C n can be extended to an orthogonal basis for the whole space Unitary and orthogonal matrices In the rest of this chapter orthogonality is in terms of the standard inner product in C n given by x, y := y x = n j=1 x jy j. A matrix U C n n is unitary if U U = I. If U is real then U T U = I and U is called an orthogonal matrix. Unitary and orthogonal matrices have orthonormal columns. If U U = I the matrix U is nonsingular, U 1 = U and UU = I as well. Moreover, both the columns and rows of a unitary matrix of order n form orthonormal bases for C n. We also note that the product of two unitary matrices is unitary. Indeed if U 1U 1 = I and U 2U 2 = I then (U 1 U 2 ) (U 1 U 2 ) = U 2U 1U 1 U 2 = I. Theorem 4.12 (Unitary matrix) The matrix U C n n is unitary if and only if Ux, Uy = x, y for all x, y C n. In particular, if U is unitary then Ux 2 = x 2 for all x C n. Proof. If U U = I and x, y C n then Ux, Uy = (Uy) (Ux) = y U Ux = y x = x, y. Conversely, if Ux, Uy = x, y for all x, y C n i, j = 1,..., n then U U = I since for (U U) i,j = e T i U Ue j = (Ue i ) (Ue j ) = Ue j, Ue i = e j, e i = e i e j = δ i,j. The last part of the theorem follows immediately by taking y = x: 4.2 The Householder Transformation We consider now a unitary matrix with many useful properties. Definition 4.13 (Householder transformation) A matrix H C n n of the form H := I uu, where u C n and u u = 2

116 104 Chapter 4. Orthonormal and Unitary Transformations is called a Householder transformation. The name elementary reflector is also used. ] In the real case and for n = 2 we find H =. A Householder [ 1 u 2 1 u1u2 u 2u 1 1 u 2 2 transformation is Hermitian and unitary. Indeed, H = (I uu ) = H and H H = H 2 = (I uu )(I uu ) = I 2uu + u(u u)u = I. In the real case H is symmetric and orthonormal. There are several ways to represent a Householder transformation. Householder used I 2uu, where u u = 1. For any nonzero v R n the matrix H := I 2 vvt v T v (4.10) is a Householder transformation. Indeed, H = I uu T, where u := 2 v v 2 has length 2. Two vectors in R n of the same length can be mapped into each other by a Householder transformation. Lemma 4.14 Suppose x, y R n with x 2 = y 2 and v := x y 0. Then ( I 2 vv T v T v ) x = y. Proof. Since x T x = y T y we have v T v = (x y) T (x y) = 2x T x 2y T x = 2v T x. (4.11) But then ( ) I 2 vvt v T v x = x 2v T x v T v v = x v = y. and A geometric interpretation of this lemma is shown in Figure 4.2. We have H = I 2vvT v T v = P vvt vvt v T, where P := I v v T v, P x = x vt x v T v v (4.11) = x 1 2 v = 1 (x + y). 2 It follows that Hx is the reflected image of x. The mirror M := {w R n : w T v = 0} contains the vector (x + y)/2 and has normal x y. Example 4.15 (Reflector) Suppose x := [1, 0, 1] T and y := [ 1, 0, 1] T. Then v = x y = [2, 0, 0] T and H := I 2vvT v T v = [ ] = 0 1 0,

117 4.2. The Householder Transformation 105 mirror y = Hx P x x v = x y Figure 4.2: The Householder transformation in Example 4.15 P := I vvt v T v = [ ] = 0 1 0, [ M := {w R 3 : w T w1 ] v = 0} = { w 2 : 2w w 1 = 0} 3 is the yz plane (cf. Figure 4.2), Hx = [ 1, 0, 1] T = y, and P x = [0, 0, 1] T = (x + y)/2 M. Householder transformations can be used to produce zeros in vectors. In the following Theorem we map any vector in C n into a multiple of the first unit vector. Theorem 4.16 (Zeros in vectors) For any x C n there is a Householder transformation H C n n such that Hx = ae 1, a = ρ x 2, ρ := { x 1 / x 1, if x 1 0, 1, otherwise. Proof. If x = 0 then a = 0, any u with u 2 = 2 will do, and we choose u := 2e 1 in this case. Suppose x 0. we define u := z + e 1, where z := ρx/ x 2. (4.12) 1 + z1

118 106 Chapter 4. Orthonormal and Unitary Transformations Since ρ = 1 we have ρ x 2 z = ρ 2 x = x. Moreover, z 2 = 1 and z 1 = x 1 / x 2 is real so that u u = (z+e1) (z+e 1) 1+z 1 = 2+2z1 1+z 1 = 2. Finally, Hx = x (u x)u = ρ x 2 (z (u z)u) = ρ x 2 (z (z + e 1)z 1 + z 1 (z + e 1 )) = ρ x 2 (z (z + e 1 )) = ρ x 2 e 1 = ae 1. The formulas in Theorem 4.16 are implemented in the following algorithm adapted from [28]. To any given x C n a number a and a vector u with u u = 2 is computed so that (I uu )x = ae 1. Algorithm 4.17 (Generate a Householder transformation) 1 function [ u, a]= housegen ( x ) 2 a=norm( x ) ; 3 i f a==0 4 u=x ; u ( 1 )=sqrt ( 2 ) ; return ; 5 end 6 i f x ( 1 )== 0 7 r =1; 8 else 9 r=x ( 1 ) /abs ( x ( 1 ) ) ; 10 end 11 u=conj ( r ) x/a ; 12 u ( 1 )=u ( 1 ) +1; 13 u=u/ sqrt ( u ( 1 ) ) ; 14 a= r a ; 15 end Note that In Theorem 4.16 the first component of z is z 1 = x 1 / x 2 0. Since z 2 = 1 we have z 1 2. It follows that we avoid cancelation error when computing 1 + z 1 and u and a are computed in a numerically stable way. In order to compute Hx for a vector x we do not need to form the matrix H. Indeed, Hx = (I uu )x = x (u x)u. If u, x R m this requires 2m operations to find u T x, m operations for (u T x)u and m operations for the final subtraction of the two vectors, a total of 4m arithmetic operations. If A R m n then 4mn operations are required for HA = A (u T A)u, i. e., 4m operations for each of the n columns of A. Householder transformations can also be used to zero out only the lower part of a vector. Suppose x T := [y, z] T, where y C k, z C n k for some

119 4.3. Householder Triangulation k < n. The command [û, a] := housegen(z) defines a Householder transformation Ĥ = I ûû so that Ĥz = ae 1. With u T := [0, û] T C n we see that u u = û û = 2, and Hx = [ y ae 1 ], where H := I uu = [ ] [ ] I 0 0 [0 û ] = 0 I û [ ] I 0, 0 Ĥ defines a Householder transformation that produces zeros in the lower part of x. Exercise 4.18 (What does algorithm housegen do when x = e 1?) Determine H in Algorithm 4.17 when x = e 1. Exercise 4.19 (Examples of Householder transformations) If x, y R n with x 2 = y 2 and v := x y 0 then it follows from Example 4.15 that ( ) I 2 vvt v T v x = y. Use this to construct a Householder transformation H such that Hx = y in the following cases. [ ] [ ] 3 5 a) x =, y =. 4 0 b) x = 2 2 1, y = Exercise 4.20 (2 2 Householder transformation) Show that a real 2 2 Householder transformation can be written in the form [ ] cos φ sin φ H =. sin φ cos φ Find Hx if x = [cos φ, sin φ] T. 4.3 Householder Triangulation We say that a matrix R C m n is upper trapezoidal, if r i,j = 0 for j < i and i = 2, 3..., m. Upper trapezoidal matrices corresponding to m < n, m = n, and m > n look as follows: x x x x x x x x x x x 0 x x x, 0 x x x 0 0 x x, 0 x x 0 0 x. 0 0 x x x 0 0 0

120 108 Chapter 4. Orthonormal and Unitary Transformations In this section we consider a method for bringing a matrix to upper trapezoidal form using Householder transformations. We treat the cases m > n and m n separately and consider first m > n. We describe how to find a sequence H 1,..., H n of Householder transformations such that [ ] R1 A n+1 := H n H n 1 H 1 A = = R, 0 and where R 1 is upper triangular. We define A 1 := A, A k+1 = H k A k, k = 1, 2,..., n. Suppose A k is upper trapezoidal in its first k 1 columns (which is true for k = 1) a (1) 1,1 a (1) 1,k 1 a (1) 1,k a (1) 1,j a (1) 1,n a (k 1) k 1,k 1 a (k 1) k 1,k a (k 1) k 1,j a (k 1) k 1,n a (k) k,k a (k) k,j a (k) k,n A k =... a (k) i,k a (k) i,j a (k) (4.13) i,n... a (k) m,k a (k) m,j a (k) m,n [ ] Bk C = k. 0 D k Let Ĥk := I û k û k be a Householder transformation that maps the first column [a (k) k,k,..., a(k) m,k ]T of D k to a multiple of e 1, Ĥ k (D k e 1 ) = a k e 1. Using [ ] Ik 1 0 Algorithm 4.17 we have [û k, a k ] = housegen(d k e 1 ). Then H k := is a 0 Ĥ k Householder transformation and [ ] [ ] Bk C k Bk+1 C A k+1 := H k A k = = k+1, 0 Ĥ k D k 0 D k+1 where B k+1 C k k is upper triangular and D k+1 C (m k) (n k). Thus A k+1 is upper trapezoidal in its first k columns and the reduction has been carried one step further. At the end R := A n+1 = [ ] R 1 0, where R1 is upper triangular. The process can also be applied to A C m n if m n. If m = 1 then A is already in upper trapezoidal form. Suppose m > 1. In this case m 1 Householder transformations will suffice and H m 1 H 1 A is upper trapezoidal. In an algorithm we can store most of the vector û k = [u kk,..., u mk ] T and the matrix and A k in A. However, the elements u k,k and a k = r k,k have to

121 4.3. Householder Triangulation 109 compete for the diagonal in A. For m = 4 and n = 3 the two possibilities look as follows: u 11 r 12 r 13 r 11 r 12 r 13 A = u 21 u 22 r 23 u 31 u 32 u 33 or A = u 21 r 22 r 23 u 31 u 32 r 33. u 41 u 42 u 43 u 41 u 42 u 43 Whatever alternative is chosen, if the looser is needed, it has to be stored in a separate vector. In the following algorithm we store a k = r k,k in A. We also apply the Householder transformations to a second matrix B. We will see that the algorithm can be used to solve linear systems and least squares problems with right hand side(s) B and to compute the product of the Householder transformations by choosing B = I. Algorithm 4.21 (Householder triangulation) Suppose A C m n, B C m r and let s := min(n, m 1). The algorithm uses housegen to compute Householder transformations H 1,..., H s such that R = H s H 1 A is upper trapezoidal and C = H s H 1 B. If B is the empty matrix then C is the empty matrix with m rows and 0 columns. 1 function [R,C] = h o u s e t r i a n g (A,B) 2 [m, n]= size (A) ; r=size (B, 2 ) ; A=[A,B ] ; 3 for k=1:min( n,m 1) 4 [ v,a( k, k ) ]= housegen (A( k :m, k ) ) ; 5 C=A( k :m, k+1:n+r ) ; A( k :m, k+1:n+r )=C v ( v C) ; 6 end 7 R=triu (A( :, 1 : n ) ) ; C=A( :, n+1:n+r ) ; Here v = û k and the update is computed as ĤkC = (I vv )C = C v(v C). The Matlab command triu extracts the upper triangular part of A introducing zeros in rows n + 1,..., m The number of arithmetic operations The bulk of the work in Algorithm 4.21 is the computation of C v (v C) for each k. Since in Algorithm 4.21, C C (m k+1) (n+r k) and m n the cost of computing the update C v (v T C) in the real case is 4(m k + 1)(n + r k) arithmetic operations. This implies that the work in Algorithm 4.21 can be estimated as n 0 4(m k)(n + r k)dk = 2m(n + r) (n + r)3. (4.14) For m = n and r = 0 this gives 4n 3 /3 = 2G n for the number of arithmetic operations to bring a matrix A R n n to upper triangular form using Householder transformations.

122 110 Chapter 4. Orthonormal and Unitary Transformations Solving linear systems using unitary transformations Consider now the linear system Ax = b, where A is square. Using Algorithm 4.21 we obtain an upper triangular system Rx = c that is upper triangular and nonsingular if A is nonsingular. Thus, it can be solved by back substitution and we have a method for solving linear systems that is an alternative to Gaussian elimination. The two methods are similar since they both reduce A to upper triangular form using certain transformations and they both work for nonsingular systems. Which method is better? Here is a short discussion. Advantages with Householder: Row interchanges are not necessary, but see [6]. Numerically stable. Advantages with Gauss Half the number of arithmetic operations compared to Householder. Row interchanges are often not necessary. Usually stable (but no guarantee). Linear systems can be constructed where Gaussian elimination will fail numerically even if row interchanges are used, see [36]. On the other hand the transformations used in Householder triangulation are unitary so the method is quite stable. So why is Gaussian elimination more popular than Householder triangulation? One reason is that the number of arithmetic operations in (4.14) when m = n is 4n 3 /3 = 2G n, which is twice the number for Gaussian elimination. Numerical stability can be a problem with Gaussian elimination, but years and years of experience shows that it works well for most practical problems and pivoting is often not necessary. Also Gaussian elimination often wins for banded and sparse problems. 4.4 The QR Decomposition and QR Factorization Gaussian elimination without row interchanges results in an LU factorization A = LU of A R n n. Consider Householder triangulation of A. Applying Algorithm 4.21 gives R = H n 1 H 1 A implying the factorization A = QR, where Q = H 1 H n 1 is orthonormal and R is upper triangular. This is known as a QR-factorization of A Existence For a rectangular matrix we define the following.

123 4.4. The QR Decomposition and QR Factorization 111 Definition 4.22 (QR decomposition) Let A C m n with m, n N. We say that A = QR is a QR decomposition of A if Q C m,m is square and unitary and R C m n is upper trapezoidal. If m n then R takes the form [ ] R1 R = 0 m n,n where R 1 C n n is upper triangular and 0 m n,n is the zero matrix with m n rows and n columns. For m n we call A = Q 1 R 1 a QR factorization of A if Q 1 C m n has orthonormal columns and R 1 C n n is upper triangular. Suppose m n. A QR factorization is obtained from a QR decomposition A = QR by simply using the first n columns of Q and the first n rows of R. Indeed, if we partition Q as [Q 1, Q 2 ] and R = [ ] R 1 0, where Q1 R m n and R 1 R n n then A = Q 1 R 1 is a QR factorization of A. On the other hand a QR factorization A = Q 1 R 1 of A can be turned into a QR decomposition by extending the set of columns {q 1,..., q n } of Q 1 into an orthonormal basis {q 1,..., q n, q n+1,..., q m } for R m and adding m n rows of zeros to R 1. We then obtain the QR decomposition A = QR, where Q = [q 1,..., q m ] and R = [ ] R 1 0. Example 4.23 (QR decomposition and factorization) An example of a QR decomposition is A = = = QR, while a QR factorization A = Q 1 R 1 is obtained by dropping the last column of Q and the last row of R, so that A = = Q 1 R Consider existence and uniqueness. Theorem 4.24 (Existence of QR decomposition) Any matrix A C m n with m, n N has a QR decomposition. Proof. If m = 1 then A is already in upper trapezoidal form and A = [1]A is a QR decomposition of A. Suppose m > 1 and set s := min(m 1, n). Note that

124 112 Chapter 4. Orthonormal and Unitary Transformations the function housegen(x) returns the vector u in a Householder transformation for any vector x. With B = I in Algorithm 4.21 we obtain R = CA and C = H s H 2 H 1. Thus A = QR is a QR decomposition of A since Q := C = H 1 H s is a product of unitary matrices and therefore unitary. Theorem 4.25 (Uniqueness of QR factorization) If m n and A is real then the QR factorization is unique if A has linearly independent columns and R has positive diagonal elements. Proof. Let A = Q 1 R 1 be a QR factorization of A. Now A T A = R T 1 Q T 1 Q 1 R 1 = R T 1 R 1. Since A T A is symmetric positive definite the matrix R 1 is nonsingular, and if its diagonal elements are positive this is the Cholesky factorization of A T A. Since the Cholesky factorization is unique it follows that R 1 is unique and since necessarily Q 1 = AR 1 1, it must also be unique. Example 4.26 (QR decomposition and factorization) [ Consider finding the QR decomposition and factorization of the matrix A = 2 1 ] 1 2 using the method of the uniqueness proof of Theorem We find B := A T A = [ ] The Cholesky factorization of B = R T R is given by [ R = ] Now R 1 = 1 3 [ ] so Q = [ AR 1 = ] Since A is square A = QR is both the QR decomposition and QR factorization of A. The QR factorization can be used to prove a classical determinant inequality. Theorem 4.27 (Hadamard s inequality) For any A = [a 1,..., a n ] C n n we have det(a) n a j 2. (4.15) Equality holds if and only if A has a zero column or the columns of A are orthogonal. j=1 Proof. Let A = QR be a QR factorization of A. Since 1 = det(i) = det(q Q) = det(q ) det(q) = det(q) det(q) = det(q) 2 we have det(q) = 1. Let R = [r 1,..., r n ]. Then (A A) jj = a j 2 2 = (R R) jj = r j 2 2, and det(a) = det(qr) = det(r) = n r jj j=1 n r j 2 = j=1 n a j 2. j=1

125 4.4. The QR Decomposition and QR Factorization 113 The inequality is proved. We clearly have equality if A has a zero column, for then both sides of (4.15) are zero. Suppose the columns are nonzero. We have equality if and only if r jj = r j 2 for j = 1,..., n. This happens if and only if R is diagonal. But then A A = R R is diagonal, which means that the columns of A are orthogonal. Exercise 4.28 (QR decomposition) A = , Q = , R = Show that Q is orthonormal and that QR is a QR decomposition of A. Find a QR factorization of A. Exercise 4.29 (Householder triangulation) a) Let A := Find Householder transformations H 1, H 2 R 3 3 such that H 2 H 1 A is upper triangular. b) Find the QR factorization of A, when R has positive diagonal elements QR and Gram-Schmidt The Gram-Schmidt orthogonalization of the columns of A can be used to find the QR factorization of A. Theorem 4.30 (QR and Gram-Schmidt) Suppose A R m n has rank n and let v 1,..., v n be the result of applying Gram Schmidt to the columns a 1,..., a n of A, i. e., v 1 = a 1, j 1 v j = a j i=1 a T j v i v T i v v i, for j = 2,..., n. (4.16) i

126 114 Chapter 4. Orthonormal and Unitary Transformations Let Q 1 := [q 1,..., q n ], q j := v j, j = 1,..., n and v j 2 v 1 2 a T 2 q 1 a T 3 q 1 a T n 1q 1 a T n q 1 0 v 2 2 a T 3 q 2 a T n 1q 2 a T n q 2 0 v 3 2 a T n 1q 3 a T n q 3 R 1 := vn 1 2 a T n q n 1 (4.17) Then A = Q 1 R 1 is the unique QR factorization of A. 0 v n 2 Proof. Let Q 1 and R 1 be given by (4.17). The matrix Q 1 is well defined and has orthonormal columns, since {q 1,..., q n } is an orthonormal basis for span(a) by Theorem 4.9. By (4.16) j 1 a j = v j + i=1 a T j v i v T i v i j 1 v i = r jj q j + q i r ij = Q 1 R 1 e j, j = 1,..., n. Clearly R 1 has positive diagonal elements and the factorization is unique. i=1 Example 4.31 (QR using Gram-Schmidt) Consider ] finding the QR decomposition and factorization of the matrix A = = [a1, a 2 ] using Gram-Schmidt. Using (4.16) we find v 1 = a 1 and [ v 2 = a 2 at 2 v1 v v T 1 v1 1 = 3 5 [ 1 2 ]. Thus Q = [q [ 1, q 2 ], where q 1 = ] 5 and q 2 = 1 5 [ 1 2 ]. By (4.17) we find [ ] v1 R 1 = R = 2 a T 2 q 1 = 1 [ ] v and this agrees with what we found in Example Exercise 4.32 (QR using Gram-Schmidt, II) Construct Q 1 and R 1 in Example 4.23 using Gram-Schmidt orthogonalization. Warning. The Gram-Schmidt orthogonalization process should not be used to compute the QR factorization numerically. The columns of Q 1 computed in floating point arithmetic using Gram-Schmidt orthogonalization will often be far from orthogonal. There is a modified version of Gram-Schmidt which behaves better numerically, see [3]. Here we only considered Householder transformations (cf. Algorithm 4.21).

127 4.5. Givens Rotations 115 x θ y=rx 4.5 Givens Rotations Figure 4.3: A plane rotation. In some applications, the matrix we want to triangulate has a special structure. Suppose for example that A R n n is square and upper Hessenberg as illustrated by a Wilkinson diagram for n = 4 x x x x A = x x x x 0 x x x. 0 0 x x Only one element in each column needs to be annihilated and a full Householder transformation will be inefficient. In this case we can use a simpler transformation. Definition 4.33 (Givens rotation, plane rotation) A plane rotation (also called a Given s rotation) is a matrix P R 2,2 of the form [ ] c s P :=, where c 2 + s 2 = 1. s c A plane rotation is orthonormal and there is a unique angle θ [0, 2π) such that c = cos θ and s = sin θ. Moreover, the identity matrix is a plane rotation corresponding to θ = 0. Exercise 4.34 (Plane rotation) [ ] Show that if x = [ r r cos sin α α ] then P x = r cos (α θ). Thus P rotates a vector x in r sin (α θ) the plane an angle θ clockwise. See Figure 4.3. Suppose x = [ x1 x 2 ] 0, c := x 1 r, s := x 2 r, r := x 2. Then P x = 1 r [ x1 x 2 x 2 x 1 ] [ x1 x 2 ] = 1 r [ ] [ x x 2 2 r =, 0 0]

128 116 Chapter 4. Orthonormal and Unitary Transformations and we have introduced a zero in x. We can take P = I when x = 0. For an n-vector x R n and 1 i < j n we define a rotation in the i, j-plane as a matrix P ij = (p kl ) R n n by p kl = δ kl except for positions ii, jj, ij, ji, which are given by [ ] [ ] pii p ij c s =, where c 2 + s 2 = 1. p ji p jj s c Thus, for n = 4, c s 0 0 c 0 s P 1,2 = s c , P 13 = s 0 c 0, P 23 = 0 s c 0 0 s c Karl Adolf Hessenberg, (left), James Wallace Givens, Jr, (right) Premultiplying a matrix by a rotation in the i, j-plane changes only rows i and j of the matrix, while post multiplying the matrix by such a rotation only changes column i and j. In particular, if B = P ij A and C = AP ij then B(k, : ) = A(k, :), C(:, k) = A(:, k) for all k i, j and [ ] [ ] [ ] B(i, :) c s A(i, :) =, [ C(:, i) C(:, j) ] = [ A(:, i) A(:, j) ] [ ] c s. B(j, :) s c A(j, :) s c (4.18) Givens rotations can be used as an alternative to Householder transformations for solving linear systems. It can be shown that for a dense system of order n the number of arithmetic operations is asymptotically 2n 3, corresponding to the work of 3 Gaussian eliminations, while, the work using Householder transformations corresponds to 2 Gaussian eliminations. However, for matrices with a special structure Givens rotations can be used to advantage. As an example consider an upper Hessenberg matrix A R n n. It can be transformed to upper triangular

129 4.6. Review Questions 117 form using rotations P i,i+1 for i = 1,..., n 1. For n = 4 the process can be illustrated as follows. [ x x x x ] [ r11 r 12 r 13 r 14 ] [ r11 r 12 r 13 r 14 ] [ r11 r 12 r 13 r 14 ] P A = 12 0 x x x P 23 0 r 22 r 23 r 24 P 34 0 r 22 r 23 r 24. x x x x 0 x x x 0 0 x x 0 x x x 0 0 x x 0 0 x x 0 0 x x 0 0 r 33 r r 44 For an algorithm see Exercise discussed in Chapter 14. This reduction is used in the QR-method Exercise 4.35 (Solving upper Hessenberg system using rotations) Let A R n n be upper Hessenberg and nonsingular, and let b R n. The following algorithm solves the linear system Ax = b using rotations P k,k+1 for k = 1,..., n 1. It uses the back solve algorithm 2.7. Determine the number of arithmetic operations of this algorithm. Algorithm 4.36 (Upper Hessenberg linear system) 1 function x=r o t h e s s t r i (A, b ) 2 n=length (A) ; A=[A b ] ; 3 for k=1:n 1 4 r=norm ( [A( k, k ),A( k+1,k ) ] ) ; 5 i f r>0 6 c=a( k, k ) / r ; s=a( k+1,k ) / r ; 7 A( [ k k +1], k+1:n+1)=[ c s ; s c ] A( [ k k +1], k+1:n+1) ; 8 end 9 A( k, k )=r ; A( k+1,k ) =0; 10 end 11 x=r b a c k s o l v e (A( :, 1 : n ),A( :, n+1), n ) ; 4.6 Review Questions What is a Householder transformation? Why are they good for numerical work? What are the main differences between solving a linear system by Gaussian elimination and Householder transformations? What are the differences between a QR decomposition and a QR factorization? Does any matrix have a QR decomposition? What is a Givens transformation?

130 118 Chapter 4. Orthonormal and Unitary Transformations

131 Part II Some matrix theory and least squares 119

132

133 Chapter 5 Eigenpairs and Similarity Transformations We have seen that a Hermitian matrix is positive definite if and only if it has positive eigenvalues. Eigenvalues and some related quantities called singular values occur in many branches of applied mathematics and are also needed for a deeper study of linear systems. In this and the next chapter we study eigenvalues and singular values. Recall that if A C n n is a square matrix, λ C and x C n then (λ, x) is an eigenpair for A if Ax = λx and x is nonzero. The scalar λ is called an eigenvalue and x is said to be an eigenvector. The set of eigenvalues is called the spectrum of A and is denoted by σ(a). For example, σ(i) = {1,..., 1} = {1}. The eigenvalues are the roots of the characteristic polynomial of A given for λ C by π A (λ) = det(a λi). The equation det(a λi) = 0 is called the characteristic equation of A. Equivalently the characteristic equation can be written det(λi A) = Defective and nondefective matrices For the eigenvectors we will see that it is important to know if the eigenvectors of a matrix of order n form a basis for C n. We say that A is: defective if the eigenvectors do not form a basis, nondefective if the eigenvectors form a basis, We have the following sufficient condition for a matrix to be nondefective. 121

134 122 Chapter 5. Eigenpairs and Similarity Transformations Theorem 5.1 (Distinct eigenvalues) A matrix with distinct eigenvalues is nondefective. Proof. The proof is by contradiction. Suppose A has linearly dependent eigenpairs (λ k, x k ), k = 1,..., n and let m be the smallest integer such that {x 1,..., x m } is linearly dependent. Thus m j=1 c jx j = 0, where at least one c j is nonzero. Since the value of the sum is zero there must be at least two c j s that are nonzero so we have m 2. We find m m m c j x j = 0 c j Ax j = c j λ j x j = 0. j=1 j=1 From the last relation we subtract m j=1 c jλ m x j = 0 and find m 1 j=1 c j(λ j λ m )x j = 0. But since λ j λ m 0 for j = 1,..., m 1 and at least one c j 0 for j < m we see that {x 1,..., x m 1 } is linearly dependent, contradicting the minimality of m. If some of the eigenvalues occur with multiplicity higher than one then the matrix can be either defective or nondefective. j=1 Example 5.2 (Defective and nondefective matrices) Consider the matrices [ ] [ ] I :=, J := Since Ix = x and λ 1 = λ 2 = 1 any vector x C 2 is an eigenvector for I. In particular the two unit vectors e 1 and e 2 are eigenvectors and form an orthonormal basis for C 2. We conclude that the identity matrix is nondefective. The matrix J also has the eigenvalue one with multiplicity two, but since Jx = x if and only if x 2 = 0, any eigenvector must be a multiple of e 1. Thus J is defective. If the eigenvectors x 1,..., x n form a basis for C n then any x C n can be written n x = c j x j for some scalars c 1,..., c n. j=1 We call this an eigenvector expansion of x. By definition any nondefective matrix has an eigenvector expansion. Example 5.3 (Eigenvector [ ] expansion example) 2 1 The eigenpairs of are (1, [1, 1] 1 2 T ) and (3, [1, 1] T ). Any x = [x 1, x 2 ] T C 2 has the eigenvector expansion x = x [ 1 + x ] x [ 1 x ]

135 5.1. Defective and nondefective matrices Similarity transformations We need a transformation that can be used to simplify a matrix without changing the eigenvalues. Definition 5.4 (Similar matrices) Two matrices A, B C n n are said to be similar if there is a nonsingular matrix S C n n such that B = S 1 AS. The transformation A B is called a similarity transformation. The columns of S are denoted by s 1, s 2,..., s n. We note that 1. Similar matrices have the same eigenvalues, they even have the same characteristic polynomial. Indeed, by the product rule for determinants det(ac) = det(a) det(c) so that π B (λ) = det(s 1 AS λi) = det ( S 1 (A λi)s ) = det(s 1 ) det(a λi) det(s) = det(s 1 S) det(a λi) = π A (λ), since det(i) = (λ, x) is an eigenpair for S 1 AS if and only if (λ, Sx) is an eigenpair for A. In fact (S 1 AS)x = λx if and only if A(Sx) = λ(sx). 3. If S 1 AS = D = diag(λ 1,..., λ n ) we can partition AS = SD by columns to obtain [As 1,..., As n ] = [λ 1 s 1,..., λ n s n ]. Thus the columns of S are eigenvectors of A. Moreover, A is nondefective since S is nonsingular. Conversely, if A is nondefective then it can be diagonalized by a similarity transformation S 1 AS, where the columns of S are eigenvectors of A. 4. For any square matrices A, C C n n the two products AC and CA have the same characteristic polynomial. More generally, for rectangular matrices A C m n and C C n m, with say m > n, the bigger matrix has m n extra zero eigenvalues π AC (λ) = λ m n π CA (λ), λ C. (5.1) To show this define for any m, n N block triangular matrices of order n+m by [ ] [ ] [ ] AC I A E :=, F :=, S =. C 0 C CA 0 I [ ] The matrix S is nonsingular with S 1 I A =. Moreover, ES = SF so 0 I E and F are similar and therefore have the same characteristic polynomials that are the products of the characteristic polynomial of the diagonal blocks. But then π E (λ) = λ n π AC (λ) = π F (λ) = λ m π CA (λ). This implies the statements for m n.

136 124 Chapter 5. Eigenpairs and Similarity Transformations Exercise 5.5 (Eigenvalues of a block triangular matrix) What are the eigenvalues of the matrix R 8,8? (5.2) Exercise 5.6 (Characteristic polynomial of transpose) We have det(b T ) = det(b) and det(b) = det(b)for any square matrix B. Use this to show that 1. π A T = π A, 2. π A (λ) = π A (λ). Exercise 5.7 (Characteristic polynomial of inverse) Suppose (λ, x) is an eigenpair for A C n n. Show that 1. If A is nonsingular then (λ 1, x) is an eigenpair for A (λ k, x) is an eigenpair for A k for k Z. Exercise 5.8 (The power of the eigenvector expansion) Show that if A C n n is nondefective with eigenpairs (λ j, x j ), j = 1,..., n then for any x C n and k N A k x = n c j λ k j x j for some scalars c 1,..., c n. (5.3) j=1 Show that if A is nonsingular then (5.3) holds for all k Z. Exercise 5.9 (Eigenvalues of an idempotent matrix) Let λ σ(a) where A 2 = A C n n. Show that λ = 0 or λ = 1. (A matrix is called idempotent if A 2 = A). Exercise 5.10 (Eigenvalues of a nilpotent matrix) Let λ σ(a) where A k = 0 for some k N. Show that λ = 0. (A matrix A C n n such that A k = 0 for some k N is called nilpotent). Exercise 5.11 (Eigenvalues of a unitary matrix) Let λ σ(a), where A A = I. Show that λ = 1.

137 5.2. Geometric multiplicity of eigenvalues and the Jordan Form 125 Exercise 5.12 (Nonsingular approximation of a singular matrix) Suppose A C n n is singular. Then we can find ɛ 0 > 0 such that A + ɛi is nonsingular for all ɛ C with ɛ < ɛ 0. Hint: det(a) = λ 1 λ 2 λ n, where λ i are the eigenvalues of A. Exercise 5.13 (Companion matrix) For q 0,..., q n 1 C let p(λ) = λ n + q n 1 λ n q 0 be a polynomial of degree n in λ. We derive two matrices that have ( 1) n p as its characteristic polynomial. a) Show that p = ( 1) n π A where A = q n 1 q n 2 q 1 q A is called the companion matrix of f. b) Show that p = ( 1) n π B where q q 1 B = q q n 1. Thus B can also be regarded as a companion matrix for p. 5.2 Geometric multiplicity of eigenvalues and the Jordan Form We have seen that a nondefective matrix can be diagonalized by its eigenvectors, while a defective matrix does not enjoy this property. The following question arises. How close to a diagonal matrix can we reduce a general matrix by a similarity transformation? We give one answer to this question, called the Jordan form, or the Jordan canonical form, in Theorem For a proof, see for example [15]. The Jordan form is an important tool in matrix analysis and it has applications to systems of differential equations, see [14]. We first take a closer look at the multiplicity of eigenvalues.

138 126 Chapter 5. Eigenpairs and Similarity Transformations Algebraic and geometric multiplicity of eigenvalues Linear independence of eigenvectors depends on the multiplicity of the eigenvalues in a nontrivial way. For multiple eigenvalues we need to distinguish between two kinds of multiplicities. Suppose A C n n has k distinct eigenvalues λ 1,..., λ k with multiplicities a 1,..., a k so that π A (λ) := det(a λi) = (λ 1 λ) a1 (λ k λ) a k, λ i λ j, i j, k a i = n. (5.4) The positive integer a i = a(λ i ) = a A (λ i ) is called the multiplicity, or more precisely the algebraic multiplicity of the eigenvalue λ i. The multiplicity of an eigenvalue is simple (double, triple) if a i is equal to one (two, three). To define a second kind of multiplicity we consider for each λ σ(a) the nullspace ker(a λi) := {x C n : (A λi)x = 0} (5.5) of A λi. The nullspace is a subspace of C n consisting of all eigenvectors of A corresponding to the eigenvalue λ. The dimension of the subspace must be at least one since A λi is singular. Definition 5.14 (Geometric multiplicity) The geometric multiplicity g = g(λ) = g A (λ) of an eigenvalue λ of A is the dimension of the nullspace ker(a λi). i=1 Example 5.15 (Geometric multiplicity) The n n identity matrix I has the eigenvalue λ = 1 with π I (λ) = (1 λ) n. Since I λi is the zero matrix when λ = 1, the nullspace of I λi is all of n-space and it follows[ that ] a = g = n. On the other hand we saw in Example 5.2 that the 1 1 matrix J := has the eigenvalue λ = 1 with a = 2 and any eigenvector is a 0 1 multiple of e 1. Thus g = 1. Theorem 5.16 (Geometric multiplicity of similar matrices) Similar matrices have the same eigenvalues with the same algebraic and geometric multiplicities. Proof. Similar matrices have the same characteristic polynomials and only the invariance of geometric multiplicity needs to be shown. Suppose λ σ(a), dim ker(s 1 AS λi) = k, and dim ker(a λi) = l. We need to show that k = l. Suppose v 1,..., v k is a basis for ker(s 1 AS λi). Then S 1 ASv i = λv i

5.2. Geometric multiplicity of eigenvalues and the Jordan Form 127 or ASv i = λsv i, i = 1,..., k. But then {Sv 1,..., Sv k } ker(a λi), which implies that k l. Similarly, if w 1,.

17 (Find eigenpair example) 1 2 3 Find eigenvalues and eigenvectors of A = 0 2 3. Is A defective? 0 0 2 5.2.2 The Jordan form Marie Ennemond Camille Jordan, 1838-1922 (left), William Rowan Hamilton, 1805-1865 (right).

A 3 3 Jordan block has the form J 3 (λ) = [ λ 1 0 0 λ 1 0 0 λ 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0. (5.6). ].

139 5.2. Geometric multiplicity of eigenvalues and the Jordan Form 127 or ASv i = λsv i, i = 1,..., k. But then {Sv 1,..., Sv k } ker(a λi), which implies that k l. Similarly, if w 1,..., w l is a basis for ker(a λi) then {S 1 w 1,..., S 1 w l } ker(s 1 AS λi). which implies that k l. We conclude that k = l. Exercise 5.17 (Find eigenpair example) Find eigenvalues and eigenvectors of A = Is A defective? The Jordan form Marie Ennemond Camille Jordan, (left), William Rowan Hamilton, (right). Definition 5.18 (Jordan block) A Jordan block of order m, denoted J m (λ) is an m m matrix of the form J m (λ) :=. λ λ λ λ λ. = λi m + E m, E m :=. A 3 3 Jordan block has the form J 3 (λ) = [ λ λ λ (5.6). ]. Since a Jordan block is upper triangular λ is an eigenvalue of J m (λ) and any eigenvector must be a multiple of e 1. Indeed, if J m (λ)v = λv for some v = [v 1,..., v m ] then v 2 = = v m = 0. Thus, the eigenvalue λ of J m (λ) have algebraic multiplicity a = m and geometric multiplicity g = 1. The Jordan form is a decomposition of a matrix into Jordan blocks.

140 128 Chapter 5. Eigenpairs and Similarity Transformations Theorem 5.19 (The Jordan form of a matrix) Suppose A C n n has k distinct eigenvalues λ 1,..., λ k of algebraic multiplicities a 1,..., a k and geometric multiplicities g 1,..., g k. There is a nonsingular matrix S C n n such that J := S 1 AS = diag(u 1,..., U k ), with U i C ai ai, (5.7) where each U i is block diagonal having g i Jordan blocks along the diagonal U i = diag(j mi,1 (λ i ),..., J mi,gi (λ i )). (5.8) Here m i,1,..., m i,gi are integers and they are unique if they are ordered so that m i,1 m i,2 m i,gi. Moreover, a i = g i j=1 m i,j for all i. We note that 1. The matrices S and J in (5.7) are called Jordan factors. We also call J the Jordan form of A. 2. The columns of S are called principal vectors. They satisfy the matrix equation AS = SJ. 3. Each U i is upper triangular with the eigenvalue λ i on the diagonal and consists of g i Jordan blocks. These Jordan blocks can be taken in any order and it is customary to refer to any such block diagonal matrix as the Jordan form of A. Example 5.20 (Jordan form) As an example consider the Jordan form J := diag(u 1, U 2 ) = R 8 8. (5.9) We encountered this matrix in Exercise 5.5. The eigenvalues together with their algebraic and geometric multiplicities can be read off directly from the Jordan form. U 1 = diag(j 3 (2), J 2 (2), J 1 (2)) and U 2 = J 2 (3). 2 is an eigenvalue of algebraic multiplicity 6 and geometric multiplicity 3, the number of Jordan blocks corresponding to λ = 2. 3 is an eigenvalue of algebraic multiplicity 2 and geometric multiplicity 1.

141 5.2. Geometric multiplicity of eigenvalues and the Jordan Form 129 The columns of S = [s 1,..., s 8 ] are called generalized eigenvectors of A. They are found from the columns of J as follows As 1 = 2s 1, As 2 = s 1 + 2s 2, As 3 = s 2 + 2s 3, As 4 = 2s 4, As 5 = s 4 + 2s 5, As 6 = 2s 6, As 7 = 3s 7, As 8 = s 7 + 3s 8. We see that the generalized eigenvector corresponding to the first column in a Jordan block is an eigenvector of A. The remaining generalized eigenvectors are not eigenvectors. The matrix J := is also a Jordan form of A. In any Jordan form of this A the sizes of the 4 Jordan blocks J 3 (2), J 2 (2), J 1 (2), J 2 (3) are uniquely given. The Jordan form implies 12 Corollary 5.21 (Geometric multiplicity) We have 1. The geometric multiplicity of an eigenvalue is always bounded above by the algebraic multiplicity of the eigenvalue. 2. The number of linearly independent eigenvectors of a matrix equals the sum of the geometric multiplicities of the eigenvalues. 3. A matrix A C n n has n linearly independent eigenvectors if and only if the algebraic and geometric multiplicity of all eigenvalues are the same. Proof. 1. The algebraic multiplicity a i of an eigenvalue λ i is equal to the size of the corresponding U i. Moreover each U i contains g i Jordan blocks of size m i,j 1. Thus g i a i. 2. Since A and J are similar the geometric multiplicities of the eigenvalues of these matrices are the same, and it is enough to prove statement 2 for the 12 Corollary 5.21 can also be shown without using the Jordan form, see [16].

142 130 Chapter 5. Eigenpairs and Similarity Transformations Jordan factor J. We show this only for the matrix J given by (5.9). The general case should then be clear. There are only 4 eigenvectors of J, namly e 1, e 4, e 6, e 7 corresponding to the 4 Jordan blocks. These 4 vectors are clearly linearly independent. Moreover there are k = 2 distinct eigenvalues and g 1 + g 2 = = Since g i a i for all i and i a i = n we have i g i = n if and only if a i = g i for i = 1,..., k. Exercise 5.22 (Jordan example) For the Jordan form of the matrix A = [ ] we have J = [ ] Find S Exercise 5.23 (A nilpotent matrix) Show that (J m (λ) λi) r = [ ] 0 I m r 0 0 for 1 r m 1 and conclude that (J m (λ) λi) m = 0. Exercise 5.24 (Properties of the Jordan form) Let J be the Jordan form of a matrix A C n n as given in Theorem Then for r = 0, 1, 2,..., m = 2, 3,..., and any λ C 1. A r = SJ r S 1, 2. J r = diag(u r 1,..., U r k), 3. U r i = diag(j mi,1 (λ i ) r,..., J mi,gi (λ i ) r ), 4. J m (λ) r = (E m + λi m ) r = min{r,m 1} ( r ) k=0 k λ r k E k m. Exercise 5.25 (Powers of a Jordan block) Find J 100 and A 100 for the matrix in Exercise Exercise 5.26 (The minimal polynomial) Let J be the Jordan form of a matrix A C n n as given in Theorem The polynomial k µ A (λ) := (λ i λ) mi where m i := max m i,j, (5.10) 1 j g i i=1 is called the minimal polynomial of A. We define the matrix polynomial µ A (A) by replacing the factors λ i λ by λ i I A.

5.3. The Schur decomposition and normal matrices 131 1. We have π A (λ) = k gi i=1 j=1 (λ i λ) mi,j. Use this to show that the minimal polynomial divides the characteristic polynomial, i. e.

Thus a matrix satisfies its minimal equation. Finally show that the degree of any polynomial p such that p(a) = 0 is at least as large as the degree of the minimal polynomial. 4. Use 2.

143 5.3. The Schur decomposition and normal matrices We have π A (λ) = k gi i=1 j=1 (λ i λ) mi,j. Use this to show that the minimal polynomial divides the characteristic polynomial, i. e., π A = µ A ν A for some polynomial ν A. 2. Show that µ A (A) = 0 µ A (J) = (can be difficult) Use Exercises 5.23, 5.24 and the maximality of m i to show that µ A (A) = 0. Thus a matrix satisfies its minimal equation. Finally show that the degree of any polynomial p such that p(a) = 0 is at least as large as the degree of the minimal polynomial. 4. Use 2. to show the Cayley-Hamilton Theorem which says that a matrix satisfies its characteristic equation π A (A) = 0. Exercise 5.27 (Big Jordan example) Find the Jordan form of the matrix A = (5.11) 5.3 The Schur decomposition and normal matrices Issai Schur, (left),john William Strutt (Lord Rayleigh), (right) who is a famous English physicist. The Rayleigh quotient (see Section 5.4.1) is named after him.

144 132 Chapter 5. Eigenpairs and Similarity Transformations The Schur decomposition We turn now to unitary similarity transformations S 1 AS, where S = U is unitary. Thus S 1 = U and a unitary similarity transformation takes the form U AU Unitary and orthogonal matrices Although not every matrix can be diagonalized it can be brought into triangular form by a unitary similarity transformation. Theorem 5.28 (Schur decomposition) For each A C n n there exists a unitary matrix U C n n such that R := U AU is upper triangular. The matrices U and R in the Schur decomposition are called Schur factors. We call A = URU the Schur factorization of A. Proof. We use induction on n. For n = 1 the matrix U is the 1 1 identity matrix. Assume that the theorem is true for matrices of order k and suppose A C n n, where n := k + 1. Let (λ 1, v 1 ) be an eigenpair for A with v 1 2 = 1. By Theorem 4.10 we can extend v 1 to an orthonormal basis {v 1, v 2,..., v n } for C n. The matrix V := [v 1,..., v n ] C n n is unitary, and V AV e 1 = V Av 1 = λ 1 V v 1 = λ 1 e 1. It follows that [ V λ1 x AV = 0 M ], for some M C k k and x C k. (5.12) By the induction hypothesis there is a unitary matrix W 1 C (n 1) (n 1) such that W 1MW 1 is upper triangular. Define [ ] 1 0 W = and U = V W. 0 W 1 Then W and U are unitary and [ ] [ U AU = W (V 1 0 λ1 x AV )W = 0 W 1 0 M [ ] λ1 x = W 1 0 W 1MW 1 is upper triangular. ] [ W 1 ]

145 5.3. The Schur decomposition and normal matrices 133 If A has complex eigenvalues then U will be complex even if A is real. The following is a real version of Theorem Theorem 5.29 (Schur form, real eigenvalues) For each A R n n with real eigenvalues there exists an orthogonal matrix U R n n such that U T AU is upper triangular. Proof. Consider the proof of Theorem Since A and λ 1 are real the eigenvector v 1 is real and the matrix W is real and W T W = I. By the induction hypothesis V is real and V T V = I. But then also U = V W is real and U T U = I. A real matrix with some complex eigenvalues can only be reduced to block triangular form by a real unitary similarity transformation. We consider this in Section Exercise 5.30 (Schur decomposition example) Show that a Schur decomposition of A = [ ] is U T AU = [ ] 1 1 [ 0 4, where U = ] Exercise 5.31 (Deflation example) By using the unitary transformation V on the n n matrix A, we obtain a matrix M of order n 1. M has the same eigenvalues as A except λ. Thus we can find another eigenvalue of A by working with a smaller matrix M. This is an example of a deflation technique which is very useful in numerical work. The second derivative matrix T := [ ] has an eigenpair (2, x 1 ), where x 1 = [ 1, 0, 1] T. Find the remaining eigenvalues using deflation. Hint: We can extend x 1 to a basis {x 1, x 2, x 3 } for R 3 by defining x 2 = [0, 1, 0] T, x 3 = [1, 0, 1] T. This is already an orthogonal basis and normalizing we obtain the orthogonal matrix V = We obtain (5.12) with λ = 2 and [ M = ]. We can now find the remaining eigenvalues of A from the 2 2 matrix M.

146 134 Chapter 5. Eigenpairs and Similarity Transformations Normal matrices A matrix A C n n is normal if A A = AA. Examples of normal matrices are 1. A = A, (Hermitian) 2. A = A, (Skew-Hermitian) 3. A = A 1, (Unitary) 4. A = diag(d 1,..., d n ). (Diagonal) If A is diagonal then A A = diag(d 1 d 1,..., d n d n ) = diag( d 1 2,..., d n 2 ) = AA, and A is normal. The 2. derivative matrix T in (1.23) is normal. The eigenvalues of a normal matrix can be complex (cf. Exercise 5.35). However in the Hermitian case the eigenvalues are real (cf. Lemma 1.32). The following theorem shows that A has a set of orthonormal eigenvectors if and only if it is normal. Theorem 5.32 (Spectral theorem for normal matrices) A matrix A C n n is normal if and only if there exists a unitary matrix U C n n such that U AU = D is diagonal. If D = diag(λ 1,..., λ n ) and U = [u 1,..., u n ] then (λ j, u j ), j = 1,..., n are orthonormal eigenpairs for A. Proof. If B = U AU, with B diagonal, and U U = I, then A = UBU and AA = (UBU )(UB U ) = UBB U and A A = (UB U )(UBU ) = UB BU. Now BB = B B since B is diagonal, and A is normal. Conversely, suppose A A = AA. By Theorem 5.28 we can find U with U U = I such that B := U AU is upper triangular. Since A is normal B is normal. Indeed, BB = U AUU A U = U AA U = U A AU = B B. The proof is complete if we can show that an upper triangular normal matrix B must be diagonal. The diagonal elements in E := B B and F := BB are given by n i n n e ii = b ki b ki = b ki 2 and f ii = b ik b ik = b ik 2. k=1 k=1 k=1 k=i

147 5.3. The Schur decomposition and normal matrices 135 The result now follows by equating e ii and f ii for i = 1, 2,..., n. In particular for i = 1 we have b 11 2 = b b b 1n 2, so b 1k = 0 for k = 2, 3,..., n. Suppose B is diagonal in its first i 1 rows so that b jk = 0 for j = 1,..., i 1, k = j+1,..., n. Then e ii = i b ki 2 = b ii 2 = k=1 n b ik 2 = f ii and it follows that b ik = 0, k = i+1,..., n. By induction on the rows we see that B is diagonal. The last part of the theorem follows from Section k=i Example 5.33 The orthogonal diagonalization of A = [ 2 [ diag(1, 3), where U = ] ] is U T AU = Exercise 5.34 (Skew-Hermitian matrix) Suppose C = A + ib, where A, B R n n. Show that C is skew-hermitian if and only if A T = A and B T = B. Exercise 5.35 (Eigenvalues of a skew-hermitian matrix) Show that any eigenvalue of a skew-hermitian matrix is purely imaginary. Exercise 5.36 (Eigenvector expansion using orthogonal eigenvectors) Show that if the eigenpairs (λ 1, u 1 ),..., (λ n, u n ) of A C n n are orthogonal, i. e., u j u k = 0 for j k then the eigenvector expansions of x, Ax C n take the form n n x = c j u j, Ax = c j λ j u j, where c j = u j x u j u. (5.13) j j=1 j= The quasi triangular form How far can we reduce a real matrix A with some complex eigenvalues by a real unitary similarity transformation? To study this we note that the complex eigenvalues of a real matrix occur in conjugate pairs, λ = µ + iν, λ = µ iν, where µ, ν are real. The real 2 2 matrix [ ] µ ν M = (5.14) ν µ has eigenvalues λ = µ + iν and λ = µ iν.

148 136 Chapter 5. Eigenpairs and Similarity Transformations Definition 5.37 (Quasi-triangular matrix) We say that a matrix is quasi-triangular if it is block triangular with only 1 1 and 2 2 blocks on the diagonal. Moreover, no 2 2 block should have real eigenvalues. As an example consider the matrix D 1 R 1,2 R 1,3 [ R := 0 D 2 R 2,3 2 1, D 1 := D 3 Since R is block triangular π R = π D1 π D2 π D3. We find ], D 2 := [ 1 ] [ 3 2, D 3 := 1 1 π D1 (λ) = π D3 (λ) = λ 2 4λ + 5, π D2 (λ) = λ 1, and the eigenvalues D 1 and D 3 are λ 1 = 2+i, λ 2 = 2 i, while D 2 obviously has the eigenvalue λ = 1. Any A R n n can be reduced to quasi-triangular form by a real orthogonal similarity transformation. For a proof see [31]. We will encounter the quasi triangular form in Chapter Hermitian Matrices The special cases where A is Hermitian, or real and symmetric, deserve special attention. Theorem 5.38 (Spectral theorem, complex form) Suppose A C n n is Hermitian. Then A has real eigenvalues λ 1,..., λ n. Moreover, there is a unitary matrix U C n n such that U AU = diag(λ 1,..., λ n ). For any such U the columns {u 1,..., u n } of U are orthonormal eigenvectors of A and Au j = λ j u j for j = 1,..., n. Proof. That the eigenvalues are real was shown in Lemma The rest follows from Theorem There is also a real version. Theorem 5.39 (Spectral Theorem (real form)) Suppose A R n n is symmetric. Then A has real eigenvalues λ 1, λ 2,..., λ n. There is an orthogonal matrix U R n n such that U T AU = diag(λ 1, λ 2,..., λ n ). For any such U the columns {u 1,..., u n } of U are orthonormal eigenvectors of A and Au j = λ j u j for j = 1,..., n. Proof. Since a real symmetric matrix has real eigenvalues and eigenvectors this follows from Theorem ].

149 5.4. Hermitian Matrices The Rayleigh Quotient The Rayleigh quotient is an important tool when studying eigenvalues. Definition 5.40 (Rayleigh quotient) For A C n n and a nonzero x the number is called a Rayleigh quotient. R(x) = R A (x) := x Ax x x If (λ, x) is an eigenpair for A then R(x) = x Ax x x = λ. Equation (5.15) in the following lemma shows that the Rayleigh quotient of a normal matrix is a convex combination of its eigenvalues. Lemma 5.41 (Convex combination of the eigenvalues) Suppose A C n n is normal with orthonormal eigenpairs (λ j, u j ), j = 1, 2,..., n. Then the Rayleigh quotient is a convex combination of the eigenvalues of A n i=1 R A (x) = λ i c i 2 n j=1 c j, x 0, x = n c 2 j u j. (5.15) Proof. By orthonormality of the eigenvectors x x = n n i=1 j=1 c iu i c j u j = n j=1 c j 2. Similarly, x Ax = n n i=1 j=1 c iu i c j λ j u j = n i=1 λ i c i 2. and (5.15) follows. This is clearly a combination of nonnegative quantities and a convex combination since n i=1 c i 2 / n j=1 c j 2 = 1. j= Minmax Theorems There are some useful characterizations of the eigenvalues of a Hermitian matrix. First we show Theorem 5.42 (Minmax) Suppose A C n n is Hermitian with eigenvalues λ 1,..., λ n, ordered so that λ 1 λ n. Let 1 k n. For any subspace S of C n of dimension n k + 1 λ k max R(x), (5.16) x S x 0 with equality for S = S := span(u k,..., u n ) and x = u k. Here (λ j, u j ), 1 j n are orthonormal eigenpairs for A.

150 138 Chapter 5. Eigenpairs and Similarity Transformations Proof. Let S be any subspace of C n of dimension n k + 1 and define S := span(u 1,..., u k ). We need to find y S so that R(y) λ k. Now S + S := {s + s : s S, s S } is a subspace of C n and by (7) dim(s S ) = dim(s) + dim(s ) dim(s + S ) (n k + 1) + k n = 1. It follows that S S is nonempty. Let y S S = k j=1 c ju j with k j=1 c j 2 = 1. Defining c j = 0 for k + 1 j n, we obtain by Lemma 5.41 max R(x) R(y) = x S x 0 n λ j c j 2 = j=1 k λ j c j 2 j=1 k λ k c j 2 = λ k, j=1 and (5.16) follows. If y S, say y = n j=k d ju j with n by Lemma 5.41 R(y) = n j=k λ j d j 2 have max x S x 0 R(u k ) = λ k. j=k d j 2 = 1 then again λ k, and since y S is arbitrary we R(x) λ k and equality in (5.16) follows for S = S. There is also a maxmin version of this result. Moreover, Theorem 5.43 (Maxmin) Suppose A C n n is Hermitian with eigenvalues λ 1,..., λ n, ordered so that λ 1 λ n. Let 1 k n. For any subspace S of C n of dimension k λ k min R(x), (5.17) x S x 0 with equality for S = S := span(u 1,..., u k ) and x = u k. Here (λ j, u j ), 1 j n are orthonormal eigenpairs for A. Proof. The proof is very similar to the proof of Theorem We define S := span(u k,..., u n ) and show that R(y) λ k for some y S S. It is easy to see that R(y) λ k for any y S. These theorems immediately lead to classical minmax and maxmin characterizations. Corollary 5.44 (The Courant-Fischer Theorem ) Suppose A C n n is Hermitian with eigenvalues λ 1,..., λ n, ordered so that λ 1 λ n. Then λ k = min max R(x) = dim(s)=n k+1 x S x 0 max min R(x), k = 1,..., n. (5.18) dim(s)=k x S x 0

151 5.4. Hermitian Matrices 139 Using Theorem 5.42 we can prove inequalities of eigenvalues without knowing the eigenvectors and we can get both upper and lower bounds. Theorem 5.45 (Eigenvalue perturbation for Hermitian matrices) Let A, B C n n be Hermitian with eigenvalues α 1 α 2 α n and β 1 β 2 β n. Then α k + ε n β k α k + ε 1, for k = 1,..., n, (5.19) where ε 1 ε 2 ε n are the eigenvalues of E := B A. Proof. Since E is a difference of Hermitian matrices it is Hermitian and the eigenvalues are real. Let (α j, u j ), j = 1,..., n be orthonormal eigenpairs for A and let S := span{u k,..., u n }. By Theorem 5.42 we obtain β k max x S x 0 R B (x) max x S x 0 R A (x)+max x S x 0 R E (x) max x S x 0 R A (x)+max R E (x) = α k +ε 1, x C n x 0 and this proves the upper inequality. For the lower one we define D := E and observe that ε n is the largest eigenvalue of D. Since A = B + D it follows from the result just proved that α k β k ε n, which is the same as the lower inequality. In many applications of this result the eigenvalues of the matrix E will be small and then the theorem states that the eigenvalues of B are close to those of A. Moreover, it associates a unique eigenvalue of A with each eigenvalue of B. Exercise 5.46 (Eigenvalue perturbation for Hermitian matrices) Show that in Theorem 5.45, if E is symmetric positive semidefinite then β i α i The Hoffman-Wielandt Theorem We can also give a bound involving all eigenvalues. The following theorem shows that the eigenvalue problem for a normal matrix is well conditioned. Theorem 5.47 (Hoffman-Wielandt Theorem ) Suppose A, B C n n are both normal matrices with eigenvalues λ 1,..., λ n and µ 1,..., µ n, respectively. Then there is a permutation i 1,..., i n of 1, 2,..., n such that n n n µ ij λ j 2 a ij b ij 2. (5.20) j=1 i=1 j=1 For a proof of this theorem see [[30], p. 190]. For a Hermitian matrix we can use the identity permutation if we order both set of eigenvalues in nonincreasing or nondecreasing order.

152 140 Chapter 5. Eigenpairs and Similarity Transformations Exercise 5.48 (Hoffman-Wielandt) Show that (5.20) does not hold for the matrices A := [ ] and B := [ ] Why does this not contradict the Hoffman-Wielandt theorem? 5.5 Left Eigenvectors Definition 5.49 (Left and right eigenpairs) Suppose A C n n is a square matrix, λ C and y C n. We say that (λ, y) is a left eigenpair for A if y A = λy or equivalently A y = λy, and y is nonzero. We say that (λ, y) is a right eigenpair for A if Ay = λy and y is nonzero. If (λ, y) is a left eigenpair then λ is called a left eigenvalue and y a left eigenvector. Similarly if (λ, y) is a right eigenpair then λ is called a right eigenvalue and y a right eigenvector. In this book an eigenpair will always mean a right eigenpair. A left eigenvector is an eigenvector of A. If λ is a left eigenvalue of A then λ is an eigenvalue of A and then λ is an eigenvalue of A (cf. Exercise 5.7). Thus left and right eigenvalues are identical, but left and right eigenvectors are in general different. For a Hermitian matrix the right and left eigenpairs are the same. Using right and left linearly independent eigenpairs we get some useful eigenvector expansions. Theorem 5.50 (Biorthogonal eigenvector expansion) If A C n n has linearly independent right eigenvectors {x 1,..., x n } then there exists a set of left eigenvectors {y 1,..., y n } with y i x j = δ i,j. Conversely, if A C n n has linearly independent left eigenvectors {y 1,..., y n } then there exists a set of right eigenvectors {x 1,..., x n } with y i x j = δ i,j. For any scaling of these sets we have the eigenvector expansions v = n y j v y j x x j = j j=1 n x k v y k x y k. (5.21) k Proof. For any right eigenpairs (λ 1, x 1 ),..., (λ n, x n ) and left eigenpairs (λ 1, y 1 ),..., (λ n, y n ) of A we have AX = XD, Y A = DY, where k=1 X := [x 1,..., x n ], Y := [y 1,..., y n ], D := diag(λ 1,..., λ n ). Suppose X is nonsingular. Then AX = XD = A = XDX 1 = X 1 A = DX 1 and it follows that Y := X 1 contains a collection of left eigenvectors such that Y X = I. Thus the columns of Y are linearly independent and y i x j = δ i,j. Similarly, if Y is nonsingular then AY = Y D and it follows that X := Y contains a collection of linearly independent right eigenvectors such that Y X = I. If v = n j=1 c jx j then y i v = n j=1 c jy i x j = c i y i x i, so c i =

153 5.5. Left Eigenvectors 141 y i v/y i x i for i = 1,..., n and the first expansion in (5.21) follows. The second expansion follows similarly. For a Hermitian matrix the right eigenvectors {x 1,..., x n } are also left eigenvectors and (5.21) takes the form n x j v = v x j x x j. (5.22) j j=1 Exercise 5.51 (Biorthogonal expansion) Determine right and left eigenpairs for the matrix A := [ ] and the two expansions in (5.21) for any v R Biorthogonality Left- and right eigenvectors corresponding to distinct eigenvalues are orthogonal. Theorem 5.52 (Biorthogonality) Suppose (µ, y) and (λ, x) are left and right eigenpairs of A C n n. If λ µ then y x = 0. Proof. Using the eigenpair relation in two ways we obtain y Ax = λy x = µy x and we conclude that y x = 0. Right and left eigenvectors corresponding to the same eigenvalue are sometimes orthogonal, sometimes not. Theorem 5.53 (Simple eigenvalue) Suppose (λ, x) and (λ, y) are right and left eigenpairs of A C n n. If λ has algebraic multiplicity one then y x 0. Proof. Assume that x 2 = 1. We have (cf. (5.12)) [ ] V λ z AV =, 0 M where V is unitary and V e 1 = x. We show that if y x = 0 then λ is also an eigenvalue of M contradicting the multiplicity assumption of λ. Let u := V y. Then (V A V )u = V A y = λv y = λu, so (λ, u) is an eigenpair of V A V. But then y x = u V V e 1 = u e 1. Suppose that u e 1 = 0, i. e., u = [ v 0 ] for some nonzero v Cn 1. Then [ ] [ [ ] [ V A λ V u = z M v] = M = λ v v]

154 142 Chapter 5. Eigenpairs and Similarity Transformations and λ is an eigenvalue of M. The case with multiple eigenvalues is more complicated. For example, the matrix A := [ ] has one eigenvalue λ = 1 of algebraic multiplicity two, one right eigenvector x = e 1 and one left eigenvector y = e 2. Thus x and y are orthogonal. Exercise 5.54 (Generalized Rayleigh quotient) For A C n n and any y, x C n with y x 0 the quantity R(y, x) = R A (y, x) := y Ax y x is called a generalized Rayleigh quotient for A. Show that if (λ, x) is a right eigenpair for A then R(y, x) = λ for any y with y x 0. Also show that if (λ, y) is a left eigenpair for A then R(y, x) = λ for any x with y x Review Questions Does A, A T and A have the same eigenvalues? What about A A and AA? What is the geometric multiplicity of an eigenvalue? Can it be bigger than the algebraic multiplicity? What is the Jordan form of a matrix? What are the eigenvalues of a diagonal matrix? What are the Schur factors of a matrix? What is a quasi-triangular matrix? Give some classes of normal matrices. Why are normal matrices important? State the Courant-Fischer theorem State the Hoffman-Wielandt theorem for Hermitian matrices What is a left eigenvector of a matrix?

155 Chapter 6 The Singular Value Decomposition The singular value decomposition is useful both for theory and practice. Some of its applications include solving over-determined equations, principal component analysis in statistics, numerical determination of the rank of a matrix, algorithms used in search engines, and the theory of matrices. We know from Theorem 5.32 that a square matrix A can be diagonalized by a unitary similarity transformation if and only if it is normal, that is A A = AA. In particular, if A C n n is normal then it has a set of orthonormal eigenpairs (λ 1, u 1 ),..., (λ n, u n ). Letting U := [u 1,..., u n ] C n n and D := diag(λ 1,..., λ n ) we have the spectral decomposition A = UDU, where U U = I. (6.1) The singular value decomposition (SVD) is a decomposition of a matrix in the form A = UΣV, where U and V are unitary, and Σ is a nonnegative diagonal matrix, i. e., Σ ij = 0 for all i j and σ i := Σ ii 0 for all i. The diagonal elements σ i are called singular values, while the columns of U and V are called singular vectors. To be a singular value decomposition the singular values should be ordered, i. e., σ i σ i+1 for all i. Example 6.1 (SVD) The following is a singular value decomposition of a rectangular matrix. A = = [ ] 3 4 = UΣV. (6.2) Indeed, U and V are unitary since the columns (singular vectors) are orthonormal, and Σ is a nonnegative diagonal matrix with singular values σ 1 = 2 and σ 2 =

156 144 Chapter 6. The Singular Value Decomposition Exercise 6.2 (SVD1) Show that the decomposition [ ] 1 1 A := = 1 [ ] [ ] [ ] is both a spectral decomposition and a singular value decomposition. = UDU T (6.3) Exercise 6.3 (SVD2) Show that the decomposition [ ] 1 1 A := = 1 [ ] [ ] [ ] =: UΣV T (6.4) is a singular value decomposition. Show that A is defective so it cannot be diagonalized by any similarity transformation. 6.1 The SVD always exists The singular value decomposition is closely related to the eigenpairs of A A and AA The matrices A A, AA Theorem 6.4 (The matrices A A, AA ) Suppose m, n N and A C m n. 1. The matrices A A C n n and AA C m m have the same nonzero eigenvalues with the same algebraic multiplicities. Moreover the extra eigenvalues of the larger matrix are all zero. 2. The matrices A A C n n and AA C m m are Hermitian with nonnegative eigenvalues. 3. Let (λ j, v j ) be orthonormal eigenpairs for A A with λ 1 λ r > 0 = λ r+1 = = λ n. Then {Av 1,..., Av r } is an orthogonal basis for the column space span(a) := {Ay C m : y C n } and {v r+1,..., v n } is an orthonormal basis for the nullspace ker(a) := {y C n : Ay = 0}. 4. Let (λ j, u j ) be orthonormal eigenpairs for AA. If λ j > 0, j = 1,..., r and λ j = 0, j = r + 1,..., m then {A u 1,..., A u r } is an orthogonal basis for the column space span(a ) and {u r+1,..., u m } is an orthonormal basis for the nullspace ker(a ).

157 6.1. The SVD always exists The rank of A equals the number of positive eigenvalues of A A and AA. Proof. 1. The characteristic polynomials of the two matrices are closely related as stated in (5.1) and shown there. 2. Clearly A A and AA are Hermitian. If A Av = λv with v 0, then λ = v A Av v v = Av 2 2 v (6.5) and the eigenvalues of A A are nonnegative. By part 1 also AA has nonnegative eigenvalues. 3. By orthonormality of v 1,..., v n we have (Av j ) Av k = v j A Av k = λ k v j v k = 0 for j k, showing that Av 1,..., Av n are orthogonal vectors. Moreover, (6.5) implies that Av 1,..., Av r are nonzero and Av j = 0 for j = r + 1,..., n. In particular, the elements of {Av 1,..., Av r } and {v r+1,..., v n } are linearly independent vectors in span(a) and ker(a), respectively. The proof will be complete once it is shown that span(a) span(av 1,..., Av r ) and ker(a) span(v r+1,..., v n ). Suppose x span(a). Then x = Ay for some y C n, Let y = n j=1 c jv j be an eigenvector expansion of y. Since Av j = 0 for j = r + 1,..., n we obtain x = Ay = n j=1 c jav j = r j=1 c jav j span(av 1,..., Av r ). Finally, if y = n j=1 c jv j ker(a), then we have Ay = r j=1 c jav j = 0, and c 1 = = c r = 0 since Av 1,..., Av r are linearly independent. But then y = n j=r+1 c jv j span(v r+1,..., v n ). 4. Since AA = B B with B := A this follows from part 3 with A = B. 5. By part 1 and 2 A A and AA have the same number r of positive eigenvalues and by part 3 and 4 r is the rank of A. The following theorem shows, in a constructive way, that any matrix has a singular value decomposition. Theorem 6.5 (Existence of SVD) Suppose for m, n, r N that A C m n has rank r, and that (λ j, v j ) are orthonormal eigenpairs for A A with λ 1 λ r > 0 = λ r+1 = = λ n. Define 1. V := [v 1,..., v n ] C n n,

158 146 Chapter 6. The Singular Value Decomposition 2. Σ R m n is a diagonal matrix with diagonal elements σ j := λ j for j = 1,..., min(m, n), 3. U := [u 1,..., u m ] C m m, where u j = σ 1 j Av j for j = 1,..., r and u r+1,..., u m is an extension of u 1,..., u r to an orthonormal basis u 1,..., u m for C m. Then A = UΣV is a singular value decomposition of A. Proof. Let U, Σ, V be as in the theorem. The vectors u 1,..., u r are orthonormal since Av 1,..., Av r are orthogonal and σ j = Av j 2 > 0, j = 1,..., r by (6.5). But then U and V are unitary and Σ is a nonnegative diagonal matrix. Moreover, UΣ = U[σ 1 e 1,..., σ r e r, 0,..., 0] = [σ 1 u 1,..., σ r u r, 0,..., 0] = [Av 1,..., Av n ]. Thus UΣ = AV and since V is unitary we find UΣV = AV V = A and we have an SVD of A with σ 1 σ 2 σ r. Example 6.6 (Find SVD) Derive the SVD in (6.2). Discussion: We have A = and eigenpairs of B := A T A = 1 25 [ 52 ] are found from [ [ [ ] [ ] B = 4, B = 1. 4] 4] 3 3 [ ] 3 4 Thus σ 1 = 2, σ 2 = 1, and V = 1 5. Now u = Au/σ 1 = [1, 2, 2] T /3, u 2 = Av 2 /σ 2 = [2, 2, 1] T /3. For an SVD we also need u 3 which is any vector of length one orthogonal to u 1 and u 2. u 3 = [2, 1, 2] T /3 is such a vector and we obtain the singular value decomposition A = [ ] 3 4. (6.6) Exercise 6.7 (SVD examples) Find the singular value decomposition of the following matrices

159 6.2. Further properties of SVD 147 (a) A = [ 3 4 (b) A = ] Exercise 6.8 (More SVD examples) Find the singular value decomposition of the following matrices (a) A = e 1 the first unit vector in R m. (b) A = e T n the last unit vector in R n. (c) A = [ ]. Exercise 6.9 (Singular values of a normal matrix) Show that 1. the singular values of a normal matrix are the absolute values of its eigenvalues, 2. the singular values of a symmetric positive semidefinite matrix are its eigenvalues. Exercise 6.10 (The matrices A A, AA and SVD) Show the following: If A = UΣV is a singular value decomposition of A C m n then 1. A A = V diag(σ 2 1,..., σ 2 n)v is a spectral decomposition of A A. 2. AA = U diag(σ 2 1,..., σ 2 m)u is a spectral decomposition of AA. 3. The columns of U are orthonormal eigenvectors of AA. 4. The columns of V are orthonormal eigenvectors of A A. 6.2 Further properties of SVD We first consider a reduced SVD that is often convenient.

160 148 Chapter 6. The Singular Value Decomposition The singular value factorization Suppose A = UΣV is a singular value decomposition of A of rank r. Consider the block partitions U = [U 1, U 2 ] C m m, U 1 := [u 1,..., u r ], U 2 := [u r+1,..., u m ], V = [V 1, V 2 ] C n n, V 1 := [v 1,..., v r ], V 2 := [v r+1,..., v n ], [ ] Σ1 0 Σ = r,n r R m n, where Σ 0 m r,r 0 1 := diag(σ 1,..., σ r ). m r,n r (6.7) Thus Σ 1 contains the r positive singular values on the diagonal and for k, l 0 the symbol 0 k,l = [ ] denotes the empty matrix if k = 0 or l = 0, and the zero matrix with k rows and l columns otherwise. We obtain by block multiplication a reduced factorization A = UΣV = U 1 Σ 1 V 1. (6.8) As an example: [ ] 1 1 = 1 [ ] [ ] [ ] = 1 [ ] 1 [2 ] 1 [ ] Definition 6.11 (SVF) Let m, n, r N and suppose A C m n with 1 r min(m, n). A singular value factorization (SVF) is a factorization of A C m n of the form A = U 1 Σ 1 V 1, where U 1 C m r and V 1 C n r have orthonormal columns, and Σ 1 R r r is a diagonal matrix with σ 1 σ r > 0. An SVD and an SVF of a matrix A are closely related. 1. Let A have rank r and let A = UΣV be an SVD of A. Then A = U 1 Σ 1 V 1 is an SVF of A. Moreover, U 1, V 1 contain the first r columns of U, V and Σ 1 is a diagonal matrix with the positive singular values on the diagonal. 2. Conversely, suppose A = U 1 Σ 1 V 1 is a singular value factorization of A with Σ 1 R r r. Extend U 1 and V 1 in any way to unitary matrices U C m m and V C n n, and let Σ be given by (6.7). Then A = UΣV is an SVD of A. Moreover, r is uniquely given as the rank of A. 3. If A = [u 1,..., u r ] diag(σ 1,..., σ r )[v 1,..., v r ] is a singular value factorization of A then r A = σ j u j v j. (6.9) j=1 This is known as the outer product form of the SVF.

161 6.2. Further properties of SVD We note that a nonsingular square matrix has full rank and only positive singular values. Thus the SVD and SVF are the same for a nonsingular matrix. Example 6.12 (r < n < m) Find the SVF and SVD of Discussion: Eigenpairs of are derived from B [ ] 1 = 4 1 A = B := A T A = [ ] 1, B 1 [ ] [ ] 1 = 0 1 [ ] 1 1 and we find σ 1 = 2, σ 2 = 0, Thus r = 1, m = 3, n = 2 and Σ 1 0 Σ = 0 0, Σ 1 = [2], V = 1 [ ] We find u 1 = Av 1 /σ 1 = s 1 / 2, where s 1 = [1, 1, 0] T, and the SVF of A is given by A = [2] 1 [ ] To find an SVD we need to extend u 1 to an orthonormal basis for R 3. We first extend s 1 to a basis {s 1, s 2, s 3 } for R 3, apply the Gram-Schmidt orthogonalization process to {s 1, s 2, s 3 }, and then normalize. Choosing the basis we find from (4.9) s 1 = [ 11 w 1 = s 1, w 2 = s 2 st 2 w 1 w T 1 w w 1 = 1 0 ], s 2 = [ 01 0 ], s 3 = [ 00 1 ], [ ] 1/2 1/2, w 3 = s 3 st 3 w 1 0 w T 1 w w 1 st 3 w 2 1 w T 2 w w 2 = 2 Normalizing the w i s we obtain u 1 = w 1 / w 1 2 = [1/ 2, 1/ 2, 0] T, u 2 = w 2 / w 2 2 = [ 1/ 2, 1/ 2, 0] T, and u 3 = s 3 / s 3 2 = [0, 0, 1] T. Therefore, A = UΣV T, where U := [ 1/ 2 1/ 2 0 1/ 2 1/ ] R 3,3, Σ := [ 00 [ 2 0 ] 0 0 R 3,2, V := 1 [ 1 1 ] R 2, ].

162 150 Chapter 6. The Singular Value Decomposition The method we used to find the singular value decomposition in the examples and exercises can be suitable for hand calculation with small matrices, but it is not appropriate as a basis for a general purpose numerical method. In particular, the Gram-Schmidt orthogonalization process is not numerically stable, and forming A A can lead to extra errors in the computation. Standard computer implementations of the singular value decomposition ([31] ) first reduces A to bidiagonal form and then use an adapted version of the QR algorithm where the matrix A A is not formed. The QR algorithm is discussed in Chapter 14. Exercise 6.13 (Nonsingular matrix) [ ] Derive the SVF and SVD of the matrix 13 A = Also find its spectral decomposition UDU T. The matrix A is normal, but the spectral decomposition is not an SVD. Why? Exercise 6.14 (Full row rank) Find 14 the SVF and SVD of A := 1 15 [ ] R SVD and the Four Fundamental Subspaces The singular vectors form orthonormal bases for the four fundamental subspaces span(a), ker(a), span(a ), and ker(a ). Theorem 6.15 (Singular vectors and orthonormal bases) For positive integers m, n let A C m n have rank r and a singular value decomposition A = [u 1,..., u m ]Σ[v 1,..., v n ] = UΣV. Then the singular vectors satisfy Av i = σ i u i, i = 1,..., r, Av i = 0, i = r + 1,..., n, A u i = σ i v i, i = 1,..., r, A (6.10) u i = 0, i = r + 1,..., m. Moreover, 1. {u 1,..., u r } is an orthonormal basis for span(a), 2. {u r+1,..., u m } is an orthonormal basis for ker(a ), 3. {v 1,..., v r } is an orthonormal basis for span(a ), 4. {v r+1,..., v n } is an orthonormal basis for ker(a). [ ] [ ] [ ] 13 Answer: A = Hint: Take the transpose of the matrix in (6.2) (6.11)

163 6.2. Further properties of SVD 151 Proof. If A = UΣV then AV = UΣ, or in terms of the block partition (6.7) A[V 1, V 2 ] = [U 1, U 2 ] [ ] Σ But then AV 1 = U 1 Σ 1, AV 2 = 0, and this implies the first part of (6.10). Taking conjugate transpose of A = UΣV gives A = V Σ T U or A U = V Σ T. Using the block partition as before we obtain the last part of (6.10). It follows from Theorem 6.4 that {Av 1,..., Av r } is an orthogonal basis for span(a), {A u 1,..., A u r } is an orthogonal basis for span(a ), {v r+1,..., v m } is an orthonormal basis for ker(a) and{u r+1,..., u m } is an orthonormal basis for ker(a ). By (6.10) {u 1,..., u r } is an orthonormal basis for span(a) and {v 1,..., v r } is an orthonormal basis for span(a ). Exercise 6.16 (Counting dimensions of fundamental subspaces) Suppose A C m n. Show using SVD that 1. rank(a) = rank(a ). 2. rank(a) + null(a) = n, 3. rank(a) + null(a ) = m, where null(a) is defined as the dimension of ker(a). Exercise 6.17 (Rank and nullity relations) Use Theorem 6.4 to show that for any A C m n 1. rank A = rank(a A) = rank(aa ), 2. null(a A) = null A, and null(aa ) = null(a ). Exercise 6.18 (Orthonormal bases example) Let A and B be as in Example 6.6. Give orthonormal bases for span(b) and ker(b). Exercise 6.19 (Some spanning sets) Show for any A C m n that span(a A) = span(v 1 ) = span(a ) Exercise 6.20 (Singular values and eigenpair of composite matrix) Let A C m n with m n have singular values σ 1,..., σ n, left singular vectors u 1,..., u m C m, and right singular vectors v 1,..., v n C n. Show that the matrix [ ] 0 A C := A 0

164 152 Chapter 6. The Singular Value Decomposition has the n + m eigenpairs where p i = {(σ 1, p 1 ),..., (σ n, p n ), ( σ 1, q 1 ),..., ( σ n, q n ), (0, r n+1 ),..., (0, r m )}, [ ui v i ], q i = [ ui ], r v j = i A Geometric Interpretation [ ] uj, for i = 1,..., n and j = n + 1,..., m. 0 The singular value decomposition and factorization give insight into the geometry of a linear transformation. Consider the linear transformation T : R n R m given by T z := Az where A R m n. Assume that rank(a) = n. The function T maps the unit sphere S := {z R n : z 2 = 1} onto an ellipsoid E := AS = {Az : z S} in R m. Theorem 6.21 (SVF ellipse) Suppose A R m,n has rank r = n, and let A = U 1 Σ 1 V T 1 factorization of A. Then be a singular value E = U 1 Ẽ where Ẽ := {y = [y 1,..., y n ] T R n : y2 1 σ y2 n σn 2 = 1}. Proof. Suppose z S. Now Az = U 1 Σ 1 V T 1 z = U 1 y, where y := Σ 1 V T 1 z. Since rank(a) = n it follows that V 1 = V is square so that V 1 V T 1 = I. But then V 1 Σ 1 1 y = z and we obtain 1 = z 2 2 = V 1 Σ 1 1 y 2 2 = Σ 1 1 y 2 2 = y2 1 σ y2 n σn 2. This implies that y Ẽ. Finally, x = Az = U 1Σ 1 V T 1 z = U 1 y, where y Ẽ implies that E = U 1 Ẽ. The equation 1 = y y2 σ1 2 n σ 2 describes an ellipsoid in R n with semiaxes n of length σ j along the unit vectors e j for j = 1,..., n. Since the orthonormal transformation U 1 y x preserves length, the image E = AS is a rotated ellipsoid with semiaxes along the left singular vectors u j = Ue j, of length σ j, j = 1,..., n. Since Av j = σ j u j, for j = 1,..., n the right singular vectors defines points in S that are mapped onto the semiaxes of E. Example 6.22 (Ellipse) Consider the transformation A : R 2 R 2 given by the matrix A := 1 [ ]

6.3. Determining the Rank of a Matrix Numerically 153 Figure 6.1: The ellipse y 2 1/9 + y 2 2 = 1 (left) and the rotated ellipse AS (right). in Example 6.13.

3 Determining the Rank of a Matrix Numerically In many elementary linear algebra courses a version of Gaussian elimination, called Gauss-Jordan elimination, is used to determine the rank of a matrix.

165 6.3. Determining the Rank of a Matrix Numerically 153 Figure 6.1: The ellipse y 2 1/9 + y 2 2 = 1 (left) and the rotated ellipse AS (right). in Example Recall that σ 1 = 3, σ 2 = 1, u 1 = [3, 4] T /5 and u 2 = [ 4, 3] T /5. The ellipses y1/σ y2/σ = 1 and E = AS = U 1 Ẽ are shown in Figure 6.1. Since y = U T 1 x = [3/5x 1 + 4/5x 2, 4/5x 1 + 3/5x 2 ] T, the equation for the ellipse on the right is ( 3 5 x x 2) 2 + ( 4 5 x x 2) 2 = 1, Determining the Rank of a Matrix Numerically In many elementary linear algebra courses a version of Gaussian elimination, called Gauss-Jordan elimination, is used to determine the rank of a matrix. To carry this out by hand for a large matrix can be a Herculean task and using a computer and floating point arithmetic the result will not be reliable. Entries, which in the final result should have been zero, will have nonzero values because of round-off errors. As an alternative we can use the singular value decomposition to determine rank. Although success is not at all guaranteed, the result will be more reliable than if Gauss-Jordan elimination is used. By Theorem 6.5 the rank of a matrix is equal to the number of nonzero singular values and if we have computed the singular values, then all we have to do is to count the nonzero ones. The problem however is the same as for Gaussian elimination. Due to round-off errors none of the computed singular values are likely to be zero The Frobenius norm This commonly occurring matrix norm will be used here in a discussion of how many of the computed singular values can possibly be considered to be zero. The

166 154 Chapter 6. The Singular Value Decomposition Frobenius norm, of a matrix A C m n is defined by A F := ( m i=1 j=1 n a ij 2) 1/2. (6.12) There is a relation between the Frobenius norm of a matrix and its singular values. First we derive some elementary properties of this norm. A systematic study of matrix norms is given in the next chapter. Lemma 6.23 (Frobenius norm properties) For any m, n N and any matrix A C m n 1. A F = A F, 2. A 2 F = n j=1 a :j 2 2, 3. UA F = AV F = A F for any unitary matrices U C m m and V C n n, 4. AB F A F B F for any B C n,k, k N, 5. Ax 2 A F x 2, for all x C n. Proof. 1. A 2 F = n j=1 m i=1 a ij 2 = m i=1 n j=1 a ij 2 = A 2 F. 2. This follows since the Frobenius norm is the Euclidian norm of a vector, A F := vec(a) 2, where vec(a) C mn is the vector obtained by stacking the columns of A on top of each other. 3. Recall that if U U = I then Ux 2 = x 2 for all x C n. Applying this to each column a :j of A we find UA 2 2. F = n j=1 Ua :j 2 2 = n j=1 a :j = A 2 F. Similarly, since V V = I we find AV F 1. = V A F = A F 1. = A F. 4. Using the Cauchy-Schwarz inequality and 2. we obtain AB 2 F = m i=1 j=1 k m a T i:b :j 2 i=1 j=1 k a i: 2 2 b :j 2 2 = A 2 F B 2 F. 5. Since v F = v 2 for a vector this follows by taking k = 1 and B = x in 4.

167 6.3. Determining the Rank of a Matrix Numerically 155 Theorem 6.24 (Frobenius norm and singular values) We have A F = σ σ2 n, where σ 1,..., σ n are the singular values of A. Proof. Using Lemma 6.23 we find A F 3. = U AV F = Σ F = σ σ2 n Low rank approximation Suppose m n 1 and A C m n has a singular value decomposition A = U [ D 0 ] V, where D = diag(σ 1,..., σ n ). We choose ɛ > 0 and let 1 r n be the smallest integer such that σr σn 2 < ɛ 2. Define A := U [ ] D 0 V, where D := diag(σ 1,..., σ r, 0,..., 0) R n n. By Lemma 6.23 A A F = U [ ] D D 0 V F = [ ] D D F = σ 2 0 r σ2 n < ɛ. Thus, if ɛ is small then A is near a matrix A of rank r. This can be used to determine rank numerically. We choose an r such that σr σ2 n is small. Then we postulate that rank(a) = r since A is close to a matrix of rank r. The following theorem shows that of all m n matrices of rank r, A is closest to A measured in the Frobenius norm. Theorem 6.25 (Best low rank approximation) Suppose A R m n has singular values σ 1 σ n 0. For any r rank(a) we have A A F = min A B F = σr σ2 n. B R m n rank(b)=r For the proof of this theorem we refer to p. 322 of [31]. Exercise 6.26 (Rank example) Consider the singular value decomposition A := = (a) Give orthonormal bases for span(a), span(a T ), ker(a), ker(a T ) and span(a).

168 156 Chapter 6. The Singular Value Decomposition (b) Explain why for all matrices B R 4,3 of rank one we have A B F 6. (c) Give a matrix A 1 of rank one such that A A 1 F = 6. Exercise 6.27 (Another rank example) Let A be the n n matrix that for n = 4 takes the form ] A =. [ Thus A is upper triangular with diagonal elements one and all elements above the diagonal equal to 1. Let B be the matrix obtained from A by changing the (n, 1) element from zero to 2 2 n. (a) Show that Bx = 0, where x := [2 n 2, 2 n 3,..., 2 0, 1] T. Conclude that B is singular, det(a) = 1, and A B F = 2 2 n. Thus even if det(a) is not small the Frobenius norm of A B is small for large n, and the matrix A is very close to being singular for large n. (b) Use Theorem 6.25 to show that the smallest singular vale σ n of A is bounded above by 2 2 n. 6.4 Review Questions Consider an SVD and an SVF of a matrix A. What are the singular values of A? how is the SVD defined? how can we find an SVF if we know an SVD? how can we find an SVD if we know an SVF? what are the relations between the singular vectors? which singular vectors form bases for span(a) and ker(a )? How are the Frobenius norm and singular values related?

169 Chapter 7 Norms and perturbation theory for linear systems To measure the size of vector and matrices we use norms. 7.1 Vector Norms Definition 7.1 (Vector norm) A (vector) norm in a real (resp. complex) vector space V is a function : V R that satisfies for all x, y in V and all a in R (resp. C) 1. x 0 with equality if and only if x = 0. (positivity) 2. ax = a x. (homogeneity) 3. x + y x + y. (subadditivity) The triple (V, R, ) (resp. (V, C, )) is called a normed vector space and the inequality 3. is called the triangle inequality. In this book the vector space will be one of R n, C n or one of the matrix spaces R m n, or C m n. Vector addition is defined by element wise addition and scalar multiplication is defined by multiplying every element by the scalar. In this book we will use the following family of vector norms on V = C n and V = R n. 157

158 Chapter 7. Norms and perturbation theory for linear systems Otto Ludwig Hölder, 1859-1937 (left), Hermann Minkowski, 1864-1909 (right). Definition 7.

x 2 := n j=1 x j 2, (the two-norm, l 2 -norm, or Euclidian norm) 3. x := max 1 j n x j, (the infinity-norm, l -norm, or max norm) Some remarks are in order. 1. That the Euclidian norm is a vector norm follows from Theorem 4.

170 158 Chapter 7. Norms and perturbation theory for linear systems Otto Ludwig Hölder, (left), Hermann Minkowski, (right). Definition 7.2 (Vector p-norms) We define for p 1 and x R n or x C n the p-norms by ( n ) 1/p x p := x j p, (7.1) j=1 The most important cases are p = 1, 2, : 1. x 1 := x := max 1 j n x j. (7.2) n x j, (the one-norm or l 1 -norm) j=1 2. x 2 := n j=1 x j 2, (the two-norm, l 2 -norm, or Euclidian norm) 3. x := max 1 j n x j, (the infinity-norm, l -norm, or max norm) Some remarks are in order. 1. That the Euclidian norm is a vector norm follows from Theorem 4.3. In Section 7.4, we show that the p-norms are vector norms for 1 p. 2. The triangle inequality x + y p x p + y p is called Minkowski s inequality. 3. To prove it one first establishes Hölder s inequality n x j y j x p y q, j=1 1 p + 1 q = 1, x, y Cn. (7.3)

171 7.1. Vector Norms 159 The relation 1 p + 1 q = 1 means that if p = 1 then q = and if p = 2 then q = 2, and the Hölder s inequality is the same as the Cauchy-Schwarz inequality (cf. Theorem 4.2) for the Euclidian norm. 4. The infinity norm is related to the other p-norms by lim x p = x for all x C n. (7.4) p 5. The equation (7.4) clearly holds for x = 0. For x 0 we write x p := x ( n j=1 ) ( x j ) 1/p p. x Now each term in the sum is not greater than one and at least one term is equal to one, and we obtain x x p n 1/p x, p 1. (7.5) Since lim p n 1/p = 1 for any n N we see that (7.4) follows. We return now to the general case. Definition 7.3 (Equivalent norms) We say that two norms and on V are equivalent if there are positive constants m and M such that for all vectors x V we have m x x M x. (7.6) By (7.5) the p- and -norms are equivalent for any p 1. This result is generalized in the following theorem. Theorem 7.4 (Basic properties of vector norms) The following holds for a normed vector space (V, C, ). 1. x y x y, for all x, y C n (inverse triangle inequality). 2. The vector norm is a continuous function V R. 3. All vector norms on V are equivalent provided V is finite dimensional. Proof. 1. Since x = x y + y x y + y we obtain x y x y. By symmetry x y = y x y x and we obtain the inverse triangle inequality.

172 160 Chapter 7. Norms and perturbation theory for linear systems 2. This follows from the inverse triangle inequality. 3. The following proof can be skipped by those who do not have the necessary background in advanced calculus. Define the unit sphere S := {y V : y = 1}. The set S is a closed and bounded set and the function f : S R given by f(y) = y is continuous by what we just showed. Therefore f attains its minimum and maximum value on S. Thus, there are positive constants m and M such that m y M, y S. (7.7) For any x V one has y := x/ x S, and (7.6) follows if we apply (7.7) to these y. 7.2 Matrix Norms For simplicity we consider only norms on the vector space (C m n, C). All results also holds for (R m n, R). A matrix norm is simply a norm on these vector spaces. The Frobenius norm A F := ( m n a ij 2) 1/2 i=1 j=1 is a matrix norm. Indeed, writing all elements in A in a string of length mn we see that the Frobenius norm is the 2-norm on the Euclidian space C mn. Adapting Theorem 7.4 to the matrix situation gives Theorem 7.5 (Matrix norm equivalence) All matrix norms on C m n are equivalent. Thus, if and are two matrix norms on C m n then there are positive constants µ and M such that µ A A M A holds for all A C m n. Moreover, a matrix norm is a continuous function. Any vector norm V on C mn defines a matrix norm on C m n given by A := vec(a) V, where vec(a) C mn is the vector obtained by stacking the columns of A on top of each other. In particular, to the p vector norms for p = 1, 2,, we have the corresponding sum norm, Frobenius norm, and max norm defined by m n A S := a ij, A F := ( m n a ij 2) 1/2, A M := max a ij. (7.8) i=1 j=1 i=1 j=1 i,j

7.2. Matrix Norms 161 Ferdinand Georg Frobenius, (1849-1917). Of these norms the Frobenius norm is the most useful. Some of its properties were derived in Lemma 6.23 and Theorem 6.24. 7.2.1 Consistent and subordinate matrix norms Since matrices can be multiplied it is useful to have an analogue of subadditivity for matrix multiplication.

173 7.2. Matrix Norms 161 Ferdinand Georg Frobenius, ( ). Of these norms the Frobenius norm is the most useful. Some of its properties were derived in Lemma 6.23 and Theorem Consistent and subordinate matrix norms Since matrices can be multiplied it is useful to have an analogue of subadditivity for matrix multiplication. For square matrices the product AB is defined in a fixed space C n n, while in the rectangular case matrix multiplication combines matrices in different spaces. The following definition captures this distinction. Definition 7.6 (Consistent matrix norms) A matrix norm is called consistent on C n n if 4. AB A B (submultiplicativity) holds for all A, B C n n. A matrix norm is consistent if it is defined on C m n for all m, n N, and 4. holds for all matrices A, B for which the product AB is defined. Clearly the Frobenius norm is defined for all m, n N. From Lemma 6.23 it follows that the Frobenius norm is consistent. Exercise 7.7 (Consistency of sum norm?) Show that the sum norm is consistent. Exercise 7.8 (Consistency of max norm?) Show that the max norm is not consistent by considering [ ]. Exercise 7.9 (Consistency of modified max norm)

174 162 Chapter 7. Norms and perturbation theory for linear systems (a) Show that the norm A := mn A M, A C m n is a consistent matrix norm. (b) Show that the constant mn can be replaced by m and by n. For a consistent matrix norm on C n n we have the inequality A k A k for k N. (7.9) When working with norms one often has to bound the vector norm of a matrix times a vector by the norm of the matrix times the norm of the vector. This leads to the following definition. Definition 7.10 (Subordinate matrix norms) Suppose m, n N are given, let on C m and β on C n be vector norms, and let be a matrix norm on C m n. We say that the matrix norm is subordinate to the vector norms and β if Ax A x β for all A C m n and all x C n. By Lemma 6.23 we have Ax 2 A F x 2, for all x C n. Thus the Frobenius norm is subordinate to the Euclidian vector norm. Exercise 7.11 (What is the sum norm subordinate to?) Show that the sum norm is subordinate to the l 1 -norm. Exercise 7.12 (What is the max norm subordinate to?) (a) Show that the max norm is subordinate to the and 1 norm, i. e., Ax A M x 1 holds for all A C m n and all x C n. (b) Show that if A M = a kl, then Ae l = A M e l 1. (c) Show that A M = max x 0 Ax x Operator norms Corresponding to vector norms on C n and C m there is an induced matrix norm on C m n which we call the operator norm. It is possible to consider one vector norm on C m and another vector norm on C n, but we treat only the case of one vector norm defined on C n for all n N In the case of one vector norm on C m and another vector norm β on C n we would define A := max x 0 Ax x β.

175 7.2. Matrix Norms 163 Definition 7.13 (Operator norm) Let be a vector norm defined on C n for all n N. For given m, n N and A C m n we define A := max x 0 Ax x. (7.10) We call this the operator norm corresponding to the vector norm. With a risk of confusion we use the same symbol for the operator norm and the corresponding vector norm. Before we show that the operator norm is a matrix norm we make some observations. 1. It is enough to take the max over subsets of C n. For example The set A = max Ax. (7.11) x =1 S := {x C n : x = 1} (7.12) is the unit sphere in C n with respect to the vector norm. It is enough to take the max over this unit sphere since max x 0 Ax x = max x 0 A ( x x ) = max y =1 Ay. 2. The operator norm is subordinate to the corresponding vector norm. Thus, Ax A x for all A C m n and x C n. (7.13) 3. We can use max instead of sup in (7.10). This follows by the following compactness argument. The unit sphere S given by (7.12) is bounded. It is also finite dimensional and closed, and hence compact. Moreover, since the vector norm : S R is a continuous function, it follows that the function f : S R given by f(x) = Ax is continuous. But then f attains its max and min and we have A = Ax for some x S. (7.14) Lemma 7.14 (The operator norm is a matrix norm) For any vector norm the operator norm given by (7.10) is a consistent matrix norm. Moreover, I = 1. Proof. We use (7.11). In 2. and 3. below we take the max over the unit sphere S given by (7.12).

176 164 Chapter 7. Norms and perturbation theory for linear systems 1. Nonnegativity is obvious. If A = 0 then Ay = 0 for each y C n. In particular, each column Ae j in A is zero. Hence A = ca = max x cax = max x c Ax = c A. 3. A + B = max x (A + B)x max x Ax + max x Bx = A + B. 4. AB = max x 0 ABx x max y 0 Ay y = max Bx 0 ABx x max x 0 Bx x = A B. ABx Bx = max Bx 0 Bx x That I = 1 for any operator norm follows immediately from the definition. Since I F = n, we see that the Frobenius norm is not an operator norm for n > The operator p-norms Recall that the p or l p vector norms (7.1) are given by x p := ( n x j p) 1/p, p 1, x := max x j. 1 j n j=1 The operator norms p defined from these p-vector norms are used quite frequently for p = 1, 2,. We define for any 1 p A p := max x 0 Ax p x p = max y p=1 Ay p. (7.15) For p = 1, 2, we have explicit expressions for these norms. Theorem 7.15 (one-two-inf-norms) For A C m n we have A 1 := max Ae j 1 = max 1 j n 1 j n m a k,j, (max column sum) k=1 A 2 := σ 1, (largest singular value of A) n A = max 1 k m et k A 1 = max 1 k m a k,j. (max row sum) j=1 (7.16) The two-norm A 2 is also called the spectral norm of A. Proof. The result for p = 2 follows from the singular value decomposition A = UΣV, or Σ = U AV, where U and V are square and unitary, and Σ is a diagonal matrix with nonnegative diagonal elements σ 1 σ n 0. Moreover,

177 7.2. Matrix Norms 165 Av 1 = σ 1 u 1, where v 1 and u 1 are the first columns in V and U, respectively (cf. (6.10) ). It follows that Av 1 2 = σ 1 u 1 2 = σ 1. The result will follow if we can show that Ax 2 σ 1 for all x C n with x 2 = 1. Let x = V c. Then V c 2 = c 2 = 1 so that Ax 2 2 = AV c 2 2 = U AV c 2 2 = Σc 2 2 = For p = 1, we proceed as follows: n σj 2 c j 2 σ1 2 j=1 n c j 2 = σ1. 2 (a) We derive a constant K p such that Ax p K p for any x C n with x p = 1. (b) We give an extremal vector y C n with y p = 1 so that Ay p = K p. It then follows from (7.15) that A p = Ay p = K p. 1-norm: Define K 1, c and y by K 1 := Ae c 1 = max 1 j n Ae j 1 and y := e c, a unit vector. Then y 1 = 1 and we obtain (a) m n m n Ax 1 = a kj x j a kj x j = k=1 (b) Ay 1 = K 1. j=1 k=1 j=1 j=1 n ( m a kj ) x j K 1. -norm: Define K, r and y by K := e T r A 1 = max 1 k m e T k A 1 and y := [e iθ1,..., e iθn ] T, where a rj = a rj e iθj for j = 1,..., n. (a) Ax = max 1 k m n j=1 a kjx j max1 k m n j=1 a kj x j K. (b) Ay = max 1 k m n j=1 a kje iθj = K. The last equality is correct because n j=1 a kje iθj n j=1 a kj K with equality for k = r. j=1 k=1 Example 7.16 (Comparing one-two-inf-norms) The largest singular value of the matrix A := 1 15 [ ], is σ 1 = 2 (cf. Example 6.14 ). We find A 1 = 29 15, A 2 = 2, A = 37 15, A F = 5. We observe that the values of these norms do not differ by much.

178 166 Chapter 7. Norms and perturbation theory for linear systems In some cases the spectral norm is equal to an eigenvalue of the matrix. Theorem 7.17 (Spectral norm) Suppose A C n n has singular values σ 1 σ 2 σ n and eigenvalues λ 1 λ 2 λ n. Then A 2 = σ 1 and A 1 2 = 1 σ n, (7.17) A 2 = λ 1 and A 1 2 = 1 λ n, if A is Hermitian positive definite, (7.18) A 2 = λ 1 and A 1 2 = 1, if A is normal. (7.19) λ n For the norms of A 1 we assume of course that A is nonsingular. Proof. Since 1/σ n is the largest singular value of A 1, (7.17) follows. By Theorem 6.9 the singular values of a Hermitian positive definite matrix (normal matrix) are equal to the eigenvalues (absolute value of the eigenvalues). This implies (7.18) and (7.19). The following result is sometimes useful. Theorem 7.18 (Spectral norm bound) For any A C m n we have A 2 2 A 1 A. Proof. Let (σ 2, v) be an eigenpair for A A corresponding to the largest singular value σ of A. Then A 2 2 v 1 = σ 2 v 1 = σ 2 v 1 = A Av 1 A 1 A 1 v 1. Observing that A 1 = A by Theorem 7.15 and canceling v 1 proves the result. Exercise 7.19 (Spectral norm) Let m, n N and A C m n. Show that A 2 = max x y Ax. 2= y 2=1 Exercise 7.20 (Spectral norm of the inverse) Suppose A C n n is nonsingular. Show that Ax 2 σ n for all x C n with x 2 = 1. Use this and (7.17) to show that A 1 2 = max x 0 x 2 Ax 2.

179 7.2. Matrix Norms 167 Exercise 7.21 (p-norm example) Let [ 2 1 A = 1 2 Compute A p and A 1 p for p = 1, 2,. ] Unitary invariant matrix norms Definition 7.22 (Unitary invariant norm) A matrix norm on C m n is called unitary invariant if UAV = A for any A C m n and any unitary matrices U C m m and V C n n. When a unitary invariant matrix norm is used, the size of a perturbation is not increased by a unitary transformation. Thus if U and V are unitary then U(A + E)V = UAV + F, where F = E. It follows from Lemma 6.23 that the Frobenius norm is unitary invariant. We show here that this also holds for the spectral norm. Theorem 7.23 (Unitary invariant norms) The Frobenius norm and the spectral norm are unitary invariant. Moreover A F = A F and A 2 = A 2. Proof. The results for the Frobenius norm follow from Lemma Suppose A C m n and let U C m m and V C n n be unitary. Since the 2-vector norm is unitary invariant we obtain UA 2 = max x 2=1 UAx 2 = max x 2=1 Ax 2 = A 2. Now A and A have the same nonzero singular values, and it follows from Theorem 7.15 that A 2 = A 2. Moreover V is unitary. Using these facts we find AV 2 = (AV ) 2 = V A 2 = A 2 = A 2. It can be shown that the spectral norm is the only unitary invariant operator norm, see [15] p Exercise 7.24 (Unitary invariance of the spectral norm) Show that V A 2 = A 2 holds even for a rectangular V as long as V V = I. Exercise 7.25 ( AU 2 rectangular A) Find A R 2 2 and U R 2 1 with U T U = I such that AU 2 < A 2. Thus, in general, AU 2 = A 2 does not hold for a rectangular U even if U U = I.

180 168 Chapter 7. Norms and perturbation theory for linear systems Exercise 7.26 (p-norm of diagonal matrix) Show that A p = ρ(a) := max λ i (the largest eigenvalue of A), 1 p, when A is a diagonal matrix. Exercise 7.27 (spectral norm of a column vector) A vector a C m can also be considered as a matrix A C m,1. (a) Show that the spectral matrix norm (2-norm) of A equals the Euclidean vector norm of a. (b) Show that A p = a p for 1 p Absolute and monotone norms A vector norm on C n is an absolute norm if x = x for all x C n. Here x := [ x 1,..., x n ] T, the absolute values of the components of x. Clearly the vector p norms are absolute norms. We state without proof (see Theorem of [15]) that a vector norm on C n is an absolute norm if and only if it is a monotone norm, i. e., x i y i, i = 1,..., n = x y, for all x, y C n. Absolute and monotone matrix norms are defined as for vector norms. Exercise 7.28 (Norm of absolute value matrix) If A C m n has elements a ij, let A R m n be the matrix with elements a ij. [ ] 1+i 2 (a) Compute A if A =, i = i (b) Show that for any A C m n A F = A F, A p = A p for p = 1,. (c) Show that for any A C m n A 2 A 2. (d) Find a real symmetric 2 2 matrix A such that A 2 < A 2. The study of matrix norms will be continued in Chapter The Condition Number with Respect to Inversion Consider the system of two linear equations x 1 + x 2 = 20 x 1 + ( )x 2 =

181 7.3. The Condition Number with Respect to Inversion 169 whose exact solution is x 1 = x 2 = 10. If we replace the second equation by x 1 + ( )x 2 = , the exact solution changes to x 1 = 30, x 2 = 10. Here a small change in one of the coefficients, from to , changed the exact solution by a large amount. A mathematical problem in which the solution is very sensitive to changes in the data is called ill-conditioned. Such problems can be difficult to solve on a computer. In this section we consider what effect a small change (perturbation) in the data A,b has on the solution x of a linear system Ax = b. Suppose y solves (A + E)y = b+e where E is a (small) n n matrix and e a (small) vector. How large can y x be? To measure this we use vector and matrix norms. In this section will denote a vector norm on C n and also a matrix norm on C n n which for any A, B C n n and any x C n satisfy AB A B and Ax A x. This holds if the matrix norm is the operator norm corresponding to the given vector norm, but is also satisfied for the Frobenius matrix norm and the Euclidian vector norm. This follows from Lemma Suppose x and y are vectors in C n that we want to compare. The difference y x measures the absolute error in y as an approximation to x, while y x / x and y x / y are measures for the relative error. We consider first a perturbation in the right-hand side b. Theorem 7.29 (Perturbation in the right-hand side) Suppose A C n n is nonsingular, b, e C n, b 0 and Ax = b, Ay = b+e. Then 1 e y x K(A) e K(A) b x b, K(A) := A A 1. (7.20) Proof. Subtracting Ax = b from Ay = b+e we have A(y x) = e or y x = A 1 e. Combining y x = A 1 e A 1 e and b = Ax A x we obtain the upper bound in (7.20). Combining e A y x and x A 1 b we obtain the lower bound. Consider (7.20). e / b is a measure of the size of the perturbation e relative to the size of b. The upper bound says that y x / x in the worst case can be K(A) = A A 1 times as large as e / b. K(A) is called the condition number with respect to inversion of a matrix, or just the condition number, if it is clear from the context that we are talking about solving

182 170 Chapter 7. Norms and perturbation theory for linear systems linear systems or inverting a matrix. The condition number depends on the matrix A and on the norm used. If K(A) is large, A is called ill-conditioned (with respect to inversion). If K(A) is small, A is called well-conditioned (with respect to inversion). We always have K(A) 1. For since x = Ix I x for any x we have I 1 and therefore A A 1 AA 1 = I 1. Since all matrix norms are equivalent, the dependence of K(A) on the norm chosen is less important than the dependence on A. Sometimes one chooses the spectral norm when discussing properties of the condition number, and the l 1, l, or Frobenius norm when one wishes to compute it or estimate it. The following explicit expressions for the 2-norm condition number follow from Theorem Theorem 7.30 (Spectral condition number) Suppose A C n n is nonsingular with singular values σ 1 σ 2 σ n > 0 and eigenvalues λ 1 λ 2 λ n > 0. Then K 2 (A) := A 2 A 1 2 = σ 1 /σ n. Moreover, { λ 1 /λ n, if A is Hermitian positive definite, K 2 (A) = (7.21) λ 1 / λ n, if A is normal. It follows that A is ill-conditioned with respect to inversion if and only if σ 1 /σ n is large, or λ 1 /λ n is large when A is Hermitian positive definite. Suppose we have computed an approximate solution y to Ax = b. The vector r(y) := Ay b is called the residual vector, or just the residual. We can bound x y in terms of r. Theorem 7.31 (Perturbation and residual) Suppose A C n n, b C n, A is nonsingular and b 0. Let r(y) = Ay b for any y C n. If Ax = b then 1 r(y) y x K(A) r(y) K(A) b x b. (7.22) Proof. We simply take e = r(y) in Theorem If A is well-conditioned, (7.22) says that y x / x r(y) / b. In other words, the accuracy in y is about the same order of magnitude as the residual as long as b 1. If A is ill-conditioned, anything can happen. We can for example have an accurate solution even if the residual is large. Consider next the effect of a perturbation in the coefficient matrix. Suppose A, E C n n with A nonsingular. We like to compare the solution x and y of the systems Ax = b and (A + E)y = b. We expect A + E to be nonsingular if the elements of E are sufficiently small and we need to address this question. Consider first the case where A = I.

183 7.3. The Condition Number with Respect to Inversion 171 Theorem 7.32 (Nonsingularity of perturbation of identity) Suppose B C n n and B < 1 for some consistent matrix norm on C n n. Then I B is nonsingular and B (I 1 B) 1 1 B. (7.23) Proof. Suppose I B is singular. Then (I B)x = 0 for some nonzero x C n, and x = Bx so that x = Bx B x. But then B 1. It follows that I B is nonsingular if B < 1. Next, since I = (I B)(I B) 1 I B (I B) 1 ( I + B ) (I B) 1, and since I 1, we obtain the lower bound in (7.23): implies B I I + B (I B) 1. (7.24) Taking norms and using the inverse triangle inequality in I = (I B)(I B) 1 = (I B) 1 B(I B) 1 I (I B) 1 B(I B) 1 ( 1 B ) (I B) 1. If the matrix norm is an operator norm then I = 1 and the upper bound follows. We show in Section 11.4 that the upper bound also holds for the Frobenius norm, and more generally for any consistent matrix norm on C n n. Theorem 7.33 (Nonsingularity of perturbation) Suppose A, E C n n, b C n with A nonsingular and b 0. If r := A 1 E < 1 for some matrix norm consistent on C n n then A+E is nonsingular. If Ax = b and (A + E)y = b then y x y y x x In (7.26) we have assumed that r 1/2. A 1 E K(A) E A, (7.25) 2K(A) E A. (7.26)

184 172 Chapter 7. Norms and perturbation theory for linear systems Proof. Since r < 1 Theorem 7.32 implies that the matrix I B := I + A 1 E is nonsingular and then A + E = A ( I + A 1 E ) is nonsingular. Subtracting (A + E)y = b from Ax = b gives A(x y) = Ey or x y = A 1 Ey. Taking norms and dividing by y proves (7.25). Solving x y = A 1 Ey for y we obtain y = (I + A 1 E) 1 x. By (7.23) y (I + A 1 E) 1 x But then (7.26) follows from (7.25). x 1 A 1 E 2 x. In Theorem 7.33 we gave bounds for the relative error in x as an approximation to y and the relative error in y as an approximation to x. E / A is a measure for the size of the perturbation E in A relative to the size of A. The condition number again plays a crucial role. y x / y can be as large as K(A) times E / A. It can be shown that the upper bound can be attained for any A and any b. In deriving the upper bound we used the inequality A 1 Ey A 1 E y. For a more or less random perturbation E this is not a severe overestimate for A 1 Ey. In the situation where E is due to round-off errors (7.25) can give a fairly realistic estimate for y x / y. We end this section with a perturbation result for the inverse matrix. Again the condition number plays an important role. Theorem 7.34 (Perturbation of inverse matrix) Suppose A C n n is nonsingular and let be a consistent matrix norm on C n n. If E C n n is so small that r := A 1 E < 1 then A + E is nonsingular and (A + E) 1 A 1 1 r. (7.27) If r < 1/2 then (A + E) 1 A 1 A 1 2K(A) E A. (7.28) Proof. We showed in Theorem 7.33 that A + E is nonsingular and since (A + E) 1 = (I + A 1 E) 1 A 1 we obtain (A + E) 1 (I + A 1 E) 1 A 1 A 1 1 A 1 E and (7.27) follows. Since (A + E) 1 A 1 = ( I A 1 (A + E) ) (A + E) 1 = A 1 E(A + E) 1

185 7.3. The Condition Number with Respect to Inversion 173 we obtain by (7.27) (A + E) 1 A 1 A 1 E (A + E) 1 K(A) E A 1 A 1 r. Dividing by A 1 and setting r < 1/2 proves (7.28). Exercise 7.35 (Sharpness of perturbation bounds) The upper and lower bounds for y x / x given by (7.20) can be attained for any matrix A, but only for special choices of b. Suppose y A and y A 1 are vectors with y A = y A 1 = 1 and A = Ay A and A 1 = A 1 y A 1. (a) Show that the upper bound in (7.20) is attained if b = Ay A and e = y A 1. (b) Show that the lower bound is attained if b = y A 1 and e = Ay A. Exercise 7.36 (Condition number of 2. derivative matrix) In this exercise we will show that for m 1 4 π 2 (m + 1)2 2/3 < cond p (T ) 1 2 (m + 1)2, p = 1, 2,, (7.29) where T := tridiag( 1, 2, 1) R m m and cond p (T ) := T p T 1 p is the p- norm condition number of T. The p matrix norm is given by (7.15). You will need the explicit inverse of T given by (1.27) and the eigenvalues given in Lemma As usual we define h := 1/(m + 1). a) Show that for m 3 cond 1 (T ) = cond (T ) = 1 { h 2, m odd, 2 h 2 1, m even. and that cond 1 (T ) = cond (T ) = 3 for m = 2. b) Show that for p = 2 and m 1 we have ( cond 2 (T ) = cot 2 πh ) 2 c) Show the bounds ( = 1/ tan 2 πh ). 2 (7.30) 4 π 2 h < cond 2(T ) < 4 π 2 h 2. (7.31) Hint: For the upper bound use the inequality tan x > x valid for 0 < x < π/2. For the lower bound we use (without proof) the inequality cot 2 x > 1 x for x > 0. d) Show (7.29).

186 174 Chapter 7. Norms and perturbation theory for linear systems 7.4 Proof that the p-norms are Norms We want to show Theorem 7.37 (The p vector norms are norms) Let for 1 p and x C n x p := ( n x j p) 1/p, x := max x j. 1 j n j=1 Then for all 1 p, x, y C n and all a C 1. x p 0 with equality if and only if x = 0. (positivity) 2. ax p = a x p. (homogeneity) 3. x + y p x p + y p. (subadditivity) Positivity and homogeneity follows immediately. To show the subadditivity we need some elementary properties of convex functions. Definition 7.38 (Convex function) Let I R be an interval. A function f : I R is convex if f ( (1 λ)x 1 + λx 2 ) (1 λ)f(x1 ) + λf(x 2 ) (7.32) for all x 1, x 2 I with x 1 < x 2 and all λ [0, 1]. The sum n j=1 λ jx j is called a convex combination of x 1,..., x n if λ j 0 for j = 1,..., n and n j=1 λ j = 1. The convexity condition is illustrated in Figure 7.1. Lemma 7.39 (A sufficient condition for convexity) If f C 2 [a, b] and f (x) 0 for x [a, b] then f is convex. Proof. We recall the formula for linear interpolation with remainder, (cf a book on numerical methods) For any a x 1 x x 2 b there is a c [x 1, x 2 ] such that f(x) = x 2 x x 2 x 1 f(x 1 ) + x x 1 x 2 x 1 f(x 2 ) + (x x 1 )(x x 2 )f (c)/2 = (1 λ)f(x 1 ) + λf(x 2 ) + (x 2 x 1 ) 2 λ(λ 1)f (c)/2, λ := x x 1 x 2 x 1. Since λ [0, 1] the remainder term is not positive. Moreover, x = x 2 x x 2 x 1 x 1 + x x 1 x 2 x 1 x 2 = (1 λ)x 1 + λx 2

187 7.4. Proof that the p-norms are Norms 175 (1 λ)f(x 1 ) + λf(x 2 ) f(x) x 1 x = (1 λ)x 1 + λ x 2 x 2 Figure 7.1: A convex function. so that (7.32) holds, and f is convex. The following inequality is elementary, but can be used to prove many nontrivial inequalities. Theorem 7.40 (Jensen s inequality) Suppose I R is an interval and f : I R is convex. Then for all n N, all λ 1,..., λ n with λ j 0 for j = 1,..., n and n j=1 λ j = 1, and all z 1,..., z n I f( n λ j z j ) j=1 n λ j f(z j ). Proof. We use induction on n. The result is trivial for n = 1. Let n 2, assume the inequality holds for n 1, and let λ j, z j for j = 1,..., n be given as in the theorem. Since n 2 we have λ i < 1 for at least one i so assume without loss of generality that λ 1 < 1, and define u := n λ j j=2 1 λ 1 z j. Since n j=2 λ j = 1 λ 1 this is a convex combination of n 1 terms and the induction hypothesis implies that f(u) n λ j j=2 1 λ 1 f(z j ). But then by the convexity of f ( n ) n f λ j z j = f(λ 1 z 1 + (1 λ 1 )u) λ 1 f(z 1 ) + (1 λ 1 )f(u) λ j f(z j ) j=1 and the inequality holds for n. j=1 j=1

188 176 Chapter 7. Norms and perturbation theory for linear systems Corollary 7.41 (Weighted geometric/arithmetic mean inequality) Suppose n j=1 λ ja j is a convex combination of nonnegative numbers a 1,..., a n. Then n a λ1 1 aλ2 2 aλn n λ j a j, (7.33) where 0 0 := 0. Proof. The result is trivial if one or more of the a j s are zero so assume a j > 0 for all j. Consider the function f : (0, ) R given by f(x) = log x. Since f (x) = 1/x 2 > 0 for x (0, ), this function is convex. By Jensen s inequality j=1 j=1 j=1 log ( n ) n λ j a j λ j log(a j ) = log ( a λ1 1 ) aλn n or log ( a λ1 1 ) ( n aλn n log j=1 λ ) ja j. The inequality follows since exp(log x) = x for x > 0 and the exponential function is monotone increasing. Taking λ j = 1 n for all j in (7.33) we obtain the classical geometric/arithmetic mean inequality (a 1 a 2 a n ) 1 n 1 n n a j. (7.34) j=1 Corollary 7.42 (Hölder s inequality) For x, y C n and 1 p n x j y j x p y q, where 1 p + 1 q = 1. j=1 Proof. We leave the proof for p = 1 and p = as an exercise so assume 1 < p <. For any a, b 0 the weighted arithmetic/geometric mean inequality implies that a p b q p a + 1 q b, where 1 p + 1 = 1. (7.35) q If x = 0 or y = 0 there is nothing to prove so assume that both x and y are nonzero. Using 7.35 on each term we obtain 1 x p y q n x j y j = j=1 n ( xj p j=1 x p p ) 1 p ( y j q y q q ) 1 q n ( 1 x j p p x p + 1 y j q ) p q y q = 1 q j=1

189 7.4. Proof that the p-norms are Norms 177 and the proof of the inequality is complete. Corollary 7.43 (Minkowski s inequality) For x, y C n and 1 p x + y p x p + y p. Proof. We leave the proof for p = 1 and p = as an exercise so assume 1 < p <. We write x + y p p = n x j + y j p j=1 n x j x j + y j p 1 + j=1 n y j x j + y j p 1. We apply Hölder s inequality with exponent p and q to each sum. In view of the relation (p 1)q = p the result is x + y p p x p x + y p/q p + y p x + y p/q p and canceling the common factor, the inequality follows. j=1 = ( x p + y p ) x + y p 1 p, It is possible to characterize the p-noms that are derived from an inner product. We start with the following identity. Theorem 7.44 (Parallelogram identity) For all x, y in a real or complex inner product space x + y 2 + x y 2 = 2 x y 2. (7.36) Proof. We set a = ±1 in (4.5) and add the two equations. Theorem 7.45 (When is a norm an inner product norm?) To a given norm on a real or complex vector space V there exists an inner product on V such that x, x = x 2 if and only if the parallelogram identity (7.36) holds for all x, y V. Proof. If x, x = x 2 then Theorem 7.44 shows that the parallelogram identity holds. For the converse we prove the real case and leave the complex case as an exercise. Suppose (7.36) holds for all x, y in the real vector space V. We show that x, y := 1 4( x + y 2 x y 2), x, y V (7.37)

190 178 Chapter 7. Norms and perturbation theory for linear systems defines an inner product on V. Clearly 1. and 2. in Definition 4.1 hold. The hard part is to show 3. We need to show that Now x, z + y, z = x + y, z, x, y, z V, (7.38) ax, y = a x, y, a R, x, y V. (7.39) 4 x, z + 4 y, z (7.37) = x + z 2 x z 2 + y + z 2 y z 2 = ( z + x + y ) x y + 2 ( z x + y ) y x ( z + x + y ) x y 2 ( z x + y ) y x (7.36) = 2 z + x + y x y 2 2 z x + y 2 2 y x (7.37) = 8 x + y, z, 2 or x, z + y, z = 2 x + y, z, x, y, z V. 2 In particular, since y = 0 implies y, z = 0 we obtain x, z = 2 x 2, z for all x, z V. This means that 2 x+y 2, z = x + y, z for all x, y, z V and (7.38) follows. We first show (7.39) when a = n is a positive integer. By induction nx, y = (n 1)x + x, y (7.38) = (n 1)x, y + x, y = n x, y. (7.40) If m, n N then m 2 n (7.40) x, y = m nx, y (7.40) = mn x, y, m implying that (7.39) holds for positive rational numbers n m x, y = n x, y. m Now if a > 0 there is a sequence {a n } of positive rational numbers converging to a. For each n a n x, y = a n x, y (7.37) = 1 4( an x + y 2 a n x y 2). Taking limits and using continuity of norms we obtain a x, y = ax, y. This also holds for a = 0. Finally, if a < 0 then ( a) > 0 and from what we just showed ( a) x, y = ( a)x, y (7.37) = 1 4( ax + y 2 ax y 2) = ax, y,

191 7.4. Proof that the p-norms are Norms 179 so (7.39) also holds for negative a. Corollary 7.46 (Are the p-norms inner product norms?) For the p vector norms on V = R n or V = C n, 1 p, n 2, there is an inner product on V such that x, x = x 2 p for all x V if and only if p = 2. Proof. For p = 2 the p-norm is the Euclidian norm which corresponds to the standard inner product. If p 2 then the parallelogram identity (7.36) does not hold for say x := e 1 and y := e 2. Exercise 7.47 (When is a complex norm an inner product norm?) Given a vector norm in a complex vector space V, and suppose (7.36) holds for all x, y Show that x, y := 1 4( x + y 2 x y 2 + i x + iy 2 i x iy 2), (7.41) defines an inner product on V, where i = 1. The identity (7.41) is called the polarization identity. 16 Exercise 7.48 (p norm for p = 1 and p = ) Show that p is a vector norm in R n for p = 1, p =. Exercise 7.49 (The p- norm unit sphere) The set S p = {x R n : x p = 1} is called the unit sphere in R n with respect to p. Draw S p for p = 1, 2, for n = 2. Exercise 7.50 (Sharpness of p-norm inequalitiy) For p 1, and any x C n we have x x p n 1/p x (cf. (7.5)). Produce a vector x l such that x l = x l p and another vector x u such that x u p = n 1/p x u. Thus, these inequalities are sharp. Exercise 7.51 (p-norm inequalitiies for arbitrary p) If 1 q p then x p x q n 1/q 1/p x p, x C n. Hint: For the rightmost inequality use Jensen s inequality Cf. Theorem 7.40 with f(z) = z p/q and z i = x i q. For the left inequality consider first y i = x i / x, i = 1, 2,..., n. 16 Hint: We have x, y = s(x, y) + is(x, iy), where s(x, y) := 1 4 ( x + y 2 x y 2).

192 180 Chapter 7. Norms and perturbation theory for linear systems 7.5 Review Questions 7.5.1] What is a consistent matrix norm? what is a subordinate matrix norm? is an operator norm consistent? why is the Frobenius norm not an operator norm? what is the spectral norm of a matrix? how do we compute A? what is the spectral condition number of a symmetric positive definite matrix? Why is A 2 A F for any matrix A? What is the spectral norm of the inverse of a normal matrix?

193 Chapter 8 Least Squares Consider the linear system Ax = b of m equations in n unknowns. It is overdetermined, if m > n, square, if m = n, and underdetermined, if m < n. In either case the system can only be solved approximately if b / span(a). One way to solve Ax = b approximately is to select a vector norm, say a p-norm, and look for x C n which minimizes Ax b. The use of the one and norm can be formulated as linear programming problems, while the Euclidian norm leads to a linear system and has applications in statistics. Only this norm is considered here. Definition 8.1 (Least squares problem (LSQ)) Suppose m, n N, A C m n and b C m. To find x C n that minimizes E : C n R given by E(x) := Ax b 2 2, is called the least squares problem. A minimizer x is called a least squares solution. Since the square root function is monotone, minimizing E(x) or E(x) is equivalent. Example 8.2 (Average) Consider an overdetermined linear system of 3 equations in one unknown x 1 = x 1 = 1, A = 1, x = [x 1 ], b = 1. x 1 =

194 182 Chapter 8. Least Squares To solve this as a least squares problem we find Ax b 2 2 = (x 1 1) 2 + (x 1 1) 2 + (x 1 2) 2 = 3x 2 1 8x Setting the first derivative with respect to x 1 equal to zero we obtain 6x 1 8 = 0 or x 1 = 4/3, the average of b 1, b 2, b 3. The second derivative is positive and x 1 = 4/3 is a global minimum. We will show below the following results valid for any m, n N, A C m n and b C n. Theorem 8.3 (Existence) The least squares problem always has a solution. Theorem 8.4 (Uniqueness) The solution of the least squares problem is unique if and only if A has linearly independent columns. Theorem 8.5 (Characterization) x C n is a solution of the least squares problem if and only if A Ax = A b. The linear system A Ax = A b is known as the normal equations. By Lemma 3.8 the coefficient matrix A A is symmetric and positive semidefinite, and it is positive definite if and only if A has linearly independent columns. This is the same condition which guarantees that the least squares problem has a unique solution. 8.1 Examples Example 8.6 (Linear regression) We want to fit a straight line p(t) = x 1 + x 2 t to m 2 given data (t k, y k ) R 2, k = 1,..., m. This is part of the linear regression process in statistics. We obtain the linear system Ax = p(t 1 ). p(t m ) = 1 t 1. 1 t m [ x1 ] = x 2 y 1. y m = b. This is square for m = 2 and overdetermined for m > 2. The matrix A has linearly independent columns if and only if the set {t 1,..., t m } of sites contains at least least two distinct elements. For if say t i t j then 1 t 1 0 [ ] [ ] [ 1 ti c1 0 c 1. + c 2. =. = = = c 1 t j c 2 0] 1 = c 2 = 0. 1 t m 0

195 8.1. Examples 183 y t Figure 8.1: A least squares fit to data. Conversely, if t 1 = = t m then the columns of A are not linearly independent. The normal equations are [ ] 1 t 1 A T 1 1 Ax = t 1 t m. 1 t m [ ] y = t 1 t m. = y m [ x1 x 2 ] = [ m tk tk [ yk tk y k ] = A T b, t 2 k ] [ x1 x 2 where k ranges from 1 to m in the sums. By what we showed the coefficient matrix is positive semidefinite and positive definite if we at least two distinct cites. If m = 2 and t 1 t 2 then both systems Ax = b and A T Ax = A T b are square, and p is the linear interpolant to the data. Indeed, p is linear and p(t k ) = y k, k = 1, 2. With the data t y [ ] [ ] [ ] 4 10 x1 6 the normal equations become =. The data and the least x squares polynomial p(t) = x 1 + x 2 t = t are shown in Figure 8.1. ],

196 184 Chapter 8. Least Squares Example 8.7 (Input/output model) Suppose we have a simple input/output model. To every input u R n we obtain an output y R. Assuming we have a linear relation y = u T x = n u i x i, between u and y, how can we determine x? Performing m n experiments we obtain a table of values i=1 u u 1 u 2 u m y y 1 y 2 y m. We would like to find x such that Ax = u T 1 u T 2. u T m x = y 1 y 2. y m = b. We can estimate x by solving the least squares problem min Ax b Curve Fitting Given size: 1 n m, sites: S := {t 1, t 2,..., t m } [a, b], y-values: y = [y 1, y 2,..., y m ] T R m, functions: φ j : [a, b] R, j = 1,..., n. Find a function (curve fit) p : [a, b] R given by p := n j=1 x jφ j such that p(t k ) y k for k = 1,..., m. The curve fitting problem can be solved by least squares from the following linear system: Ax = p(t 1 ). p(t m ) = φ 1 (t 1 ) φ n (t 1 ).. φ 1 (t m ) φ n (t m ) x 1. x n = y 1. y m =: b. (8.1)

197 8.1. Examples 185 Then we find x R n as a solution of the corresponding least squares problem given by m E(x) := Ax b 2 ( n ) 2. 2 = x j φ j (t k ) y k (8.2) k=1 Typical examples of functions φ j are polynomials, trigonometric functions, exponential functions, or splines. In (8.2) one can also include weights w k > 0 for k = 1,..., m and minimize E(x) := j=1 m ( n ) 2. w k x j φ j (t k ) y k k=1 j=1 If y k is an accurate observation, we can choose a large weight w k. This will force p(t k ) y k to be small. Similarly, a small w k will allow p(t k ) y k to be large. If an estimate for the standard deviation δy k in y k is known for each k, we can choose w k = 1/(δy k ) 2, k = 1, 2,..., m. For simplicity we will assume in the following that w k = 1 for all k. Lemma 8.8 (Curve fitting) Let A be given by (8.1). The matrix A T A is symmetric positive definite if and only if {φ 1,..., φ n } is linearly independent on S, i. e., p(t k ) := n x j φ j (t k ) = 0, k = 1,..., m x 1 = = x n = 0. (8.3) j=1 Proof. By Lemma 3.8 A T A is positive definite if and only if A has linearly independent columns. Since (Ax) k = n j=1 x jφ j (t k ), k = 1,..., m this is equivalent to (8.3). Example 8.9 (Ill conditioning and the Hilbert matrix) The normal equations can be extremely ill-conditioned. Consider the curve fitting problem using the polynomials φ j (t) := t j 1, for j = 1,..., n and equidistant sites t k = (k 1)/(m 1) for k = 1,..., m. The normal equations are B n x = c n, where for n = 3 B 3 x := m t k t 2 k tk t 2 k t 3 k x 1 x 2 = t 2 k t 3 k t 4 k x 3 yk tk y k t 2 k y k B n is symmetric positive definite if at least n of the t s are distinct. However B n is extremely ill-conditioned even for moderate n. Indeed, 1 m B n H n, where.

198 186 Chapter 8. Least Squares H n R n n is the Hilbert Matrix with i, j element 1/(i + j 1). Thus for n = 3 H 3 = The elements of 1 m B n are Riemann sums approximations to the elements of H n. In fact, if B n = [b i,j ] n i,j=1 then 1 m b i,j = 1 m m k=1 t i+j 2 k = 1 m m k= ( ) i+j 2 k 1 1 x i+j 2 1 dx = m 1 0 i + j 1 = h i,j. The elements of H 1 n are determined in Exercise We find K 1 (H 6 ) := H 6 1 H It appears that m B n and hence B n is ill-conditioned for moderate n at least if m is large. The cure for this problem is to use a different basis for polynomials. Orthogonal polynomials are an excellent choice. Another possibility is to use the shifted power basis (t t) j 1, j = 1,..., n, for a suitable t. Exercise 8.10 (Fitting a circle to points) In this problem we derive an algorithm to fit a circle (t c 1 ) 2 + (y c 2 ) 2 = r 2 to m 3 given points (t i, y i ) m i=1 in the (t, y)-plane. We obtain the overdetermined system (t i c 1 ) 2 + (y i c 2 ) 2 = r 2, i = 1,..., m, (8.4) of m equations in the three unknowns c 1, c 2 and r. This system is nonlinear, but it can be solved from the linear system t i x 1 + y i x 2 + x 3 = t 2 i + y 2 i, i = 1,..., m, (8.5) and then setting c 1 = x 1 /2, c 2 = x 2 /2 and r 2 = c c x 3. a) Derive (8.5) from (8.4). Explain how we can find c 1, c 2, r once [x 1, x 2, x 3 ] is determined. b) Formulate (8.5) as a linear least squares problem for suitable A and b. c) Does the matrix A in b) have linearly independent columns? d) Use (8.5) to find the circle passing through the three points (1, 4), (3, 2), (1, 0).

199 8.2. Geometric Least Squares theory Geometric Least Squares theory The least squares problem can be studied as a quadratic minimization problem. In the real case we have E(x) := Ax b 2 2 = (Ax b) T (Ax b) = x T A T Ax 2x T A T b + b T b. Minimization of a quadratic function like E(x) will be considered in Chapter 12. Here we consider a geometric approach based on orthogonal sums of subspaces Sum of subspaces and orthogonal projections Suppose S and T are subspaces of R n or C n endowed with the usual inner product x, y = y x. 17 The following subsets are subspaces of R n or C n. Sum: S + T := {s + t : s S and t T }, direct sum S T : a sum where S T = {0}, orthogonal sum S T : a sum where s, t = 0 for all s S and t T. We note that Every x S T can be decomposed uniquely in the form x = s + t, where s S and t T. For if x = s 1 + t 1 = s 2 + t 2 for s 1, s 2 S and t 1, t 2 T, then s 1 s 2 = t 2 t 1 and it follows that s 1 s 2 and t 2 t 1 belong to both S and T and hence to S T. But then s 1 s 2 = t 2 t 1 = 0 so s 1 = s 2 and t 2 = t 1. An orthogonal sum is a direct sum. For if b S T then b is orthogonal to itself, b, b = 0, which implies that b = 0. Thus, every b S T can be written uniquely as b = b 1 + b 2, where b 1 S and b 2 T. The vectors b 1 and b 2 are called the orthogonal projections of b into S and T. For any s S we have b b 1, s = b 2, s = 0, see Figure 8.2. Using orthogonal sums and projections we can prove the existence, uniqueness and characterization theorems for least squares problems. For A C m n we consider the column space of A and the null space of A 17 other inner products could be used as well S := span(a), T := ker(a ).

200 188 Chapter 8. Least Squares b b 2 S b 1 Figure 8.2: The orthogonal projection of b into S. These are subspaces and the sum is an orthogonal sum S + T = S T. For if s span(a) then s = Ac for some c R n and if t ker(a ) then t, s = s t = c A t = 0. Moreover, (cf, Theorem 8.13) S + T equals all of C m C m = span(a) ker(a ). It follows that any b C m can be decomposed uniquely as b = b 1 + b 2, where b 1 is the orthogonal projection of b into span(a) and b 2 is the orthogonal projection of b into ker(a ). Proof of Theorem 8.3 Suppose x C n. Clearly Ax b 1 span(a) since it is a subspace and b 2 ker(a ). But then Ax b 1, b 2 = 0 and by Pythagoras Ax b 2 2 = (Ax b 1 ) b = Ax b b b with equality if and only if Ax = b 1. It follows that the set of all least squares solutions is {x C n : Ax = b 1 }. (8.6) This set is nonempty since b 1 span(a). Proof of Theorem 8.4 The set (8.6) contains exactly one element if and only if A has linearly independent columns. Proof of Theorem 8.5 If x solves the least squares problem then Ax b 1 = 0 and it follows that A (Ax

201 8.3. Numerical Solution 189 b 1 ) = 0 showing that the normal equations hold. Conversely, if A Ax = A b then A b 2 = 0 implies that A (Ax b 1 ) = 0. But then Ax b 1 span(a) ker(a ) showing that Ax b 1 = 0, and x is a least squares solution. 8.3 Numerical Solution We assume that m n, A R m n, b R m. Numerical methods can be based on normal equations, QR factorization, or Singular Value Factorization. We discuss each of these approaches in turn. Another possibility is to use an iterative method like the conjugate gradient method (cf. Exercise 12.18) Normal equations We assume that rank(a) = n, i. e., A has linearly independent columns. The coefficient matrix B := A A in the normal equations is symmetric positive definite, and we can solve these equations using the Cholesky factorization of B. Consider forming the normal equations. We can use either a column oriented (inner product)- or a row oriented (outer product) approach. 1. inner product: (A A) i,j = m k=1 a k,ia k,j, i, j = 1,..., n, (A b) i = m k=1 a k,ib k, i = 1,..., n, 2. outer product: A A = m k=1 a k,1. a k,n [a k1 a kn ], A b = m k=1 a k,1. a k,n b k. The outer product form is suitable for large problems since it uses only one pass through the data importing one row of A at a time from some separate storage. Consider the number of operations to find the least squares solution for real data. We need 2m arithmetic operations for each inner product. Since B is symmetric we only need to compute n(n+1)/2 such inner products. It follows that B can be computed in approximately mn 2 arithmetic operations. In conclusion the number of operations are mn 2 to find B, 2mn to find c := A b, n 3 /3 to find L, n 2 to solve L T y = c and n 2 to solve Lx = y. If m n it takes 4 3 n3 = 2G n arithmetic operations. If m is much bigger than n the number of operations is approximately mn 2, the work to compute B. Conditioning of A can be a problem with the normal equation approach. We have Theorem 8.11 (Spectral condition number of A A) Suppose 1 n m and that A C m n has linearly independent columns. Then K 2 (A A) := A A 2 (A A) 1 2 = λ 1 = σ2 1 λ n σn 2 = K 2 (A) 2, (8.7)

190 Chapter 8. Least Squares where λ 1 λ n > 0 are the eigenvalues of A A, and σ 1 σ n > 0 are the singular values of A. Proof. Since A A is Hermitian it follows from Theorem 7.

202 190 Chapter 8. Least Squares where λ 1 λ n > 0 are the eigenvalues of A A, and σ 1 σ n > 0 are the singular values of A. Proof. Since A A is Hermitian it follows from Theorem 7.30 that K 2 (A) = σ1 σ n and K 2 (A A) = λ1 λ n. But λ i = σi 2 by Theorem 6.5 and the proof is complete. It follows from Theorem 8.11 that the 2-norm condition number of B := A A is the square of the condition number of A and therefore can be quite large even if A is only mildly illconditioned. Another difficulty which can be encountered is that the computed A A might not be positive definite. See Problem 8.36 for an example QR factorization Gene Golub, square problems. He pioneered use of the QR factorization to solve least The QR factorization can be used to solve the least squares problem. We assume that rank(a) = n, i. e., A has linearly independent columns. Suppose A = Q 1 R 1 is a QR factorization of A. Since Q 1 C m n has orthonormal columns we find A A = R 1Q 1Q 1 R 1 = R 1R 1, A b = R 1Q 1b. Since A has rank n the matrix R 1 is nonsingular and can be canceled. Thus A Ax = A b = R 1 x = c 1, c 1 := Q 1b.

203 8.3. Numerical Solution 191 We can use Householder transformations or Givens rotations to find R 1 and c 1. Consider using the Householder triangulation algorithm Algorithm We find R = Q A and c = Q b, where A = QR is the QR decomposition of A. The matrices R 1 and c 1 are located in the first n rows of R and c. Using also Algorithm 2.7 we have the following method to solve the full rank least squares problem. 1. [R,c]=housetriang(A,b). 2. x=rbacksolve(r(1:n,1:n),c(1:n),n). Example 8.12 (Solution using QR factorization) Consider the least squares problem with A = and b = This is the matrix in Example The least squares solution x is found by solving the system x x 2 = x and we find x = [1, 0, 0]. Using Householder triangulation is a useful alternative to normal equations for solving full rank least squares problems. It can even be extended to rank deficient problems, see [3]. The 2 norm condition number for the system R 1 x = c 1 is K 2 (R 1 ) = K 2 (Q 1 R 1 ) = K 2 (A), and as discussed in the previous section this is the square root of K 2 (A A), the condition number for the normal equations. Thus if A is mildly ill-conditioned the normal equations can be quite ill-conditioned and solving the normal equations can give inaccurate results. On the other hand Algorithm 4.21 is quite stable. But using Householder transformations requires more work. The leading term in the number of arithmetic operations in Algorithm 4.21 is approximately 2mn 2 2n 3 /3, (cf. (4.14) while the number of arithmetic operations needed to form the normal equations, taking advantage of symmetry is approximately mn 2. Thus for m much larger than n using Householder triangulation requires twice as many arithmetic operations as the approach based on the normal equations. Also, Householder triangulation have problems taking advantage of the structure in sparse problems.

204 192 Chapter 8. Least Squares Least squares and singular value decomposition Consider a singular value decomposition of A and the corresponding singular value factorization: [ ] [ ] A = UΣV Σ1 0 V = [U 1, U 2 ] 1 = U Σ 1 V 1, Σ 1 = diag(σ 1,..., σ r ), V 2 where A has rank r so that σ 1 σ r > 0 and U 1 = [u 1,..., u r ], V 1 = [v 1,..., v r ]. Moreover, U U = I and V V = I. We recall (cf. Theorem 6.15) the set of columns of U 1 span(a), is an orthonormal basis for the column space the set of columns of U 2 is an orthonormal basis for the null space ker(a ), the set of columns of V 2 is an orthonormal basis for the null space ker(a), Theorem 8.13 (Orthogonal projection and least squares solution) Let A C m n, b C m, S := span(a) and T := ker(a ). Then 1. C m = S T is an orthogonal decomposition of C m with respect to the usual inner product s, t = t s. 2. If A = U 1 Σ 1 V 1 is a singular value factorization of A then the orthogonal projection b 1 of b into S is b 1 = U 1 U 1b = AA b, where A := V 1 Σ 1 1 U 1 C n m. (8.8) Proof. By block multiplication b = UU b = [U 1, U 2 [ U 1 U 2 ] b = b 1 + b 2, where b 1 = U 1 U 1b S and b 2 = U 2 U 2b T. Since U 2U 1 = 0 it follows that b 1, b 2 = b 2b 1 = 0 implying Part 1. Moreover, b 1 is the orthogonal projection into S. Since V 1V 1 = I we find and Part 2 follows. AA b = (U 1 Σ 1 V 1)(V 1 Σ 1 1 U 1)b = U 1 U 1b = b 1 Corollary 8.14 (LSQ characterization using SVD) x C n solves the least squares problem min x Ax b 2 2 if and only if x = A b+z, where A is given by (8.8) and z ker(a).

205 8.3. Numerical Solution 193 Proof. Let x be a least squares solution, i. e., Ax = b 1. If z := x A b then Az = Ax AA b = b 1 b 1 = 0 and x = A b + z. If x = A b + z with Az = 0 then A Ax = A A(A b + z) = A (AA b + Az) = A b 1 = A b. The last equality follows since b = b 1 + b 2 and b 2 ker(a ). Thus x satisfies the normal equations and by Theorem 8.5 is a least squares solution. Example 8.15 (Projections and least squares solutions) We have the singular value factorization (cf. Example 6.12) We then find A = 1 2 [ 1 1 A := = [2] 1 [ ] ] [1 2 ] 1 2 [ ] = 1 4 [ ] = AT. Thus, b 1 = U 1 U 1b = [1 1 0]b = b 1 (b 1 + b 2 )/ b 2 = AA b = (b 1 + b 2 )/ b 3 0 Moreover, the set of all least quares solutions is {x R 2 : Ax = b 1 } = {x 1, x 2 R : x 1 + x 2 = b 1 + b 2 }. (8.9) 2 Since ker(a) = {[ z z ] : z R} we obtain the general least squares solution x = 1 [ ] [ [ b b z b1+b 2 ] + = 4 + z b z] 1+b 2. b 3 4 z When rank(a) is less than the number of columns of A then ker(a) {0}, and we have a choice of z. One possible choice is z = 0 giving the solution A b. Theorem 8.16 (Minimal solution) The least squares solution with minimal Euclidian norm is x = A b corresponding to z = 0.

206 194 Chapter 8. Least Squares Proof. Suppose x = A b + z, with z ker(a). Recall that if the right singular vectors of A are partitioned as [v 1,..., v r, v r+1,..., v n ] = [V 1, V 2 ], then V 2 is a basis for ker(a). Moreover, V 2V 1 = 0 since V has orthonormal columns. If A = V 1 Σ 1 1 U 1 and z ker(a) then z = V 2 y for some y C n r and we obtain z A b = y V 2V 1 Σ 1 U 1b = 0. Thus z and A b are orthogonal so that by Pythagoras x 2 2 = A b + z 2 2 = A b z 2 2 A b 2 2 with equality for z = 0. Using MATLAB a least squares solution can be found using x=a\b if A has full rank.for rank deficient problems the function x=lscov(a,b) finds a least squares solution with a maximal number of zeros in x. Example 8.17 (Rank deficient least squares solution) Consider the least squares problem with A = [ ] and b = [1, 1]T. The singular value factorization,a and A b are given by A := 1 [ ] 1 [2] 1 [ ] 1 1, A = 1 [ ] [1 ] [ ] 1 1 [ ] 1 1/ = A, A b =. 1/2 Using Corollary 8.14 we find the general solution [1/2, 1/2]+[a, a] for any a C. lscov gives the solution [1, 0] T corresponding to a = 1/2, while the minimal norm solution is [1/2, 1/2] obtained for a = The generalized inverse Consider the matrix A := V 1 Σ 1 1 U 1 in (8.8). If A is square and nonsingular then A A = AA = I and A is the usual inverse of A. Thus A is a possible generalization of the usual inverse. The matrix A satisfies Properties (1)- (4) in Exercise 8.18 and these properties define A uniquely (cf. Exercise 8.19). The unique matrix B satisfying Properties (1)- (4) in Exercise 8.18 is called the generalized inverse or pseudo inverse of A and denoted A. It follows that A := V 1 Σ 1 1 U 1 for any singular value factorization U 1 Σ 1 V 1 of A. We show in Exercise 8.21 that if A has linearly independent columns then For further properties and examples see the exercises. A = (A A) 1 A. (8.10) Exercise 8.18 (The generalized inverse) Show that B := V 1 Σ 1 1 U 1 satisfies (1) ABA = A, (2) BAB = B, (3) (BA) = BA, and (4) (AB) = AB.

207 8.3. Numerical Solution 195 Exercise 8.19 (Uniqueness of generalized inverse) Given A C m n, and suppose B, C C n m satisfy ABA = A (1) ACA = A, BAB = B (2) CAC = C, (AB) = AB (3) (AC) = AC, (BA) = BA (4) (CA) = CA. Verify the following proof that B = C. B = (BA)B = (A )B B = (A C )A B B = CA(A B )B = CA(BAB) = (C)AB = C(AC)AB = CC A (AB) = CC (A B A ) = C(C A ) = CAC = C. Exercise 8.20 (Verify that a matrix ] is a generalized inverse) Show that the matrices A = [ and B = 1 4 [ ] satisfy the axioms in Exercise Thus we can conclude that B = A without computing the singular value decomposition of A. Exercise 8.21 (Linearly independent columns and generalized inverse) Suppose A C m n has linearly independent columns. Show that A A is nonsingular and A = (A A) 1 A. If A has linearly independent rows, then show that AA is nonsingular and A = A (AA ) 1. Exercise 8.22 (The generalized inverse of a vector) Show that u = (u u) 1 u if u C n,1 is nonzero. Exercise 8.23 (The generalized inverse of an outer product) If A = uv where u C m, v C n are nonzero, show that A = 1 α A, α = u 2 2 v 2 2. Exercise 8.24 (The generalized inverse of a diagonal matrix) Show that diag(λ 1,..., λ n ) = diag(λ 1,..., λ n) where λ i = { 1/λi, λ i 0 0 λ i = 0.

208 196 Chapter 8. Least Squares Exercise 8.25 (Properties of the generalized inverse) Suppose A C m n. Show that a) (A ) = (A ). b) (A ) = A. c) (αa) = 1 α A, α 0. Exercise 8.26 (The generalized inverse of a product) Suppose k, m, n N, A C m n, B C n k. Suppose A has linearly independent columns and B has linearly independent rows. a) Show that (AB) = B A. Hint: Let E = AF, F = B A. Show by using A A = BB = I that F is the generalized inverse of E. b) Find A R 1,2, B R 2,1 such that (AB) B A. Exercise 8.27 (The generalized inverse of the conjugate transpose) Show that A = A if and only if all singular values of A are either zero or one. Exercise 8.28 (Linearly independent columns) Show that if A has rank n then A(A A) 1 A b is the projection of b into span(a). (Cf. Exercise 8.21.) Exercise 8.29 (Analysis of the general linear system) Consider the linear system Ax = b where A C n n has rank r > 0 and b C n. Let [ ] U Σ1 0 AV = 0 0 represent the singular value decomposition of A. a) Let c = [c 1,..., c n ] T = U b and y = [y 1,..., y n ] T = V x. Show that Ax = b if and only if [ ] Σ1 0 y = c. 0 0 b) Show that Ax = b has a solution x if and only if c r+1 = = c n = 0. c) Deduce that a linear system Ax = b has either no solution, one solution or infinitely many solutions.

209 8.4. Perturbation Theory for Least Squares 197 Exercise 8.30 (Fredholm s alternative) For any A C m n, b C n show that one and only one of the following systems has a solution (1) Ax = b, (2) A y = 0, y b 0. In other words either b span(a), or we can find y ker(a ) such that y b 0. This is called Fredholm s alternative. 8.4 Perturbation Theory for Least Squares In this section we consider what effect small changes in the data A, b have on the solution x of the least squares problem min Ax b 2. If A has linearly independent columns then we can write the least squares solution x (the solution of A Ax = A b) as x = A b = A b 1, A := (A A) 1 A, where b 1 is the orthogonal projection of b into the column space span(a) Perturbing the right hand side Let us now consider the effect of a perturbation in b on x. Theorem 8.31 (Perturbing the right hand side) Suppose A C m n has linearly independent columns, and let b, e C m. Let x, y C n be the solutions of min Ax b 2 and min Ay b e 2. Finally, let b 1, e 1 be the orthogonal projections of b and e into span(a). If b 1 0, we have for any operator norm 1 e 1 y x K(A) e 1 K(A) b 1 x b 1, K(A) = A A. (8.11) Proof. Subtracting x = A b 1 from y = A b 1 + A e 1 we have y x = A e 1. Thus y x = A e 1 A e 1. Moreover, b 1 = Ax A x. Therefore y x / x A A e 1 / b 1 proving the rightmost inequality. From A(x y) = e 1 and x = A b 1 we obtain the leftmost inequality. (8.11) is analogous to the bound (7.20) for linear systems. We see that the number K(A) = A A generalizes the condition number A A 1 for a square matrix. The main difference between (8.11) and (7.20) is however that e / b in (7.20) has been replaced by e 1 / b 1, the orthogonal projections of e and b into span(a). If b lies almost entirely in ker(a ), i.e. b / b 1 is large,

210 198 Chapter 8. Least Squares τ = N( A T ) b b 2 e span( A ) e 1 b 1 Figure 8.3: Graphical interpretation of the bounds in Theorem then e 1 / b 1 can be much larger than e / b. This is illustrated in Figure 8.3. If b is almost orthogonal to span(a), e 1 / b 1 will normally be much larger than e / b. Example 8.32 (Perturbing the right hand side) Suppose A = 0 1, b = 0, e = For this example we can compute K(A) by finding A explicitly. Indeed, [ ] [ ] [ ] A T 1 1 A =, (A T A) =, A = (A T A) 1 A T = Thus K (A) = A A = 2 2 = 4 is quite small. Consider now the projections b 1 and e 1. We find AA =. [ ] Hence b 1 = AA b = [10 4, 0, 0] T, and e 1 = AA e = [10 6, 0, 0] T. Thus e 1 / b 1 = 10 2 and (8.11) takes the form y x x (8.12) To verify the bounds we compute the solutions as x = A b = [10 4, 0] T y = A (b + e) = [ , 0] T. Hence and x y x = = 10 2,

211 8.4. Perturbation Theory for Least Squares 199 in agreement with (8.12) Exercise 8.33 (Condition number) Let 1 2 A = 1 1, b = 1 1 a) Determine the projections b 1 and b 2 of b on span(a) and ker(a T ). b) Compute K(A) = A 2 A 2. For each A we can find b and e so that we have equality in the upper bound in (8.11). The lower bound is best possible in a similar way. b 1 b 2 b 3 Exercise 8.34 (Equality in perturbation bound). a) Let A C m n. Show that we have equality to the right in (8.11) if b = Ay A, e 1 = y A where Ay A = A, A y A = A. b) Show that we have equality to the left if we switch b and e in a). c) Let A be as in Example Find extremal b and e when the l norm is used Perturbing the matrix The analysis of the effects of a perturbation E in A is quite difficult. The following result is stated without proof, see [22, p. 51]. For other estimates see [3] and [30]. Theorem 8.35 (Perturbing the matrix) Suppose A, E C m n, m > n, where A has linearly independent columns and α := 1 E 2 A 2 > 0. Then A + E has linearly independent columns. Let b = b 1 + b 2 C m where b 1 and b 2 are the orthogonal projections into span(a) and ker(a ) respectively. Suppose b 1 0. Let x and y be the solutions of min Ax b 2 and min (A + E)y b 2. Then ρ = x y 2 x 2 1 α K(1 + βk) E 2 A 2, β = b 2 2 b 1 2, K = A 2 A 2. (8.13) (8.13) says that the relative error in y as an approximation to x can be at most K(1 + βk)/α times as large as the size E 2 / A 2 of the relative perturbation in A. β will be small if b lies almost entirely in span(a), and we have approximately ρ 1 α K E 2/ A 2. This corresponds to the estimate (7.26) for

212 200 Chapter 8. Least Squares linear systems. If β is not small, the term 1 α K2 β E 2 / A 2 will dominate. In other words, the condition number is roughly K(A) if β is small and K(A) 2 β if β is not small. Note that β is large if b is almost orthogonal to span(a) and that b 2 = b Ax is the residual of x. Exercise 8.36 (Problem using normal equations) Consider the least squares problems where A = , b = 2 3, ɛ R. 1 1+ɛ 2 a) Find the normal equations and the exact least squares solution. b) Suppose ɛ is small and we replace the (2, 2) entry 3+2ɛ+ɛ 2 in A T A by 3+2ɛ. (This will be done in a computer if ɛ < u, u being the round-off unit). For example, if u = then u = Solve A T Ax = A T b for x and compare with the x found in a). (We will get a much more accurate result using the QR factorization or the singular value decomposition on this problem). 8.5 Perturbation Theory for Singular Values In this section we consider what effect a small change in the matrix A has on the singular values The Minmax Theorem for Singular Values and the Hoffman-Wielandt Theorem We have a minmax and maxmin characterization for singular values. Theorem 8.37 (The Courant-Fischer Theorem for Singular Values) Suppose A C m,n has singular values σ 1, σ 2,..., σ n ordered so that σ 1 σ n. Then for k = 1,..., n Proof. Since σ k = min max Ax 2 = max dim(s)=n k+1 x S x 2 x 0 Ax 2 2 x 2 2 = Ax, Ax x, x min dim(s)=k x S x 0 = x, A Ax x, x Ax 2 x 2. (8.14) denotes the Rayleigh quotient R A A(x) of A A, and since the singular values of A are the nonnegative square roots of the eigenvalues of A A, the results follow from the Courant-Fischer Theorem for eigenvalues, see Theorem 5.44.

213 8.5. Perturbation Theory for Singular Values 201 By taking k = 1 and k = n in (8.14) we obtain for any A C m,n Ax 2 σ 1 = max x C n x 0 x 2, σ n = min x C n x 0 Ax 2 x 2. (8.15) This follows since the only subspace of C n of dimension n is C n itself. The Hoffman-Wielandt Theorem, see Theorem 5.47, for eigenvalues of Hermitian matrices can be written n µ j λ j 2 A B 2 F := j=1 n i=1 j=1 n a ij b ij 2, (8.16) where A, B C n,n are both Hermitian matrices with eigenvalues λ 1 λ n and µ 1 µ n, respectively. For singular values we have a similar result. Theorem 8.38 (Hoffman-Wielandt Theorem for singular values) For any m, n N and A, B C m,n we have n β j α j 2 A B 2 F. (8.17) j=1 where α 1 α n and β 1 β n are the singular values of A and B, respectively. Proof. We apply the Hoffman-Wielandt Theorem for eigenvalues to the Hermitian matrices C := [ 0 ] A A 0 and D := [ ] 0 B B C m+n,m+n. 0 If C and D has eigenvalues λ 1 λ m+n and µ 1 µ m+n, respectively then m+n λ j µ j 2 C D 2 F. (8.18) j=1 Suppose A has rank r and SVD UΣV. We use (6.10) and determine the eigen-

214 202 Chapter 8. Least Squares pairs of C as follows. [ ] [ ] [ ] [ ] [ ] 0 A ui Avi αi u A = 0 v i A = i ui = α u i α i v i, i = 1,..., r, i v i [ ] [ ] [ ] [ ] [ ] 0 A ui Avi αi u A = 0 v i A = i ui = α u i α i v i, i = 1,..., r, i v i [ ] [ ] [ ] [ [ ] 0 A ui 0 0 ui A = 0 0 A = = 0, i = r + 1,..., m, u i 0] 0 [ ] [ ] [ ] [ [ 0 A 0 Avi 0 0 A = = = 0, i = r + 1,..., n. 0 v i 0 0] vi] Thus C has the 2r eigenvalues α 1, α 1,..., α r, α r and m + n 2r additional zero eigenvalues. Similarly, if B has rank s then D has the 2s eigenvalues β 1, β 1,..., β s, β s and m + n 2s additional zero eigenvalues. Let Then We find t := max(r, s). λ 1 λ m+n = α 1 α t 0 = = 0 α t α 1, µ 1 µ m+n = β 1 β t 0 = = 0 β t β 1. m+n λ j µ j 2 = j=1 t α i β i 2 + i=1 t α i + β i 2 = 2 i=1 t α i β i 2 and [ ] C D 2 0 A B F = A B 2 0 F = B A 2 F + (B A) 2 F = 2 B A 2 F. But then (8.18) implies t i=1 α i β i 2 B A 2 F. Since t n and α i = β i = 0 for i = t + 1,..., n we obtain (8.17). The Hoffman-Wielandt Theorem for singular values, Theorem 8.38 shows that the singular values of a matrix are well conditioned. Changing the Frobenius norm of a matrix by small amount only changes the singular values by a small amount. Using the 2-norm we have a similar result. Theorem 8.39 (Perturbation of singular values) Let A, B R m n be rectangular matrices with singular values α 1 α 2 α n and β 1 β 2 β n. Then α j β j A B 2, for j = 1, 2,..., n. (8.19) i=1

215 8.6. Review Questions 203 Proof. Fix j and let S be the n j + 1 dimensional subspace for which the minimum in Theorem 8.37 is obtained for A. Then (B + (A B))x 2 α j = max x S x 2 x 0 max x S x 0 Bx 2 x 2 (A B)x 2 +max β j + A B 2. x S x 2 x 0 By symmetry we obtain β j α j + A B 2 and the proof is complete. The following result is an analogue of Theorem Theorem 8.40 (Generalized inverse when perturbing the matrix) Let A, E R m n have singular values α 1 α n and ɛ 1 ɛ n. If A 2 E 2 < 1 then 1. rank(a + E) rank(a), 2. (A + E) 2 where r is the rank of A. A 2 1 A 2 E 2 = 1 α r ɛ 1, Proof. Suppose A has rank r and let B := A + E have singular values β 1 β n. In terms of singular values the inequality A 2 E 2 < 1 can be written ɛ 1 /α r < 1 or α r > ɛ 1. By Theorem 8.39 we have α r β r ɛ 1, which implies β r α r ɛ 1 > 0, and this shows that rank(a+e) > r. To prove 2., the inequality β r α r ɛ 1 implies that (A + E) 2 1 β r 1 α r ɛ 1 = 1/α r 1 ɛ 1 /α r = A 2 1 A 2 E Review Questions Do the normal equations always have a solution? When is the least squares solution unique? Express the general least squares solution in terms of the generalized inverse Consider perturbing the right-hand side in a linear equation and a least squares problem. What is the main difference in the perturbation inequalities?

216 204 Chapter 8. Least Squares Why does one often prefer using QR factorization instead of normal equations for solving least squares problems What is an orthogonal sum? How is an orthogonal projection defined?

217 Part III Kronecker Products and Fourier Transforms 205

218

Chapter 9 The Kronecker Product Leopold Kronecker, 1823-1891 (left), Siméon Denis Poisson, 1781-1840 (right).

Identifying a Kronecker structure can be very rewarding since it simplifies the study of such matrices. 9.0.

219 Chapter 9 The Kronecker Product Leopold Kronecker, (left), Siméon Denis Poisson, (right). Matrices arising from 2D and 3D problems sometimes have a Kronecker product structure. Identifying a Kronecker structure can be very rewarding since it simplifies the study of such matrices The 2D Poisson problem Let Ω := (0, 1) 2 = {(x, y) : 0 < x, y < 1} be the open unit square with boundary Ω. Consider the problem u := 2 u x 2 2 u = f on Ω, (9.1) y2 u := 0 on Ω. Here the function f is given and continuous on Ω, and we seek a function u = u(x, y) such that (9.1) holds and which is zero on Ω. 207

220 208 Chapter 9. The Kronecker Product Let m be a positive integer. We solve the problem numerically by finding approximations v j,k u(jh, kh) on a grid of points given by Ω h := {(jh, kh) : j, k = 0, 1,..., m + 1}, where h = 1/(m + 1). The points Ω h := {(jh, kh) : j, k = 1,..., m} are the interior points, while Ω h \Ω h are the boundary points. The solution is zero at the boundary points. Using the difference approximation from Chapter 1 for the second derivative we obtain the following approximations for the partial derivatives 2 u(jh, kh) x 2 v j 1,k 2v j,k + v j+1,k h 2, 2 u(jh, kh) y 2 v j,k 1 2v j,k + v j,k+1 h 2. Inserting this in (9.1) we get the following discrete analog of (9.1) h v j,k = f j,k, (jh, kh) Ω h, v j,k = 0, (jh, kh) Ω h, (9.2) where f j,k := f(jh, kh) and h v j,k := v j 1,k + 2v j,k v j+1,k h 2 Multiplying both sides of (9.2) by h 2 we obtain + v j,k 1 + 2v j,k v j,k+1 h 2. (9.3) 4v j,k v j 1,k v j+1,k v j,k 1 v j,k+1 = h 2 f jk, (jh, kh) Ω h, v 0,k = v m+1,k = v j,0 = v j,m+1 = 0, j, k = 0, 1,..., m + 1. (9.4) The equations in (9.4) define a set of linear equations for the unknowns V := [v jk ] R m m. Observe that (9.4) can be written as a matrix equation in the form T V + V T = h 2 F with h = 1/(m + 1), (9.5) where T = tridiag( 1, 2, 1) R m m is the second derivative matrix given by (1.23) and F = (f jk ) = (f(jh, kh)) R m m. Indeed, the (j, k) element in T V + V T is given by m m T j,i v i,k + v j,i T i,k, i=1 and this is precisely the left hand side of (9.4). To write (9.4) in standard form Ax = b we need to order the unknowns v j,k in some way. The following operation of vectorization of a matrix gives one possible ordering. i=1

221 209 1,1 1,2 1,3 1,3 2,3 3, ,1 2,2 2,3 1,2 2,2 3, ,1 3,2 3,3 1,1 2,1 3, v in V - matrix j,k v in grid j,k x i in grid Figure 9.1: Numbering of grid points v j,k+1 x i+m -1 v v v j-1,k x x j,k j+1,k i-1 i i x -1 v x j,k-1 i-m -1 Figure 9.2: The 5-point stencil Definition 9.1 (vec operation) For any B R m n we define the vector vec(b) := [b 11,..., b m1, b 12,..., b m2,..., b 1n,..., b mn ] T R mn by stacking the columns of B on top of each other. Let n = m 2 and x := vec(v ) R n. Note that forming x by stacking the columns of V on top of each other means an ordering of the grid points which for m = 3 is illustrated in Figure 9.1. We call this the natural ordering. The elements in (9.4) form a 5-point stencil, as shown in Figure 9.2. To find the matrix A we note that for values of j, k where the 5-point stencil does not touch the boundary, (9.4) takes the form 4x i x i 1 x i+1 x i m x i+m = b i,

222 210 Chapter 9. The Kronecker Product Figure 9.3: Band structure of the 2D test matrix, n = 9, n = 25, n = 100 where x i = v jk and b i = h 2 f jk. This must be modified close to the boundary. We obtain the linear system Ax = b, A R n n, b R n, n = m 2, (9.6) where x = vec(v ), b = h 2 vec(f ) with F = (f jk ) R m m, and A is the Poisson matrix given by a ii = 4, i = 1,..., n, a i+1,i = a i,i+1 = 1, i = 1,..., n 1, i m, 2m,..., (m 1)m, a i+m,i = a i,i+m = 1, i = 1,..., n m, a ij = 0, otherwise. (9.7) For m = 3 we have the following matrix A = Exercise 9.2 (4 4 Poisson matrix) Write down the Poisson matrix for m = 2 and show that it is strictly diagonally dominant.

223 9.1. The Kronecker Product The test matrices The (2-dimensional) Poisson matrix is a special case of the matrix T 2 = [a ij ] R n n with elements a ii = 2d, i = 1,..., n, a i,i+1 = a i+1,i = a, i = 1,..., n 1, i m, 2m,..., (m 1)m, a i,i+m = a i+m,i = a, i = 1,..., n m, a ij = 0, otherwise, (9.8) and where a, d are real numbers. We will refer to this matrix as simply the 2D test matrix. For m = 3 the 2D test matrix looks as follows 2d a 0 a a 2d a 0 a a 2d 0 0 a a 0 0 2d a 0 a 0 0 T 2 = 0 a 0 a 2d a 0 a 0. (9.9) 0 0 a 0 a 2d 0 0 a a 0 0 2d a a 0 a 2d a a 0 a 2d The partition into 3 3 sub matrices shows that T 2 is block tridiagonal. Properties of T 2 can be derived from properties of T 1 = tridiagonal(a, d, a) by using properties of the Kronecker product. 9.1 The Kronecker Product Definition 9.3 (Kronecker product) For any positive integers p, q, r, s we define the Kronecker product of two matrices A R p q and B R r s as a matrix C R pr qs given in block form as Ab 1,1 Ab 1,2 Ab 1,s Ab 2,1 Ab 2,2 Ab 2,s C = Ab r,1 Ab r,2 Ab r,s We denote the Kronecker product of A and B by C = A B. This definition of the Kronecker product is known more precisely as the left Kronecker product. In the literature one often finds the right Kronecker product which in our notation is given by B A. The Kronecker product u v = [ ] u T v 1,..., u T T v r of two column vectors u R p and v R r is a column vector of length p r.

224 212 Chapter 9. The Kronecker Product As examples of Kronecker products which are relevant for our discussion, if d a T 1 = a d a and I = a d then T 1 I + I T 1 = T T T 1 + di ai 0 ai di ai 0 ai di = T 2 given by (9.9). The same equation holds for any integer m 2 T 1 I + I T 1 = T 2, T 1, I R m m, T 2 R (m2 ) (m 2). (9.10) The sum of two Kronecker products involving the identity matrix is worthy of a special name. Definition 9.4 (Kronecker sum) For positive integers r, s, k, let A R r r, B R s s, and I k be the identity matrix of order k. The sum A I s + I r B is known as the Kronecker sum of A and B. In other words, the 2D test matrix T 2 is the Kronecker sum involving the 1D test matrix T 1. The following simple arithmetic rules hold for Kronecker products. For scalars λ, µ and matrices A, A 1, A 2, B, B 1, B 2, C of dimensions such that the operations are defined, we have ( ) ( ) ( ) λa µb = λµ A B, ( ) A1 + A 2 B = A1 B + A 2 B, A ( ) (9.11) B 1 + B 2 = A B1 + A B 2, (A B) C = A (B C). Note however that in general we have A B B A, but it can be shown that there are permutation matrices P, Q such that B A = P (A B)Q, see [16]. Exercise 9.5 (Properties of Kronecker products) Prove (9.11). The following mixed product rule is an essential tool for dealing with Kronecker products and sums.

225 9.1. The Kronecker Product 213 Lemma 9.6 (Mixed product rule) Suppose A, B, C, D are rectangular matrices with dimensions so that the products AC and BD are defined. Then the product (A B)(C D) is defined and (A B)(C D) = (AC) (BD). (9.12) Proof. If B R r,t and D R t,s for some integers r, s, t, then Ab 1,1 Ab 1,t Cd 1,1 Cd 1,s (A B)(C D) =..... Ab r,1 Ab r,t Cd t,1 Cd t,s Thus for all i, j ((A B)(C D)) i,j = AC t b i,k d k,j = (AC)(BD) i,j = ((AC) (BD)) i,j. k=1 Using the mixed product rule we obtain the following properties of Kronecker products and sums. Theorem 9.7 (Properties of Kronecker products) Suppose for r, s N that A R r,r and B R s,s are square matrices with eigenpairs (λ i, u i ) i = 1,..., r and (µ j, v j ), j = 1,..., s. Moreover, let F, V R r s. Then 1. (A B) T = A T B T, (this also holds for rectangular matrices). 2. If A and B are nonsingular then A B is nonsingular. with (A B) 1 = A 1 B If A and B are symmetric then A B and A I + I B are symmetric. 4. (A B)(u i v j ) = λ i µ j (u i v j ), i = 1,..., r, j = 1,..., s, 5. (A I s +I r B)(u i v j ) = (λ i +µ j )(u i v j ), i = 1,..., r, j = 1,..., s., 6. If one of A, B is symmetric positive definite and the other is symmetric positive semidefinite then A I + I B is symmetric positive definite. 7. AV B T = F (A B) vec(v ) = vec(f ), 8. AV + V B T = F (A I s + I r B) vec(v ) = vec(f ).

226 214 Chapter 9. The Kronecker Product Before giving the simple proofs of this theorem we present some comments. 1. The transpose (or the inverse) of an ordinary matrix product equals the transpose (or the inverse) of the matrices in reverse order. For Kronecker products the order is kept. 2. The eigenvalues of the Kronecker product (or sum) are the product (or sum) of the eigenvalues of the factors. The eigenvectors are the Kronecker products of the eigenvectors of the factors. In particular, the eigenvalues of the test matrix T 2 are sums of eigenvalues of T 1. We will find these eigenvalues in the next section. 3. Since we already know that T = tridiag( 1, 2, 1) is positive definite the 2D Poisson matrix A = T I + I T is also positive definite. 4. The system AV B T = F in part 7 can be solved by first finding W from AW = F, and then finding V from BV T = W T. This is preferable to solving the much larger linear system (A B) vec(v ) = vec(f ). 5. A fast way to solve the 2D Poisson problem in the form T V + V T = F will be considered in the next chapter. Proof. 1. Exercise. 2. By the mixed product rule ( A B )( A 1 B 1) = ( AA 1) ( BB 1) = I r I s = I rs.thus ( A B ) is nonsingular with the indicated inverse. 3. By 1, (A B) T = A T B T = A B. Moreover, since then A I and I B are symmetric, their sum is symmetric. 4. (A B)(u i v j ) = (Au i ) (Bv j ) = (λ i u i ) (µ j v j ) = (λ i µ j )(u i v j ), for all i, j, where we used the mixed product rule. 5. (A I s )(u i v j ) = λ i (u i v j ), and (I r B)(u i v j ) = µ j (u i v j ). The result now follows by summing these relations. 6. By 1, A I + I B is symmetric. Moreover, the eigenvalues λ i + µ j are positive since for all i, j, both λ i and µ j are nonnegative and one of them is positive. It follows that A I + I B is symmetric positive definite.

227 9.2. Properties of the 2D Test Matrices We partition V, F, and B T by columns as V = [v 1,..., v s ], F = [f 1,..., f s ] and B T = [b 1,..., b s ]. Then we have (A B) vec(v ) = vec(f ) Ab 11 Ab 1s v 1 f 1... =. Ab s1 Ab ss v s f s [ A b 1j v j,..., ] b sj v j = [f 1,..., f s ] j j 8. This follows immediately from (7) as follows A[V b 1,..., V b s ] = F AV B T = F. (A I s + I r B) vec(v ) = vec(f ) (AV I T s + I r V B T ) = F AV + V B T = F. For more on Kronecker products see [16]. 9.2 Properties of the 2D Test Matrices Using Theorem 9.7 we can derive properties of the 2D test matrix T 2 from those of T 1. We need to determine the eigenpairs of T 1. Theorem 9.8 (Eigenpairs of 2D test matrix) For fixed m 2 let T 2 be the matrix given by (9.8) and let h = 1/(m + 1). 1. We have T 2 x j,k = λ j,k x j,k for j, k = 1,..., m, where 2. The eigenvectors are orthogonal x j,k = s j s k, (9.13) s j = [sin(jπh), sin(2jπh),..., sin(mjπh)] T, (9.14) λ j,k = 2d + 2a cos(jπh) + 2a cos(kπh). (9.15) x T j,kx p,q = 1 4h 2 δ j,pδ k,q, j, k, p, q = 1,..., m. (9.16) 3. T 2 is symmetric positive definite if d > 0 and d 2 a.

228 216 Chapter 9. The Kronecker Product Proof. By Theorem 9.7 the eigenvalues of T 2 = T 1 I + I T 1 are sums of eigenvalues of T 1 and the eigenvectors are Kronecker producets of the eigenvectors of T 1. Part 1 now follows from Lemma Using the transpose rule, the mixed product rule and (1.37) we find for j, k, p, q = 1,..., m ( sj s k ) T ( sp s q ) = ( s T j s T k )( sp s q ) = ( s T j s p ) ( s T k s q ) = 1 4h 2 δ j,pδ k,q and part 2 follows. Since T 2 is symmetric, part 3 will follow if the eigenvalues are positive. But this is true if d > 0 and d 2 a. Thus T 2 is positive definite. Exercise 9.9 (2. derivative matrix is positive definite) Write down the eigenvalues of T = tridiag( 1, 2, 1) using Lemma 1.31 and conclude that T is symmetric positive definite. Exercise 9.10 (1D test matrix is positive definite?) Use Lemma 1.31 to show that the matrix T 1 := tridiag(a, d, a) R n n is symmetric positive definite if d > 0 and d 2 a. Exercise 9.11 (Eigenvalues for 2D test matrix of order 4) For m = 2 the matrix (9.8) is given by A = 2d a a 0 a 2d 0 a a 0 2d a 0 a a 2d Show that λ = 2a + 2d is an eigenvalue corresponding to the eigenvector x = [1, 1, 1, 1] T. Verify that apart from a scaling of the eigenvector this agrees with (9.15) and (9.14) for j = k = 1 and m = 2.. Exercise 9.12 (Nine point scheme for Poisson problem) Consider the following 9 point difference approximation to the Poisson problem u = f, u = 0 on the boundary of the unit square (cf. (9.1)) (a) ( h v) j,k = (µf) j,k j, k = 1,..., m (b) 0 = v 0,k = v m+1,k = v j,0 = v j,m+1, j, k = 0, 1,..., m + 1, (c) ( h v) j,k = [20v j,k 4v j 1,k 4v j,k 1 4v j+1,k 4v j,k+1 v j 1,k 1 v j+1,k 1 v j 1,k+1 v j+1,k+1 ]/(6h 2 ), (9.17) (d) (µf) j,k = [8f j,k + f j 1,k + f j,k 1 + f j+1,k + f j,k+1 ]/12.

229 9.2. Properties of the 2D Test Matrices 217 a) Write down the 4-by-4 system we obtain for m = 2. b) Find v j,k for j, k = 1, 2, if f(x, y) = 2π 2 sin (πx) sin (πy) and m = 2. Answer: v j,k = 5π 2 /66. It can be shown that (9.17) defines an O(h 4 ) approximation to (9.1). Exercise 9.13 (Matrix equation for nine point scheme) Consider the nine point difference approximation to (9.1) given by (9.17) in Problem a) Show that (9.17) is equivalent to the matrix equation T V + V T 1 6 T V T = h2 µf. (9.18) Here µf has elements (µf) j,k given by (9.17d) and T = tridiag( 1, 2, 1). b) Show that the standard form of the matrix equation (9.18) is Ax = b, where A = T I + I T 1 6 T T, x = vec(v ), and b = h2 vec(µf ). Exercise 9.14 (Biharmonic equation) Consider the biharmonic equation 2 u(s, t) := ( u(s, t) ) = f(s, t) (s, t) Ω, u(s, t) = 0, u(s, t) = 0 (s, t) Ω. (9.19) Here Ω is the open unit square. The condition u = 0 is called the Navier boundary condition. Moreover, 2 u = u xxxx + 2u xxyy + u yyyy. a) Let v = u. Show that (9.19) can be written as a system v(s, t) = f(s, t) (s, t) Ω u(s, t) = v(s, t) (s, t) Ω u(s, t) = v(s, t) = 0 (s, t) Ω. (9.20) b) Discretizing, using (9.3), with T = tridiag( 1, 2, 1) R m m, h = 1/(m+1), and F = ( f(jh, kh) ) m we get two matrix equations j,k=1 Show that T V + V T = h 2 F, T U + UT = h 2 V. (T I + I T )vec(v ) = h 2 vec(f ), (T I + I T )vec(u) = h 2 vec(v ). and hence A = (T I + I T ) 2 is the matrix for the standard form of the discrete biharmonic equation.

230 218 Chapter 9. The Kronecker Product c) Show that with n = m 2 the vector form and standard form of the systems in b) can be written T 2 U + 2T UT + UT 2 = h 4 F and Ax = b, (9.21) where A = T 2 I+2T T +I T 2 R n n, x = vec(u), and b = h 4 vec(f ). d) Determine the eigenvalues and eigenvectors of the matrix A in c) and show that it is symmetric positive definite. Also determine the bandwidth of A. e) Suppose we want to solve the standard form equation Ax = b. We have two representations for the matrix A, the product one in b) and the one in c). Which one would you prefer for the basis of an algorithm? Why? 9.3 Review Questions Consider the Poisson matrix. Write this matrix as a Kronecker sum, how are its eigenvalues and eigenvectors related to the second derivative matrix? is it symmetric? positive definite? What are the eigenpairs of T 1 := tridiagonal(a, d, a)? What are the inverse and transpose of a Kronecker product? give an economical general way to solve the linear system (A B) vec(v ) = vec(f )? Same for (A I s + I r B) vec(v ) = vec(f ).

231 Chapter 10 Fast Direct Solution of a Large Linear System 10.1 Algorithms for a Banded Positive Definite System In this chapter we present a fast method for solving Ax = b, where A is the Poisson matrix (9.7). Thus, for n = 9 A = = T + 2I I 0 I T + 2I I 0 I T + 2I where T = tridiag( 1, 2, 1). For the matrix A we know by now that 1. It is symmetric positive definite. 2. It is banded. 3. It is block-tridiagonal. 219,

220 Chapter 10. Fast Direct Solution of a Large Linear System Figure 10.1: Fill-inn in the Cholesky factor of the Poisson matrix (n = 100). 4. We know the eigenvalues and eigenvectors of A. 5.

232 220 Chapter 10. Fast Direct Solution of a Large Linear System Figure 10.1: Fill-inn in the Cholesky factor of the Poisson matrix (n = 100). 4. We know the eigenvalues and eigenvectors of A. 5. The eigenvectors are orthogonal Cholesky factorization Since A is symmetric positive definite we can use the Cholesky factorization A = LL T, with L lower triangular, to solve Ax = b. Since A and L has the same bandwidth d = n the complexity of this factorization is O(nd 2 ) = O(n 2 ). We need to store A, and this can be done in sparse form. The nonzero elements in L are shown in Figure 10.1 for n = 100. Note that most of the zeros between the diagonals in A have become nonzero in L. This is known as fill-inn Block LU factorization of a block tridiagonal matrix The Poisson matrix has a block tridiagonal structure. Consider finding the block LU factorization of a block tridiagonal matrix. We are looking for a factorization of the form D 1 C 1 A 1 D 2 C A m 2 D m 1 C m 1 A m 1 D m = I L 1 I L m 1 I U 1 C U m 1 C m 1 U m. (10.1) Here D 1,..., D m and U 1,..., U m are square matrices while A 1,..., A m 1,L 1,...,L m 1 and C 1,..., C m 1 can be rectangular.

233 10.2. A Fast Poisson Solver based on Diagonalization 221 Using block multiplication the formulas (1.16) generalize to U 1 = D 1, L k = A k U 1 k, U k+1 = D k+1 L k C k, k = 1, 2,..., m 1. (10.2) To solve the system Ax = b we partition b conformally with A in the form b T = [b T 1,..., b T m]. The formulas for solving Ly = b and Ux = y are as follows: y 1 = b 1, y k = b k L k 1 y k 1, k = 2, 3,..., m, x m = U 1 m y m, x k = U 1 k (y k C k x k+1 ), k = m 1,..., 2, 1. (10.3) The solution is then x T = [x T 1,..., x T m]. To find L k in (10.2) we solve the linear systems L k U k = A k. Similarly we need to solve a linear system to find x k in (10.3). The number of arithmetic operations using block factorizations is O(n 2 ), asymptotically the same as for Cholesky factorization. However we only need to store the m m blocks and using matrix operations can be an advantage Other methods Other methods include Iterative methods, (we study this in Chapters 11 and 12), multigrid. See [9], fast solvers based on diagonalization and the fast Fourier transform. See Sections 10.2, A Fast Poisson Solver based on Diagonalization The algorithm we now derive will only require O(n 3/2 ) arithmetic operations and we only need to work with matrices of order m. Using the fast Fourier transform the number of arithmetic operations can be reduced further to O(n log n). To start we recall that Ax = b can be written as a matrix equation in the form (cf. (9.5)) T V + V T = h 2 F with h = 1/(m + 1), where T = tridiag( 1, 2, 1) R m m is the second derivative matrix, V = (v jk ) R m m are the unknowns, and F = (f jk ) = (f(jh, kh)) R m m contains function values.

234 222 Chapter 10. Fast Direct Solution of a Large Linear System Let Recall that the eigenpairs of T are given by T s j = λ j s j, j = 1,..., m, s j = [sin (jπh), sin (2jπh),..., sin (mjπh)] T, λ j = 2 2 cos(jπh) = 4 sin 2 (jπh/2), h = 1/(m + 1), s T j s k = δ jk /(2h) for all j, k. S := [s 1,..., s m ] = [ sin (jkπh) ] m j,k=1 Rm m, D = diag(λ 1,..., λ m ). (10.4) Then T S = [T s 1,..., T s m ] = [λ 1 s 1,..., λ m s m ] = SD, S 2 = S T S = 1 2h I. Define X R m m by V = SXS, where V is the solution of T V + V T = h 2 F. Then V =SXS T V + V T = h 2 F T SXS + SXST = h 2 F S( )S ST SXS 2 + S 2 XST S = h 2 SF S = h 2 G T S=SD S 2 DXS 2 + S 2 XS 2 D = h 2 G S 2 =I/(2h) DX + XD = 4h 4 G. Since D is diagonal, the equaltion DX + XD = 4h 4 G, is easy to solve. For the j, k element we find so that for all j, k m m (DX + XD) j,k = d j,l x l,k + x j,l d l,k = λ j x j,k + λ k x j,k l=1 l=1 x jk = 4h 4 g jk /(λ j + λ k ) = h 4 g jk /(σ j + σ k ), σ j := λ j /4 = sin 2 (jπh/2). Thus to find V we compute 1. G = SF S, 2. x j,k = h 4 g j,k /(σ j + σ k ), j, k = 1,..., m, 3. V = SXS.

235 10.3. A Fast Poisson Solver based on the discrete sine and Fourier transforms 223 We can compute mx, S and the σ s without using loops. Using outer products, element by element division, and raising a matrix element by element to a power we find [ σ1 ] [ ] 1. X = h 4 G/M, where M := S = sin ( πh [ 12. m ]. σ m [ 1,..., 1 ] + [ m ] ), σ = sin ( πh 2 [ 12. m 1 [ σ σ m ], ] ) 2. We now get the following algorithm to solve numerically the Poisson problem u = f on Ω = (0, 1) 2 and u = 0 on Ω using the 5-point scheme, i. e., let m N, h = 1/(m+1), and F = (f(jh, kh)) R m m. We compute V R (m+2) (m+2) using diagonalization of T = tridiag( 1, 2, 1) R m m. Algorithm 10.1 (Fast Poisson solver) 1 function V=f a s t p o i s s o n (F) 2 %f u n c t i o n V=f a s t p o i s s o n (F) 3 m=length (F) ; h=1/(m+1) ; hv=pi h ( 1 :m) ; 4 sigma=sin ( hv /2). ˆ 2 ; 5 S=sin ( hv ( 1 :m) ) ; 6 G=S F S ; 7 X=hˆ4 G. / ( sigma ones ( 1,m)+ ones (m, 1 ) sigma ) ; 8 V=zeros (m+2,m+2) ; 9 V( 2 :m+1,2:m+1)=s X S ; The formulas are fully vectorized. Since the 6th line in Algorithm 10.1 only requires O(m 2 ) arithmetic operations the complexity of this algorithm is for large m determined by the 4 m-by-m matrix multiplications and is given by O(4 2m 3 ) = O(8n 3/2 ). 18 The method is very fast and will be used as a preconditioner for a more complicated problem in Chapter 12. In 2012 it took about 0.2 seconds on a laptop to find the 10 6 unknowns v j,k on a grid A Fast Poisson Solver based on the discrete sine and Fourier transforms In Algorithm 10.1 we need to compute the product of the sine matrix S R m m given by (10.4) and a matrix A R m m. Since the matrices are m-by-m this will normally require O(m 3 ) operations. In this section we show that it is possible to calculate the products SA and AS in O(m 2 log 2 m) operations. 18 It is possible to compute V using only two matrix multiplications and hence reduce the complexity to O(4n 3/2 ). This is detailed in Problem 10.8.

224 Chapter 10. Fast Direct Solution of a Large Linear System We need to discuss certain transforms known as the discrete sine transform, the discrete Fourier transform and the fast Fourier transform.

236 224 Chapter 10. Fast Direct Solution of a Large Linear System We need to discuss certain transforms known as the discrete sine transform, the discrete Fourier transform and the fast Fourier transform. In addition we have the discrete cosine transform which will not be discussed here. These transforms are of independent interest. They have applications to signal processing and image analysis, and are often used when one is dealing with discrete samples of data on a computer The discrete sine transform (DST) Given v = [v 1,..., v m ] T R m we say that the vector w = [w 1,..., w m ] T given by w j = m k=1 ( jkπ ) sin v k, m + 1 j = 1,..., m is the discrete sine transform (DST) of v. In matrix form we can write the DST as the matrix times vector w = Sv, where S is the sine matrix given by (10.4). We can then identify the matrix B = SA as the DST of A R m,n, i.e. as the DST of the columns of A. The product B = AS can also be interpreted as a DST. Indeed, since S is symmetric we have B = (SA T ) T which means that B is the transpose of the DST of the rows of A. It follows that we can compute the unknowns V in Algorithm 10.1 by carrying out discrete sine transforms on 4 m-by-m matrices in addition to the computation of X The discrete Fourier transform (DFT) Jean Baptiste Joseph Fourier, The fast computation of the DST is based on its relation to the discrete Fourier transform (DFT) and the fact that the DFT can be computed by a technique known as the fast Fourier transform (FFT). To define the DFT let for N N ω N = exp 2πi/N = cos(2π/n) i sin(2π/n), (10.5) where i = 1 is the imaginary unit. Given y = [y 1,..., y N ] T R N we say that

237 10.3. A Fast Poisson Solver based on the discrete sine and Fourier transforms 225 z = [z 1,..., z N ] T given by z = F N y, z j+1 = N 1 k=0 ω jk N y k+1, j = 0,..., N 1 is the discrete Fourier transform (DFT) of y. We can write this as a matrix times vector product z = F N y, where the Fourier matrix F N C N N has elements ω jk N, j, k = 0, 1,..., N 1. For a matrix we say that B = F N A is the DFT of A. As an example, since ω 4 = exp 2πi/4 = cos(π/2) i sin(π/2) = i we find ω 2 4 = ( i) 2 = 1, ω 3 4 = ( i)( 1) = i, ω 4 4 = ( 1) 2 = 1, ω 6 4 = i 2 = 1, ω 9 4 = i 3 = i, and so F 4 = ω 4 ω 2 4 ω ω 2 4 ω 4 4 ω ω 3 4 ω 6 4 ω 9 4 = i 1 i i 1 i. (10.6) The following lemma shows how the discrete sine transform of order m can be computed from the discrete Fourier transform of order 2m + 2. We recall that for any complex number w sin w = eiw e iw. 2i Lemma 10.2 (Sine transform as Fourier transform) Given a positive integer m and a vector x R m. Component k of Sx is equal to i/2 times component k + 1 of F 2m+2 z where In symbols z T = [0, x T, 0, x T B] R 2m+2, x T B := [x m,..., x 2, x 1 ]. (Sx) k = i 2 (F 2m+2z) k+1, k = 1,..., m. Proof. Let ω = ω 2m+2 = e 2πi/(2m+2) = e πi/(m+1). We note that ω jk = e πijk/(m+1), ω (2m+2 j)k = e 2πi e πijk/(m+1) = e πijk/(m+1).

238 226 Chapter 10. Fast Direct Solution of a Large Linear System Component k + 1 of F 2m+2 z is then given by (F 2m+2 z) k+1 = = 2m 1 j=0 m m ω jk z j+1 = x j ω jk x j ω (2m+2 j)k j=1 j=1 m ( x j e πijk/(m+1) e πijk/(m+1)) j=1 = 2i m j=1 ( ) jkπ x j sin = 2i(S m x) k. m + 1 Dividing both sides by 2i and noting 1/(2i) = i/(2i 2 ) = i/2, proves the lemma. It follows that we can compute the DST of length m by extracting m components from the DFT of length N = 2m The fast Fourier transform (FFT) From a linear algebra viewpoint the fast Fourier transform is a quick way to compute the matrix- vector product F N y. Suppose N is even. The key to the FFT is a connection between F N and F N/2 which makes it possible to compute the FFT of order N as two FFT s of order N/2. By repeating this process we can reduce the number of arithmetic operations to compute a DFT from O(N 2 ) to O(N log 2 N). Suppose N is even. The connection between F N and F N/2 involves a permutation matrix P N R N N given by P N = [e 1, e 3,..., e N 1, e 2, e 4,..., e N ], where the e k = (δ j,k ) are unit vectors. If A is a matrix with N columns [a 1,..., a N ] then AP N = [a 1, a 3,..., a N 1, a 2, a 4,..., a N ], i.e. post multiplying A by P N permutes the columns of A so that all the oddindexed columns are followed by all the even-indexed columns. For example we have from (10.6) P 4 = [e 1 e 3 e 2 e 4 ] = F 4P 4 = i i i i,

239 10.3. A Fast Poisson Solver based on the discrete sine and Fourier transforms 227 where we have indicated a certain block structure of F 4 P 4. These blocks can be related to the 2-by-2 matrix F 2. We define the diagonal scaling matrix D 2 by [ ] 1 0 D 2 = diag(1, ω 4 ) =. 0 i Since ω 2 = exp 2πi/2 = 1 we find [ 1 1 F 2 = 1 1 ] [, D 2 F 2 = 1 1 i i ], and we see that [ F F 4 P 4 = 2 D 2 F 2 F 2 D 2 F 2 This result holds in general. ]. Theorem 10.3 (Fast Fourier transform) If N = 2m is even then [ F F 2m P 2m = m D m F m F m D m F m ], (10.7) where D m = diag(1, ω N, ωn, 2..., ω m 1 N ). (10.8) Proof. Fix integers p, q with 1 p, q m and set j := p 1 and k := q 1. Since ω m m = 1, ω 2k 2m = ω k m, ω m 2m = 1, (F m ) p,q = ω jk m, (D m F m ) p,q = ω j 2m ωjk m, we find by considering elements in the four sub-blocks in turn (F 2m P 2m ) p,q = ω j(2k) 2m = ω jk m, (F 2m P 2m ) p+m,q = ω (j+m)(2k) 2m = ω m (j+m)k = ωm jk, (F 2m P 2m ) p,q+m = ω j(2k+1) 2m = ω j 2m ωjk m, (F 2m P 2m ) p+m,q+m = ω (j+m)(2k+1) 2m = ω j+m 2m m ω(j+m)k = ω j 2m ωjk m. It follows that the four m-by-m blocks of F 2m P 2m have the required structure. Using Theorem 10.3 we can carry out the DFT as a block multiplication. Let y R 2m and set w = P T 2my = [w T 1, w T 2 ] T, where w T 1 = [y 1, y 3,..., y 2m 1 ], w T 2 = [y 2, y 4,..., y 2m ].

240 228 Chapter 10. Fast Direct Solution of a Large Linear System Then F 2m y = F 2m P 2m P T 2my = F 2m P 2m w [ ] [ ] [ F = m D m F m w1 q1 + q = 2 F m D m F m w 2 q 1 q 2 ], where q 1 = F m w 1, and q 2 = D m (F m w 2 ). In order to compute F 2m y we need to compute F m w 1 and F m w 2. Thus, by combining two FFT s of order m we obtain an FFT of order 2m. If n = 2 k then this process can be applied recursively as in the following Matlab function: Algorithm 10.4 (Recursive FFT) 1 function z=f f t r e c ( y ) 2 %f u n c t i o n z=f f t r e c ( y ) 3 y=y ( : ) ; 4 n=length ( y ) ; 5 i f n==1 6 z=y ; 7 else 8 q1=f f t r e c ( y ( 1 : 2 : n 1) ) 9 q2=exp( 2 pi i /n ). ˆ ( 0 : n/2 1). f f t r e c ( y ( 2 : 2 : n ) ) 10 z =[q1+q2 ; q1 q2 ] ; 11 end Statement 3 is included so that the input y R n can be either a row or column vector, while the output z is a column vector. Such a recursive version of FFT is useful for testing purposes, but is much too slow for large problems. A challenge for FFT code writers is to develop nonrecursive versions and also to handle efficiently the case where N is not a power of two. We refer to [34] for further details. The complexity of the FFT is given by γn log 2 N for some constant γ independent of N. To show this for the special case when N is a power of two let x k be the complexity (the number of arithmetic operations) when N = 2 k. Since we need two FFT s of order N/2 = 2 k 1 and a multiplication with the diagonal matrix D N/2, it is reasonable to assume that x k = 2x k 1 + γ2 k for some constant γ independent of k. Since x 0 = 0 we obtain by induction on k that x k = γk2 k. Indeed, this holds for k = 0 and if x k 1 = γ(k 1)2 k 1 then x k = 2x k 1 + γ2 k = 2γ(k 1)2 k 1 + γ2 k = γk2 k. Reasonable implementations of FFT typically have γ 5, see [34]. The efficiency improvement using the FFT to compute the DFT is spectacular for large N. The direct multiplication F N y requires O(8n 2 ) arithmetic

241 10.3. A Fast Poisson Solver based on the discrete sine and Fourier transforms 229 operations since complex arithmetic is involved. Assuming that the FFT uses 5N log 2 N arithmetic operations we find for N = the ratio 8N 2 5N log 2 N Thus if the FFT takes one second of computing time and the computing time is proportional to the number of arithmetic operations then the direct multiplication would take something like seconds or 23 hours A poisson solver based on the FFT We now have all the ingredients to compute the matrix products SA and AS using FFT s of order 2m + 2 where m is the order of S and A. This can then be used for quick computation of the exact solution V of the discrete Poisson problem in Algorithm We first compute H = SF using Lemma 10.2 and m FFT s, one for each of the m columns of F. We then compute G = HS by m FFT s, one for each of the rows of H. After X is determined we compute Z = SX and V = ZS by another 2m FFT s. In total the work amounts to 4m FFT s of order 2m + 2. Since one FFT requires O(γ(2m + 2) log 2 (2m + 2)) arithmetic operations the 4m FFT s amount to 8γm(m + 1) log 2 (2m + 2) 8γm 2 log 2 m = 4γn log 2 n, where n = m 2 is the size of the linear system Ax = b we would be solving if Cholesky factorization was used. This should be compared to the O(8n 3/2 ) arithmetic operations used in Algorithm 10.1 requiring 4 straightforward matrix multiplications with S. What is faster will depend heavily on the programming of the FFT and the size of the problem. We refer to [34] for other efficient ways to implement the DST. Exercise 10.5 (Fourier matrix) Show that the Fourier matrix F 4 is symmetric, but not Hermitian. Exercise 10.6 (Sine transform as Fourier transform) Verify Lemma 10.2 directly when m = 1. Exercise 10.7 (Explicit solution of the discrete Poisson equation) Show that the exact solution of the discrete Poisson equation (9.4) can be written V = (v i,j ) m i,j=1, where v ij = 1 (m + 1) 4 m m m p=1 r=1 k=1 l=1 m sin ( ) ( ipπ m+1 sin jrπ ) ( m+1 sin kpπ ) ( m+1 sin lrπ ) m+1 [ sin ( ) ] 2 [ pπ + sin ( ) ] 2 f p,r. rπ 2(m+1) 2(m+1)

242 230 Chapter 10. Fast Direct Solution of a Large Linear System Exercise 10.8 (Improved version of Algorithm 10.1) Algorithm 10.1 involves multiplying a matrix by S four times. In this problem we show that it is enough to multiply by S two times. We achieve this by diagonalizing only the second T in T V + V T = h 2 F. Let D = diag(λ 1,..., λ m ), where λ j = 4 sin 2 (jπh/2), j = 1,..., m. (a) Show that (b) Show that T X + XD = C, where X = V S, and C = h 2 F S. (T + λ j I)x j = c j j = 1,..., m, (10.9) where X = [x 1,..., x m ] and C = [c 1,..., c m ]. Thus we can find X by solving m linear systems, one for each of the columns of X. Recall that a tridiagonal m m system can be solved by Algorithms 1.8 and 1.9 in 8m 7 arithmetic operations. Give an algorithm to find X which only requires O(δm 2 ) arithmetic operations for some constant δ independent of m. (c) Describe a method to compute V which only requires O(4m 3 ) = O(4n 3/2 ) arithmetic operations. (d) Describe a method based on the fast Fourier transform which requires O(2γn log 2 n) where γ is the same constant as mentioned at the end of the last section. Exercise 10.9 (Fast solution of 9 point scheme) Consider the equation T V + V T 1 6 T V T = h2 µf, that was derived in Exercise 9.13 for the 9-point scheme. Define the matrix X by V = SXS = (x j,k ) where V is the solution of (9.18). Show that DX + XD 1 6 DXD = 4h4 G, where G = SµF S, where D = diag(λ 1,..., λ m ), with λ j = 4 sin 2 (jπh/2), j = 1,..., m, and that x j,k = h 4 g j,k σ j + σ k 2 3 σ jσ k, where σ j = sin 2 ( (jπh)/2 ) for j, k = 1, 2,..., m. Show that σ j + σ k 2 3 σ jσ k > 0 for j, k = 1, 2,..., m. Conclude that the matrix A in Exercise 9.13 b) is symmetric positive definite and that (9.17) always has a solution V.

243 10.3. A Fast Poisson Solver based on the discrete sine and Fourier transforms 231 Exercise (Algorithm for fast solution of 9 point scheme) Derive an algorithm for solving (9.17) which for large m requires essentially the same number of operations as in Algorithm (We assume that µf already has been formed). Exercise (Fast solution of biharmonic equation) For the biharmonic problem we derived in Exercise 9.14 the equation T 2 U + 2T UT + UT 2 = h 4 F. Define the matrix X = (x j,k ) by U = SXS where U is the solution of (9.21). Show that and that x j,k = D 2 X + 2DXD + XD 2 = 4h 6 G, where G = SF S, h 6 g j,k 4(σ j + σ k ) 2, where σ j = sin 2 ((jπh)/2) for j, k = 1, 2,..., m. Exercise (Algorithm for fast solution of biharmonic equation) Use Exercise to derive an algorithm function U=simplefastbiharmonic(F) which requires only O(δn 3/2 ) operations to find U in Problem some constant independent of n. Here δ is Exercise (Check algorithm for fast solution of biharmonic equation) In Exercise compute the solution U corresponding to F = ones(m,m). For some small m s check that you get the same solution obtained by solving the standard form Ax = b in (9.21). You can use x = A\b for solving Ax = b. Use F(:) to vectorize a matrix and reshape(x,m,m) to turn a vector x R m2 into an m m matrix. Use the Matlab command surf(u) for plotting U for, say, m = 50. Compare the result with Exercise by plotting the difference between both matrices. Exercise (Fast solution of biharmonic equation using 9 point rule) Repeat Exercises 9.14, and using the nine point rule (9.17) to solve the system (9.20).

244 232 Chapter 10. Fast Direct Solution of a Large Linear System 10.4 Review Questions Consider the Poisson matrix. What is the bandwidth of its Cholesky factor? approximately how many arithmetic operations does it take to find the Cholesky factor? same question for block LU, same question for the fast Poisson solver with and without FFT What is the discrete sine transform and discrete Fourier transform of a vector?

245 Part IV Iterative Methods for Large Linear Systems 233

246

247 Chapter 11 The Classical Iterative Methods Gaussian elimination and Cholesky factorization are direct methods. In absence of rounding errors they find the exact solution using a finite number of arithmetic operations. In an iterative method we start with an approximation x 0 to the exact solution x and then compute a sequence {x k } such that hopefully x k x. Iterative methods are mainly used for large sparse systems, i. e., where many of the elements in the coefficient matrix are zero. The main advantages of iterative methods are reduced storage requirements and ease of implementation. In an iterative method the main work in each iteration is a matrix times vector multiplication, an operation which often does not need storing the matrix, not even in sparse form. In this chapter we consider the classical iterative methods of Richardson, Jacobi, Gauss-Seidel and an accelerated version of Gauss-Seidel s method called successive overrelaxation (SOR). David Young developed in his thesis a beautiful theory describing the convergence rate of SOR, see [37]. We give the main points of this theory specialized to the discrete Poisson matrix. With a careful choice of an acceleration parameter the amount of work using SOR on the discrete Poisson problem is the same as for the fast Poisson solver without FFT (cf. Algorithm 10.1). Moreover, SOR is not restricted to constant coefficient methods on a rectangle. However, to obtain fast convergence using SOR it is necessary to have a good estimate for an acceleration parameter. For convergence we need to study convergence of powers of matrices. 235

236 Chapter 11. The Classical Iterative Methods 11.1 Classical Iterative Methods; Component Form We start with an example showing how a linear system can be solved using an iterative method.

1 (Iterative methods on a special 2 2 matrix) Solving for the diagonal elements the linear system [ ] 2 1 1 2 [ y z ] = [ 1 1 ] can be written in component form as y = (z + 1)/2 and z = (y + 1)/2.

248 236 Chapter 11. The Classical Iterative Methods 11.1 Classical Iterative Methods; Component Form We start with an example showing how a linear system can be solved using an iterative method. Example 11.1 (Iterative methods on a special 2 2 matrix) Solving for the diagonal elements the linear system [ ] [ y z ] = [ 1 1 ] can be written in component form as y = (z + 1)/2 and z = (y + 1)/2. Starting with y 0, z 0 we generate two sequences {y k } and {z k } using the difference equations y k+1 = (z k + 1)/2 and z k+1 = (y k + 1)/2. This is known as Jacobi s method. If y 0 = z 0 = 0 then we find y 1 = z 1 = 1/2 and in general y k = z k = 1 2 k for k = 0, 1, 2, 3,.... The iteration converges to the exact solution [1, 1] T, and the error is halved in each iteration. We can improve the convergence rate by using the most current approximation in each iteration. This leads to Gauss-Seidel s method: y k+1 = (z k +1)/2 and z k+1 = (y k+1 + 1)/2. If y 0 = z 0 = 0 then we find y 1 = 1/2, z 1 = 3/4, y 2 = 7/8, z 2 = 15/16, and in general y k = k and z k = 1 4 k for k = 1, 2, 3,.... The error is now reduced by a factor 4 in each iteration. Consider the general case. Suppose A C n n is nonsingular and b C n. Suppose we know an approximation x k = [x k (1),..., x k (n)] T to the exact solution x of Ax = b. Lewis Fry Richardson, (left),carl Gustav Jacob Jacobi, (right).

11.1. Classical Iterative Methods; Component Form 237 Philipp Ludwig von Seidel, 1821-1896 (left), David M. Young Jr.

Solving the ith equation of Ax = b for x(i), we obtain a fixed-point form of Ax = b x(i) = ( i 1 n ) a ij x(j) a ij x(j) + b i /aii, i = 1, 2,..., n. (11.1) j=1 j=i+1 1.

249 11.1. Classical Iterative Methods; Component Form 237 Philipp Ludwig von Seidel, (left), David M. Young Jr., (right) We need to assume that the rows are ordered so that A has nonzero diagonal elements. Solving the ith equation of Ax = b for x(i), we obtain a fixed-point form of Ax = b x(i) = ( i 1 n ) a ij x(j) a ij x(j) + b i /aii, i = 1, 2,..., n. (11.1) j=1 j=i+1 1. In Jacobi s method (J method) we substitute x k into the right hand side of (11.1) and compute a new approximation by x k+1 (i) = ( i 1 a ij x k (j) j=1 n j=i+1 a ij x k (j) + b i ) /aii, for i = 1, 2,..., n. (11.2) 2. Gauss-Seidel s method (GS method) is a modification of Jacobi s method, where we use the new x k+1 (i) immediately after it has been computed. x k+1 (i) = ( i 1 a ij x k+1 (j) j=1 n j=i+1 a ij x k (j) + b i ) /aii, for i = 1, 2,..., n. (11.3) 3. The Successive overrelaxation method (SOR method) is obtained by introducing an acceleration parameter 0 < ω < 2 in the GS method. We write x(i) = ωx(i) + (1 ω)x(i) and this leads to the method x k+1 (i) = ω ( i 1 a ij x k+1 (j) j=1 n j=i+1 a ij x k (j)+b i ) /aii +(1 ω)x k (i). (11.4) The SOR method reduces to the Gauss-Seidel method for ω = 1. Denoting the right hand side of (11.3) by x gs k+1 we can write (11.4) as x k+1 = ωx gs k+1 +

250 238 Chapter 11. The Classical Iterative Methods (1 ω)x k, and we see that x k+1 is located on the straight line passing through the two points x gs k+1 and x k. The restriction 0 < ω < 2 is necessary for convergence (cf. Theorem 11.14). Normally, the best results are obtained for the relaxation parameter ω in the range 1 ω < 2 and then x k+1 is computed by linear extrapolation, i. e., it is not located between x gs k+1 and x k. 4. We mention also briefly the symmetric successive overrelaxation method SSOR. One iteration in SSOR consists of two SOR sweeps. A forward SOR sweep (11.4), computing an approximation denoted x k+1/2 instead of x k+1, is followed by a backward SOR sweep computing x k+1 (i) = ω ( i 1 a ij x k+1/2 (j) j=1 n j=i+1 a ij x k+1 (j)+b i ) /aii +(1 ω)x k+1/2 (i) (11.5) in the order i = n, n 1, The method is slower and more complicated than the SOR method. Its main use is as a symmetric preconditioner. For if A is symmetric then SSOR combines the two SOR steps in such a way that the resulting iteration matrix is similar to a symmetric matrix. We will not discuss this method any further here and refer to Section 12.6 for an alternative example of a preconditioner. We will refer to the J, GS and SOR methods as the classical (iteration) methods The discrete Poisson system Consider the classical methods applied to the discrete Poisson matrix A R n n given by (9.7). Let n = m 2 and set h = 1/(m + 1). In component form the linear system Ax = b can be written (cf. (9.4)) 4v(i, j) v(i 1, j) v(i+1, j) v(i, j 1) v(i, j+1) = h 2 f i,j, i, j = 1,..., m, with homogenous boundary conditions also given in (9.4). Solving for v(i, j) we obtain the fixed point form v(i, j) = ( v(i 1, j)+v(i+1, j)+v(i, j 1)+v(i, j+1)+e i,j ) /4, ei,j := f i,j /(m+1) 2. (11.6)

251 11.1. Classical Iterative Methods; Component Form 239 The J, GS, and SOR methods take the form J : v k+1 (i, j) = ( v k (i 1, j) + v k (i, j 1) + v k (i+1, j) + v k (i, j+1) + e(i, j) ) /4 GS : v k+1 (i, j) = ( v k+1 (i 1, j) + v k+1 (i, j 1) + v k (i+1, j) + v k (i, j+1) + e(i, j) ) /4 SOR : v k+1 (i, j) = ω ( v k+1 (i 1, j) + v k+1 (i, j 1) + v k (i + 1, j) + v k (i, j + 1) + e(i, j) ) /4 + (1 ω)v k (i, j). (11.7) We note that for GS and SOR we have used the natural ordering, i. e., (i 1, j 1 ) < (i 2, j 2 ) if and only if j 1 j 2 and i 1 < i 2 if j 1 = j 2. For the J method any ordering can be used. In Algorithm 11.2 we give a Matlab program to test the convergence of Jacobi s method on the discrete Poisson problem. We carry out Jacobi iterations on the linear system (11.6) with F = (f ij ) R m m, starting with V 0 = 0 R (m+2) (m+2). The output is the number of iterations k, to obtain V (k) U M := max i,j v ij u ij < tol. Here [u ij ] R (m+2) (m+2) is the exact solution of (11.6) computed using the fast Poisson solver in Algorithm We set k = K + 1 if convergence is not obtained in K iterations. In Table 11.1 we show the output k = k n from this algorithm using F = ones(m, m) for m = 10, 50, K = 10 4, and tol = We also show the number of iterations for Gauss- Seidel and SOR with a value of ω known as the optimal acceleration parameter ω := 2/ ( 1 + sin( π m+1 )). We will derive this value later. Algorithm 11.2 (Jacobi) 1 function k=jdp (F,K, t o l ) 2 m=length (F) ; U=f a s t p o i s s o n (F) ; V=zeros (m+2,m+2) ; E=F/(m+1) ˆ 2 ; 3 for k=1:k 4 V( 2 :m+1,2:m+1)=(v( 1 :m, 2 :m+1)+v( 3 :m+2,2:m+1) V( 2 :m+1,1:m)+v( 2 :m+1,3:m+2)+e) / 4 ; 6 i f max(max( abs (V U) ) )<t o l, return 7 end 8 end 9 k=k+1; For the GS and SOR methods we have used Algorithm This is the analog of Algorithm 11.2 using SOR instead of J to solve the discrete Poisson problem. w is an acceleration parameter with 0 < w < 2. For w = 1 we obtain Gauss-Seidel s method.

252 240 Chapter 11. The Classical Iterative Methods k 100 k 2500 k k k J GS SOR Table 11.1: The number of iterations k n to solve the discrete Poisson problem with n unknowns using the methods of Jacobi, Gauss-Seidel, and SOR (see text) with a tolerance Algorithm 11.3 (SOR) 1 function k=sordp (F,K, w, t o l ) 2 m=length (F) ; U=f a s t p o i s s o n (F) ; V=zeros (m+2,m+2) ; E=F/(m+1) ˆ 2 ; 3 for k=1:k 4 for j =2:m+1 5 for i =2:m+1 6 V( i, j )=w (V( i 1, j )+V( i +1, j )+V( i, j 1) V( i, j +1)+E( i 1, j 1) ) /4+(1 w) V( i, j ) ; 8 end 9 end 10 i f max(max( abs (V U) ) )<t o l, return 11 end 12 end 13 k=k+1; We make several remarks about these programs and the results in Table The rate (speed) of convergence is quite different for the four methods. The J and GS methods converge, but rather slowly. The J method needs about twice as many iterations as the GS method. The improvement using the SOR method with optimal ω is spectacular. 2. We show in Section that the number of iterations k n for a size n problem is k n = O(n) for the J and GS method and k n = O( n) for SOR with optimal ω. The choice of tol will only influence the constants multiplying n or n. 3. From (11.7) it follows that each iteration requires O(n) arithmetic operations. Thus the number of arithmetic operations to achieve a given tolerance is O(k n n). Therefore the number of arithmetic operations for the J and GS method is O(n 2 ), while it is only O(n 3/2 ) for the SOR method with optimal ω. Asymptotically, for J and GS this is the same as using banded Cholesky, while SOR competes with the fast method (without FFT).

253 11.2. Classical Iterative Methods; Matrix Form We do not need to store the coefficient matrix so the storage requirements for these methods on the discrete Poisson problem is O(n), asymptotically the same as for the fast methods. 5. Jacobi s method has the advantage that it can be easily parallelized Classical Iterative Methods; Matrix Form To study convergence we need matrix formulations of the classical methods Fixed-point form In general we can construct an iterative method by choosing a nonsingular matrix M and write Ax = b in the equivalent form The matrix M is known as a splitting matrix. The corresponding iterative method is given by Mx = (M A)x + b. (11.8) Mx k+1 = (M A)x k + b (11.9) or x k+1 := Gx k + c, G = I M 1 A,, c = M 1 b. (11.10) This is known as a fixed-point iteration. Starting with x 0 this defines a sequence {x k } of vectors in C n. For a general G C n n and c C n a solution of x = Gx + c is called a fixed-point. The fixed-point is unique if I G is nonsingular. If lim k x k = x for some x C n then x is a fixed point since x = lim k x k+1 = lim k (Gx k + c) = G lim k x k + c = Gx + c. The matrix M can also be interpreted as a preconditioning matrix. We first write Ax = b in the equivalent form M 1 Ax = M 1 b or x = x M 1 Ax+ M 1 b.this again leeds to the iterative method (11.10), and M is chosen to reduce the condition number of A The splitting matrices for the classical methods Different choices of M in (11.9) lead to different iterative methods. We now derive M for the classical methods. For GS and SOR it is convenient to write A as a

254 242 Chapter 11. The Classical Iterative Methods sum of three matrices, A = D A L A R, where A L, D, and A R are the lower, diagonal, and upper part of A, respectively. Thus D := diag(a 11,..., a nn ), 0 0 a 1,2 a 1,n a 2,1 0 A L := , A R :=. 0 a n 1,n. a n,1 a n,n (11.11) Theorem 11.4 (Splitting matrices for J, GS and SOR) The splitting matrices M J, M 1 and M ω for the J, GS and SOR methods are given by M J = D, M 1 = D A L, M ω = ω 1 D A L. (11.12) Proof. To find M we write the methods in the form (11.9) where the coefficient of b is equal to one. Moving a ii to the left hand side of the Jacobi iteration (11.2) we obtain the matrix form Dx k+1 = (D A)x k + b showing that M J = D. For the SOR method a matrix form is Dx k+1 = ω ( A L x k+1 + A R x k + b ) + (1 ω)dx k. (11.13) Dividing both sides by ω and moving A L x k+1 to the left hand side this takes the form ( ω 1 D A L ) xk+1 = A R x k + b + (ω 1 1)Dx k showing that M ω = ω 1 D A L. We obtain M 1 by letting ω = 1 in M ω. Example 11.5 (Splitting matrices) For the system [ ] [ ] [ 2 1 x1 1 = 1 2 x 2 1] we find and A L = M J = D = [ ] 0 0, D = 1 0 [ ] 2 0, A 0 2 R = [ ] 0 1, 0 0 [ ] [ ] 2 0, M 0 2 ω = ω 1 2ω 1 0 D A L = 1 2ω 1. The iteration matrix G ω = I M 1 ω A is given by [ ] [ ] [ ] [ ] 1 0 ω/ ω ω/2 G ω = 0 1 ω 2 = /4 ω/2 1 2 ω(1 ω)/2 1 ω + ω 2. /4 (11.14)

255 11.3. Convergence 243 For the J and GS method we have G J = I D 1 A = [ ] 0 1/2, G 1/2 0 1 = [ ] 0 1/2. (11.15) 0 1/4 We could have derived these matrices directly from the component form of the iteration. For example, for the GS method we have the component form x k+1 (1) = 1 2 x k(2) + 1 2, x k+1(2) = 1 2 x k+1(1) Substituting the value of x k+1 (1) from the first equation into the second equation we find x k+1 (2) = 1 2 (1 2 x k(2) ) = 1 4 x k(2) Thus x k+1 = [ ] xk+1 (1) = x k+1 (2) [ 0 1/2 0 1/4 ] [ ] xk (1) + x k (2) [ ] 1/2 = G 3/4 1 x k + c Convergence For Newton s method the choice of starting value is important. This is not the case for methods of the form x k+1 := Gx k + c. Definition 11.6 (Convergence of x k+1 := Gx k + c) We say that the iterative method x k+1 := Gx k + c converges if the sequence {x k } converges for any starting vector x 0. We have the following necessary and sufficient condition for convergence: Theorem 11.7 (Convergence of an iterative method) The iterative method x k+1 := Gx k + c converges if and only if lim k G k = 0. Proof. We subtract x = Gx + c from x k+1 = Gx k + c. The vector c cancels and we obtain x k+1 x = G(x k x). By induction on k x k x = G k (x 0 x), k = 0, 1, 2,... (11.16) Clearly x k x 0 if G k 0. The converse follows by choosing x 0 x = e j, the jth unit vector for j = 1,..., n. Theorem 11.8 (Sufficient condition for convergence) If G < 1 for some consistent matrix norm on C n n, then the iteration x k+1 = Gx k + c converges.

256 244 Chapter 11. The Classical Iterative Methods Proof. We have x k x = G k (x 0 x) G k x 0 x G k x 0 x 0, k. A necessary and sufficient condition for convergence involves the eigenvalues of G. We define the spectral radius of a matrix A C n n as the maximum absolute value of its eigenvalues. ρ(a) := max λ. (11.17) λ σ(a) Theorem 11.9 (When does an iterative method converge?) Suppose G C n n and c C n. The iteration x k+1 = Gx k + c converges if and only if ρ(g) < 1. We will prove this theorem using Theorem in Section Richardson s method. The Richardson s method (R method) is defined by x k+1 = x k + α ( b Ax k ). (11.18) Here we pick an acceleration parameter α and compute a new approximation by adding a multiple of the residual vector r k := b Ax k. Note that we do not need the assumption of nonzero diagonal elements. Richardson considered this method in We will assume that α is real. If all eigenvalues of A have positive real parts then then the R method converges provided α is positive and sufficiently small. We show this result for positive eigenvalues and leave the more general case to Exercise Theorem (Convergence of Richardson s method) If A has positive eigenvalues λ 1 λ 2 λ n > 0 then the R method given by x k+1 = ( I αa ) x k + b converges if and only if 0 < α < 2/λ 1. Moreover, ρ(i αa) > ρ(i α A) = κ 1 κ + 1, α R \ {α }, κ := λ 1, α 2 :=. λ n λ 1 + λ n (11.19) Proof. The eigenvalues of I αa are 1 αλ j, j = 1,..., n. We have 1 αλ 1, if α 0 ρ α = ρ(i αa) := max 1 αλ j = 1 αλ n, if 0 < α α j αλ 1 1, if α > α,

257 11.3. Convergence ΑΛ n ΑΛ ΑΛ 1 ΑΛ n Α Λ 1 Λ n Figure 11.1: The functions α 1 αλ 1 and α 1 αλ n. see Figure Clearly 1 αλ n = αλ 1 1 for α = α and ρ α = α λ 1 1 = λ 1 λ n λ 1 + λ n = κ 1 κ + 1 < 1. We have ρ α < 1 if and only if α > 0 and αλ 1 1 < 1 showing convergence if and only if 0 < α < 2/λ 1 and ρ α > ρ α for α 0 and α 2/λ 1. Finally, if 0 < α < α then ρ α = 1 αλ n > 1 α λ n = ρ α and if α < α < 2/λ 1 then ρ α = αλ 1 1 > α λ 1 1 = ρ α. For a symmetric positive definite matrix we obtain Corollary (Rate of convergence for the R method) Suppose A is symmetric positive definite with largest and smallest eigenvalue λ max and λ min, respectively. Richardson s method x k+1 = ( I αa ) x k + b converges if and only if 0 < α < 2/λ max. With α 2 := λ max+λ min we have the error estimate x k x 2 ( κ 1) k x0 x 2, k = 0, 1, 2,... (11.20) κ + 1 where κ := λ max /λ min is the spectral condition number of A. Proof. The spectral norm 2 is consistent and therefore x k x 2 I α A k 2 x 0 x 2. But for a symmetric positive definite matrix the spectral norm is equal to the spectral radius and the result follows form (11.19). Exercise (Richardson and Jacobi) Show that if a ii = d 0 for all i then Richardson s method with α := 1/d is the same as Jacobi s method.

258 246 Chapter 11. The Classical Iterative Methods Exercise ( R-method when eigenvalues have positive real part) Suppose all eigenvalues λ j of A have positive real parts u j for j = 1,..., n and that α is real. Show that the R method converges if and only if 0 < α < min j (2u j / λ j 2 ) Convergence of SOR The condition ω (0, 2) is necessary for convergence of the SOR method. Theorem (Necessary condition for convergence of SOR) Suppose A C n n is nonsingular with nonzero diagonal elements. If the SOR method applied to A converges then ω (0, 2). Proof. We have (cf. (11.13)) Dx k+1 = ω ( A L x k+1 + A R x k + b ) + (1 ω)dx k or x k+1 = ω ( Lx k+1 + Rx k + D 1 b ) + (1 ω)x k, where L := D 1 A L and R := D 1 A R. Thus (I ωl)x k+1 = (ωr + (1 ω)i)x k + D 1 b so the following form of the iteration matrix is obtained G ω = (I ωl) 1( ωr + (1 ω)i ). (11.21) We next compute the determinant of G ω. Since I ωl is lower triangular with ones on the diagonal, the same holds for the inverse by Lemma 1.35, and therefore the determinant of this matrix is equal to one. The matrix ωr+(1 ω)i is upper triangular with 1 ω on the diagonal and therefore its determinant equals (1 ω) n. It follows that det(g ω ) = (1 ω) n. Since the determinant of a matrix equals the product of its eigenvalues we must have λ 1 ω for at least one eigenvalue λ of G ω and we conclude that ρ(g ω ) ω 1. But then ρ(g ω ) 1 if ω is not in the interval (0, 2) and by Theorem 11.9 SOR diverges. The SOR method always converges for a symmetric positive definite matrix. Theorem (SOR on positive definite matrix) SOR converges for a symmetric positive definite matrix A R n n if and only if 0 < ω < 2. In particular, Gauss-Seidel s method converges for a symmetric positive definite matrix. Proof. By Theorem convergence implies 0 < ω < 2. Suppose now 0 < ω < 2 and let (λ, x) be an eigenpair for G ω. Note that λ and x can be complex. We need to show that λ < 1. The following identity will be shown below: ω 1 (2 ω) 1 λ 2 x Dx = (1 λ 2 )x Ax, (11.22) where D := diag(a 11,..., a nn ). Now x Ax and x Dx are positive for all nonzero x C n since a positive definite matrix has positive diagonal elements a ii =

259 11.3. Convergence 247 e T i Ae i > 0. It follows that the left hand side of (11.22) is nonnegative and then the right hand side must be nonnegative as well. This implies λ 1. It remains to show that we cannot have λ = 1. By (11.12) the eigenpair equation G ω x = λx can be written x (ω 1 D A L ) 1 Ax = λx or Ax = (ω 1 D A L )y, y := (1 λ)x. (11.23) Now Ax 0 implies that λ 1. To prove equation (11.22) consider the matrix E := ω 1 D + A R D. Since A R D = A L A we find Ey = (ω 1 D A L A)y (11.23) = Ax Ay = λax. Observe that (ω 1 D A L ) = ω 1 D A R so that by (11.23) (Ax) y + y (λax) = y (ω 1 D A R )y + y Ey = y (2ω 1 1)Dy = ω 1 (2 ω) 1 λ 2 x Dx. Since (Ax) = x A, y := (1 λ)x and y = (1 λ)x this also equals (Ax) y + y (λax) = (1 λ)x Ax + λ(1 λ)x Ax = (1 λ 2 )x Ax, and (11.22) follows. Exercise (Examle: GS converges, J diverges) Show (by finding its eigenvalues) that the matrix 1 a a a 1 a a a 1 is symmetric positive definite for 1/2 < a < 1. Thus, GS converges for these values of a. Show that the J method does not converge for 1/2 < a < 1. Exercise (Divergence example for J and GS) Show that both Jacobi s method and Gauss-Seidel s method diverge for A = [ ]. Exercise (Strictly diagonally dominance; The J method) Show that the J method converges if a ii > j i a ij for i = 1,..., n. Exercise (Strictly diagonally dominance; The GS method) Consider the GS method. Suppose r := max i r i < 1, where r i = a ij j i a. Show ii using induction on i that ɛ k+1 (j) r ɛ k for j = 1,..., i. Conclude that Gauss-Seidel s method is convergent when A is strictly diagonally dominant.

260 248 Chapter 11. The Classical Iterative Methods Convergence of the classical methods for the discrete Poisson matrix We know the eigenvalues of the discrete Poisson matrix A given by (9.7) and we can use this to estimate the number of iterations necessary to achieve a given accuracy for the various methods. Recall that by (9.15) the eigenvalues λ j,k of A are λ j,k = 4 2 cos (jπh) 2 cos (kπh), j, k = 1,..., m, h = 1/(m + 1). It follows that the largest and smallest eigenvalue of A, and the spectral condition number κ of A, are given by λ max = 8 cos 2 w, λ min = 8 sin 2 w, κ := cos2 w sin 2 w, w := π 2(m + 1). (11.24) Consider first the J method. The matrix G J = I D 1 A = I A/4 has eigenvalues µ j,k = λ j,k = 1 2 cos(jπh) + 1 cos(kπh), j, k = 1,..., m. (11.25) 2 It follows that ρ(g J ) = cos(πh) < 1. Since G J is symmetric it is normal, and the spectral norm is equal to the spectral radios (cf. Theorem 7.17). We obtain x k x 2 G J k 2 x 0 x 2 = cos k (πh) x 0 x 2, k = 0, 1, 2,... (11.26) The R method given by x k+1 = x k + αr k with α = 2/(λ max + λ min ) = 1/4 is the same as the J-method so (11.26) holds in this case as well. This also follows from Corollary with κ given by (11.24). For the SOR method it is possible to explicitly determine ρ(g ω ) for any ω (0, 2). The following result will be shown in Section Theorem (The spectral radius of SOR matrix) Consider the SOR iteration (11.7), with the natural ordering. The spectral radius of G ω is ( 1 ρ(g ω ) = 4 ωβ + 2 (ωβ) 2 4(ω 1)), for 0 < ω ω, (11.27) ω 1, for ω < ω < 2, where β := ρ(g J ) = cos(πh) and Moreover, ω 2 := 1 + > 1. (11.28) 1 β2 ρ(g ω ) > ρ(g ω ) for ω (0, 2) \ {ω }. (11.29)

261 11.3. Convergence Figure 11.2: ρ(g ω ) with ω [0, 2] for n = 100, (lower curve) and n = 2500 (upper curve). A plot of ρ(g ω ) as a function of ω (0, 2) is shown in Figure 11.2 for n = 100 (lower curve) and n = 2500 (upper curve). As ω increases the spectral radius of G ω decreases monotonically to the minimum ω. Then it increases linearly to the value one for ω = 2. We call ω the optimal relaxation parameter. For the discrete Poisson problem we have β = cos(πh) and it follows from (11.27),(11.28) that ω = sin(πh), ρ(g ω ) = ω 1 = 1 sin(πh) 1 + sin(πh), h = 1 m + 1. (11.30) Letting ω = 1 in (11.27) we find ρ(g 1 ) = β 2 = ρ(g J ) 2 = cos 2 (πh) for the GS method. Thus, for the discrete Poisson problem the J method needs twice as many iterations as the GS method for a given accuracy. The values of ρ(g J ), ρ(g 1 ), and ρ(g ω ) = ω 1 are shown in Table 11.2 for n = 100 and n = We also show the smallest integer k n such that ρ(g) kn This is an estimate for the number of iteration needed to obtain an accuracy of These values are comparable to the exact values given in Table Number of iterations Consider next the rate of convergence of the iteration x k+1 = Gx k + c. We like to know how fast the iterative method converges. Suppose is a matrix

262 250 Chapter 11. The Classical Iterative Methods n=100 n=2500 k 100 k 2500 J GS SOR Table 11.2: Spectral radia for G J, G 1, G ω and the smallest integer k n such that ρ(g) kn norm that is subordinate to a vector norm also denoted by. x k x = G k (x 0 x). For k sufficiently large Recall that x k x G k x 0 x ρ(g) k x 0 x. For the last formula we apply Theorem which says that lim k G k 1/k = ρ(g). For Jacobi s method and the spectral norm we have G k J 2 = ρ(g J ) k (cf. (11.26)). For fast convergence we should use a G with small spectral radius. Lemma (Number of iterations) Suppose ρ(g) = 1 η for some 0 < η < 1, a consistent matrix norm on C n n, and let s N. Then k := s log(10) (11.31) η is an estimate for the smallest number of iterations k so that ρ(g) k 10 s. Proof. The estimate k is an approximate solution of the equation ρ(g) k = 10 s. Thus, since log(1 η) η when η is small s log (10) k = log(1 η) s log(10) = η k. The following estimates are obtained. They agree with those we found numerically in Section R and J: ρ(g J ) = cos(πh) = 1 η, η = 1 cos(πh) = 1 2 π2 h 2 + O(h 4 ) = π 2 2 /n + O(n 2 ). Thus k n = 2 log(10)s π 2 n + O(n 1 ) = O(n).

263 11.3. Convergence 251 GS: ρ(g 1 ) = cos 2 (πh) = 1 η, η = 1 cos 2 (πh) = sin 2 πh = π 2 h 2 +O(h 4 ) = π 2 /n + O(n 2 ). Thus k n = log(10)s π 2 n + O(n 1 ) = O(n). SOR: ρ(g ω ) = 1 sin(πh) 1+sin(πh) = 1 2πh + O(h2 ). Thus, k n = log(10)s n + O(n 1/2 ) = O( n). 2π Exercise (Convergence example for fix point iteration) Consider for a C [ ] [ ] [ ] [ ] x1 0 a x1 1 a x := = + =: Gx + c. a 0 1 a x 2 Starting with x 0 = 0 show by induction x 2 x k (1) = x k (2) = 1 a k, k 0, and conclude that the iteration converges to the fixed-point x = [1, 1] T for a < 1 and diverges for a > 1. Show that ρ(g) = 1 η with η = 1 a. Compute the estimate (11.31) for the rate of convergence for a = 0.9 and s = 16 and compare with the true number of iterations determined from a k Exercise (Estimate in Lemma can be exact) Consider the iteration in Example Show that ρ(g J ) = 1/2. Then show that x k (1) = x k (2) = 1 2 k for k 0. Thus the estimate in Lemma is exact in this case. We note that 1. The convergence depends on the behavior of the powers G k as k increases. The matrix M should be chosen so that all elements in G k converge quickly to zero and such that the linear system (11.9) is easy to solve for x k+1. These are conflicting demands. M should be an approximation to A to obtain a G with small elements, but then (11.9) might not be easy to solve for x k The convergence lim k G k 1/k = ρ(g) can be quite slow (cf. Exercise 11.25).

264 252 Chapter 11. The Classical Iterative Methods Exercise (Slow spectral radius convergence) The convergence lim k A k 1/k = ρ(a) can be quite slow. Consider A :=. λ a λ a λ λ a λ. Rn n. If λ = ρ(a) < 1 then lim k A k = 0 for any a R. We show below that the (1, n) element of A k is given by f(k) := ( k n 1) a n 1 λ k n+1 for k n 1. (a) Pick an n, e.g. n = 5, and make a plot of f(k) for λ = 0.9, a = 10, and n 1 k 200. Your program should also compute max k f(k). Use your program to determine how large k must be before f(k) < (b) We can determine the elements of A k explicitly for any k. Let E := (A λi)/a. Show by induction that E k = [ ] 0 I n k 0 0 for 1 k n 1 and that E n = 0. (c) We have A k = (ae + λi) k = min{k,n 1} ( k ) j=0 j a j λ k j E j and conclude that the (1, n) element is given by f(k) for k n Stopping the iteration In Algorithms 11.2 and 11.3 we had access to the exact solution and could stop the iteration when the error was sufficiently small in the infinity norm. The decision when to stop is obviously more complicated when the exact solution is not known. One possibility is to choose a vector norm, keep track of x k+1 x k, and stop when this number is sufficiently small. The following result indicates that x k x can be quite large if G is close to one. Lemma (Be careful when stopping) Suppose G < 1 for some consistent matrix norm on C n n which is subordinate to a vector norm also denoted by. If x k = Gx k 1 + c and x = Gx + c. Then Proof. We find x k x k 1 1 G G x k x, k 1. (11.32) x k x = G(x k 1 x) G x k 1 x = G x k 1 x k + x k x G ( x k 1 x k + x k x ). Thus (1 G ) x k x G x k 1 x k which implies (11.32).

265 11.4. Powers of a matrix 253 Another possibility is to stop when the residual vector r k := b Ax k is sufficiently small in some norm. To use the residual vector for stopping it is convenient to write the iterative method (11.10) in an alternative form. If M is the splitting matrix of the method then by (11.9) we have Mx k+1 = Mx k Ax k +b. This leads to x k+1 = x k + M 1 r k, r k = b Ax k. (11.33) Testing on r k works fine if A is well conditioned, but Theorem 7.31 shows that the relative error in the solution can be much larger than the relative error in r k if A is ill-conditioned Powers of a matrix Let A C n n be a square matrix. In this section we consider the special matrix sequence {A k } of powers of A. We want to know when this sequence converges to the zero matrix. Such a sequence occurs in iterative methods (cf. (11.16)), in Markov processes in statistics, in the converge of geometric series of matrices (Neumann series cf. Section ) and in many other applications The spectral radius In this section we show the following theorem. Theorem (When is lim k A k = 0?) For any A C n n we have lim k Ak = 0 ρ(a) < 1, where ρ(a) is the spectral radius of A given by (11.17). Clearly ρ(a) < 1 is a necessary condition for lim k A k = 0. For if (λ, x) is an eigenpair of A with λ 1 and x 2 = 1 then A k x = λ k x, and this implies A k 2 A k x 2 = λ k x 2 = λ k, and it follows that A k does not tend to zero. The sufficiency condition is harder to show. We construct a consistent matrix norm on C n n such that A < 1 and then use Theorems 11.7 and We start with Theorem (Any consistent norm majorizes the spectral radius) For any matrix norm that is consistent on C n n and any A C n n we have ρ(a) A. Proof. Let (λ, x) be an eigenpair for A and define X := [x,..., x] C n n. Then λx = AX, which implies λ X = λx = AX A X. Since X 0 we obtain λ A.

266 254 Chapter 11. The Classical Iterative Methods The next theorem shows that if ρ(a) < 1 then A < 1 for some consistent matrix norm on C n n, thus completing the proof of Theorem Theorem (The spectral radius can be approximated by a norm) Let A C n n and ɛ > 0 be given. There is a consistent matrix norm on C n n such that ρ(a) A ρ(a) + ɛ. Proof. Let A have eigenvalues λ 1,..., λ n. By the Schur Triangulation Theorem 5.28 there is a unitary matrix U and an upper triangular matrix R = [r ij ] such that U AU = R. For t > 0 we define D t := diag(t, t 2,..., t n ) R n n, and note that the (i, j) element in D t RDt 1 is given by t i j r ij for all i, j. For n = 3 D t RD 1 t = λ 1 t 1 r 12 t 2 r 13 0 λ 2 t 1 r λ 3 For each B C n n and t > 0 we use the one norm to define the matrix norm B t := D t U BUD 1 t 1. We leave it as an exercise to show that t is a consistent matrix norm on C n n. We define B := B t, where t is chosen so large that the sum of the absolute values of all off-diagonal elements in D t RD 1 t is less than ɛ. Then n A = D t U AUD 1 t 1 = D t RD 1 t 1 = max ( D t RD 1 ) t max 1 j n ( λ j + ɛ) = ρ(a) + ɛ. 1 j n i=1 ij A consistent matrix norm of a matrix can be much larger than the spectral radius. However the following result holds. Theorem (Spectral radius convergence) For any consistent matrix norm on C n n and any A C n n we have lim k Ak 1/k = ρ(a). (11.34) Proof. If λ is an eigenvalue of A then λ k is an eigenvalue of A k for any k N. By Theorem we then obtain ρ(a) k = ρ(a k ) A k for any k N so that ρ(a) A k 1/k. Let ɛ > 0 and consider the matrix B := (ρ(a) + ɛ) 1 A. Then ρ(b) = ρ(a)/(ρ(a)+ɛ) < 1 and B k 0 by Theorem as k. Choose N N such that B k < 1 for all k N. Then for k N A k = (ρ(a) + ɛ) k B k = ( ρ(a) + ɛ ) k B k < ( ρ(a) + ɛ ) k.

11.4. Powers of a matrix 255 We have shown that ρ(a) A k 1/k ρ(a) + ɛ for k N. Since ɛ is arbitrary the result follows. Exercise 11.

267 11.4. Powers of a matrix 255 We have shown that ρ(a) A k 1/k ρ(a) + ɛ for k N. Since ɛ is arbitrary the result follows. Exercise (A special norm) Show that B t := D t U BUD 1 t 1 defined in the proof of Theorem is a consistent matrix norm on C n n Neumann series Carl Neumann., He studied potential theory. The Neumann boundary conditions are named after him. Let B be a square matrix. In this section we consider the Neumann series k=0 Bk which is a matrix analogue of a geometric series of numbers. Consider an infinite series k=0 A k of matrices in C n n. We say that the series converges if the sequence of partial sums {S m } given by S m = m k=0 A k converges. The series converges if and only if {S m } is a Cauchy sequence, i.e. to each ɛ > 0 there exists an integer N so that S l S m < ɛ for all l > m N. Theorem (Neumann series) Suppose B C n n. Then 1. The series k=0 Bk converges if and only if ρ(b) < If ρ(b) < 1 then (I B) is nonsingular and (I B) 1 = k=0 Bk. 3. If B < 1 for some consistent matrix norm on C n n then Proof. (I B) B. (11.35) 1. Suppose ρ(b) < 1. We show that S m := m k=0 Bk is a Cauchy sequence and hence convergent. Let ɛ > 0. By Theorem there is a consistent matrix norm on C n n such that B < 1. Then for l > m S l S m = l k=m+1 B k l k=m+1 B k B m+1 k=0 B k = B m+1 1 B.

268 256 Chapter 11. The Classical Iterative Methods But then {S m } is a Cauchy sequence provided N is such that B N+1 1 B < ɛ. Conversely, suppose (λ, x) is an eigenpair for B with λ 1. We find S m x = m k=0 Bk x = ( m k=0 λk) x. Since λ k does not tend to zero the series k=0 λk is not convergent and therefore {S m x} and hence {S m } does not converge. 2. We have ( m B k) (I B) = I+B+ +B m (B+ +B m+1 ) = I B m+1. (11.36) k=0 Since ρ(b) < 1 we conclude that B m+1 0 and hence taking limits in (11.36) we obtain ( k=0 Bk) (I B) = I which completes the proof of By 2: (I B) 1 = k=0 Bk k=0 B k = 1 1 B. Exercise (When is A + E nonsingular?) Suppose A C n n is nonsingular and E C n n. Show that A+E is nonsingular if and only if ρ(a 1 E) < The Optimal SOR Parameter ω The following analysis is only carried out for the discrete Poisson matrix. It also holds for the averaging matrix given by (9.8). A more general theory is presented in [37]. We will compare the eigenpair equations for G J and G ω. It is convenient to write these equations using the matrix formulation T V + V T = h 2 F. If G J v = µv is an eigenpair of G J then 1 4 (v i 1,j + v i,j 1 + v i+1,j + v i,j+1 ) = µv i,j, i, j = 1,..., m, (11.37) where V := vec(v) R m m and v i,j = 0 if i {0, m + 1} or j {0, m + 1}. Suppose (λ, w) is an eigenpair for G ω. By (11.21) (I ωl) 1( ωr + (1 ω)i ) w = λw or (ωr + λωl)w = (λ + ω 1)w, (11.38) where l i,i m = r i,i+m = 1/4 for all i, and all other elements in L and R are equal to zero. Let w = vec(w ), where W C m m. Then (11.38) can be written ω 4 (λw i 1,j + λw i,j 1 + w i+1,j + w i,j+1 ) = (λ + ω 1)w i,j, (11.39) where w i,j = 0 if i {0, m + 1} or j {0, m + 1}.

269 11.5. The Optimal SOR Parameter ω 257 Theorem (The optimal ω) Consider the SOR method applied to the discrete Poisson matrix (9.8), where we use the natural ordering. Moreover, assume ω (0, 2). 1. If λ 0 is an eigenvalue of G ω then is an eigenvalue of G J. µ := λ + ω 1 ωλ 1/2 (11.40) 2. If µ is an eigenvalue of G J and λ satisfies the equation then λ is an eigenvalue of G ω. µωλ 1/2 = λ + ω 1 (11.41) Proof. Suppose (λ, w) is an eigenpair for G ω. We claim that (µ, v) is an eigenpair for G J, where µ is given by (11.40) and v = (V ) with v i,j := λ (i+j)/2 w i,j. Indeed, replacing w i,j by λ (i+j)/2 v i,j in (11.39) and cancelling the common factor λ (i+j)/2 we obtain But then ω 4 (v i 1,j + v i,j 1 + v i+1,j + v i,j+1 ) = λ 1/2 (λ + ω 1)v i,j. G J v = (L + R)v = λ + ω 1 v = µv. ωλ 1/2 For the converse let (µ, v) be an eigenpair for G J and let et λ be a solution of (11.41). We define as before v =: vec(v ), W = vec(w ) with w i,j := λ (i+j)/2 v i,j. Inserting this in (11.37) and canceling λ (i+j)/2 we obtain 1 4 (λ1/2 w i 1,j + λ 1/2 w i,j 1 + λ 1/2 w i+1,j + λ 1/2 w i,j+1 ) = µw i,j. Multiplying by ωλ 1/2 we obtain ω 4 (λw i 1,j + λw i,j 1 + w i+1,j + w i,j+1 ) = ωµλ 1/2 w i,j, Thus, if ωµλ 1/2 = λ + ω 1 then by (11.39) (λ, w) is an eigenpair for G ω. Proof of Theorem Combining statement 1 and 2 in Theorem we see that ρ(g ω ) = λ(µ), where λ(µ) is an eigenvalue of G ω satisfying (11.41) for some eigenvalue µ of G J. The eigenvalues of G J are 1 2 cos(jπh) cos(kπh), j, k = 1,..., m, so µ is real and

270 258 Chapter 11. The Classical Iterative Methods both µ and µ are eigenvalues. Thus, to compute ρ(g ω ) it is enough to consider (11.41) for a positive eigenvalue µ of G J. Solving (11.41) for λ = λ(µ) gives λ(µ) := 1 4 ( ωµ ± (ωµ) 2 4(ω 1)) 2. (11.42) Both roots λ(µ) are eigenvalues of G ω. The discriminant is strictly decreasing on (0, 2) since d(ω) := (ωµ) 2 4(ω 1). d (ω) = 2(ωµ 2 2) < 2(ω 2) < 0. Moreover d(0) = 4 > 0 and d(2) = 4µ 2 4 < 0. As a function of ω, λ(µ) changes from real to complex when d(ω) = 0. The root in (0, 2) is In the complex case we find λ(µ) = 1 4 ω = ω(µ) := µ 2 2 µ 2 = µ. (11.43) 2 ( (ωµ) 2 + 4(ω 1) (ωµ) 2) = ω 1, ω(µ) < ω < 2. In the real case both roots of (11.42) are positive and the larger one is λ(µ) = 1 4 ( ωµ + (ωµ) 2 4(ω 1)) 2, 0 < ω ω(µ). (11.44) Both λ(µ) and ω(µ) are strictly increasing as functions of µ. It follows that λ(µ) is maximized for µ = ρ(g J ) =: β and for this value of µ we obtain (11.27) for 0 < ω ω(β) = ω. Evidently ρ(g ω ) = ω 1 is strictly increasing in ω < ω < 2. Equation (11.29) will follow if we can show that ρ(g ω ) is strictly decreasing in 0 < ω < ω. By differentiation d ( ωβ + ) (ωβ) dω 2 4(ω 1) = β (ωβ) 2 4(ω 1) + ωβ 2 2. (ωβ)2 4(ω 1) Since β 2 (ω 2 β 2 4ω + 4) < (2 ωβ 2 ) 2 the numerator is negative and the strict decrease of ρ(g ω ) in 0 < ω < ω follows.

271 11.6. Review Questions Review Questions Consider a matrix A C n n with nonzero diagonal elements. Define the J and GS method in component form, Do they always converge? Give a necessary and sufficient condition that A n 0. Is there a matrix norm consistent on C n n such that A < ρ(a)? What is a Neumann series? when does it converge? How do we define convergence of a fixed point iteration x k+1 = Gx k + c? When does it converge? Define Richardson s method.

272 260 Chapter 11. The Classical Iterative Methods

Chapter 12 The Conjugate Gradient Method Magnus Rudolph Hestenes, 1906-1991 (left), Eduard L. Stiefel, 1909-1978 (right).

Today its main use is as an iterative method for solving large sparse linear systems.

273 Chapter 12 The Conjugate Gradient Method Magnus Rudolph Hestenes, (left), Eduard L. Stiefel, (right). The conjugate gradient method was published by Hestenes and Stiefel in 1952, [12] as a direct method for solving linear systems. Today its main use is as an iterative method for solving large sparse linear systems. On a test problem we show that it performs as well as the SOR method with optimal acceleration parameter, and we do not have to estimate any such parameter. However the conjugate gradient method is restricted to symmetric positive definite systems. We also consider a mathematical formulation of the preconditioned conjugate gradient method. It is used to speed up convergence of the conjugate gradient method. We only give one example of a possible preconditioner. See [1] for a more complete treatment of iterative methods and preconditioning. The conjugate gradient method can also be used for minimization and is related to a method known as steepest descent. This method and the conjugate gradient method are both minimization methods and iterative methods for solving linear equations. Throughout this chapter A R n n will be a symmetric positive definite matrix. We recall that A has positive eigenvalues and that the spectral (2-norm) condition number of A is given by κ := λmax λ min, where λ max and λ min are the largest 261

274 262 Chapter 12. The Conjugate Gradient Method and smallest eigenvalue of A. The analysis of the methods in this chapter is in terms of two inner products on R n, the usual inner product x, y = x T y with the associated Euclidian norm x 2 = x T x, and the A-inner product and the corresponding A-norm given by x, y A := x T Ay, y A := y T Ay, x, y R n, (12.1) We note that the A-inner product is an inner product on R n. Indeed, for any x, y, z R n 1. x, x A = x T Ax 0 and x, x A = 0 if and only if x = 0, since A is positve definite, 2. x, y A := x T Ay = (x T Ay) T = y T A T x = y T Ax = y, x A by symmetry of A, 3. x + y, z A := x T Az + y T Az = x, z A + y, z A, true for any A. By Theorem 4.3 the A-norm is a vector norm on R n since it is an inner product norm. Exercise 12.1 (A-norm) One can show that the A-norm is a vector norm on R n without using the fact that it is an inner product norm. Show this with the help of the Cholesky factorization of A Quadratic Minimization and Steepest Descent We start by discussing some aspect of quadratic minimization and its relation to solving linear systems. Consider for A R n n and b R n the quadratic function Q : R n R given by As an example, some level curves of Q(x, y) := 1 [ ] [ ] [ ] 2 1 x x y y Q(y) := 1 2 yt Ay b T y. (12.2) = x 2 xy + y 2 (12.3) are shown in Figure The level curves are ellipses and the graph of Q is a paraboloid (cf. Exercise 12.2). Exercise 12.2 (Paraboloid) Let A = UDU T be the spectral decomposition of A, i. e., U is orthonormal and

275 12.1. Quadratic Minimization and Steepest Descent 263 x 2 x 3 x 2 x 0 x 1 x 0 x 1 Figure 12.1: Level curves for Q(x, y) given by (12.3). Also shown is a steepest descent iteration (left) and a conjugate gradient iteration (right) to find the minimum of Q. (cf Examples 12.4,12.7). D = diag(λ 1,..., λ n ) is diagonal. Define new varables v = [v 1,..., v n ] T := U T y, and set c := U T b = [c 1,..., c n ] T. Show that Q(y) = 1 n n λ j vj 2 c j v j. 2 j=1 The following expansion will be used repeatedly. For y, h R n and ε R Q(y + εh) = Q(y) εh T r(y) ε2 h T Ah, where r(y) := b Ay. (12.4) Minimizing a quadratic function is equivalent to solving a linear system. Lemma 12.3 (Quadratic function) A vector x R n minimizes Q given by (12.2) if and only if Ax = b. Moreover, the residual r(y) := b Ay at any y R n is equal to the negative gradient, i. e., [ ] T r(y) = Q(y), where := y 1,..., y n. Proof. If y = x, ε = 1, and Ax = b then (12.4) simplifies to Q(x + h) = Q(x) ht Ah, and since A is symmetric positive definite Q(x + h) > Q(x) for j=1

276 264 Chapter 12. The Conjugate Gradient Method all nonzero h R n. It follows that x is the unique minimum of Q. Conversely, if Ax b and h := r(x), then by (12.4), Q(x + εh) Q(x) = ε(h T r(x) 1 2 εht Ah) < 0 for ε > 0 sufficiently small. Thus x does not minimize Q. By (12.4) for y R n y i Q(y) := lim ε = lim ε 0 ε showing that r(y) = Q(y). ε (Q(y + εe i) Q(y)) ( εe Ti r(y)) + 12 ) ε2 e Ti Ae i = e T i r(y), i = 1,..., n, A general class of minimization algorithms for Q and solution algorithms for a linear system is given as follows: 1. Choose x 0 R n. 2. For k = 0, 1, 2,... Choose a search direction p k, Choose a step length α k, Compute x k+1 = x k + α k p k. (12.5) We would like to generate a sequence {x k } that converges quickly to the minimum x of Q. For a fixed direction p k we say that α k is optimal if Q(x k+1 ) is as small as possible, i.e. Q(x k+1 ) = Q(x k + α k p k ) = min α R Q(x k + αp k ). By (12.4) we have Q(x k + αp k ) = Q(x k ) αp T k r k α2 p T k Ap k, where r k := b Ax k. Since p T k Ap k 0 we find a minimum α k by solving α Q(x k +αp k ) = 0. It follows that the optimal α k is uniquely given by pt k r k α k := p T k Ap. (12.6) k In the method of Steepest descent, also known as the Gradient method we choose p k = r k the negative gradient, and the optimal α k. Starting from x 0 we compute for k = 0, 1, 2... x k+1 = x k + ( r T k r k ) r T k Ar rk. (12.7) k

277 12.2. The Conjugate Gradient Method 265 This is similer to Richardson s method (11.18), but in that method we used a constant step length. Computationally, a step in the steepest descent iteration can be organized as follows p k = r k, t k = Ap k, α k = (r T k r k )/(p T k t k ), x k+1 = x k + α k p k, r k+1 = r k α k t k. (12.8) Here, and in general, the following updating of the residual is used: r k+1 = b Ax k+1 = b A(x k + α k p k ) = r k α k Ap k. (12.9) In the steepest descent method the choice p k = r k implies that the last two gradients are orthogonal. Indeed, by (12.9), r T k+1 r k = (r k α k Ap k ) T p k = 0 since α k = rt k r k p T k Ap k and A is symmetric. Example 12.4 (Steepest descent iteration) Suppose Q(x, y) is given by (12.3). Starting with x 0 = [ 1, 1/2] T and r 0 = Ax 0 = [3/2, 0] T we find t 0 = 3 [ ] 1 1/2, α0 = 1 2, x 1 = 4 1 [ 1 2 ], r 1 = [ 0 1 ] t 1 = [ ] 1 2, α1 = 1 2, x 2 = 4 1 [ ] 1 1/2, r2 = [ ] 1/2 0, and in general for k 1 t 2k 2 = k [ ] 1 1/2, x2k 1 = 4 k [ 1 2 ], r 2k 1 = 3 4 k [ 0 1 ] t 2k 1 = 3 4 k [ ] 1 2, x2k = 4 k [ ] 1 1/2, r2k = 3 4 k [ ] 1/2 0. Since α k = 1/2 is constant for all k the methods of Richardson, Jacobi and steepest descent are the same on this simple problem. See the left part of Figure The rate of convergence is determined from x j+1 2 / x j = r j+1 2 / r j 2 = 1/2 for all j. Exercise 12.5 (Steepest descent iteration) Verify the numbers in Example The Conjugate Gradient Method In the steepest descent method the last two gradients are orthogonal. In the conjugate gradient method all gradients are orthogonal 19. We achieve this by using A-orthogonal search directions i. e., p T i Ap j = 0 for all i j. 19 It is this property that has given the method its name.

278 266 Chapter 12. The Conjugate Gradient Method Derivation of the method As in the steepest descent method we choose a starting vector x 0 R n. If r 0 = b Ax 0 = 0 then x 0 is the exact solution and we are finished, otherwise we initially make a steepest descent step. It follows that r T 1 r 0 = 0 and p 0 := r 0. For the general case we define for j 0 We note that j 1 p j := r j ( rt j Ap i p T i Ap )p i, (12.10) i i=0 x j+1 := x j + α j p j α j := rt j r j p T j Ap, (12.11) j r j+1 = r j α j Ap j. (12.12) 1. p j is computed by the Gram-Schmidt orthogonalization process applied to the residuals r 0,..., r j using the A-inner product. The search directions are therefore A-orthogonal and nonzero as long as the residuals are linearly independent. 2. Equation (12.12) follows from (12.9). 3. It can be shown that the step length α j is optimal for all j (cf. Exercise 12.10)). Lemma 12.6 (The residuals are orthogonal) Suppose that for some k 0 that x j is well defined, r j 0, and r T i r j = 0 for i, j = 0, 1,..., k, i j. Then x k+1 is well defined and r T k+1 r j = 0 for j = 0, 1,..., k. Proof. Since the residuals r j are orthogonal and nonzero for j k, they are linearly independent, and it follows form the Gram-Schmidt Theorem 4.9 that p k is nonzero and p T k Ap i = 0 for i < k. But then x k+1 and r k+1 are well defined. Now r T k+1r j (12.12) = (r k α k Ap k ) T r j = r T k r j α k p T k A ( j 1 p j + ( rt j Ap i ) p T i Ap )p i i (12.10) i=0 p T k Ap i = 0 = r T k r j α k p T k Ap j = 0, j = 0, 1,..., k. That the final expression is equal to zero follows by orthogonality and A-orthogonality for j < k and by the definition of α k for j = k. This completes the proof.

279 12.2. The Conjugate Gradient Method 267 The conjugate gradient method is also a direct method. The residuals are orthogonal and therefore linearly independent if they are nonzero. Since dim R n = n the n + 1 residuals r 0,..., r n cannot all be nonzero and we must have r k = 0 for some k n. Thus we find the exact solution in at most n iterations. The expression (12.10) for p k can be greatly simplified. All terms except the last one vanish, since by orthogonality of the residuals r T (12.12) j Ap i = r T (r i r i+1 ) j = 0, i = 0, 1,..., j 2. α i With j = k + 1 (12.10) therefore takes the simple form p k+1 = r k+1 + β k p k and we find β k := rt k+1 Ap k p T k Ap k (12.12) = rt k+1 (r k+1 r k ) (12.11) α k p T k Ap = rt k+1 r k+1 k r T k r. (12.13) k To summarize, in the conjugate gradient method we start with x 0, p 0 = r 0 = b Ax 0 and then generate a sequence of vectors {x k } as follows: For k = 0,1, 2,... x k+1 := x k + α k p k, α k := rt k r k p T k Ap, k (12.14) r k+1 := r k α k Ap k, (12.15) p k+1 := r k+1 + β k p k, β k := rt k+1 r k+1 r T k r. (12.16) k The residuals and search directions are orthogonal and A-orthogonal, respectively. For computation we organize the iterations as follows for k = 0, 1, 2,... t k = Ap k, α k = (r T k r k )/(p T k t k ), x k+1 = x k + α k p k, r k+1 = r k α k t k, β k = (r T k+1r k+1 )/(r T k r k ), p k+1 := r k+1 + β k p k. (12.17) Note that (12.17) differs from (12.8) only in the computation of the search direction. Example 12.7 (Conjugate gradient iteration) Consider (12.17) applied to the positive definite linear system [ ] [ x 1 x 2 ] = [ 0 0 ].

280 268 Chapter 12. The Conjugate Gradient Method Starting as in Example 12.4 with x 0 = t 0 = [ 3 3/2 ], α0 = 1/2, x 1 = [ ] 1 1/2 t 1 = [ 0 9/8 ], α1 = 2/3, x 2 = 0, r 2 = 0. we find p 0 = r 0 = [ ] 3/2 0 and then [ ] 1/4, r 1/2 1 = [ ] 0 3/4, β0 = 1/4, p 1 = Thus x 2 is the exact solution as illustrated in the right part of Figure [ ] 3/8, 3/4 Exercise 12.8 (Conjugate gradient iteration, II) Do one ( iteration ) with the conjugate gradient method when x 0 = 0. (Answer: x 1 = b.) b T b b T Ab Exercise 12.9 (Conjugate gradient iteration, III) Do two conjugate gradient iterations for the system starting with x 0 = 0. [ ] [ ] [ x1 0 = x 2 3] Exercise (The cg step length is optimal) Show that the step length α k in the conjugate gradient method is optimal 20. Exercise (Starting value in cg) Show that the conjugate gradient method (12.17) for Ax = b starting with x 0 is the same as applying the method to the system Ay = r 0 := b Ax 0 starting with y 0 = The conjugate gradient algorithm In this section we give numerical examples and discuss implementation. The formulas in (12.17) form a basis for an algorithm. 20 Hint: use induction on k to show that p k = r k + k 1 j=0 a k,jr j for some constants a k,j. 21 Hint: The conjugate gradient method for Ay = r 0 can be written y k+1 := y k + γ k q k, γ k := st k s k q T, s k Aq k+1 := s k γ k Aq k, q k+1 := s k+1 + δ k q k, δ k := st k+1 s k+1 k s T. Show that k s k y k = x k x 0, s k = r k, and q k = p k, for k = 0, 1, 2....

281 12.2. The Conjugate Gradient Method 269 Algorithm (Conjugate gradient iteration) The symmetric positive definite linear system Ax = b is solved by the conjugate gradient method. x is a starting vector for the iteration. The iteration is stopped when r k 2 / b 2 tol or k > itmax. K is the number of iterations used. 1 function [ x,k]= cg (A, b, x, t o l, itmax ) 2 r=b A x ; p=r ; rho0=b b ; rho=r r ; 3 for k=0: itmax 4 i f sqrt ( rho / rho0 )<= t o l 5 K=k ; return 6 end 7 t=a p ; a=rho /( p t ) ; 8 x=x+a p ; r=r a t ; 9 rhos=rho ; rho=r r ; 10 p=r+(rho / rhos ) p ; 11 end 12 K=itmax +1; The work involved in each iteration is 1. one matrix times vector (t = Ap), 2. two inner products ((p T t and r T r), 3. three vector-plus-scalar-times-vector (x = x + ap, r = r at and p = r + (rho/rhos)p), The dominating part is the computation of t = Ap Numerical example We test the conjugate gradient method on two examples. For a similar test for the steepest descent method see Exercise Consider the matrix given by the Kronecker sum T 2 := T 1 I + I T 1, where T 1 := tridiag m (a, d, a) R m m and a, d R. We recall that this matrix is symmetric positive definite if d > 0 and d 2 a (cf. Theorem 9.8). We set h = 1/(m + 1) and f = [1,..., 1] T R n. We consider two problems. 1. a = 1/9, d = 5/18, the Averaging matrix. 2. a = 1, d = 2, the Poisson matrix Implementation issues Note that for our test problems T 2 only has O(5n) nonzero elements. Therefore, taking advantage of the sparseness of T 2 we can compute t in Algorithm 12.12

282 270 Chapter 12. The Conjugate Gradient Method n K Table 12.13: The number of iterations K for the averaging problem on a n n grid for various n n K K/ n Table 12.15: The number of iterations K for the Poisson problem on a n n grid for various n in O(n) arithmetic operations. With such an implementation the total number of arithmetic operations in one iteration is O(n). We also note that it is not necessary to store the matrix T 2. To use the Conjugate Gradient Algorithm on the test matrix for large n it is advantageous to use a matrix equation formulation. We define matrices V, R, P, B, T R m m by x = vec(v ), r = vec(r), p = vec(p ), t = vec(t ), and h 2 f = vec(b). Then T 2 x = h 2 f T 1 V + V T 1 = B, and t = T 2 p T = T 1 P + P T 1. This leads to the following algorithm for testing the conjugate gradient algorithm on the matrix A = tridiag m (a, d, a) I m + I m tridiag m (a, d, a) R (m2 ) (m 2). Algorithm (Testing conjugate gradient) 1 function [V,K]= c g t e s t (m, a, d, t o l, itmax ) 2 R=ones (m) /(m+1) ˆ 2 ; rho=sum(sum(r. R) ) ; rho0=rho ; P=R; 3 V=zeros (m,m) ; T1=sparse ( t r i d i a g o n a l ( a, d, a,m) ) ; 4 for k=1: itmax 5 i f sqrt ( rho / rho0 )<= t o l 6 K=k ; return 7 end 8 T=T1 P+P T1 ; 9 a=rho /sum(sum(p. T) ) ; V=V+a P ; R=R a T; 10 rhos=rho ; rho=sum(sum(r. R) ) ; P=R+(rho / rhos ) P ; 11 end 12 K=itmax +1;

12.3. Convergence 271 For both the averaging- and Poison matrix we use tol = 10 8. For the averaging matrix we obtain the values in Table 12.13. The convergence is quite rapid.

Consider next the Poisson problem. In in Table 12.15 we list K, the required number of iterations, and K/ n. The results show that K is much smaller than n and appears to be proportional to n.

3 Convergence Leonid Vitaliyevich Kantorovich, 1912-1986 (left), Aleksey Nikolaevich Krylov, 1863-1945 (center), Pafnuty Lvovich Chebyshev, 1821-1894 (right) 12.3.1 The Main Theorem Recall that the A-norm of a vector x R n is given by x A := x T Ax.

283 12.3. Convergence 271 For both the averaging- and Poison matrix we use tol = For the averaging matrix we obtain the values in Table The convergence is quite rapid. It appears that the number of iterations can be bounded independently of n, and therefore we solve the problem in O(n) operations. This is the best we can do for a problem with n unknowns. Consider next the Poisson problem. In in Table we list K, the required number of iterations, and K/ n. The results show that K is much smaller than n and appears to be proportional to n. This is the same speed as for SOR and we don t have to estimate any acceleration parameter Convergence Leonid Vitaliyevich Kantorovich, (left), Aleksey Nikolaevich Krylov, (center), Pafnuty Lvovich Chebyshev, (right) The Main Theorem Recall that the A-norm of a vector x R n is given by x A := x T Ax. The following theorem gives upper bounds for the A-norm of the error in both steepest descent and conjugate gradients. Theorem (Error bound for steepest descent and conjugate gradients) Suppose A is symmetric positive definite. For the A-norms of the errors in the steepest descent method (12.7) the following upper bounds hold x x k A x x 0 A ( ) k κ 1 < e 2 κ k,, k > 0, (12.18) κ + 1

284 272 Chapter 12. The Conjugate Gradient Method while for the conjugate gradient method we have ( ) k x x k A κ 1 2 < 2e 2 k κ, k 0. (12.19) x x 0 A κ + 1 Here κ = cond 2 (A) := λ max /λ min is the spectral condition number of A, and λ max and λ min are the largest and smallest eigenvalue of A, respectively. Theorem implies 1. Since κ 1 κ+1 < 1 the steepest descent method always converges for a symmetric positive definite matrix. The convergence can be slow when κ 1 κ+1 is close to one, and this happens even for a moderately ill-conditioned A. 2. The rate of convergence for the conjugate gradient method appears to be determined by the square root of the spectral condition number. This is much better than the estimate for the steepest descent method. Especially for problems with large condition numbers. 3. The proofs of the estimates in (12.18) and (12.19) are quite different. This is in spite of their similar appearance The number of iterations for the model problems Consider the test matrix T 2 := tridiag m (a, d, a) I m + I m tridiag m (a, d, a) R (m2 ) (m 2). The eigenvalues were given in (9.15) as λ j,k = 2d + 2a cos(jπh) + 2a cos(kπh), j, k = 1,..., m. (12.20) For the averaging problem given by d = 5/18, a = 1/9, the largest and smallest eigenvalue of T 2 are given by λ max = cos (πh) and λ min = cos (πh). Thus κ A = cos(πh) 5 4 cos(πh) 9, and the condition number is bounded independently of n. It follows from (12.19) that the number of iterations can be bounded independently of the size n of the problem, and this is in agreement with what we observed in Table For the Poisson problem we have by (11.24) the condition number κ P = λ max = cos2 (πh/2) λ min sin 2 (πh/2) and κ P = cos(πh/2) sin(πh/2) 2 πh 2 n. π

285 12.3. Convergence 273 Thus, (see also Exercise 7.36) we solve the discrete Poisson problem in O(n 3/2 ) arithmetic operations using the conjugate gradient method. This is the same as for the SOR method and for the fast method without the FFT. In comparison the Cholesky Algorithm requires O(n 2 ) arithmetic operations both for the averaging and the Poisson problem. Exercise (Program code for testing steepest descent) Write a function K=sdtest(m,a,d,tol,itmax) to test the Steepest descent method on the matrix T 2. Make the analogues of Table and Table For Table it is enough to test for say n = 100, 400, 1600, 2500, and tabulate K/n instead of K/ n in the last row. Conclude that the upper bound (12.18) is realistic. Compare also with the number of iterations for the J and GS method in Table Exercise (Using cg to solve normal equations) Consider solving the linear system A T Ax = A T b by using the conjugate gradient method. Here A R m,n, b R m and A T A is positive definite 22. Explain why only the following modifications in Algorithm are necessary 1. r=a (b-a*x); p=r; 2. a=rho/(t *t); 3. r=r-a*a *t; Note that the condition number of the normal equations is cond 2 (A) 2, the square of the condition number of A Krylov spaces and the best approximation property For the convergence analysis of the conjugate gradient method certain subspaces of of R n called Krylov spaces play a central role. In fact the iterates in the conjugate gradient method are best approximation of the solution from these subspaces using the A-norm to measure the error. The Krylov spaces are defined by W 0 = {0} and W k = span(r 0, Ar 0, A 2 r 0,..., A k 1 r 0 ), k = 1, 2, 3,. They are nested subspaces W 0 W 1 W 2 W n R n with dim(w k ) k for all k 0. Moreover, If v W k then Av W k This system known as the normal equations appears in linear least squares problems and was considered in this context in Chapter 8.

286 274 Chapter 12. The Conjugate Gradient Method Lemma (Krylov space) For the iterates in the conjugate gradient method we have x k x 0 W k, r k, p k W k+1, k = 0, 1,..., (12.21) and r T k w = p T k Aw = 0, w W k. (12.22) Proof. (12.21) clearly holds for k = 0 since p 0 = r 0. Suppose it holds for some k 0. Then r k+1 = r k α k Ap k W k+2 and by p k+1 = r k+1 + β k p k (12.11) W k+2 and x k+1 x 0 = x k x 0 + α k p k W k+1. Thus (12.21) follows by induction. Since any w W k is a linear combination of {r 0, r 1,..., r k 1 } and also {p 0, p 1,..., p k 1 }, (12.22) follows. Theorem (Best approximation property) Suppose Ax = b, where A R n n is symmetric positive definite and {x k } is generated by the conjugate gradient method (cf. (12.14)). Then x x k A = min w W k x x 0 w A. (12.23) Proof. Fix k, let w W k and u := x k x 0 w. By (12.21) u W k and then (12.22) implies that r T k u = 0. Since (x x k) T Au = r T k u we find x x 0 w 2 A = (x x k + u) T A(x x k + u) = (x x k )A(x x k ) + 2r T k u + u T Au = x x k 2 A + u 2 A x x k 2 A. Taking square roots we obtain x x k A x x 0 w A with equality for u = 0 and the result follows. If x 0 = 0 then (12.23) says that x k is the element in W k that is closest to the solution x in the A-norm. More generally, if x 0 0 then x x k = (x x 0 ) (x k x 0 ) and x k x 0 is the element in W k that is closest to x x 0 in the A-norm. This is the orthogonal projection of x x 0 into W k, see Figure Recall that to each polynomial p(t) := m j=0 a jt m there corresponds a matrix polynomial p(a) := a 0 I + a 1 A + + a m A m. Moreover, if (λ j, u j ) are eigenpairs of A then (p(λ j ), u j ) are eigenpairs of p(a) for j = 1,..., n. Lemma (Krylov space and polynomials) Suppose Ax = b where A R n n is symmetric positive definite with orthonormal

287 12.3. Convergence 275 x x 0 x x k W k x k x 0 Figure 12.2: The orthogonal projection of x x 0 into W k. eigenpairs (λ j, u j ), j = 1, 2,..., n, and let r 0 := b Ax 0 for some x 0 R n. To each w W k there corresponds a polynomial P (t) := k 1 j=0 a jt k 1 such that w = P (A)r 0. Moreover, if r 0 = n j=1 σ ju j then x x 0 w 2 A = n j=1 σ 2 j λ j Q(λ j ) 2, Q(t) := 1 tp (t). (12.24) Proof. If w W k then w = a 0 r 0 + a 1 Ar a k 1 A k 1 r 0 for some scalars a 0,..., a k 1. But then w = P (A)r 0. We find x x 0 P (A)r 0 = A 1 (r 0 AP (A))r 0 = A 1 Q(A)r 0 and A(x x 0 P (A)r 0 ) = Q(A)r 0. Therefore, x x 0 P (A)r 0 2 A = c T A 1 c where c = (I AP (A))r 0 = Q(A)r 0. (12.25) Using the eigenvector expansion for r 0 we obtain c = n σ j Q(λ j )u j, A 1 c = j=1 Now (12.24) follows by the orthonormality of the eigenvectors. n i=1 σ i Q(λ i ) λ i u i. (12.26) We will use the following theorem to estimate the rate of convergence. Theorem (cg and best polynomial approximation) Suppose [a, b] with 0 < a < b is an interval containing all the eigenvalues of A.

288 276 Chapter 12. The Conjugate Gradient Method Then in the conjugate gradient method x x k A x x 0 A min Q Π k Q(0)=1 max Q(x), (12.27) a x b where Π k denotes the class of univariate polynomials of degree k with real coefficients. Proof. By (12.24) with Q(t) = 1 (corresponding to P (A) = 0) we find x x 0 2 A = n σ 2 j j=1 λ j. Therefore, by the best approximation property Theorem and (12.24), for any w W k x x k 2 A x x 0 w 2 A max a x b Q(x) 2 n j=1 σ 2 j λ j = max a x b Q(x) 2 x x 0 2 A, where Q Π k and Q(0) = 1. Minimizing over such polynomials Q and taking square roots the result follows. that In the next section we use properties of the Chebyshev polynomials to show x x k A x x 0 A min Q Π k Q(0)=1 max Q(x) = λ min x λ max 2 κ 1 a k + a k, a :=, (12.28) κ + 1 where κ = λ max /λ min is the spectral condition number of A. Ignoring the second term in the denominator this implies the first inequality in (12.19). Consider the second inequality in (12.19). The inequality x 1 x + 1 < e 2/x for x > 1 (12.29) follows from the familiar series expansion of the exponential function. Indeed, with y = 1/x, using 2 k /k! = 2, k = 1, 2, and 2 k /k! < 2 for k > 2, we find and (12.29) follows. e 2/x = e 2y (2y) k = < y k = 1 + y k! 1 y = x + 1 x 1 k=0 k=1 Exercise (Krylov space and cg iterations) Consider the linear system Ax = b where A = 1 2 1, and b =

289 12.4. Proof of the Convergence Estimates 277 a) Determine the vectors defining the Krylov spaces for k 3 taking as initial approximation x = 0. Answer: [b, Ab, A 2 b] = b) Carry out three CG-iterations on Ax = b. Answer: 0 2 8/3 3 [x 0, x 1, x 2, x 3 ] = 0 0 4/3 2, c) Verify that [r 0, r 1, r 2, r 3 ] = [Ap 0, Ap 1, Ap 2 ] = [p 0, p 1, p 2, p 3 ] = dim(w k ) = k for k = 0, 1, 2, 3. x 3 is the exact solution of Ax = b /3 0, / / / /9 0 r 0,..., r k 1 is an orthogonal basis for W k for k = 1, 2, 3. p 0,..., p k 1 is an A-orthogonal basis for W k for k = 1, 2, 3. { r k ɛ} is monotonically decreasing. { x k xɛ} is monotonically decreasing Proof of the Convergence Estimates Chebyshev polynomials The proof of the estimate (12.28) for the error in the conjugate gradient method is based on an extremal property of the Chebyshev polynomials. Suppose a < b, c [a, b] and k N. Consider the set S k of all polynomials Q of degree k such that Q(c) = 1. For any continuous function f on [a, b] we define f = max a x b f(x).,,

290 278 Chapter 12. The Conjugate Gradient Method We want to find a polynomial Q S k such that Q = min Q S k Q. We will show that Q is uniquely given as a suitably shifted and normalized version of the Chebyshev polynomial. The Chebyshev polynomial T n of degree n can be defined recursively by T n+1 (t) = 2tT n (t) T n 1 (t), n 1, t R, starting with T 0 (t) = 1 and T 1 (t) = t. Thus T 2 (t) = 2t 2 1, T 3 (t) = 4t 3 3t etc. In general T n is a polynomial of degree n. There are some convenient closed form expressions for T n. Lemma (Closed forms of Chebyshev polynomials) For n 0 1. T n (t) = cos (narccos t) for t [ 1, 1], 2. T n (t) = 1 2[( t + t2 1 ) n + ( t + t2 1 ) n] for t 1. Proof. 1. With P n (t) = cos (n arccos t) we have P n (t) = cos nφ, where t = cos φ. Therefore, P n+1 (t) + P n 1 (t) = cos (n + 1)φ + cos (n 1)φ = 2 cos φ cos nφ = 2tP n (t), and it follows that P n satisfies the same recurrence relation as T n. Since P 0 = T 0 and P 1 = T 1 we have P n = T n for all n Fix t with t 1 and let x n := T n (t) for n 0. The recurrence relation for the Chebyshev polynomials can then be written x n+1 2tx n + x n 1 = 0 for n 1, with x 0 = 1, x 1 = t. (12.30) To solve this difference equation we insert x n = z n into (12.30) and obtain z n+1 2tz n + z n 1 = 0 or z 2 2tz + 1 = 0. The roots of this equation are z 1 = t + t 2 1, z 2 = t t 2 1 = ( t + t 2 1 ) 1. Now z n 1, z n 2 and more generally c 1 z n 1 + c 2 z n 2 are solutions of (12.30) for any constants c 1 and c 2. We find these constants from the initial conditions x 0 = c 1 +c 2 = 1 and x 1 = c 1 z 1 + c 2 z 2 = t. Since z 1 + z 2 = 2t the solution is c 1 = c 2 = 1 2. We show that the unique solution to our minimization problem is Clearly Q S k. Q (x) = T k(u(x)) T k (u(c)), b + a 2x u(x) =. (12.31) b a

291 12.4. Proof of the Convergence Estimates 279 Theorem (A minimal norm problem) Suppose a < b, c [a, b] and k N. If Q S k and Q Q then Q > Q. Proof. Recall that a nonzero polynomial p of degree k can have at most k zeros. If p(z) = p (z) = 0, we say that p has a double zero at z. Counting such a zero as two zeros it is still true that a nonzero polynomial of degree k has at most k zeros. Q takes on its maximum 1/ T k (u(c)) at the k + 1 points µ 0,..., µ k in [a, b] such that u(µ i ) = cos(iπ/k) for i = 0, 1,..., k. Suppose Q S k and that Q Q. We have to show that Q Q. Let f Q Q. We show that f has at least k zeros in [a, b]. Since f is a polynomial of degree k and f(c) = 0, this means that f 0 or equivalently Q Q. Consider I j = [µ j 1, µ j ] for a fixed j. Let σ j = f(µ j 1 )f(µ j ). We have σ j 0. For if say Q (µ j ) > 0 then so that f(µ j ) 0. Moreover, Q(µ j ) Q Q = Q (µ j ) Q(µ j 1 ) Q Q = Q (µ j 1 ). Thus f(µ j 1 ) 0 and it follows that σ j 0. Similarly, σ j 0 if Q (µ j ) < 0. If σ j < 0, f must have a zero in I j since it is continuous. Suppose σ j = 0. Then f(µ j 1 ) = 0 or f(µ j ) = 0. If f(µ j ) = 0 then Q(µ j ) = Q (µ j ). But then µ j is a maximum or minimum both for Q and Q. If µ j (a, b) then Q (µ j ) = Q (µ j ) = 0. Thus f(µ j ) = f (µ j ) = 0, and f has a double zero at µ j. We can count this as one zero for I j and one for I j+1. If µ j = b, we still have a zero in I j. Similarly, if f(µ j 1 ) = 0, a double zero of f at µ j 1 appears if µ j 1 (a, b). We count this as one zero for I j 1 and one for I j. In this way we associate one zero of f for each of the k intervals I j, j = 1, 2,..., k. We conclude that f has at least k zeros in [a, b]. Exercise (Another explicit formula for the Chebyshev polynomial) Show that T n (t) = cosh(narccosh t) for t 1, where arccosh is the inverse function of cosh x := (e x + e x )/2. Theorem with a, and b, the smallest and largest eigenvalue of A, and c = 0 implies that the minimizing polynomial in (12.28) is given by ( ) ( ) b + a 2x b + a Q (x) = T k /T k. (12.32) b a b a

280 Chapter 12. The Conjugate Gradient Method Figure 12.3: This is an illustration of the proof of Theorem 12.25 for k = 3. f Q Q has a double zero at µ 1 and one zero between µ 2 and µ 3.

292 280 Chapter 12. The Conjugate Gradient Method Figure 12.3: This is an illustration of the proof of Theorem for k = 3. f Q Q has a double zero at µ 1 and one zero between µ 2 and µ 3. By Lemma max a x b ( ) b + a 2x T k = max b a Moreover with t = (b + a)/(b a) we have t + κ + 1 t 2 1 =, κ 1 1 t 1 Tk (t) = 1. (12.33) κ = b/a. Thus again by Lemma we find ( ) ( ) b + a κ + 1 T k = T k = 1 b a κ 1 2 [ ( κ ) k ( ) ] k + 1 κ 1 + κ 1 κ + 1 (12.34) and (12.28) follows Convergence proof for steepest descent For the proof of (12.18) the following inequality will be used. Theorem (Kantorovich inequality) For any symmetric positive definite matrix A R n n 1 (yt Ay)(y T A 1 y) (M + m)2 (y T y) 2 4Mm y 0, y R n, (12.35) where M := λ max and m := λ min are the largest and smallest eigenvalue of A, respectively.

293 12.4. Proof of the Convergence Estimates 281 Proof. If (λ j, u j ) are orthonormal eigenpairs of A then (λ 1 j, u j ) are eigenpairs for A 1, j = 1,..., n. Let y = n j=1 c ju j be the corresponding eigenvector expansion of a vector y R n. By orthonormality, (cf. (5.15)) where a := yt Ay n y T y = t i λ i, t i = c 2 i n j=1 c2 j i=1 0, b := yt A 1 y y T y i = 1,..., n and = n i=1 t i λ i, (12.36) n t i = 1. (12.37) Thus a and b are convex combinations of the eigenvalues of A and A 1, respectively. Let c be a positive constant to be chosen later. By the geometric/arithmetic mean inequality (7.34) and (12.36) ab = (ac)(b/c) (ac + b/c)/2 = 1 2 i=1 n ( t i λi c + 1/(λ i c) ) = 1 2 where f : [mc, Mc] R is given by f(x) := x + 1/x. By (12.37) i=1 1 ab max 2 f(x). mc x Mc n t i f(λ i c), Since f C 2 and f is positive it follows from Lemma 7.39 that f is a convex function. But a convex function takes it maximum at one of the endpoints of the range (cf. Exercise 12.28) and we obtain i=1 1 ab max{f(mc), f(mc)}. (12.38) 2 Choosing c := 1/ mm we find f(mc) = f(mc) = (12.38) we obtain (y T Ay)(y T A 1 y) (M + m)2 (y T y) 2 = ab 4Mm, M m + m M = M+m mm. By the upper bound in (12.35). inequality as follows For the lower bound we use the Cauchy-Schwarz 1 = ( ( n ) n ) 2 2 t i = (t i λ i ) 1/2 (t i /λ i ) 1/2 ( n )( n ) t i λ i t i /λ i = ab. i=1 i=1 i=1 i=1

294 282 Chapter 12. The Conjugate Gradient Method Exercise (Maximum of a convex function) Show that if f : [a, b] R is convex then max a x b f(x) max{f(a), f(b)}. Proof of (12.18) Let ɛ j := x x j, j = 0, 1,..., where Ax = b. It is enough to show that ( for then ɛ k A We find ɛ k+1 2 A ɛ k 2 A κ 1 κ+1 ) ɛ k 1 ( ) 2 κ 1, k = 0, 1, 2,..., (12.39) κ + 1 ( κ 1 κ+1 ) k ɛ0. It follows from (12.7) that ɛ k+1 = ɛ k α k r k, α k := rt k r k r T k Ar. k ɛ k 2 A = ɛ T k Aɛ k = r T k A 1 r k, ɛ k+1 2 A = (ɛ k α k r k ) T A(ɛ k α k r k ) = ɛ T k Aɛ k 2α k r T k Aɛ k + αkr 2 T k Ar k = ɛ k 2 A (rt k r k) 2 r T k Ar. k Combining these and using Kantorovich inequality ɛ k+1 2 A ɛ k 2 A = 1 and (12.39) is proved. (r T k r k) 2 (r T k Ar k)(r T k A 1 r k ) 1 4λ minλ max (λ min + λ max ) 2 = Monotonicity of the error ( ) 2 κ 1 κ + 1 The error analysis for the conjugate gradient method is based on the A-norm. We end this chapter by considering the Euclidian norm of the error, and show that it is strictly decreasing. Theorem (The error in cg is strictly decreasing) Let in the conjugate gradient method m be the smallest integer such that r m+1 = 0. For k m we have ɛ k+1 2 < ɛ k 2. More precisely, where ɛ j = x x j and Ax = b.. ɛ k 2 2 ɛ k = p k 2 2 p k 2 ( ɛ k 2 A + ɛ k+1 2 A) A

295 12.5. Preconditioning 283 Proof. For j m ɛ j = x m+1 x j = x m x j + α m p m = x m 1 x j + α m 1 p m 1 + α m p m =... so that ɛ j = By (12.40) and A-orthogonality By (12.16) and Lemma m i=j α i p i, α i = rt i r i p T i Ap. (12.40) i m m ɛ j 2 A = ɛ j Aɛ j = αi 2 p T i Ap i = i=j i=j (r T i r i) 2 p T i Ap. (12.41) i p T i p k = (r i + β i 1 p i 1 ) T p k = β i 1 p T i 1p k = = β i 1 β k (p T k p k ), and since β i 1 β k = (r T i r i)/(r T k r k) we obtain Since we obtain p T i p k = rt i r i r T k r p T k p k, i k. (12.42) k ɛ k 2 2 = ɛ k+1 + x k+1 x k 2 2 = ɛ k+1 + α k p k 2 2, ɛ k 2 2 ɛ k =α k ( 2p T k ɛ k+1 + α k p T k p k ) (12.40) = α k ( 2 m (12.42) = ( m + i=k and the Theorem is proved. i=k+1 m i=k+1 α i p T i p k + α k p T ) ( m k p k = + i=k ) r T k r k r T i r i r T i r i p T k Ap k p T i Ap i r T k r p T k p k k (12.41) = p k 2 ( 2 p k 2 ɛk 2 A + ɛ k+1 2 ) A. A m i=k+1 ) αk α i p T i p k 12.5 Preconditioning For problems Ax = b of size n, where both n and cond 2 (A) are large, it is often possible to improve the performance of the conjugate gradient method by using a

296 284 Chapter 12. The Conjugate Gradient Method technique known as preconditioning. Instead of Ax = b we consider an equivalent system BAx = Bb, where B is nonsingular and cond 2 (BA) is smaller than cond 2 (A). The matrix B will in many cases be the inverse of another matrix, B = M 1. We cannot use CG on BAx = Bb directly since BA in general is not symmetric even if both A and B are. But if B (and hence M) is symmetric positive definite then we can apply CG to a symmetrized system and then transform the recurrence formulas to an iterative method for the original system Ax = b. This iterative method is known as the preconditioned conjugate gradient method. We shall see that the convergence properties of this method is determined by the eigenvalues of BA. Suppose B is symmetric positive definite. By Theorem 3.18 there is a nonsingular matrix C such that B = C T C. (C is only needed for the derivation and will not appear in the final formulas). Now BAx = Bb C T (CAC T )C T x = C T Cb (CAC T )y = Cb, & x = C T y. We have 3 linear systems Ax = b (12.43) BAx = Bb (12.44) (CAC T )y = Cb, & x = C T y. (12.45) Note that (12.43) and (12.45) are symmetric positive definite linear systems. In addition to being symmetric positive definite the matrix CAC T is similar to BA. Indeed, C T (CAC T )C T = BA. Thus CAC T and BA have the same eigenvalues. Therefore, if we apply the conjugate gradient method to (12.45) then the rate of convergence will be determined by the eigenvalues of BA. We apply the conjugate gradient method to (CAC T )y = Cb. Denoting the search direction by q k and the residual by z k = Cb CAC T y k we obtain the following from (12.14), (12.15), and (12.16). With y k+1 = y k + α k q k, α k = z T k z k /q T k (CAC T )q k, z k+1 = z k α k (CAC T )q k, q k+1 = z k+1 + β k q k, β k = z T k+1z k+1 /z T k z k. x k := C T y k, p k := C T q k, s k := C T z k, r k := C 1 z k (12.46)

297 12.5. Preconditioning 285 this can be transformed into x k+1 = x k + α k p k, α k = st k r k p T k Ap, k (12.47) r k+1 = r k α k Ap k, (12.48) s k+1 = s k α k BAp k, (12.49) p k+1 = s k+1 + β k p k, β k = st k+1 r k+1 s T k r. (12.50) k Here x k will be an approximation to the solution x of Ax = b, r k = b Ax k is the residual in the original system, and s k = Bb BAx k is the residual in the preconditioned system. This follows since by (12.46) r k = C 1 z k = b C 1 CAC T y k = b Ax k and s k = C T z k = C T Cr k = Br k. We start with r 0 = b Ax 0, p 0 = s 0 = Br 0 and obtain the following preconditioned conjugate gradient algorithm for determining approximations x k to the solution of a symmetric positive definite system Ax = b. Algorithm (Preconditioned conjugate gradient ) The symmetric positive definite linear system Ax = b is solved by the preconditioned conjugate gradient method on the system BAx = Bb, where B is symmetric positive definite. x is a starting vector for the iteration. The iteration is stopped when r k 2 / b 2 tol or k > itmax. K is the number of iterations used. 1 function [ x,k]= pcg (A, B, b, x, t o l, itmax ) 2 r=b A x ; p=b r ; s=p ; rho=s r ; rho0=b b ; 3 for k=0: itmax 4 i f sqrt ( rho / rho0 )<= t o l 5 K=k ; return 6 end 7 t=a p ; a=rho /( p t ) ; 8 x=x+a p ; r=r a t ; 9 w=b t ; s=s a w; 10 rhos=rho ; rho=s r ; 11 p=r+(rho / rhos ) p ; 12 end 13 K=itmax +1; Apart from the calculation of ρ this algorithm is quite similar to Algorithm The main additional work is contained in w = B t. We ll discuss this further in connection with an example. There the inverse of B is known and we have to solve a linear system to find w. We have the following convergence result for this algorithm.

298 286 Chapter 12. The Conjugate Gradient Method Theorem (Error bound preconditioned cg) Suppose we apply a symmetric positive definite preconditioner B to the symmetric positive definite system Ax = b. Then the quantities x k computed in Algorithm satisfy the following bound: ( ) k x x k A κ 1 2, for k 0, x x 0 A κ + 1 where κ = λ max /λ min is the ratio of the largest and smallest eigenvalue of BA. Proof. Since Algorithm is equivalent to solving (12.45) by the conjugate gradient method Theorem implies that y y k CAC T y y 0 CAC T ( ) k κ 1 2, for k 0, κ + 1 where y k is the conjugate gradient approximation to the solution y of (12.45) and κ is the ratio of the largest and smallest eigenvalue of CAC T. Since BA and CAC T are similar this is the same as the κ in the theorem. By (12.46) we have y y k 2 CAC T = (y y k) T (CAC T )(y y k ) and the proof is complete. = (C T (y y k )) T A(C T (y y k )) = x x k 2 A We conclude that B should satisfy the following requirements for a problem of size n: 1. The eigenvalues of BA should be located in a narrow interval. Preferably one should be able to bound the length of the interval independently of n. 2. The evaluation of Bx for a given vector x should not be expensive in storage and arithmetic operations, ideally O(n) for both. In this book we only consider an example of a preconditioner. For a comprehensive treatment of preconditioners see [1] Preconditioning Example A variable coefficient problem Consider the problem ( x c(x, y) u x ) y ( ) c(x, y) u y = f(x, y) (x, y) Ω = (0, 1) 2 u(x, y) = 0 (x, y) Ω. (12.51)

299 12.6. Preconditioning Example 287 Here Ω is the open unit square while Ω is the boundary of Ω. The functions f and c are given and we seek a function u = u(x, y) such that (12.51) holds. We assume that c and f are defined and continuous on Ω and that c(x, y) > 0 for all (x, y) Ω. The problem (12.51) reduces to the Poisson problem in the special case where c(x, y) = 1 for (x, y) Ω. As for the Poisson problem we solve (12.51) numerically on a grid of points {(jh, kh) : j, k = 0, 1,..., m + 1}, where h = 1/(m + 1), and where m is a positive integer. Let (x, y) be one of the interior grid points. For univariate functions f, g we use the central difference approximations d (f(t) ddt ) dt g(t) (f(t + h2 ) ddt g(t + h/2) f(t h2 ) ddt g(t h2 ) ) /h ( f(t + h 2 )( g(t + h) g(t) ) f(t h 2 )( g(t) g(t h) )) /h 2 to obtain and x ( c u x )j,k c j+ 1 2,k (v j+1,k v j,k ) c j 1 2,k (v j,k v j 1,k ) h 2 ( u c y y )j,k c j,k+ 1 (v j,k+1 v 2 j,k ) c j,k 1 (v j,k v 2 j,k 1 ) h 2, where c p,q = c(ph, qh) and v j,k u(jh, kh). With these approximations the discrete analog of (12.51) turns out to be (P h v) j,k = h 2 f j,k j, k = 1,..., m v j,k = 0 j = 0, m + 1 all k or k = 0, m + 1 all j, (12.52) where (P h v) j,k = (c j,k 1 + c 2 j 1 2,k + c j+ 1 2,k + c j,k+ 1 )v j,k 2 c j,k 1 v j,k 1 c 2 j 1 2,k v j 1,k c j+ 1 2,k v j+1,k c j,k+ 1 v j,k+1 2 (12.53) and f j,k = f(jh, kh). As before we let V = (v j,k ) R m m and F = (f j,k ) R m m. The corresponding linear system can be written Ax = b where x = vec(v ), b = h 2 vec(f ), and the n-by-n coefficient matrix A is given by a i,i = c ji,k i 1 + c 2 j i 1 + c 2,ki j i+ 1 + c 2,ki j i,k i+ 1, 2 i = 1, 2,..., n a i+1,i = a i,i+1 = c ji+ 1 2,ki, i mod m 0 a i+m,i = a i,i+m = c ji,k i+ 1, 2 i = 1, 2,..., n m a i,j = 0 otherwise, (12.54)

300 288 Chapter 12. The Conjugate Gradient Method where (j i, k i ) with 1 j i, k i m is determined uniquely from the equation i = j i + (k i 1)m for i = 1,..., n. If c(x, y) = 1 for all (x, y) Ω we recover the Poisson matrix. In general we cannot write A as a Kronecker sum. But we can show that A is symmetric and it is positive definite as long as the function c is positive on Ω. Theorem (Positive definite matrix) If c(x, y) > 0 for (x, y) Ω then the matrix A given by (12.54) is symmetric positive definite. Proof. To each x R n there corresponds a matrix V R m m such that x = vec(v ). We claim that x T Ax = m m j=1 k=0 ( ) 2 m c j,k+ 1 vj,k+1 v 2 j,k + m k=1 j=0 c j+ 1 2,k ( vj+1,k v j,k ) 2, (12.55) where v 0,k = v m+1,k = v j,0 = v j,m+1 = 0 for j, k = 0, 1,..., m + 1. Since c j+ 1 2 and c,k j,k+ 1 correspond to values of c in Ω for the values of j, k in the sums it follows 2 that they are positive and from (12.55) we see that x T Ax 0 for all x R n. Moreover if x T Ax = 0 then all quadratic factors are zero and v j,k+1 = v j,k for k = 0, 1,..., m and j = 1,..., m. Now v j,0 = v j,m+1 = 0 implies that V = 0 and hence x = 0. Thus A is symmetric positive definite. It remains to prove (12.55). From the connection between (12.53) and (12.54) we have x T Ax = = m j=1 k=1 m j=1 k=1 m (P h v) j,k v j,k m ( c j,k 1 2 v2 j,k + c j 1 2,k v 2 j,k + c j+ 1 2,k v 2 j,k + c j,k+ 1 2 v2 j,k c j,k 1 v j,k 1v 2 j,k c j,k+ 1 v j,kv 2 j,k+1 ) c j 1 2,k v j 1,k v j,k c j+ 1 2,k v j,k v j+1,k. Using the homogenous boundary conditions we obtain

301 12.6. Preconditioning Example 289 It follows that m j=1 k=1 m m m j=1 k=1 m c j,k 1 2 v2 j,k = c j,k 1 2 v j,k 1v j,k = m m j=1 k=1 m j=1 k=1 x T Ax = and (12.55) follows. c j 1 2,k v 2 j,k = c j 1 2,k v j 1,k v j,k = + m j=1 k=0 m m m k=1 j=0 m m j=1 k=0 m m j=1 k=0 m m k=1 j=0 m m k=1 j=0 c j,k+ 1 2 v2 j,k+1, c j,k+ 1 2 v j,k+1v j,k, c j+ 1 2,k v 2 j+1,k, c j+ 1 2,k v j+1,k v j,k. ( c j,k+ 1 2 v 2 j,k + vj,k+1 2 ) 2v j,k v j,k+1 c j+ 1 2,k ( v 2 j,k + v 2 j+1,k 2v j,k v j+1,k ) Applying preconditioning Consider solving Ax = b, where A is given by (12.54) and b R n. Since A is positive definite it is nonsingular and the system has a unique solution x R n. Moreover we can use either Cholesky factorization or the block tridiagonal solver to find x. Since the bandwidth of A is m = n both of these methods require O(n 2 ) arithmetic operations for large n. If we choose c(x, y) 1 in (12.51), we get the Poisson problem. With this in mind, we may think of the coefficient matrix A p arising from the discretization of the Poisson problem as an approximation to the matrix (12.54). This suggests using B = A 1 p, the inverse of the discrete Poisson matrix as a preconditioner for the system (12.52). Consider Algorithm With this preconditioner the calculation w = Bt takes the form A p w k = t k. In Section 10.2 we developed a Simple fast Poisson Solver, Cf. Algorithm This method can be utilized to solve A p w = t. Consider the specific problem where c(x, y) = e x+y and f(x, y) = 1.

302 290 Chapter 12. The Conjugate Gradient Method n K K/ n K pre Table 12.33: The number of iterations K (no preconditioning) and K pre (with preconditioning) for the problem (12.51) using the discrete Poisson problem as a preconditioner. We have used Algorithm (conjugate gradient without preconditioning), and Algorithm (conjugate gradient with preconditioning) to solve the problem (12.51). We used x 0 = 0 and ɛ = The results are shown in Table Without preconditioning the number of iterations still seems to be more or less proportional to n although the convergence is slower than for the constant coefficient problem. Using preconditioning speeds up the convergence considerably. The number of iterations appears to be bounded independently of n. Using a preconditioner increases the work in each iteration. For the present example the number of arithmetic operations in each iteration changes from O(n) without preconditioning to O(n 3/2 ) or O(n log 2 n) with preconditioning. This is not a large increase and both the number of iterations and the computing time is reduced significantly. Let us finally show that the number κ = λ max /λ min which determines the rate of convergence for the preconditioned conjugate gradient method applied to (12.51) can be bounded independently of n. Theorem (Eigevalues of preconditioned matrix) Suppose 0 < c 0 c(x, y) c 1 for all (x, y) [0, 1] 2. For the eigenvalues of the matrix BA = A 1 p A just described we have κ = λ max λ min c 1 c 0. Proof. Suppose A 1 p Ax = λx for some x R n \ {0}. Then Ax = λa p x. Multiplying this by x T and solving for λ we find λ = xt Ax x T A p x. We computed x T Ax in (12.55) and we obtain x T A p x by setting all the c s there

303 12.7. Review Questions 291 equal to one x T A p x = m m ( ) 2 m m ( ) 2. vi,j+1 v i,j + vi+1,j v i,j i=1 j=0 j=1 i=0 Thus x T A p x > 0 and bounding all the c s in (12.55) from below by c 0 and above by c 1 we find c 0 (x T A p x) x T Ax c 1 (x T A p x) which implies that c 0 λ c 1 for all eigenvalues λ of BA = A 1 p A. Using c(x, y) = e x+y as above, we find c 0 = e 2 and c 1 = 1. Thus κ e 2 7.4, a quite acceptable matrix condition number which explains the convergence results from our numerical experiment Review Questions Does the steepest descent and conjugate gradient method always converge? What kind of orthogonalities occur in the conjugate gradient method? What is a Krylov space? What is a convex function? How do SOR and conjugate gradient compare?

304 292 Chapter 12. The Conjugate Gradient Method

305 Part V Eigenvalues and Eigenvectors 293

306

307 Chapter 13 Numerical Eigenvalue Problems 13.1 Eigenpairs Eigenpairs have applications in quantum mechanics, differential equations, elasticity in mechanics,... Consider the eigenpair problem for some classes of matrices A C n n. Diagonal Matrices. The eigenpairs are easily determined. Since Ae i = a ii e i the eigenpairs are (λ i, e i ), where λ i = a ii for i = 1,..., n. Moreover, the eigenvectors of A are linearly independent. Triangular Matrices Suppose A is upper or lower triangular. Consider finding the eigenvalues Since det(a λi) = n i=1 (a ii λ) the eigenvalues are λ i = a ii for i = 1,..., n, the diagonal elements of A. To determine the eigenvectors can be challenging. Block Diagonal Matrices Suppose A = diag(a 1, A 2,..., A r ), A i C mi mi. Here the eigenpair problem reduces to r smaller problems. Let A i X i = X i D i define the eigenpairs of A i for i = 1,..., r and let X := diag(x 1,..., X r ), D := diag(d 1,..., D r ). Then the eigenpairs for A are given by AD = diag(a 1,..., A r ) diag(x 1,..., X r ) = diag(a 1 X 1,..., A r X r ) = diag(x 1 D 1,..., X r D r ) = XD. Block Triangular matrices Matrices Let A 11, A 22,..., A rr be the diagonal 295

308 296 Chapter 13. Numerical Eigenvalue Problems blocks of A. By Property 8. of determinants det(a λi) = r det(a ii λi) i=1 and the eigenvalues are found from the eigenvalues of the diagonal blocks. In this and the next chapter we consider some numerical methods for finding one or more of the eigenvalues and eigenvectors of a matrix A C n n. Maybe the first method which comes to mind is to form the characteristic polynomial π A of A, and then use a polynomial root finder, like Newton s method to determine one or several of the eigenvalues. It turns out that this is not suitable as an all purpose method. One reason is that a small change in one of the coefficients of π A (λ) can lead to a large change in the roots of the polynomial. For example, if π A (λ :) = λ 16 and q(λ) = λ then the roots of π A are all equal to zero, while the roots of q are λ j = 10 1 e 2πij/16, j = 1,..., 16. The roots of q have absolute value 0.1 and a perturbation in one of the polynomial coefficients of magnitude has led to an error in the roots of approximately 0.1. The situation can be somewhat remedied by representing the polynomials using a different basis. In this text we will only consider methods which work directly with the matrix. But before that, in Section 13.3 we consider how much the eigenvalues change when the elements in the matrix are perturbed. We start with a simple but useful result for locating the eigenvalues Gerschgorin s Theorem The following theorem is useful for locating eigenvalues of an arbitrary square matrix. Theorem 13.1 (Gerschgorin s circle theorem ) Suppose A C n n. Define for i = 1, 2,..., n R i = {z C : z a ii r i }, r i := n a ij, j=1 j i C j = {z C : z a jj c j }, c j := n a ij. Then any eigenvalue of A lies in R C where R = R 1 R 2 R n C = C 1 C 2 C n. i=1 i j and

309 13.2. Gerschgorin s Theorem 297 Proof. Suppose (λ, x) is an eigenpair for A. We claim that λ R i, where i is such that x i = x. Indeed, Ax = λx implies that j a ijx j = λx i or (λ a ii )x i = j i a ijx j. Dividing by x i and taking absolute values we find λ a ii = j i a ij x j /x i j i a ij x j /x i r i since x j /x i 1 for all j. Thus λ R i. Since λ is also an eigenvalue of A T, it must be in one of the row disks of A T. But these are the column disks C j of A. Hence λ C j for some j. The set R i is a subset of the complex plane consisting of all points inside a circle with center at a ii and radius r i, c.f. Figure R i is called a (Gerschgorin) row disk. Imaginary axis a i,i r i Real axis Figure 13.1: The Gerschgorin disk R i. An eigenvalue λ lies in the union of the row disks R 1,..., R n and also in the union of the column disks C 1,..., C n. If A is Hermitian then R i = C i for i = 1, 2,..., n. Moreover, in this case the eigenvalues of A are real, and the Gerschgorin disks can be taken to be intervals on the real line. Example 13.2 (Gerschgorin) Let T = tridiag( 1, 2, 1) R m m be the second derivative matrix. Since A is Hermitian we have R i = C i for all i and the eigenvalues are real. We find R 1 = R m = {z R : z 2 1}, and R i = {z R : z 2 2}, i = 2, 3,..., m 1.

310 298 Chapter 13. Numerical Eigenvalue Problems We conclude that λ [0, 4] for any eigenvalue λ of T. To check this, we recall that by Lemma 1.31 the eigenvalues of T are given by [ ] 2 jπ λ j = 4 sin, j = 1, 2,..., m. 2(m + 1) [ When m is large the smallest eigenvalue 4 sin mπ 2(m+1)] 2 is very close to 4. Thus Gerschgorin s theo- [ the largest eigenvalue 4 sin rem gives a remarkably good estimate for large m. π 2(m+1)] 2 is very close to zero and Sometimes some of the Gerschgorin disks are distinct and we have Corollary 13.3 (Disjoint Gerschgorin disks) If p of the Gerschgorin row disks are disjoint from the others, the union of these disks contains precisely p eigenvalues. The same result holds for the column disks. Proof. Consider a family of matrices A(t) := D + t(a D), D := diag(a 11,..., a nn ), t [0, 1]. We have A(0) = D and A(1) = A. As a function of t, every eigenvalue of A(t) is a continuous function of t. This follows from Theorem 13.7, see Exercise The row disks R i (t) of A(t) have radius proportional to t, indeed R i (t) = {z C : z a ii tr i }, r i := n a ij. Clearly 0 t 1 < t 2 1 implies R i (t 1 ) R i (t 2 ) and R i (1) is a row disk of A for all i. Suppose p k=1 R i k (1) are disjoint from the other disks of A and set R p (t) := p k=1 R i k (t) for t [0, 1]. Now R p (0) contains only the p eigenvalues a i1,i 1,..., a ip,i p of A(0) = D. As t increases from zero to one the set R p (t) is disjoint from the other row disks of A and by the continuity of the eigenvalues cannot loose or gain eigenvalues. It follows that R p (1) must contain p eigenvalues of A. Example 13.4 Consider the matrix A = [ 1 ɛ1 ɛ 2 ɛ 3 2 ɛ 4 ɛ 5 ɛ 6 3 j=1 j i ], where ɛ i all i. By Corollary 13.3 the eigenvalues λ 1, λ 2, λ 3 of A are distinct and satisfy λ j j for j = 1, 2, 3.

13.3. Perturbation of Eigenvalues 299 Semyon Aranovich Gershgorin, 1901-1933 (left), Jacques Salomon Hadamard, 1865-1963 (right). Exercise 13.

311 13.3. Perturbation of Eigenvalues 299 Semyon Aranovich Gershgorin, (left), Jacques Salomon Hadamard, (right). Exercise 13.5 (Nonsingularity using Gerschgorin) Consider the matrix A = Show using Gerschgorin s theorem that A is nonsingular. Exercise 13.6 (Gerschgorin, strictly diagonally dominant matrix) Show using Gerschgorin,s theorem that a strictly diagonally dominant matrix A ( a i,i > j i a i,j for all i) is nonsingular Perturbation of Eigenvalues In this section we study the following problem. Given matrices A, E C n n, where we think of E as a perturbation of A. By how much do the eigenvalues of A and A + E differ? Not surprisingly this problem is more complicated than the corresponding problem for linear systems. We illustrate this by considering two examples. Suppose A 0 := 0 is the zero matrix. If λ σ(a 0 + E) = σ(e), then λ E by Theorem 11.28, and any zero eigenvalue of A 0 is perturbed by at most E. On the other hand consider for ɛ > 0 the matrices A 1 :=....., E :=..... = ɛe n e T ɛ

312 300 Chapter 13. Numerical Eigenvalue Problems The characteristic polynomial of A 1 + E is π(λ) := ( 1) n (λ n ɛ), and the zero eigenvalues of A 1 are perturbed by the amount λ = E 1/n. Thus, for n = 16, a perturbation of say ɛ = gives a change in eigenvalue of 0.1. The following theorem shows that a dependence E 1/n is the worst that can happen. Theorem 13.7 (Elsner s theorem(1985)) Suppose A, E C n n. To every µ σ(a + E) there is a λ σ(a) such that µ λ K E 1/n 2, K = ( A 2 + A + E 2 ) 1 1/n. (13.1) Proof. Suppose A has eigenvalues λ 1,..., λ n and let λ 1 be one which is closest to µ. Let u 1 with u 1 2 = 1 be an eigenvector corresponding to µ, and extend u 1 to an orthonormal basis {u 1,..., u n } of C n. Note that (µi A)u 1 2 = (A + E)u 1 Au 1 2 = Eu 1 2 E 2, n n (µi A)u j 2 ( µ + Au j 2 ) ( ) n 1. (A + E) 2 + A 2 j=2 j=2 Using this and Hadamard s inequality (4.15) we find n µ λ 1 n µ λ j = det(µi A) = det ( (µi A)[u 1,..., u n ] ) j=1 (µi A)u 1 2 n j=2 (µi A)u j 2 E 2 ( (A + E) 2 + A 2 ) n 1. The result follows by taking nth roots in this inequality. It follows from this theorem that the eigenvalues depend continuously on the elements of the matrix. The factor E 1/n 2 shows that this dependence is almost, but not quite, differentiable. As an example, the eigenvalues of the matrix [ 1 ɛ 1 1 ] are 1 ± ɛ and this expression is not differentiable at ɛ = 0. Exercise 13.8 (Continuity of eigenvalues) Suppose A(t) := D + t(a D), D := diag(a 11,..., a nn ), t R. 0 t 1 < t 2 1 and that µ is an eigenvalue of A(t 2 ). Show, using Theorem 13.7 with A = A(t 1 ) and E = A(t 2 ) A(t 1 ), that A(t 1 ) has an eigenvalue λ such that λ µ C(t 2 t 1 ) 1/n, where C 2 ( D 2 + A D 2 ). Thus, as a function of t, every eigenvalue of A(t) is a continuous function of t.

313 13.3. Perturbation of Eigenvalues Nondefective matrices Recall that a matrix is nondefective if the eigenvectors form a basis for C n. For nondefective matrices we can get rid of the annoying exponent 1/n in E 2. For a more general discussion than the one in the following theorem see [30]. Theorem 13.9 (Absolute errors) Suppose A C n n has linearly independent eigenvectors {x 1,..., x n } and let X = [x 1,..., x n ] be the eigenvector matrix. To any µ C and x C n with x p = 1 we can find an eigenvalue λ of A such that λ µ K p (X) r p, 1 p, (13.2) where r := Ax µx and K p (X) := X p X 1 p. If for some E C n n it holds that (µ, x) is an eigenpair for A + E, then we can find an eigenvalue λ of A such that λ µ K p (X) E p, 1 p, (13.3) Proof. If µ σ(a) then we can take λ = µ and (13.2), (13.3) hold trivially. So assume µ / σ(a). Since A is nondefective it can be diagonalized, we have A = XDX 1, where D = diag(λ 1,..., λ n ) and (λ j, x j ) are the eigenpairs of A for j = 1,..., n. Define D 1 := D µi. Then D1 1 = diag ( (λ 1 µ) 1,..., (λ n µ) 1) exists and XD 1 1 X 1 r = ( X(D µi)x 1) 1 r = (A µi) 1 (A µi)x = x. Using this and Lemma below we obtain 1 = x p = XD 1 1 X 1 r p D 1 1 pk p (X) r p = K p(x) r p min j λ j µ. But then (13.2) follows. If (A + E)x = µx then 0 = Ax µx + Ex = r + Ex. But then r p = Ex p E p. Inserting this in (13.2) proves (13.3). The equation (13.3) shows that for a nondefective matrix the absolute error can be magnified by at most K p (X), the condition number of the eigenvector matrix with respect to inversion. If K p (X) is small then a small perturbation changes the eigenvalues by small amounts. Even if we get rid of the exponent 1/n, the equation (13.3) illustrates that it can be difficult or sometimes impossible to compute accurate eigenvalues and eigenvectors of matrices with almost linearly dependent eigenvectors. On the other hand the eigenvalue problem for normal matrices is better conditioned. Indeed, if A is normal then it has a set of orthonormal eigenvectors and the eigenvector matrix is unitary. If we restrict attention to the 2-norm then K 2 (X) = 1 and (13.3) implies the following result.

314 302 Chapter 13. Numerical Eigenvalue Problems Theorem (Perturbations, normal matrix) Suppose A C n n is normal and let µ be an eigenvalue of A + E for some E C n n. Then we can find an eigenvalue λ of A such that λ µ E 2. For an even stronger result for Hermitian matrices see Corollary We conclude that the situation for the absolute error in an eigenvalue of a Hermitian matrix is quite satisfactory. Small perturbations in the elements are not magnified in the eigenvalues. In the proof of Theorem 13.9 we used that the p-norm of a diagonal matrix is equal to its spectral radius. Lemma (p-norm of a diagonal matrix) If A = diag(λ 1,..., λ n ) is a diagonal matrix then A p = ρ(a) for 1 p. Proof. For p = the proof is left as an exercise. For any x C n and p < we have Ax p = [λ 1 x 1,..., λ n x n ] T p = ( n λ j p x j p) 1/p ρ(a) x p. Ax Thus A p = max p x 0 x p ρ(a). But from Theorem we have ρ(a) A p and the proof is complete. j=1 Exercise ( -norm of a diagonal matrix) Give a direct proof that A = ρ(a) if A is diagonal. For the accuracy of an eigenvalue of small magnitude we are interested in the size of the relative error. Theorem (Relative errors) Suppose in Theorem 13.9 that A C n n is nonsingular. To any µ C and x C n with x p = 1, we can find an eigenvalue λ of A such that λ µ λ K p (X)K p (A) r p A p, 1 p, (13.4) where r := Ax µx. If for some E C n n it holds that (µ, x) is an eigenpair for A + E, then we can find an eigenvalue λ of A such that λ µ λ K p (X) A 1 E p K p (X)K p (A) E p A p, 1 p, (13.5)

315 13.4. Unitary Similarity Transformation of a Matrix into Upper Hessenberg Form303 Proof. Applying Theorem to A 1 we have for any λ σ(a) 1 λ A 1 p = K p(a) A p and (13.4) follows from (13.2). To prove (13.5) we define the matrices B := µa 1 and F := A 1 E. If (λ j, x) are the eigenpairs for A then ( µ λ j, x ) are the eigenpairs for B for j = 1,..., n. Since (µ, x) is an eigenpair for A + E we find (B + F I)x = (µa 1 A 1 E I)x = A 1( µi (E + A) ) x = 0. Thus (1, x) is an eigenpair for B + F. Applying Theorem 13.9 to this eigenvalue we can find λ σ(a) such that µ λ 1 K p(x) F p = K p (X) A 1 E p which proves the first estimate in (13.5). The second inequality in (13.5) follows from the submultiplicativity of the p-norm Unitary Similarity Transformation of a Matrix into Upper Hessenberg Form Before attempting to find eigenvalues and eigenvectors of a matrix (exceptions are made for certain sparse matrices), it is often advantageous to reduce it by similarity transformations to a simpler form. Orthogonal or unitary similarity transformations are particularly important since they are insensitive to round-off errors in the elements of the matrix. In this section we show how this reduction can be carried out. Recall that a matrix A C n n is upper Hessenberg if a i,j = 0 for j = 1, 2,..., i 2, i = 3, 4,..., n. We will reduce A C n n to upper Hessenberg form by unitary similarity transformations. Let A 1 = A and define A k+1 = H k A k H k for k = 1, 2,..., n 2. Here H k is a Householder transformation chosen to introduce zeros in the elements of column k of A k under the subdiagonal. The final matrix A n 1 will be upper Hessenberg. Householder transformations were used in Chapter 4 to reduce a matrix to upper triangular form. To preserve eigenvalues similarity transformations are needed and then the final matrix in the reduction cannot in general be upper triangular. If A 1 = A is Hermitian, the matrix A n 1 will be Hermitian and tridiagonal. For if A k = A k then A k+1 = (H k A k H k ) = H k A kh k = A k+1. Since A n 1 is upper Hessenberg and Hermitian, it must be tridiagonal. To describe the reduction to upper Hessenberg or tridiagonal form in more detail we partition A k as follows [ ] Bk C A k = k. D k E k

316 304 Chapter 13. Numerical Eigenvalue Problems Suppose B k C k,k is upper Hessenberg, and the first k 1 columns of D k C n k,k are zero, i.e. D k = [0, 0,..., 0, d k ]. Let V k = I v k v k Cn k,n k be a Householder transformation such that V k d k = α k e 1. Define [ ] Ik 0 H k = C n n. 0 V k The matrix H k is a Householder transformation, and we find [ ] [ ] [ ] Ik 0 Bk C A k+1 = H k A k H k = k Ik 0 0 V k D k E k 0 V k [ ] B = k C k V k. V k D k V k E k V k Now V k D k = [V k 0,..., V k 0, V k d k ] = (0,..., 0, α k e 1 ). Moreover, the matrix B k is not affected by the H k transformation. Therefore the upper left (k + 1) (k + 1) corner of A k+1 is upper Hessenberg and the reduction is carried one step further. The reduction stops with A n 1 which is upper Hessenberg. To find A k+1 we use Algorithm 4.17 to find v k and α k. We store v k in the kth column of a matrix L as L(k + 1 : n, k) = v k. This leads to the following algorithm. Algorithm (Householder reduction to Hessenberg form) This algorithm uses Householder similarity transformations to reduce a matrix A C n n to upper Hessenberg form. The reduced matrix B is tridiagonal if A is symmetric. Details of the transformations are stored in a lower triangular matrix L. The elements of L can be used to assemble a unitary matrix Q such that B = Q AQ. Algorithm 4.17 is used in each step of the reduction. 1 function [ L, B] = hesshousegen (A) 2 n=length (A) ; L=zeros ( n, n ) ; B=A; 3 for k=1:n 2 4 [ v,b( k+1,k ) ]= housegen (B( k+1:n, k ) ) ; 5 L ( ( k+1) : n, k )=v ; B( ( k+2) : n, k )=zeros ( n k 1,1) ; 6 C=B( ( k+1) : n, ( k+1) : n ) ; B( ( k+1) : n, ( k+1) : n )=C v ( v C) ; 7 C=B( 1 : n, ( k+1) : n ) ; B( 1 : n, ( k+1) : n )=C (C v ) v ; 8 end Exercise (Number of arithmetic operations, Hessenberg reduction) Show that the number of arithmetic operations for Algorithm is 10 3 n3 = 5G n Assembling Householder transformations We can use the output of Algorithm to assemble the matrix Q R n n such that Q is orthonormal and Q AQ is upper Hessenberg. We need to compute the

317 13.5. Computing a Selected Eigenvalue of a Symmetric Matrix 305 product Q = H 1 H 2 H n 2, where H k = [ I 0 0 I v k v T k ] and v k R n k. Since v 1 R n 1 and v n 2 R 2 it is most economical to assemble the product from right to left. We compute Q n 1 = I and Q k = H k Q k+1 for k = n 2, n 3,..., 1. Suppose Q k+1 has the form [ I k ] 0 0 U k, where U k R n k,n k. Then [ ] [ ] [ ] Ik 0 Ik 0 Ik 0 Q k = 0 I v k v T = k 0 U k 0 U k v k (v T k U. k) This leads to the following algorithm. Algorithm (Assemble Householder transformations) Suppose [L,B] = hesshousegen(a) is the output of Algorithm This algorithm assembles an orthonormal matrix Q from the columns of L such that B = Q AQ is upper Hessenberg. 1 function Q = accumulateq (L) 2 n=length (L) ; Q=eye ( n ) ; 3 for k=n 2: 1:1 4 v=l ( ( k+1) : n, k ) ; C=Q( ( k+1) : n, ( k+1) : n ) ; 5 Q( ( k+1) : n, ( k+1) : n )=C v ( v C) ; 6 end Exercise (Assemble Householder transformations) Show that the number of arithmetic operations required by Algorithm is 4 3 n3 = 2G n. Exercise (Tridiagonalize a symmetric matrix) If A is real and symmetric we can modify Algorithm as follows. To find A k+1 from A k we have to compute V k E k V k where E k is symmetric. Dropping subscripts we have to compute a product of the form G = (I vv T )E(I vv T ). Let w := Ev, β := 1 2 vt w and z := w βv. Show that G = E vz T zv T. Since G is symmetric, only the sub- or superdiagonal elements of G need to be computed. Computing G in this way, it can be shown that we need O(4n 3 /3) operations to tridiagonalize a symmetric matrix by orthonormal similarity transformations. This is less than half the work to reduce a nonsymmetric matrix to upper Hessenberg form. We refer to [29] for a detailed algorithm Computing a Selected Eigenvalue of a Symmetric Matrix Let A R n n be symmetric with eigenvalues λ 1 λ 2 λ n. In this section we consider a method to compute an approximation to the mth eigenvalue λ m for

318 306 Chapter 13. Numerical Eigenvalue Problems some 1 m n. Using Householder similarity transformations as outlined in the previous section we can assume that A is symmetric and tridiagonal. d 1 c 1 c 1 d 2 c 2 A = (13.6) c n 2 d n 1 c n 1 c n 1 d n Suppose one of the off-diagonal elements is equal to zero, say c i = 0. We then have A = [ A 1 ] 0 0 A 2, where d 1 c 1 d i+1 c i+1 c 1 d 2 c 2 c i+1 d i+2 c i+2 A 1 = and A 2 = c i 2 d i 1 c i 1 c n 2 d n 1 c n 1 c i 1 d i c n 1 d n Thus A is block diagonal and we can split the eigenvalue problem into two smaller problems involving A 1 and A 2. We assume that this reduction has been carried out so that A is irreducible, i. e., c i 0 for i = 1,..., n 1. We first show that irreducibility implies that the eigenvalues are distinct. Lemma (Distinct eigevalues of a tridiagonal matrix) An irreducible, tridiagonal and symmetric matrix A R n n has n real and distinct eigenvalues. Proof. Let A be given by (13.6). By Theorem 5.39 the eigenvalues are real. Define for x R the polynomial p k (x) := det(xi k A k ) for k = 1,..., n, where A k is the upper left k k corner of A (the leading principal submatrix of order k). The eigenvalues of A are the roots of the polynomial p n. Using the last column to expand for k 2 the determinant p k+1 (x) we find p k+1 (x) = (x d k+1 )p k (x) c 2 kp k 1 (x). (13.7) Since p 1 (x) = x d 1 and p 2 (x) = (x d 2 )(x d 1 ) c 2 1 this also holds for k = 0, 1 if we define p 1 (x) = 0 and p 0 (x) = 1. For M sufficiently large we have p 2 ( M) > 0, p 2 (d 1 ) < 0, p 2 (+M) > 0. Since p 2 is continuous there are y 1 ( M, d 1 ) and y 2 (d 1, M) such that p 2 (y 1 ) = p 2 (y 2 ) = 0. It follows that the root d 1 of p 1 separates the roots of p 2, so y 1 and y 2 must be distinct. Consider next p 3 (x) = (x d 3 )p 2 (x) c 2 2p 1 (x) = (x d 3 )(x y 1 )(x y 2 ) c 2 2(x d 1 ).

319 13.5. Computing a Selected Eigenvalue of a Symmetric Matrix 307 Since y 1 < d 1 < y 2 we have for M sufficiently large p 3 ( M) < 0, p 3 (y 1 ) > 0, p 3 (y 2 ) < 0, p 3 (+M) > 0. Thus the roots x 1, x 2, x 3 of p 3 are separated by the roots y 1, y 2 of p 2. In the general case suppose for k 2 that the roots z 1,..., z k 1 of p k 1 separate the roots y 1,..., y k of p k. Choose M so that y 0 := M < y 1, y k+1 := M > y k. Then y 0 < y 1 < z 1 < y 2 < z 2 < z k 1 < y k < y k+1. We claim that for M sufficiently large p k+1 (y j ) = ( 1) k+1 j p k+1 (y j ) 0, for j = 0, 1,..., k + 1. This holds for j = 0, k + 1, and for j = 1,..., k since p k+1 (y j ) = c 2 kp k 1 (y j ) = c 2 k(y j z 1 ) (y j z k 1 ). It follows that the roots x 1,..., x k+1 are separated by the roots y 1,..., y k of p k and by induction the roots of p n (the eigenvalues of A) are distinct The inertia theorem We say that two matrices A, B C n n are congruent if A = E BE for some nonsingular matrix E C n n. By Theorem 5.32 a Hermitian matrix A is both congruent and similar to a diagonal matrix D, U AU = D where U is unitary. The eigenvalues of A are the diagonal elements of D. Let π(a), ζ(a) and υ(a) denote the number of positive, zero and negative eigenvalues of A. If A is Hermitian then all eigenvalues are real and π(a) + ζ(a) + υ(a) = n. Theorem (Sylvester s inertia theorem ) If A, B C n n are Hermitian and congruent then π(a) = π(b), ζ(a) = ζ(b) and υ(a) = υ(b). Proof. Suppose A = E BE, where E is nonsingular. Assume first that A and B are diagonal matrices. Suppose π(a) = k and π(b) = m < k. We shall show that this leads to a contradiction. Let E 1 be the upper left m k corner of E. Since m < k, we can find a nonzero x such that E 1 x = 0 (cf. Lemma 0.16). Let y T = [x T, 0 T ] C n, and z = [z 1,..., z n ] T = Ey. Then z i = 0 for i = 1, 2,..., m. If A has positive eigenvalues λ 1,..., λ k and B has eigenvalues µ 1,..., µ n, where µ i 0 for i m + 1 then y Ay = n λ i y i 2 = i=1 k λ i x i 2 > 0. i=1

320 308 Chapter 13. Numerical Eigenvalue Problems But n y Ay = y E BEy = z Bz = µ i z i 2 0, i=m+1 a contradiction. We conclude that π(a) = π(b) if A and B are diagonal. Moreover, υ(a) = π( A) = π( B) = υ(b) and ζ(a) = n π(a) υ(a) = n π(b) υ(b) = ζ(b). This completes the proof for diagonal matrices. Let in the general case U 1 and U 2 be unitary matrices such that U 1AU 1 = D 1 and U 2BU 2 = D 2 where D 1 and D 2 are diagonal matrices. Since A = E BE, we find D 1 = F D 2 F where F = U 2EU 1 is nonsingular. Thus D 1 and D 2 are congruent diagonal matrices. But since A and D 1, B and D 2 have the same eigenvalues, we find π(a) = π(d 1 ) = π(d 2 ) = π(b). Similar results hold for ζ and υ. Corollary (Counting eigenvalues using the LDL* factorization) Suppose A = tridiag(c i, d i, c i ) R n n is symmetric and that α R is such that A αi has an symmetric LU factorization, i.e. A αi = LDL T where L is unit lower triangular and D is diagonal. Then the number of eigenvalues of A strictly less than α equals the number of negative diagonal elements in D. The diagonal elements d 1 (α),..., d n (α) in D can be computed recursively as follows d 1 (α) = d 1 α, d k (α) = d k α c 2 k 1/d k 1 (α), k = 2, 3,..., n. (13.8) Proof. Since the diagonal elements in L in an LU factorization equal the diagonal elements in D in an LDL T factorization we see that the formulas in (13.8) follows immediately from (1.16). Since L is nonsingular, A αi and D are congruent. By the previous theorem υ(a αi) = υ(d), the number of negative diagonal elements in D. If Ax = λx then (A αi)x = (λ α)x, and λ α is an eigenvalue of A αi. But then υ(a αi) equals the number of eigenvalues of A which are less than α. Exercise (Counting eigenvalues) Consider the matrix in Exercise Determine the number of eigenvalues greater than 4.5. Exercise (Overflow in LDL* factorization)

321 13.5. Computing a Selected Eigenvalue of a Symmetric Matrix 309 Let for n N A n = R n n. a) Let d k be the diagonal elements of D in an LDL* factorization of A n. Show that < d k 10, k = 1, 2,..., n. b) Show that D n := det(a n ) > (5+ 24) n. Give n 0 N such that your computer gives an overflow when D n0 is computed in floating point arithmetic. Exercise (Simultaneous diagonalization) (Simultaneous diagonalization of two symmetric matrices by a congruence transformation). Let A, B R n n where A T = A and B is symmetric positive definite. Let B = U T DU where U is orthonormal and D = diag(d 1,..., d n ). Let Â = D 1/2 UAU T D 1/2 where a) Show that Â is symmetric. D 1/2 := diag ( d 1/2 1,..., d 1/2 ) n. Let Â = Û T ˆDÛ where Û is orthonormal and ˆD is diagonal. Set E = U T D 1/2 Û T. b) Show that E is nonsingular and that E T AE = ˆD, E T BE = I. For a more general result see Theorem 10.1 in [19] Approximating λ m Corollary can be used to determine the mth eigenvalue of A, where λ 1 λ 2 λ n. Using Gerschgorin s theorem we first find an interval [a, b], such that (a, b) contains the eigenvalues of A. Let for x [a, b] ρ(x) := #{k : d k (x) > 0 for k = 1,..., n} be the number of eigenvalues of A which are strictly greater than x. Clearly ρ(a) = n, ρ(b) = 0. Choosing a tolerance ɛ and using bisection we proceed as

322 310 Chapter 13. Numerical Eigenvalue Problems follows: h = b a; for j = 1 : itmax c = (a + b)/2; if b a < eps h λ = (a + b)/2; return end k = ρ(c); if k m a = c else b = c; end (13.9) We generate a sequence {[a j, b j ]} of intervals, each containing λ m and b j a j = 2 j (b a). As it stands this method will fail if in (13.8) one of the d k (α) is zero. One possibility is to replace such a d k (α) by a suitable small number, say δ k = c k ɛ M, where ɛ M is the Machine epsilon, typically for Matlab. This replacement is done if d k (α) < δ k. Exercise (Program code for one eigenvalue) Suppose A = tridiag(c, d, c) is symmetric and tridiagonal with elements d 1,..., d n on the diagonal and c 1,..., c n 1 on the neighboring subdiagonals. Let λ 1 λ 2 λ n be the eigenvalues of A. We shall write a program to compute one eigenvalue λ m for a given m using bisection and the method outlined in (13.9). a) Write a function k=count(c,d,x) which for given x counts the number of eigenvalues of A strictly greater than x. Use the replacement described above if one of the d j (x) is close to zero. b) Write a function lambda=findeigv(c,d,m) which first estimates an interval (a, b] containing all eigenvalues of A and then generates a sequence {(a j, b j ]} of intervals each containing λ m. Iterate until b j a j (b a)ɛ M, where ɛ M is Matlab s machine epsilon eps. Typically ɛ M c) Test the program on T := tridiag( 1, 2, 1) of size 100. Compare the exact value of λ 5 with your result and the result obtained by using Matlab s built-in function eig. Exercise (Determinant of upper Hessenberg matrix) Suppose A C n n is upper Hessenberg and x C. We will study two algorithms to compute f(x) = det(a xi).

323 13.6. Review Questions 311 a) Show that Gaussian elimination without pivoting requires O(n 2 ) arithmetic operations. b) Show that the number of arithmetic operations is the same if partial pivoting is used. c) Estimate the number of arithmetic operations if Given s rotations are used. d) Compare the two methods discussing advantages and disadvantages Review Questions Suppose A, E C n n. To every µ σ(a + E) there is a λ σ(a) which is in some sense close to µ. What is the general result (Elsner s theorem)? what if A is non defective? what if A is normal? what if A is Hermitian? Can Gerschgorin s theorem be used to check if a matrix is nonsingular? How many arithmetic operation does it take to reduce a matrix by similarity transformations to upper Hessenberg form by Householder transformations? Give a condition ensuring that a tridiagonal symmetric matrix has real and distinct eigenvalues: What is the content of Sylvester s inertia theorem? Give an application of this theorem.

324 312 Chapter 13. Numerical Eigenvalue Problems

325 Chapter 14 The QR Algorithm The QR algorithm is a method to find all eigenvalues and eigenvectors of a matrix. It is related to a simpler method called the power method and we start studying this method and its variants. Vera Nikolaevna Kublanovskaya, The Power Method and its variants These methods can be used to compute a single eigenpair of a matrix. They also play a role when studying properties of the QR algorithm. 313

Lecture Notes for Inf-Mat 3350/4350, Tom Lyche

Lecture Notes for Inf-Mat 3350/4350, 2007 Tom Lyche August 5, 2007 2 Contents Preface vii I A Review of Linear Algebra 1 1 Introduction 3 1.1 Notation............................... 3 2 Vectors 5 2.1 Vector