APPLIED LINEAR ALGEBRA - PDF Free Download

APPLIED LINEAR ALGEBRA Giorgio Picci November 24, 2015 1

Contents 1 LINEAR VECTOR SPACES AND LINEAR MAPS 10 1.1 Linear Maps and Matrices........................... 11 1.2 Inverse of a Linear Map............................ 12 1.3 Inner products and norms........................... 13 1.4 Inner products in coordinate spaces (1).................... 14 1.5 Inner products in coordinate spaces (2).................... 15 1.6 Adjoints..................................... 16 1.7 Subspaces................................... 18 1.8 Image and kernel of a linear map....................... 19 1.9 Invariant subspaces in R n........................... 22 1.10 Invariant subspaces and block-diagonalization................ 23 2

1.11 Eigenvalues and Eigenvectors......................... 24 2 SYMMETRIC MATRICES 25 2.1 Generalizations: Normal, Hermitian and Unitary matrices.......... 26 2.2 Change of Basis................................ 27 2.3 Similarity.................................... 29 2.4 Similarity again................................ 30 2.5 Problems.................................... 31 2.6 Skew-Hermitian matrices (1)......................... 32 2.7 Skew-Symmetric matrices (2)......................... 33 2.8 Square roots of positive semidefinite matrices................. 36 2.9 Projections in R n............................... 38 2.10 Projections on general inner product spaces.................. 40 3

2.11 Gramians.................................... 41 2.12 Example: Polynomial vector spaces...................... 42 3 LINEAR LEAST SQUARES PROBLEMS 43 3.1 Weighted Least Squares............................ 44 3.2 Solution by the Orthogonality Principle................... 46 3.3 Matrix least-squares Problems........................ 48 3.4 A problem from subspace identification.................... 50 3.5 Relation with Left- and Right- Inverses.................... 51 3.6 The Pseudoinverse............................... 54 3.7 The Euclidean pseudoinverse......................... 63 3.8 The Pseudoinverse and Orthogonal Projections................ 64 3.9 Linear equations................................ 66 4

3.10 Unfeasible linear equations and Least Squares................ 68 3.11 The Singular value decomposition (SVD)................... 70 3.12 Useful Features of the SVD.......................... 72 3.13 Matrix Norms................................. 73 3.14 Generalization of the SVD.......................... 75 3.15 SVD and the Pseudoinverse.......................... 79 4 NUMERICAL ASPECTS OF L-S PROBLEMS 81 4.1 Numerical Conditioning and the Condition Number............. 86 4.2 Conditioning of the Least Squares Problem.................. 90 4.3 The QR Factorization............................. 91 4.4 The role of orthogonality........................... 97 4.5 Fourier series and least squares.................... 100 5

4.6 SVD and least squares............................. 101 5 INTRODUCTION TO INVERSE PROBLEMS 102 5.1 Ill-posed problems............................... 103 5.2 From ill-posed to ill-conditioned........................ 104 5.3 Regularized Least Squares problems................. 105 6 Vector spaces of second order random variables (1) 106 6.1 Vector spaces of second order random variables (2).............. 107 6.2 About random vectors............................ 108 6.3 Sequences of second order random variables................. 109 6.4 Principal Components Analysis (PCA).................... 111 6.5 Bayesian Least Squares Estimation...................... 114 6.6 The Orthogonal Projection Lemma...................... 115 6

6.7 Block-diagonalization of Symmetric Positive Definite matrices........ 119 6.8 The Matrix Inversion Lemma (ABCD Lemma)................ 122 6.9 Change of basis................................ 123 6.10 Cholesky Factorization............................. 124 6.11 Bayesian estimation for a linear model................ 126 6.12 Use of the Matrix Inversion Lemma................. 127 6.13 Interpretation as a regularized least squares.................. 128 6.14 Application to Canonical Correlation Analysis (CCA)............ 129 6.15 Computing the CCA in coordinates...................... 134 7 KRONECKER PRODUCTS 135 7.1 Eigenvalues................................... 137 7.2 Vectorization.................................. 138 7

7.3 Mixing ordinary and Kronecker products: The mixed-product property... 141 7.4 Lyapunov equations.............................. 145 7.5 Symmetry................................... 147 7.6 Sylvester equations............................... 152 7.7 General Stein equations............................ 154 8 Circulant Matrices 158 8.1 The Symbol of a Circulant....................... 162 8.2 The finite Fourier Transform...................... 163 8.3 Back to Circulant matrices.......................... 166 8

Notation A : transpose of A. A : transpose conjugate of (the complex matrix) A. σ(a): the spectrum (set of the eigenvalues) of A. Σ(A): the set singular values of A. A + : pseudoinverse of A. Im (A): image of A. ker(a): kernel of A. A { 1} : inverse image with respect to A. A R : right-inverse of A (AA R = I). A L : left-inverse of A (A L A = I). 9

1 LINEAR VECTOR SPACES AND LINEAR MAPS A vector space is a mathematical structure formed by a collection of elements called vectors, which may be added together and multiplied by numbers, called scalars. Scalars may be real or complex numbers, or generally elements of any field F. Accordingly the vector space is called real- or complex- or an F vector space. The operations of vector addition and multiplication by a scalar must satisfy certain natural axioms which we shall not need to report here. The modern definition of vector space was introduced by Giuseppe Peano in 1888. Examples of (real) vector spaces are the arrows in a fixed plane or in the three-dimensional space representing forces or velocity in Physics. Vectors may however be very general objects such as functions or polynomials etc. provided they can be added together and multiplied by scalars to give elements of the same kind. The vector space composed of all the n-tuples of real or complex numbers is known as a coordinate space and is usually denoted by R n or C n. 10

1.1 Linear Maps and Matrices The concepts of linear independence, basis, coordinates, etc. are given for granted. Vector spaces admitting a basis consisting of a finite number n of elements are called n dimensional vector spaces. Example: the complex numbers C are a two-dimensional real vector space, with a two dimensional basis consisting of 1 and the imaginary unit i. A function between two vector spaces f : V W is a linear map if for all scalars α, β and all vectors v 1, v 2 in V f(αv 1 + βv 2 ) = αf(v 1 ) + βf(v 2 ) When V and W are finite dimensional, say n- and m- dimensional, a linear map can be represented by a m n matrix with elements in the field of scalars. The matrix acts by multiplication on the coordinates of the vectors of V, written as n 1 matrices (which are called column vectors) and provides the coordinates of the image vectors in W. The matrix hence depends on the choice of basis in the two vector spaces. The set of all n m matrices with elements in R (resp. C) form a real (resp. complex) vector space of dimension mn. These vector spaces are denoted R n m or C n m respectively. 11

1.2 Inverse of a Linear Map Let V and W be finite dimensional, say n- and m- dimensional. By choosing bases in the two spaces any linear map f : V W is represented by a matrix A C m n. Proposition 1.1 If f : V W is invertible the matrix A must also be invertible and the two vector spaces must have the same dimension (say n). Invertible matrices are also called non-singular. The inverse A 1 can be computed by the so-called Cramer rule A 1 = 1 det A Adj(A) where he algebraic adjoint Adj(A) is the transpose of a matrix having in position (i, j) the determinant of the complement to row i and column j (an n 1 n 1 matrix) multiplied by the factor ( 1) i+j. This rule is seldom used for actual computations. There is a wealth of algorithms to compute inverses which apply to matrices of specific structure. In fact computing inverses is seldom of interest per se; one may rather have to look for algorithms which compute solutions of a linear system of equations Ax = b. 12

1.3 Inner products and norms An inner product on V is a map, : V V C satisfying the following requirements. Conjugate symmetry: x, y = y, x Linearity in the first argument: ax, y = a x, y x + y, z = x, z + y, z Positive-definiteness: x, x 0, x, x = 0 x = 0 The norm induced by a inner product is x = + x, x. This is the length of the vector x. Directly from the axioms, one can prove the Cauchy- Schwarz inequality: for x, y elements of V x, y x y with equality if and only if x and y are linearly dependent. This is one of the most important inequalities in mathematics. It is also known in the Russian literature as the CauchyBunyakovskySchwarz inequality. 13

1.4 Inner products in coordinate spaces (1) In the vector space R n (in C n you must use conjugation), the bilinear function:, : R n R n R, u, v := u v Column vectors has all the prescribed properties to be an inner product. It induces the Euclidean norm on R n : : R n R +, u := u u The bilinear form defined on C n m C n m by A, B : (A, B) tr (A B ) = tr ( B A) (1.1) where tr denotes trace and B is the complex conjugate of B, is a bona fide inner product on C n m. The matrix norm defined by the inner product (1.1), A F := A, A 1/2 = [tr AĀ ] 1/2 (1.2) is called the Frobenius, or the weak norm of A. 14

1.5 Inner products in coordinate spaces (2) More general inner products in C n can be defined as follows. Definition 1.1 A square matrix A C n n is Hermitian if Ā = A and positive semidefinite if x Ax 0 for all x C n. The matrix is called positive definite if x Ax can be zero only when x = 0. There are well-known tests of positive definiteness based on checking the signs of the principal minors which should all be positive. Given an Hermitian positive definite matrix Q we define the weighted inner product, Q in the coordinate space C n by setting x, y Q : = x Qy This clearly satisfies the axioms of inner product. Problem 1.1 Show that any inner product in C n must have this structure for a suitable Q. Is Q uniquely defined? 15

1.6 Adjoints. Consider a linear map A : X Y, both finite-dimensional vector spaces endowed with inner products, X and, Y respectively. Definition 1.2 The adjoint, of A : X Y is a linear map A : Y X, defined by the relation y, Ax Y = A y, x X, x X, y Y (1.3) Problem 1.2 Prove that A is well-defined by the condition (1.3). Hint: here you must use the fact that X and Y are finite-dimensional. Example: Let A : C n C m where the spaces are equipped with weighted inner products, say x 1, x 2 C n = x 1 Q 1 x 2, y 1, y 2 C m = ȳ1 Q 2 y 2 where Q 1, Q 2 are Hermitian positive definite matrices. Then we have 16

Proposition 1.2 The adjoint of the linear map A : X Y defined by a matrix A : C n C m with weighted inner products as above, is where Ā is the conjugate transpose of A. A = Q 1 1 Ā Q 2 (1.4) Problem 1.3 Prove proposition 1.2. Let A : C n C m and assume that Q 1 and Q 2 = I are both identity matrices. Both inner products in this case are Euclidean inner products. Then A = Ā i.e. the adjoint is the Hermitian conjugate. In particular, for a real matrix the adjoint is just the transpose. For any square Hermitian matrix the adjoint coincides with the original matrix. The linear map defined by the matrix is then called a self-adjoint operator. In the real case self-adjoint operators are represented by symmetric matrices. Note that all this is true only if the inner products are Euclidean. 17

1.7 Subspaces A subset of a vector space X V which is itself a vector space with the same field of scalars is called a subspace. Subspaces of finite-dimensional vector spaces are automatically closed with respect to any inner product induced norm topology. This is not necessarily so if V is infinite dimensional. Definition 1.3 Let X, Y V be subspaces. Then 1. X Y := {v V : v = x + y, x X, y Y } is called the vector sum of X and Y. 2. When X Y = {0} the vector sum is called direct. Notation: X + Y. 3. When X + Y = V the subspace Y is called a Direct Complement of X in V. Notation: Often in the literature vector sum is denoted by a + and direct vector sum by a. We shall use the latter symbol for orthogonal direct sum. Let R n = X + Y with dim{x } = k and dim{y } = m. Then n = k + m and there exist a basis in R n such that [ ] [ ] x v X v =, x R k, 0 R m 0, v Y v =, 0 R k, y R m 0 y 18

1.8 Image and kernel of a linear map Definition 1.4 Let A R n m. 1. Im (A) := {v R n : v = Aw, w R m } (subspace of R n ). 2. ker(a) := {v R m : Av = 0} (subspace of R m ). Definition 1.5 Let V subspace of R n. The orthogonal complement of V is defined as V := {w R n : w, v = 0 v V } = {w R n : v, w = 0 v V } Matlab orth: Q = orth(a) computes a matrix Q whose columns are an orthonormal basis for Im (A) (i.e. Q Q = I, Im (Q) = Im (A) and the number of columns of Q is the rank(a)). null: Z = null(a) computes a matrix Z whose columns are an orthonormal basis for ker(a). 19

Proposition 1.3 Let A R n m, 1. ker(a) = ker(a A). 2. ker(a) = [Im (A )] that is R n = ker(a) Im (A ). 3. Im (A) = [ker(a )] that is R n = Im (A) ker(a ). 4. Im (A) = Im (AA ). Proof. 1. Let v ker(a) Av = 0 A Av = 0 v ker(a A). Let v ker(a A) A Av = 0 v A Av = 0 Av 2 = 0 Av = 0 v ker(a). 2. v ker(a) Av = 0 v A = 0 v A w = 0 w R n v [Im (A )] 3. Immediate consequence of 2. 4. Immediate consequence of 1. and 2. Hence: if V = Im (A) then V = Im (B) where B can be computed in Matlab as B = null(a ). 20

Intersecting kernels is easy: ([ ]) A ker(a) ker(b) = ker. B Similarly, adding images is easy: Im (A) Im (B) = Im ([A B]) Adding kernels can now be done by using the image representation. For example: ker(a) ker(b) can be computed by representing ker(a) as Im (A 1 ) and ker(b) as Im (B 1 ) (Matlab function null ). Intersection of images can be done as Im (A) Im (B) = [ker(a )] [ker(b )] = [ker(a ) ker(b )]. Problem 1.4 State and prove Proposition 1.3 for complex matrixes. 21

1.9 Invariant subspaces in R n Let A R n m and V be a subspace of R m. We denote by AV the map A. If V = Im (V ), then AV = Im (AV ). the image of V through Definition 1.6 Let A R n n and let V be a subspace of R n. We say that V is invariant for A or A-invariant if AV V (i.e. x V, Ax V ). If in addition, V is also invariant, we say that V is a reducing subspace. It is trivial that both Im (A) and ker(a) are A-invariant. Let V be a matrix whose columns form a basis for V. Then V is A-invariant if and only if Im (AV ) Im (V ). Problem 1.5 Let A R n n and V be a subspace of R n. Prove that: 1. If A is invertible, V is A-invariant if and only if it is A 1 -invariant. 2. V is A-invariant if and only if V is A -invariant. 3. If V is invariant and A is symmetric; i.e. A = A then V is a reducing subspace. 22

1.10 Invariant subspaces and block-diagonalization Let A R n n and R n = V + W where V is A-invariant. Then there is a choice of basis in R n with respect to which A has the representation [ ] A 1 A 1,2 Â = 0 A 2 where A 1 [ ] R k k with k = dim V. In any such a basis vectors in V are represented as v columns with the last n k components equal to zero. 0 If both V and W are invariant then there is basis in R n with respect to which A has a block-diagonal representation Â = [ A 1 0 0 A 2 The key to finding invariant subspaces is spectral analysis. ]. 23

1.11 Eigenvalues and Eigenvectors Along some directions a square matrix A R n n acts like a multiplication by a scalar Av = λv the scalar factor λ is called eigenvalue associated to the eigenvector v. Eigenvectors are actually directions in space and are usually normalized to unit norm. In general eigenvalues (and eigenvectors) are complex as they must be roots of the characteristic polynomial equation χ A (λ) := det(a λi) = 0 which is of degree n in λ and hence has n (not necessarily distinct) complex roots {λ 1,..., λ n }. This set is called the spectrum of A and is denoted σ(a). The multiplicity of λ k as a root of the characteristic polynomial is called the algebraic multiplicity. When eigenvectors are linearly independent they form a basis in which the matrix A looks like multiplication by a diagonal matrix whose elements are the eigenvalues. Unfortunately this happens only for special classes of matrices. 24

2 SYMMETRIC MATRICES Proposition 2.1 Let A = A R n. Then 1. The eigenvalues of A are real and the eigenvectors can be chosen to be a real orthonormal basis. 2. A is diagonalizable by an orthogonal transformation ( T s.t. T T = I and T AT diagonal). Proof. (sketch) 1. Let λ be an eigenvalue and v a corresponding eigenvector, λv v = v Av = (Av) v = λv v λ = λ. Therefore solutions of the real equation (A λi)v = 0 can be chosen to be real. 2. If v 1 and v 2 are eigenvectors corresponding to λ 1 λ 2 then v1 Av 2 = λ 1 v1 v 2 = λ 2 v1 v 2 v1 v 2 = 0. Hence ker(a λ i I) ker(a λ j I) whenever i j. 3. ker(a λ k I) is A-invariant; in fact reducing. Hence A restricted to this subspace is represented by a n k n k matrix having as only eigenvalue λ k and characteristic polynomial (λ λ k ) r k where r k is the algebraic multiplicity of λ k. But then dimension and degree of the characteristic polynomial must coincide. Therefore n k = r k. 4. Hence we can take a basis of r k orthonormal eigenvectors corresponding to each distinct λ k. Then A is diagonalizable by an orthonormal choice of basis. 25

2.1 Generalizations: Normal, Hermitian and Unitary matrices Definition 2.1 A matrix A C n n is : 1. Hermitian if A = A where A = Ā and skew-hermitian if A = A. 2. Unitary if AA = A A = I 3. Normal if AA = A A. Real unitary matrices are called orthogonal. Both have orthonormal columns with respect to the proper inner product. Clearly, Hermitian, skew-hermitian and unitary matrices are all Normal. It can be shown that all Normal matrices are diagonalizable. In particular, Proposition 2.2 Let A C n n be Hermitian. Then 1. The eigenvalues of A are real and the eigenvectors can be chosen to be an orthonormal basis in C n. 2. A is diagonalizable by a unitary transformation: i.e. there exist T C n n with T T = T T = I such that T AT is diagonal. Problem 2.1 Prove that all eigenvalues of a Hermitian positive semidefinite matrix are nonnegative. 26

2.2 Change of Basis Let V be a finite dimensional vector space with basis u 1,..., u n and A : V V be a linear map represented in the given basis by the n n matrix A. Let û 1,..., û n be another basis in V. Problem 2.2 How does the matrix A change under the change of basis u 1,..., u n û 1,..., û n? Answer: Theorem 2.1 Let T be the matrix of coordinate vectors with the k-th column representing û k in terms of the old basis u 1,..., u n. Then the matrix representation of A with respect to the new basis is Â = T 1 AT. Proof: using matrix notation û k = j t j,k u j [û 1... û n ] = ] [u 1... u n T (2.5) 27

Clearly T must be invertible (show this by contradiction). Then write Aû k := j Â j,k û k ; k = 1,..., n as a row matrix [Aû 1... Aû n ] = ] ] [û 1... û n Â = [u 1... u n T Â Take any vector x V having coordinate vector ˆξ C n with respect to the new basis and ξ with respect to the old basis and let η = Aξ and ˆη := Âˆξ so that ] ] ] [û 1... û n ˆη = [û 1... û n Âˆξ = [u 1... u n T Âˆξ ] ] but in the old basis the left member [û 1... û n ˆη is [u 1... u n Aξ so that by uniqueness of the coordinates Aξ = T Âˆξ and since from (2.5) it follows that ˆξ = T 1 ξ we have the assertion. 28

2.3 Similarity Definition 2.2 Matrices A and B in C n n are similar if there is a nonsingular T C n n such that B = T 1 AT. Problem 2.3 Show that similar matrices have the same eigenvalues (easy!). The following is a classical problem in Linear Algebra Problem 2.4 Show that a matrix is similar to its own transpose; i.e. nonsingular T R n n such that A = T 1 AT. there is a Hence A and A must have the same eigenvalues. Solving this problem requires the use of the Jordan Form which we shall not dig into. But you may easily prove that Proposition 2.3 A Jordan block matrix λ 0...... 0 J(λ) := 1 λ 0... 0.. 0...... 1 λ is similar to its transpose. 29

2.4 Similarity again The following is the motivation for introducing the Jordan Form of a matrix. Problem 2.5 Find necessary and sufficient conditions for two matrices of the same dimension to be similar. Having the same eigenvalues is just a necessary conditions. For example you may check that the two matrices [ ] [ ] λ 0 λ 0 J 1 (λ) :=, J 2 (λ) := 1 λ 0 λ have the same eigenvalue(s) but are not similar. The Jordan Canonical Form of a matrix A is just a block diagonal matrix made of square Jordan blocks like J(λ k ) (not necessarily distinct), of dimension n 1, where the λ k are all the (distinct) eigenvalues of A. 30

Some Jordan blocks may actually be of dimension 1 1. So the Jordan Canonical Form may show a diagonal submatrix. For example the identity matrix is already in Jordan Canonical Form. The Jordan form is unique modulo permutation of the sub-blocks. Theorem 2.2 Two matrices of the same dimension are similar if and only if have the same Jordan Canonica Form. 2.5 Problems 1. Describe the Jordan canonical form of a symmetric matrix. 2. Show that any Jordan block J(λ) of arbitrary dimension n n has just one eigenvector. 3. Compute the second and third power matrix of a Jordan block J(λ) of dimension 3 3 and find the normalized eigenvectors of J(λ), J(λ) 2, J(λ) 3. 31

2.6 Skew-Hermitian matrices (1) Recall that a matrix A C n n is Skew-Hermitian if A = A, Skew-Symmetric if A = A. Problem 2.6 Prove that for skew-hermitian matrices the quadratic form x Ax must be identically zero. Therefore A is positive semidefinite if and only if its Hermitian component: A H := 1 2 [ A + A ] is such. Hence there is no loss of generality to assume that a positive semidefinite matrix is Hermitian (or symmetric in the real case). Problem 2.7 Prove that a skew symmetric matrix of odd dimension is always singular. Is this also true for Hermitian skew symmetric matrices? 32

2.7 Skew-Symmetric matrices (2) The eigenvalues of a skew-symmetric matrix always come in pairs ±λ (except in the odddimensional case where there is an additional unpaired 0 eigenvalue). Problem 2.8 Show that for a real skew-symmetric matrix χ A (λ) = χ A (λ) = ( 1 n )χ A ( λ) Hence the nonzero eigenvalues of a real skew-symmetric matrix are all pure imaginary and thus of the form iλ 1, iλ 1, iλ 2, iλ 2, where each of the λ k is real. Hint: The characteristic polynomial of a real skew-symmetric matrix has real coefficients. Since the eigenvalues of a real skew-symmetric matrix are imaginary it is not possible to diagonalize one by a real matrix. However, it is possible to bring every skew-symmetric matrix to a block diagonal form by an orthogonal transformation. 33

Proposition 2.4 Every 2n 2n real skew-symmetric matrix can be written in the form A = QΣQ where Q is orthogonal and 0 λ 1 0 0 λ 1 0 0 λ 2 0 0 λ 2 0..... Σ = 0 λ r 0 0 λ r 0 0... 0 for real λ k. The nonzero eigenvalues of this matrix are ±iλ k. In the odd-dimensional case Σ always has at least one row and column of zeros. The proof is based on the fact that a matrix M is Hermitian if and only if im is skew- Hermitian. In particular if A is real symmetric then ia is orthogonally similar to a diagonal matrix with ±iλ k on the main diagonal. 34

More generally, every complex skew-symmetric (i.e. skew Hermitian) matrix can be written in the form A = UΣU where U is unitary and Σ has the block-diagonal form given above with complex λ k. This is an example of the Youla decomposition of a complex square matrix. The following is a remarkable relation between orthogonal and skew-symmetric matrices for n = 2 [ ] ( [ ]) cos θ sin θ 0 1 = exp θ sin θ cos θ 1 0 In fact, the matrix on the left is just a representation of a general rotation matrix in R 2 2. This exponential representation of orthogonal matrices holds in general. In R 3 it is the relation between rotations and angular velocity. In fact the external or wedge product ω v is just the action on the coordinates of v by the skew symmetric matrix 0 ω z ω y ω = ω z 0 ω x ω y ω x 0 and a rotation in R 3 can be represented by an orthogonal matrix R R 3 3 given by the exponential of a skew-symmetric matrix like ω. 35

2.8 Square roots of positive semidefinite matrices Let A R n with A = A 0. Even if A > 0, there are many, in general rectangular, matrices Q such that QQ = A. Any such matrix is called a square root of A. However there is only one symmetric square root. Proposition 2.5 Let A = A 0. Then, there exists a unique matrix A 1/2 such that A 1/2 = (A 1/2 ) 0 and A 1/2 (A 1/2 ) = (A 1/2 ) 2 = A. Proof. Existence: Let T be such that T T = I and T AT = λ1 0 0 A = T 0... 0 0 0 λ n λ1 0 0 A 1/2 := T 0... 0 0 0 λ n Uniqueness. I }{{} T T λ1 0 0 0... 0 0 0 λ n T T has the desired properties. 36 λ 1 0 0 0... 0, λ i 0 0 0 λ n

Problem 2.9 Prove that if v is an eigenvector of A with eigenvalue λ then it is also an eigenvector of A 1/2 with eigenvalue λ (hint. Prove that if v is an eigenvector λ 1 0 0 of A with eigenvalue λ, then T v is an eigenvector of 0... 0 with the same 0 0 λ n eigenvalue. Since the latter matrix is diagonal, this means that only some of the elements of T v can be different form zero (which ones?). It follows that T v is also λ1 0 0 an eigenvector of 0... 0 with eigenvalue λ, and hence the conclusion.) 0 0 λn Now let S = S 0 be such that S 2 = SS = A. We now prove that S = A 1/2. Let U be an orthogonal matrix (UU = U U = I) diagonalizing S. This means that U SU = D, where D = diag(d 1, d 2,..., d n ) is a diagonal matrix and d i 0. Then U AU = U SSU = U SUD = D 2, i.e. the i-th column of U is an eigenvector of A with eigenvalue d 2 i. In view of Problem 2.9, the i-th column of U is also an eigenvector of A1/2 with eigenvalue d i. Then U SU = U A 1/2 U = D, i.e. S = A 1/2. 37

2.9 Projections in R n Here we work on the real vector space R n endowed with the Euclidean inner product. More general notions of projections will be encountered later. Definition 2.3 Let Π R n n. Π is a projection matrix if Π = Π 2. Π is an orthogonal projection if it is a projection and Π = Π. Note that Π is a (orthogonal) projection I Π is also an (orthogonal) projection. Let Π be a projection and V = Im (Π). We say that Π projects onto V. Proposition 2.6 If Π is an orthogonal projector that projects onto V I Π projects onto V. Proof. For any x, y R n, v := Πx w := (I Π)y (in fact: v w = x Π(I Π)y = 0.) Im (I Π) V. Conversely, let x V 0 = x Πx = x ΠΠx = x Π Πx = Πx 2 Πx = 0 (I Π)x = x x Im (I Π). 38

Proposition 2.7 If Π is an orthogonal projection [ then: ] σ(π) {0, 1}; i.e. the eigenvalues are either 0 or 1. Π is in fact similar to I 0 0 0. The dimension of the identity matrix is that of the range space. Proof. Π is symmetric hence diagonalizable by proposition 2.1 and since it is positive semidefinite (as x Πx = x Π Πx = Πx 2 0) has real eigenvalues. Then v = Π 2 v = λv = λ 2 v, v 0 λ = λ 2 λ {0, 1}. Proposition 2.8 Let Π k be orthogonal projections onto V k = Im (Π k ), k = 1, 2. Then 1. Π := Π 1 +Π 2 is a projection V 1 V 2 in which case Π projects onto the orthogonal direct sum V 1 V 2. 2. Π := Π 1 Π 2 is a projection iff Π 1 Π 2 = Π 2 Π 1 in which case Π projects onto the intersection V 1 V 2. For a proof see the book of Halmos [8, pp. 44 49]. This material lies at the grounds of spectral theory in Hilbert spaces. 39

2.10 Projections on general inner product spaces Problem 2.10 Prove that if A a linear map in an arbitrary inner product space (V,, ) then V = Im {A} ker{a } = ker{a} Im {A } Hence if A is self-adjoint then V = Im {A} ker{a} (Hint: The proof follows from the proof of Proposition 1.3.) Definition 2.4 In an arbitrary inner product space (V,, ) an idempotent linear map P : V V ; i.e. such that P 2 = P, is called a projection. If P is self-adjoint; i.e. P = P the projection is an orthogonal projection. Hence if X is the range space of a projection, we have P X = X and if P is an orthogonal projection, the orthogonal complement X is the kernel; i.e. P X = 0. Note that all facts exposed in Section 2.9 hold true in this more general context provide you substitute the Euclidena space R n with (V,, ) and the transpose with adjoint. 40

2.11 Gramians. Let {v 1,..., v n } be vectors in (V,, ). Their Gramian is the Hermitian (in the real case symmetric) matrix v 1, v 1... v 1, v n G(v 1,..., v n ) :=......... v n, v 1... v n, v n The Gramian is always positive semidefinite. In fact, let v = k x kv k ; x k C, then where x := of v. [ x 1 v 2 = x G(v 1,..., v n )x... x n ]. If the vk s are linearly independent, x is the vector of coordinates Problem 2.11 Show that G(v 1,..., v n ) is positive definite if and only if the vectors {v 1,..., v n } are linearly independent. 41

2.12 Example: Polynomial vector spaces Let (V,, ) be the vector space of real polynomials restricted to the interval [ 1, 1] with inner product p, q := +1 1 p(x)q(x) dx This space is not finite dimensional but if we only consider polynomials of degree less or equal to n we obtain for each n, a vector subspaces of dimension n + 1. V has a natural basis consisting of the monomials 1, x, x 2,... the coordinates of a vector p(x) V with respect to this basis being just the ordinary coefficients of the polynomial. To find an orthonormal basis we may use the classical Gram- Schmidt sequential orthogonalization procedure (see Section 4.3). In this way we obtain the Legendre Polynomials P 0 (x) = 1, P 1 (x) = x, P 2 (x) = 1/2(3x 2 1), P 3 (x) = 1/2(5x 3 3x), P 4 (x) = 1/8(35x 4 30x 2 + 3), P 5 (x) = 1/8(63x 5 70x 3 + 15x), P 6 (x) = 1/(16)(231x 6 315x 4 + 105x 2 5)., etc. There are books written about polynomial vector spaces, see [4]. 42

3 LINEAR LEAST SQUARES PROBLEMS Problem 3.1 Fit, in some reasonable way, a parametric model of known structure to measured data. Given: measured output data (y 1,..., y N ), assumed real-valued for now, and input (or exogenous) variables (u 1,..., u N ), in N experiments, plus a candidate class of parametric models (from a priori information) ŷ t (θ) = f(u t, θ, t) t = 1,..., N, θ Θ R p Use a quadratic approximation criterion N V (θ) := [y t ŷ t (θ)] 2 = 1 N 1 [y t f(u t, θ, t)] 2 The best model corresponds to the value(s) ˆθ, of θ minimizing V (ˆθ) V (ˆθ) = min θ Θ V (θ). This is a simple empirical rule for constructing models from measured data. May come out from statistical estimation criteria in problems with probabilistic side information. Obviously ˆθ depends on (y 1,..., y N ) (u 1,..., u N ); ˆθ = ˆθ(y 1,..., y N ; u 1,..., u N ), is called a Least-Squares-Estimator of θ. No statistical significance attached to this word. 43

3.1 Weighted Least Squares Reasonable to weight the modeling errors by some positive coefficients q t corresponding to more or less reliable results of the experiment. This leads to Weighted Least Squares, criteria of the type V Q (θ) := N 1 q t [y(t) f(u t, θ, t)] 2, where q 1,..., q N are positive numbers, which are large for reliable data and small for bad data. In general may introduce a symmetric positive-definite weight matrix Q where V Q (θ) = [y f(u, θ)] Q [y f(u, θ)] = y f(u, θ) 2 Q, y = y 1. y N f(u, θ) = f(u 1, θ, 1). (3.6) f(u N, θ, N) The minimization of V Q (θ) can be done analytically when the model is linear in the parameters, that is f(u t, θ, t) = p 1 s i (u t, t) θ i, t = 1,..., N. 44

Since u t is a known quantity can rewrite this as f(u t, θ, t) := s (t) θ, with s (t) a p-dimensional row vector which is a known function of u and of the index t. Using vector notations, introducing the N p, Signal matrix, s (1) S =.. s (N) we get the linear model class { ŷ θ = Sθ, θ Θ } and the problem becomes to minimize with respect to θ the quadratic form V Q (θ) = [y Sθ] Q[y Sθ] = y Sθ 2 Q. (3.7) The minimization can be done by elementary calculus. However it is more instructive to do this by geometric means using the Orthogonal Projection Lemma. Make R N into an inner product space by introducing the inner product x, y Q = x Qy and let the corresponding norm be denoted by Q. Let S be the linear subspace of R N spanned by the columns of the matrix S. Then the minimization of y Sθ 2 Q is just the minimum distance problem of finding the vector ŷ S of shortest distance from the data vector y. See the picture below. 45

3.2 Solution by the Orthogonality Principle y S ˆθ S The minimizer of V Q (θ) = y Sθ 2 Q must render the error y Sθ orthogonal (according to the scalar product x, y Q ) to the subspace S, or, equivalently, to the columns of S, that is S Q(y Sθ) = 0, which can be rewritten S Q Sθ = S Qy. (3.8) which are the famous normal equations of the Least-Squares problem. 46

Let us first assume that rank S = p N. (3.9) This is an identifiability condition of the model class. Each model corresponds 1 : 1 to a unique value of the parameter. Under this condition the equation (3.8) has a unique solution which we denote ˆθ(y) given by ˆθ(y) = [S QS ] 1 S Qy, (3.10) which is a linear function of the observations y. For short we shall denote ˆθ(y) = Ay. Then S ˆθ(y) := SAy is the orthogonal projection of y onto the subspace S = span (S). In other words the matrix P R N N, defined as P = SA, is the orthogonal projector, with respect to the inner product, Q, from R N onto S. In fact P is idempotent (P = P 2 ), since SA SA = S I A = SA however P is not symmetric, as it happens with the ordinary Euclidean metric, but rather P = (SA) = A S = Q S [S Q S ] 1 S = Q S A Q 1 = Q P Q 1, (3.11) so P is just similar to P. Actually, from (1.4) we see that P is a self adjoint operator with respect to the inner product, Q. Therefore the projection P of the least squares problem is self-adjoint as all bona-fide orthogonal projectors should be. 47

3.3 Matrix least-squares Problems We first discuss a dual row-version of the least-squares problem. y is now a N-row vector which we want to model as θ S where S is a signal matrix with rowspace S made of known N-vectors. Consider the dual LS problem min z S y θ S Q Problem 3.2 Describe the solution of the dual LS problem. A matrix generalization which has applications to statistical system identification follows. Arrange N successive observations made in parallel from m channels, as rows of a m N matrix Y (we shall only worry about real data here). The k-th row collects the N measurements from the k-th channel y 1 ] y k := [y k,1 y k,2,..., y k,n Y := y 2... We want to model each y k as a distinct linear combination via p parameters of the rows of a given Signal matrix S, We assume S R p N with the same number of columns of Y. y m 48

One may then generalize the standard LS problem to matrix-valued data as follows. Let Y R m N and S R p N be known real matrices and consider the problem min Y ΘS Θ R m p F, (Frobenius norm) (3.12) where Θ R m p is an unknown matrix parameter. The Frobenius norm could actually be weighted by a positive definite weight matrix Q. The problem can be solved for each row y k by the orthogonality principle. Let S be the rowspace of S and denote by θ k S ; θ k R p a vector in S. Then the optimality condition is y k θ k S S ; i.e. y k QS = θ k SQS k = 1, 2,..., m so that, assuming S of rank m, the solution is ˆΘ = Y QS [ SQS ] 1. 49

3.4 A problem from subspace identification Assume you observe the trajectories of the state, input and output variables of a linear MIMO stationary stochastic system [ ] [ ] [ ] [ ] x(t + 1) A B x(t) K = + y(t) C D u(t) J w(t) where w is white noise. With the observed trajectories from some time t onwards one constructs the data matrices (all having N + 1 columns) Y t := [ y t, y t+1, y t+2,..., y t+n ] U t := [ u t, u t+1, u t+2,..., u t+n ] X t := [ x t, x t+1, x t+2,..., x t+n ]q X t+1 := [ x t+1, x t+2,..., x t+n+1 ] If the data obey the linear equation above, there must exist a corresponding white noise trajectory W t := [ w t, w t+1, w t+2,..., w t+n ] such that [ ] [ ] [ ] [ ] X t+1 A B X t K = + W t Y t C D U t J [ ] From this model one can now attempt to estimate the matrix parameter Θ := A B C D based on the observed data. This leads to a matrix LS problem of the kind formulated in the previous page. In practice the state trajectory is not observable and must be previously estimated from input-output data. 50

3.5 Relation with Left- and Right- Inverses It is obvious that A = [S QS ] 1 S Q is a left-inverse of S; i.e. AS = I for any non-singular Q. Left- and right-inverses are related to least-squares problems. Let A R m n and let Q 1 and Q 2 be symmetric positive definite. Consider the following weighted least-squares problems If rank A = n, min x R n Ax b 2 Q2 (3.13) If rank A = m, min y R m A y b 1 Q1 (3.14) where b 1 R n, b 2 R m are fixed column vectors. From formula (3.10) (and its dual) we get: Proposition 3.1 The solution to Problem (3.13) can be written as x = A L b 2 where A L is the left-inverse given by A L = [ A Q 2 A ] 1 A Q 2, while that of Problem (3.14) can be written as y = b 1 A R where A R is the rightinverse given by A R = Q 1 A [ AQ 1 A ] 1. 51

Conversely, we can show that any left- or right-inverse admits a representation as a solution of a weighted Least-Squares problem. Proposition 3.2 Assume rank A = n and let A L be a particular left-inverse of A. Then there is a Hermitian positive-definite matrix Q such that A L = [ Ā QA ] 1 Ā Q (3.15) and, in case rank A = m, a dual statement holds for an arbitrary right inverse. Proof. The property of being a left-inverse is independent of the metrics on C m and C n. Hence we may assume Euclidean metrics. Since A has linearly independent columns we can write A = RÃ where Ã := [I 0] and R C m m is invertible. Any left inverse Ã L must be of the form Ã L = [I T ] with T arbitrary. There exist a square matrix Q such that ( Ã QÃ ) = Q 11 is invertible and (Ã QÃ ) 1 Ã Q = [I T ] = Ã L. 1 In fact, just let T = Q Q 11 12, which is clearly still arbitrary. Without loss of generality we may actually choose Q 11 = I, Q12 = T. 52

To get a representation of Ã L of the form (3.15) to hold, we just need to make sure that there exists Q so that, Q = Q > 0. To this end we may just choose Q 21 = Q 12 and Q 22 = Q 22 > Q 21 Q 1 11 Q 12. In general, A L A = I means that A L RÃ = I; i.e. A L R is a left inverse of Ã; that is A L R = Ã L = ( Ã QÃ ) 1 Ã Q = = [ A (R ) 1 QR 1 Ã ] 1 A (R ) 1 Q (3.16) and, renaming Q := (R ) 1 QR 1 we get A L = ( A Q A ) 1 A Q. Since R is invertible Q can be taken to be Hermitian and positive definite. The statement is proved. Problem 3.3 For a fixed A of full column rank the left inverses can be parametrized in terms of Q. Is this parametrization 1:1? The full-rank condition simplifies the discussion but is not essential. First, when rank S < p but still p N, the model can be reparametrized by using a smaller number of parameters. Problem 3.4 Show how to reparametrize in 1:1 way a model with rank S < p but still p N. 53

3.6 The Pseudoinverse For an arbitrary A, the least squares problems (3.13), (3.14) have no unique solution. When (3.9) does not hold one can bring in the pseudoinverse of S QS. The following definition is for arbitrary weighted inner-product spaces. Consider a linear map between finite-dimensional inner product spaces A : X Y. For concreteness may think of A as a m n (in general complex) matrix and the spaces endowed with weighted inner products x 1, x 2 C n = x 1 Q 1 x 2 and y 1, y 2 C m = ȳ1 Q 2 y 2 where Q 1, Q 2 are Hermitian positive definite matrices. Recall the basic fact which holds for arbitrary linear operators on finite-dimensional inner product spaces A : X Y. Lemma 3.1 We have X = ker(a) Im (A ), Y = Im (A) ker(a ). (3.17) where the orthogonal complements and the adjoint are with respect to the inner products in X and Y. Below is a key observation for the introduction of generalized inverses of a linear map. 54

Proposition 3.3 The restriction of A to the orthogonal complement of its nullspace (ker A) = Im A is a bijective map onto its range Im A. Proof. Let y 1 be an arbitrary element of Im (A) so that y 1 = Ax for some x C n and let x = x 1 + x 2 be relative to the orthogonal decomposition C n = ker(a) Im A. Then there is an x 2 such that y 1 = Ax 2. This x 2 Im A must be unique since A(x 2 x 2) = 0 implies x 2 x 2 ker(a) which is orthogonal to Im A so that it must be that x 2 x 2 = 0. Therefore the restriction of A to Im A is injective. Hence the restriction of A to Im (A ) is a map onto Im (A) which has an inverse. This inverse can be extended to the whole space Y by making its kernel equal to the orthogonal complement of Im (A). The exension is called the Moore-Penrose generalized inverse or simply the pseudo-inverse of A and is denoted A. Proposition 3.4 The pseudoinverse A is the unique linear transformation Y X which satisfies the following two conditions x Im (A ) A Ax = x (3.18) y ker(a ) A y = 0. (3.19) Moreover Im A = Im A, ker(a ) = ker(a ) (3.20) 55

Figure 1: Proposition 3.3 Proof. Equation (3.18) follows by definition of inverse of the map A restricted to Im (A ). The second equation defines the map A on the orthogonal complement of Im (A); in fact Im (A) = ker(a ). Therefore (3.18), (3.19) define A as a linear map unambiguosly on the whole space Y. This just means that A is the unique linear map satisfying the two conditions (3.18), (3.19). 56

Corollary 3.1 Let A C m n be block-diagonal: A = diag {A 1, 0} where A 1 C p p, p < n is invertibile. Then, irrespective of the inner product in C n, [ ] A A 1 1 0 =. (3.21) 0 0 Problem 3.5 Prove Corollary 3.1. (Hint: identify the various subspaces and use the basic relations (3.18), (3.19).) The following facts follow from (3.18), (3.19). Proposition 3.5 1. A A and AA are self-adjoint maps. 2. A A is the orthogonal projector of X onto Im (A ). 3. I A A is the orthogonal projector of X onto ker(a). 4. AA is the orthogonal projector of Y onto Im (A). 5. I AA is the orthogonal projector of Y onto ker(a ). Proof. Let x = x 1 + x 2 be the orthogonal decomposition of x induced by X = ker(a) Im (A ). To prove (1) take any x, z X and note that by (3.18), z, A Ax X = 57

z, A Ax 2 X = z 2, x 2 X = A Az 2, x X = A Az, x X. 2. Clearly A A is idempotent since by (3.18) A Ax = A Ax 2 = x 2 and hence A AA Ax = A Ax 2 = A Ax. Same is obviously true for AA. Hence by (1) A A is an orthogonal projection onto Im (A ). To prove (3) and (5) let, dually, b = b 1 + b 2 be the orthogonal decomposition of b Y = ker(a ) Im (A), then, by (3.19) AA b = AA b 2 = b 2. since A is the inverse of A restricted to Im (A). Problem 3.6 Describe the pseudoinverse of an orthogonal Projection P : X Y 58

Certain properties of the inverse are shared by the pseudoinverse only to a limited extent. Products: If A R m n, B R n p the product formula (AB) = B A is generally not true. It holds if A has orthonormal columns (i.e. A A = I n ) or, B has orthonormal rows (i.e. BB = I n ) or, A has all columns linearly independent (full column rank) and B has all rows linearly independent (full row rank). The last property yields the solution to Problem 3.7 Let A R n m with rank(a) = r. Consider a full rank factorization of the form A = LR with L R n r, R R r m. Prove that A + = R (L AR ) 1 L Problem 3.8 Prove that [A ] = [A ] Show that the result below is not true for arbitrary invertible maps (matrices) T 1, T 2. 59

Proposition 3.6 Let A : X Y and T 1 : X X 1 and T 2 : Y Y 2, be unitary maps; i.e. T 1 T 1 = I, T 2 T 2 = I. Then (T 1 AT 2 ) = T 1 2 A T 1 1 = T 2 A T 1. (3.22) Problem 3.9 Show that A + = (A A) + A, A + = A (AA ) + and thereby prove the product formulas A + (A ) + = (A A) +, (A ) + A + = (AA ) + 60

Least Squares and the Moore-Penrose pseudo-inverse The following result provides a characterization of the pseudoinverse in terms of leastsquares. Theorem 3.1 The vector x 0 := A b is the minimizer for the least-squares problem which has minimum Q1 -norm. min x R n Ax b Q 2 Proof. Let V (x) := Ax b 2 Q 2 and L, M be square matrices such that L L = Q 1 and M M = Q 2. By defining ˆx := Lx and scaling A and b according to Â := MAL 1, ˆb := 2 Mb we can rephrase our problem in Euclidean metrics and rewrite V (x) as Âˆx ˆb where is the Euclidean norm. Further, let ˆx = ˆx 1 + ˆx 2, ˆx 1 ker(â), ˆx 2 Im (Â ) (3.23) ˆb = ˆb1 + ˆb 2 ˆb1 Im (Â), ˆb 2 ker(â ) (3.24) 61

be the orthogonal sum decompositions according to (3.17). Now V (x) V (x 0 ) is equal to V (ˆx) V (ˆx 0 ) = Â(ˆx 1 + ˆx 2 ) (ˆb 1 + ˆb 2 ) 2 Âˆx 0 (ˆb 1 + ˆb 2 ) 2 = (Âˆx 2 ˆb 1 ) ˆb 2 2 (Âˆx 0 ˆb 1 ) ˆb 2 2 = Âˆx 2 ˆb ( Âˆx 1 2 + ˆb 2 2 0 ˆb ) 1 2 + ˆb 2 2 = Âˆx 2 ˆb 1 2 ÂÂ (ˆb 1 + ˆb 2 ) ˆb 1 2 = Âˆx 2 ˆb 1 2 0 the last equality following from proposition 3.4. Hence x 0 = L 1ˆx 0 is a minimum point of V (x). However all ˆx = ˆx 1 + ˆx 2 such that Âˆx 2 ˆb 1 = 0 are also minimum points. For all these solutions it must however hold that ˆx 2 = ˆx 0, for, Â(ˆx 2 ˆx 0 ) = 0 implies that ˆx 2 ˆx 0 ker(â) so that ˆx 2 ˆx 0 must be zero. Hence ˆx 2 = ˆx 1 + ˆx 2 2 = ˆx 1 + ˆx 0 2 = ˆx 1 2 + ˆx 0 2 ˆx 0 2 which is by definition equivalent to x 2 Q 1 x 0 2 Q 1. 62

3.7 The Euclidean pseudoinverse Below is the classical definition of pseudoinverse of a matrix. Theorem 3.2 The (Euclidean) Moore-Penrose pseudoinverse, A +, of a real (or complex) matrix A R n m is the unique matrix satifying the four conditions: 1. AA + A = A 2. A + AA + = A + 3. A + A is symmetric (resp. Hermitian) 4. AA + is symmetric (resp. Hermitian) Proof. The proof of existence is via the Singular Value Decomposition; see Theorem 3.6. Proof of Uniqueness: Let A + 1, A+ 2 be two matrices satisfying 1.,2.,3.,4. Let D := A+ 1 A+ 2. Then 1. ADA = 0. 4. A = AA + 1 AA+ 2 symmetric AD = D A D A A = 0 A AD = 0 D ker(a A) = ker(a) (columns). 2.+3. A (A + 1 ) A + 1 = A+ 1 and A (A + 2 ) A + 2 = A+ 2 A (A + 1 ) A + 1 A (A + 2 ) A + 2 = D D = A [(A + 1 ) A + 1 (A+ 2 ) A + 2 ] Im (A ) = [ker(a)] D = 0. 63

3.8 The Pseudoinverse and Orthogonal Projections The proposition below is the Euclidean version of Proposition 3.5. Proposition 3.7 Let A R n m. 1. AA + is the orthogonal projection onto Im (A). 2. A + A is an orthogonal projection onto Im (A ). 3. I AA + projects onto [Im (A)] = ker(a ). 4. I A + A projects onto [Im (A )] = ker(a). Proof. 1. (AA + ) 2 = AA + AA + = AA +. Moreover AA + is symmetric, Problem 3.19. Im (AA + ) Im (A). Conversely, Im (A) = Im (AA + A) Im (AA + ). 2. Similar proof: Im (A + A) = Im (A (A ) + ) and 2. Im (A (A ) + ) = Im (A ). 3. and 4. Follow from 1. and 2. These projections are all related to least squares problems. 64

Problem 3.10 Assume A as a m n (in general complex) matrix acting on spaces endowed with weighted inner products x 1, x 2 C n = x 1 Q 1 x 2 and y 1, y 2 C m = ȳ1 Q 2 y 2 where Q 1, Q 2 are Hermitian positive definite matrices. Denote by A + the Euclidean pseudoinverse of A. What is the relation between A and A +? Solution: Let T 1 : X C n and T 2 : Y C m and L, M be square matrices such that L L = Q 1 and M M = Q 2 so that T 1 : x Lx and T 2 : y My are unitary maps onto Euclidean spaces, T 1 : X C n and T 2 : Y C m. Since T 1 x = Lx it follows from (1.4) that T1 ξ = Q 1 1 L = L 1 and similarly T2 = M 1. Now, the action of A can be decomposed as X C n Â {}}{ C m Y for a certain matrix Â : Cn C m, we have A = T 1 ÂT 1 2 but T 2 is self adjoint; i.e. T 1 2 = T 2 and hence formula (3.22) holds. Hence Â Â+ = M 1 A L 65

3.9 Linear equations Let A R n m, B R n p. Consider the equation AX = B (3.25) Theorem 3.3 1. Equation (3.25) admits solutions iff B Im (A) (Im (B) Im (A)). 2. If equation (3.25) admits solutions, all its solutions are given by: X = A + B + (I A + A)C, (3.26) for an arbitrary C R m p. Proof. In fact, if B Im (A) then Y such that B = AY. Let X = A + B + (I A + A)C = A + AY + (I A + A)C AX = AA + AY + A(I A + A)C = AY + 0 = B. Conversely, let Z be a solution of (3.25) and := Z A + B then A = AZ AA + B = AZ B = 0 ker(a) = [Im (A )] = [Im (A + )] = [Im (A + A)] = Im (I A + A) C s.t. = (I A + A)C. 66

Let A R n m, B R n p. Assume that AX = B admits solutions. What is the meaning of X = A + B? Example: let p = 1 (i.e. X = x R m and B = b R n are vectors): Ax = b (3.27) The solutions of this equation are all of the form x = A + b + (I A + A)c where the two terms are orthogonal, then x 2 = [b (A + ) + c (I A + A) ][A + b + (I A + A)c] = b (A + ) A + b + c (I A + A) 2 c = A + b 2 + (I A + A)c 2 Hence x = A + b is the minimum norm solution of (3.27). For p > 1 use the Frobenius norm of a matrix X: X F X F = ij X2 ij ). := tr (X X) (note that Since the solutions of (3.25) are all of the form X = A + B + (I A + A)C X 2 F = A+ B 2 F + (I A+ A)C 2 F X = A+ B is the minimum norm solution. 67

3.10 Unfeasible linear equations and Least Squares Let A R n m, B R n p and assume B Im (A). Then X solving the equation AX = B. May solve the equation in the (approximate) least squares sense (MATLAB). Problem 3.11 Find X minimizing AX B F Consider again the case p = 1 (minimize Ax b ). For all x R m y R n such that Ax = AA + y (generalizes Sθ = P y) Ax b = AA + y b = AA + (y b + b) b = AA + (y b) }{{}}{{} + [ (I AA + )b] }{{} Im (A) [Im (A)] Ax b 2 = AA + (y b) 2 + (I AA + )b 2 Now ŷ = b is a solution of the LS problem min y [ AA + (y b) 2 + (I AA + )b 2 ] Hence ˆx = A + b is a solution of min x { Ax b 2 } and min x Ax b 2 = (I AA + )b 2. Proposition 3.8 For p = 1, ˆx = A + b is the solution to Problem 3.11 of minimum norm. 68

Problem 3.12 Prove the minimum norm property. Compare with Theorem 3.1. Are we repeating the same proof? For p > 1 the computations are the same: Proposition 3.9 ˆX = A + B is the solution of Problem 3.11 of minimum Frobenius norm. Problem 3.13 Parametrize all solutions to Problem 3.11. 69

3.11 The Singular value decomposition (SVD) We shall first do the SVD for real matrices. In Section 3.14 we shall generalize to general linear maps in inner product spaces. Problem 3.14 Let A R m n and r := min{n, m}. Show that AA and A A share the first r eigenvalues. How are the eigenvectors related? Theorem 3.4 Let A R m n of rank r min(m, n). Can find two orthogonal matrices U R m m and V R n n and positive numbers {σ 1,..., σ r }, the singular values of A, so that [ ] A = U V Σ 0 =, Σ = diag {σ 1,..., σ r } (3.28) 0 0 ] ] Let U = [U r Ũ r, V = [V r Ṽ r where the submatrices U r, V r keep only the first r columns of U, V. We get a Full-rank factorization of A A = U r Σ V r = [u 1,..., u r ] Σ [v 1,..., v r ] where U r U r = I r = V r V r, but U r U r I m, V r V r I n. 70