IV. Matrix Approximation using Least-Squares

The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that has rank r 1? As before, we will use the Frobenius norm to measure the distance between two matrices: A B 2 F = M m=1 A[m, n] B[m, n] 2. Recall that X 2 F is also equal to the sum of the squares of the singular values of X. We can now formulate our problem as X A X 2 F subject to rank(x) = r. (1) The functional above is standard least-squares, but the constraint set (the set of all M N matrices that have a rank of r) is a complicated entity. Nevertheless, as with many things in this class, the SVD reveals the solution immediately. Low-rank approximation. Let A be a matrix with SVD A = UΣV T = R σ p u p v T p. p=1 1 We will assume that r < R, as for r = R the answer is easy, and for R < r min(m, N) the question is not well-posed. 1

Then (1) is solved simply by truncating the SVD: ˆX = r σ p u p v T p = U r Σ r V T r, p=1 where U r contains the first r columns of U, V r contains the first r columns of V, and Σ r is the first r columns and r rows of Σ. The framed result above, known as the Eckart-Young theorem, is an immediate consequence of the following lemma, which we will actually use again later in this set of notes. Subspace Approximation Lemma. A = UΣV T, the optimization program For fixed A with SVD Q:M r Θ:r N A QΘ 2 F subject to Q T Q = I, (2) has solution ˆQ = U r, ˆΘ = U T r A, where U r = [ u 1 u 2 u r ] contains the first r columns of U. We prove this lemma in the technical details section at the end of the notes. To see how it implies the Eckart-Young theorem, we can interpret the search over M r matrices Q with orthonormal columns as a search over all possible column spaces of dimension r. Then the search over Θ finds the best linear combinations in a column spaces 2

to approximate the columns of A. Since any rank-r matrix can be represented this way, the optimization program (2) is equivalent to (1); if ˆQ, ˆΘ solve (2), then Â = ˆQ ˆΘ solves (1). Also note that ˆΘ = U T r UΣV T = [ I 0 ] ΣV T, where I is the r r identity matrix, and 0 is a r (R r) matrix of zeros. This matrix of all zeros has the same effect as removing all but the first r terms along the diagonal of Σ and all but the first r rows of V T. Thus ˆQ ˆΘ = U r [ I 0 ] ΣV T = U r Σ r V T r. What is the error between A and its best rank-r approximation Â? Well, R A Â = σ p u p v T p, p=r+1 and so the error matrix has singular values σ r+1,..., σ R. Since the Frobenius norm (squared) can be calculated by summing the squares of the singular values, A Â 2 F = R p=r+1 σ 2 p. In what follows, we use this low-rank matrix approximation result to develop two fundamental tools: total least-squares, and principal components analysis. 3

Total Least-Squares Our fundamental approach thus far to solving y Ax is to optimize y Ax 2 x 2. Thought of another way, if we can t find a x such that y = Ax exactly, we are looking for the smallest possible perturbation we could add to y so that there is an exact solution. Mathematically, the standard least-squares program above is equivalent to solving y,x y 2 2 subject to (y + y) = Ax. This reformulation makes it clear that least-squares implicitly assumes that all of the error (i.e. all of the reasons we can t find an exact solution) lies in the measured data y. But what if the entries of A are also subject to error? That is, how can we account for modeling error as well as measurement error? Total least-squares (TLS) is a framework for doing exactly this in a principled manner. TLS finds the smallest perturbations y, A such that (y + y) = (A + A)x has an exact solution. It does this by solving A, y,x A 2 F + y 2 2 subject to (y + y) = (A + A)x. Example: 1D linear regression Say we are given a set of points (a 1, y 1 ), (a 2, y 2 ),, (a M, y M ) 4

Suppose that the goal is to find the best line that fits these points. (For simplicity, we will only consider lines that pass through the origin.) That is, we are looking for the slope x such that the a m x are as close to the y m as possible. The standard least-squares framework models this problem as follows. We observe y m = a m x + noise, or in matrix form, The solution is of course y = a 1 a 2. a M x + noise. ˆx = (A T A) 1 A T y = M m=1 a m y m M. m=1 a 2 m This solution s the size of the residual r 2 2 = y Ax 2 2 = M y m a m x 2. m=1 Geometrically, we are choosing the slope that s the sum of the squares of the vertical distances of the points to the line we choose to approximate them: 5

In contrast, the TLS estimate (which we will see how to compute below) s the distance in the plane of the points to the line we choose: This distance includes changes in both the a m and y m. 6

Solving TLS We will assume that A is an M N matrix, with M > N, and rank(a) = N (i.e. A is overdetermined with full column rank). The problem only really makes sense if rank(a) < M, otherwise there is always an exact solution. By being careful with the details, the method we present here can also be extended to the case where rank(a) < N < M, but I will leave it to you to fill in those gaps. We want to find A, y, x such that (y + y) = (A + A)x, for y, A of minimal size. Rewrite this as where (A + A)x (y + y) = 0 [ A + A y + y ] [ ] x = 0 1 [ x (C + ) = 0 1] C = [ A y ], = [ A y ]. Note that both C and are M (N + 1) matrices. The result of the progression of equations above says that we [ are x looking for a (of minimal size) such that there is a vector 1] in the nullspace of C +. Since v Null(C + ) αv Null(C + ) for all α R, and x in arbitrary, we are really just asking that C + has a nullspace; as long as there is at least one vector in the nullspace 7

whose last entry is nonzero, we can find a vector of the required form just by normalizing. In short, this means that our task is to find such that the M (N + 1) matrix C + is rank deficient, that is rank(c + ) < N + 1. Put another way, we want to solve the optimization program 2 F subject to rank(c + ) = N. Making the substitution X = C +, this is equivalent to solving X C X 2 F subject to rank(x) = N, and then taking ˆ = ˆX C. This is a low-rank approximation problem 2, and we now know exactly how to solve it. Take the SVD of C, C = W ΓZ T = N+1 γ n w n z T n, and create ˆX by leaving out the last term in the sum above 3 : ˆX = γ n w n z T n. Then ˆ = ˆX C = γ N+1 w N+1 z T N+1. 2 Or at least a lower rank approximation problem. 3 If C has fewer than N + 1 non-zero singular values, then it is already rank deficient, and we can take ˆX = C ˆ = 0. 8

Now we are ready to construct the actual estimate ˆx. Recall that we want a vector such that [ [ (C + ˆ ) x x = 0, meaning ˆX = 0. 1] 1] The null space of ˆX is (by construction) simply the span of zn+1, meaning we need to find a scalar α such that [ x 1] = α z N+1. Thus we can take ˆx TLS = 1 z N+1 [N + 1] z N+1 [1] z N+1 [2]. z N+1 [N]. If it happens that z N+1 (N + 1) = 0, this means y = 0, and we would need an x such that (A + A)x = y. Such an x may or may not exist (and probably doesn t), so in this case there is no TLS solution. In the special case where the smallest singular value of C = [ A y ] is not unique, i.e. γ 1 γ 2 γ q > γ q+1 = γ q+2 = = γ N+1, for some q < N, then the TLS solution may not be unique. We take Z = [ z q+1 z q+2 z N+1 ], 9

and try to find a vector in the span that has the right form; any vector x such that [ x 1] Span ({z q+1,..., z N+1 }) is equally good. All we need is a β such that the last entry of Z β is equal to 1. Principal Components Analysis Principal Components Analysis (PCA) is a standard technique for dimensionality reduction of data sets. It is a way to automatically find simplifying linear relationships in the data. It is used everywhere in signal processing, machine learning, and statistics, with applications including data compression, pattern recognition, and factor analysis. There are two ways to think about PCA. The first is statistical: we are trying to find a transform that is carefully tuned to the (secondorder) statistics of the data. The second is geometrical: given a set of vectors, we are trying to find a subspace of a certain dimension that comes closest to containing this set. The Karhunen-Loeve Transform The Karhunen-Loeve (KL) transform is an orthobasis that is tailored to the statistics of a class of random vectors. Suppose that x R D 10

is random and has 4 mean and covariance E[x] = 0, E[xx T ] = R. Then the KL transform (or the KL basis) is simply the eigenvector V of R = V ΛV T : x = D α n v n, α n = x, v n. This transform has the property that if we want to truncate the sum above (i.e. compress the vector by using fewer than D numbers to represent it), we get an error that is optimal in the mean-square error sense. Let s set this problem up carefully. We want to find a subspace T of dimension K such that when we project x onto T, we lose as little of x (in expectation) as possible. We want to solve T E [ ] x t 2 2 t T subject to dim(t ) = K. For a fixed T, we know how to solve the inner optimization program if we have an orthobasis, so we can re-write the above as a search of sets of K orthogonal vectors in R D : Q:D K E [ x QQ T x 2 2 ] subject to Q T Q = I. 4 Modifying this discussion to vectors that are not zero-mean is straightforward. 11

Now notice that [ ] E x QQ T x 2 2 = E [ ] (I QQ T )x 2 2 = E[trace((I QQ T )xx T (I QQ T )] = trace((i QQ T ) E[xx T ](I QQ T ) = trace((i QQ T )R(I QQ T ), where in the second step above we have used the fact that for any vector v, v 2 2 = trace(vv T ). Now notice that trace((i QQ T )R(I QQ T ) = trace(r) 2 trace(qq T R) + trace(qq T ). We now apply three facts: trace(r) does not depend on Q, trace(qq T ) = trace(q T Q) = K also does not depend on Q, and trace(qq T R) = trace(q T RQ), to transform into the equivalent program maximize W :D K trace(w T RW ) subject to W T W = I. In the Technical Details section below, we show that this expression is maximized by taking Q = [ v 1 v 2 v K ], where the v k correspond to the K eigenvectors of R corresponding to the K largest eigenvalues. Moral: The best (in terms of mean-squared error) way to get a K term approximation of random data is to transform into the orthobasis formed by the eigenvectors of the covariance matrix, then 12

truncating the coefficients to K terms. This set of eigenvectors V is called the KL transform. In some sense, the v 1,..., v K are the K most important features of x they are completely determined by the covariance matrix R. Examples in R 2 : 13

PCA on observed data A very similar procedure to the above solves a common geometrical problem. Suppose that I have a bunch of data points x 1, x 2,..., x N R D, and I want to find the K-dimensional affine space (subspace plus offset) that comes closest to containing them. Example We don t even need to think of the data as random here; they are just points that we want to fit with a hyperplane. From Chapter Here is 14 a picture of Hastie, 5 Tibshirani, and Friedman Our goal is to find an offset µ R D and a matrix Q with orthonormal columns such that x n µ + Qθ n for all n = 1,..., N, 5 This is pulled from Chapter 14 of Tibshirani and Hastie s Elements of Statistical Learning. 15

for some θ n R K. We cast this as the following optimization problem. Given x 1,..., x N, solve µ,q,{θ n } x n µ Qθ n 2 2 subject to Q T Q = I. If we fix µ and Q, then by arguments very similar to those we have made before, the optimal θ n are given by ˆθ n = Q T (x n µ). This means our objective reduces to solving µ,q (I QQ T )(x n µ) 2 2 subject to Q T Q = I. The offset µ is uncontrained; if we again fix Q, we can solve for the optimal µ by taking a gradient and setting it equal to zero: ( N ) µ (I QQ T )(x n µ) 2 2 = 2 (I QQ T )(x n µ) (( N ) ) = 2(I QQ T ) x n Nµ. We can make the gradient zero by taking the offset µ to be the sample mean (average of all the observed vectors): ˆµ = 1 N x n. 16

All that remains is solving for Q. We have Q:D K (I QQ T )(x n ˆµ) 2 2 subject to Q T Q = I. Again, using an argument that perfectly parallels that in the Technical Details section below, this program is solved by forming S = (x n ˆµ)(x n ˆµ) T, taking and eigenvalue decomposition S = W ΛW T, and then taking Q = [ w 1 w 2 w K ], where w 1,..., w K are the eigenvectors of S corresponding to the K largest eigenvalues. So even though we posed this problem as being purely geometrical, the answer parallels the statistical KL transform we simply replace the true covariance matix R with the sample covariance N 1 S. 17

Technical Details: Subspace Approx. Lemma We prove the subspace approximation lemma from page 2. First, with Q fixed, we can break the optimization over Θ into a series of least-squares problems. Let a 1,..., a N be the columns of A, and θ 1,..., θ N be the columns of Θ. Then Θ A QΘ 2 F is exactly the same as θ 1,...,θ N a n Qθ n 2 2. The above is our classic closest point problem, and is optimized by taking θ n = Q T a n (since the columns of Q are orthonormal). Thus we can write the original problem (2) as Q:M r a n QQ T a n 2 2 subject to Q T Q = I, and then take ˆΘ = ˆQ T A. Expanding the functional and using the fact that (I QQ T ) 2 = (I QQ T ), we have a n QQ T a n 2 2 = = a T n(i QQ T )a n a n 2 2 a T nqq T a n. 18

Since the first term does not depend on Q, our optimization program is equivalent to maximize Q:M r a T nqq T a n subject to Q T Q = I. Now recall that for any vector v, v, v = trace(vv T ). Thus a n QQ T a n = trace(q T a n a T nq) ( ( N ) ) = trace Q T a n a T n Q ( ) = trace Q T (AA T )Q. The matrix AA T has eigenvalue decomposition AA T = UΣ 2 U T, where U and Σ come from the SVD of A (we will take U to be M M, possible adding zeros down the diagonal of Σ 2 ). Now ( ) ( ) trace Q T (AA T )Q = trace Q T UΣ 2 U T Q ( ) = trace W T Σ 2 W, where W = U T Q. Notice that W also has orthonormal columns, as W T W = Q T UU T Q = Q T Q = I. Thus our optimization program has become maximize W :M r trace(w T Σ 2 W ) subject to W T W = I. 19

After we solve this, we can take any ˆQ such that Ŵ = U T ˆQ. This last optimization program is equivalent to a simple linear program that is solvable by inspection. Let w 1,..., w r be the columns of W. Then r trace(w T Σ 2 W ) = w T p Σ 2 w p Notice that = = p=1 r p=1 M w p [m] 2 σ 2 m m=1 M h[m]σm, 2 where h[m] = m=1 h[m] = r W [p, m] 2 p=1 r w p [m] 2. is a sum of the squares of a row of W. Since the sum of the squares of every column of W is one, the sum of the squares of every entry in W must be r, and so M h[m] = r. m=1 It is clear that h[m] is non-negative, but it also true that h[m] 1. Here is why: since the columns of W are orthonormal, they can be considered as part of an orthonormal basis for R M. That is, there is a M (M r) matrix W 0 such that the M M matrix [ W W 0 ] has both orthonormal columns and orthonormal rows thus the sum of the squares of each row are equal to one. Thus the sum of the squares of the first r entries cannot be larger than this. p=1 20

Thus the maximum value trace(w T Σ 2 W ) can take is given by the linear program maximize h R M M h[m]σ 2 m m=1 subject to M h[m] = r, 0 h[m] 1. m=1 We can intuit the answer to this program. Since all of the σm 2 and all of the h[m] are positive, we want to have as much weight as possible assigned to the largest singular values. Since the weights are constrained to be less than 1, this simply means we max out the first r terms; the solution to the program above is ĥ[m] = { 1, m = 1,..., r 0, m = r + 1,..., M. This means that the sum of the squares of the first r rows in Ŵ are equal to one, while the rest are zero. There might be many such matrices that fit this bill, but one of them is [ I Ŵ =, 0] where above, I is the r r identity matrix, and 0 is a (M r) r matrix of all zeros. It is easy to see that choosing ˆQ = [ u 1 u 2 u r ] satisfies [ ] I U T ˆQ =. 0 21