IV. Matrix Approximation using Least-Squares

Similar documents
The Singular Value Decomposition

Singular Value Decomposition

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Singular Value Decomposition

14 Singular Value Decomposition

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Singular Value Decomposition

EE731 Lecture Notes: Matrix Computations for Signal Processing

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Linear Algebra Review. Fei-Fei Li

Lecture 3: Review of Linear Algebra

Singular Value Decomposition

Principal Component Analysis

Singular Value Decomposition (SVD)

Linear Algebra Methods for Data Mining

Lecture 3: Review of Linear Algebra

Solutions to Final Practice Problems Written by Victoria Kala Last updated 12/5/2015

15 Singular Value Decomposition

The Singular Value Decomposition

Linear Algebra Review. Fei-Fei Li

1 Singular Value Decomposition and Principal Component

EE731 Lecture Notes: Matrix Computations for Signal Processing

DS-GA 1002 Lecture notes 10 November 23, Linear models

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

2. Review of Linear Algebra

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

MIT Final Exam Solutions, Spring 2017

Lecture: Face Recognition and Feature Reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Maths for Signals and Systems Linear Algebra in Engineering

. = V c = V [x]v (5.1) c 1. c k

Linear Systems. Carlo Tomasi

Lecture 2: Linear Algebra Review

Least Squares Optimization

Least Squares Optimization

Probabilistic Latent Semantic Analysis

CS 143 Linear Algebra Review

Numerical Methods I Singular Value Decomposition

STA141C: Big Data & High Performance Statistical Computing

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Lecture notes: Applied linear algebra Part 1. Version 2

ECE 275A Homework #3 Solutions

Linear Algebra Review. Vectors

Lecture: Face Recognition and Feature Reduction

ECE 275A Homework # 3 Due Thursday 10/27/2016

Foundations of Computer Vision

18.06 Professor Johnson Quiz 1 October 3, 2007

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Review of Some Concepts from Linear Algebra: Part 2

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

Lecture 5 Singular value decomposition

Stat 159/259: Linear Algebra Notes

Least Squares Optimization

Review problems for MA 54, Fall 2004.

7 Principal Component Analysis

7. Dimension and Structure.

Chapter 3 Transformations

Homework 1. Yuan Yao. September 18, 2011

Linear Algebra Primer

Chapter 6 - Orthogonality

Signal Analysis. Principal Component Analysis

1 Linearity and Linear Systems

Singular Value Decompsition

Linear Algebra Fundamentals

Maximum variance formulation

MATH36001 Generalized Inverses and the SVD 2015

Principal Component Analysis

Problem # Max points possible Actual score Total 120

STA141C: Big Data & High Performance Statistical Computing

The Singular Value Decomposition

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

σ 11 σ 22 σ pp 0 with p = min(n, m) The σ ii s are the singular values. Notation change σ ii A 1 σ 2

Linear Algebra, part 3 QR and SVD

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

UNIT 6: The singular value decomposition.

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Linear Algebra, Summer 2011, pt. 2

The Singular Value Decomposition and Least Squares Problems

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

(v, w) = arccos( < v, w >

1 Principal Components Analysis

Principal Component Analysis

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

A PRIMER ON SESQUILINEAR FORMS

MATH 581D FINAL EXAM Autumn December 12, 2016

EECS 275 Matrix Computation

Singular Value Decomposition

A Brief Outline of Math 355

Vector and Matrix Norms. Vector and Matrix Norms

Transcription:

IV. Matrix Approximation using Least-Squares

The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that has rank r 1? As before, we will use the Frobenius norm to measure the distance between two matrices: A B 2 F = M m=1 A[m, n] B[m, n] 2. Recall that X 2 F is also equal to the sum of the squares of the singular values of X. We can now formulate our problem as X A X 2 F subject to rank(x) = r. (1) The functional above is standard least-squares, but the constraint set (the set of all M N matrices that have a rank of r) is a complicated entity. Nevertheless, as with many things in this class, the SVD reveals the solution immediately. Low-rank approximation. Let A be a matrix with SVD A = UΣV T = R σ p u p v T p. p=1 1 We will assume that r < R, as for r = R the answer is easy, and for R < r min(m, N) the question is not well-posed. 1

Then (1) is solved simply by truncating the SVD: ˆX = r σ p u p v T p = U r Σ r V T r, p=1 where U r contains the first r columns of U, V r contains the first r columns of V, and Σ r is the first r columns and r rows of Σ. The framed result above, known as the Eckart-Young theorem, is an immediate consequence of the following lemma, which we will actually use again later in this set of notes. Subspace Approximation Lemma. A = UΣV T, the optimization program For fixed A with SVD Q:M r Θ:r N A QΘ 2 F subject to Q T Q = I, (2) has solution ˆQ = U r, ˆΘ = U T r A, where U r = [ u 1 u 2 u r ] contains the first r columns of U. We prove this lemma in the technical details section at the end of the notes. To see how it implies the Eckart-Young theorem, we can interpret the search over M r matrices Q with orthonormal columns as a search over all possible column spaces of dimension r. Then the search over Θ finds the best linear combinations in a column spaces 2

to approximate the columns of A. Since any rank-r matrix can be represented this way, the optimization program (2) is equivalent to (1); if ˆQ, ˆΘ solve (2), then  = ˆQ ˆΘ solves (1). Also note that ˆΘ = U T r UΣV T = [ I 0 ] ΣV T, where I is the r r identity matrix, and 0 is a r (R r) matrix of zeros. This matrix of all zeros has the same effect as removing all but the first r terms along the diagonal of Σ and all but the first r rows of V T. Thus ˆQ ˆΘ = U r [ I 0 ] ΣV T = U r Σ r V T r. What is the error between A and its best rank-r approximation Â? Well, R A  = σ p u p v T p, p=r+1 and so the error matrix has singular values σ r+1,..., σ R. Since the Frobenius norm (squared) can be calculated by summing the squares of the singular values, A  2 F = R p=r+1 σ 2 p. In what follows, we use this low-rank matrix approximation result to develop two fundamental tools: total least-squares, and principal components analysis. 3

Total Least-Squares Our fundamental approach thus far to solving y Ax is to optimize y Ax 2 x 2. Thought of another way, if we can t find a x such that y = Ax exactly, we are looking for the smallest possible perturbation we could add to y so that there is an exact solution. Mathematically, the standard least-squares program above is equivalent to solving y,x y 2 2 subject to (y + y) = Ax. This reformulation makes it clear that least-squares implicitly assumes that all of the error (i.e. all of the reasons we can t find an exact solution) lies in the measured data y. But what if the entries of A are also subject to error? That is, how can we account for modeling error as well as measurement error? Total least-squares (TLS) is a framework for doing exactly this in a principled manner. TLS finds the smallest perturbations y, A such that (y + y) = (A + A)x has an exact solution. It does this by solving A, y,x A 2 F + y 2 2 subject to (y + y) = (A + A)x. Example: 1D linear regression Say we are given a set of points (a 1, y 1 ), (a 2, y 2 ),, (a M, y M ) 4

Suppose that the goal is to find the best line that fits these points. (For simplicity, we will only consider lines that pass through the origin.) That is, we are looking for the slope x such that the a m x are as close to the y m as possible. The standard least-squares framework models this problem as follows. We observe y m = a m x + noise, or in matrix form, The solution is of course y = a 1 a 2. a M x + noise. ˆx = (A T A) 1 A T y = M m=1 a m y m M. m=1 a 2 m This solution s the size of the residual r 2 2 = y Ax 2 2 = M y m a m x 2. m=1 Geometrically, we are choosing the slope that s the sum of the squares of the vertical distances of the points to the line we choose to approximate them: 5

In contrast, the TLS estimate (which we will see how to compute below) s the distance in the plane of the points to the line we choose: This distance includes changes in both the a m and y m. 6

Solving TLS We will assume that A is an M N matrix, with M > N, and rank(a) = N (i.e. A is overdetermined with full column rank). The problem only really makes sense if rank(a) < M, otherwise there is always an exact solution. By being careful with the details, the method we present here can also be extended to the case where rank(a) < N < M, but I will leave it to you to fill in those gaps. We want to find A, y, x such that (y + y) = (A + A)x, for y, A of minimal size. Rewrite this as where (A + A)x (y + y) = 0 [ A + A y + y ] [ ] x = 0 1 [ x (C + ) = 0 1] C = [ A y ], = [ A y ]. Note that both C and are M (N + 1) matrices. The result of the progression of equations above says that we [ are x looking for a (of minimal size) such that there is a vector 1] in the nullspace of C +. Since v Null(C + ) αv Null(C + ) for all α R, and x in arbitrary, we are really just asking that C + has a nullspace; as long as there is at least one vector in the nullspace 7

whose last entry is nonzero, we can find a vector of the required form just by normalizing. In short, this means that our task is to find such that the M (N + 1) matrix C + is rank deficient, that is rank(c + ) < N + 1. Put another way, we want to solve the optimization program 2 F subject to rank(c + ) = N. Making the substitution X = C +, this is equivalent to solving X C X 2 F subject to rank(x) = N, and then taking ˆ = ˆX C. This is a low-rank approximation problem 2, and we now know exactly how to solve it. Take the SVD of C, C = W ΓZ T = N+1 γ n w n z T n, and create ˆX by leaving out the last term in the sum above 3 : ˆX = γ n w n z T n. Then ˆ = ˆX C = γ N+1 w N+1 z T N+1. 2 Or at least a lower rank approximation problem. 3 If C has fewer than N + 1 non-zero singular values, then it is already rank deficient, and we can take ˆX = C ˆ = 0. 8

Now we are ready to construct the actual estimate ˆx. Recall that we want a vector such that [ [ (C + ˆ ) x x = 0, meaning ˆX = 0. 1] 1] The null space of ˆX is (by construction) simply the span of zn+1, meaning we need to find a scalar α such that [ x 1] = α z N+1. Thus we can take ˆx TLS = 1 z N+1 [N + 1] z N+1 [1] z N+1 [2]. z N+1 [N]. If it happens that z N+1 (N + 1) = 0, this means y = 0, and we would need an x such that (A + A)x = y. Such an x may or may not exist (and probably doesn t), so in this case there is no TLS solution. In the special case where the smallest singular value of C = [ A y ] is not unique, i.e. γ 1 γ 2 γ q > γ q+1 = γ q+2 = = γ N+1, for some q < N, then the TLS solution may not be unique. We take Z = [ z q+1 z q+2 z N+1 ], 9

and try to find a vector in the span that has the right form; any vector x such that [ x 1] Span ({z q+1,..., z N+1 }) is equally good. All we need is a β such that the last entry of Z β is equal to 1. Principal Components Analysis Principal Components Analysis (PCA) is a standard technique for dimensionality reduction of data sets. It is a way to automatically find simplifying linear relationships in the data. It is used everywhere in signal processing, machine learning, and statistics, with applications including data compression, pattern recognition, and factor analysis. There are two ways to think about PCA. The first is statistical: we are trying to find a transform that is carefully tuned to the (secondorder) statistics of the data. The second is geometrical: given a set of vectors, we are trying to find a subspace of a certain dimension that comes closest to containing this set. The Karhunen-Loeve Transform The Karhunen-Loeve (KL) transform is an orthobasis that is tailored to the statistics of a class of random vectors. Suppose that x R D 10

is random and has 4 mean and covariance E[x] = 0, E[xx T ] = R. Then the KL transform (or the KL basis) is simply the eigenvector V of R = V ΛV T : x = D α n v n, α n = x, v n. This transform has the property that if we want to truncate the sum above (i.e. compress the vector by using fewer than D numbers to represent it), we get an error that is optimal in the mean-square error sense. Let s set this problem up carefully. We want to find a subspace T of dimension K such that when we project x onto T, we lose as little of x (in expectation) as possible. We want to solve T E [ ] x t 2 2 t T subject to dim(t ) = K. For a fixed T, we know how to solve the inner optimization program if we have an orthobasis, so we can re-write the above as a search of sets of K orthogonal vectors in R D : Q:D K E [ x QQ T x 2 2 ] subject to Q T Q = I. 4 Modifying this discussion to vectors that are not zero-mean is straightforward. 11

Now notice that [ ] E x QQ T x 2 2 = E [ ] (I QQ T )x 2 2 = E[trace((I QQ T )xx T (I QQ T )] = trace((i QQ T ) E[xx T ](I QQ T ) = trace((i QQ T )R(I QQ T ), where in the second step above we have used the fact that for any vector v, v 2 2 = trace(vv T ). Now notice that trace((i QQ T )R(I QQ T ) = trace(r) 2 trace(qq T R) + trace(qq T ). We now apply three facts: trace(r) does not depend on Q, trace(qq T ) = trace(q T Q) = K also does not depend on Q, and trace(qq T R) = trace(q T RQ), to transform into the equivalent program maximize W :D K trace(w T RW ) subject to W T W = I. In the Technical Details section below, we show that this expression is maximized by taking Q = [ v 1 v 2 v K ], where the v k correspond to the K eigenvectors of R corresponding to the K largest eigenvalues. Moral: The best (in terms of mean-squared error) way to get a K term approximation of random data is to transform into the orthobasis formed by the eigenvectors of the covariance matrix, then 12

truncating the coefficients to K terms. This set of eigenvectors V is called the KL transform. In some sense, the v 1,..., v K are the K most important features of x they are completely determined by the covariance matrix R. Examples in R 2 : 13

14

PCA on observed data A very similar procedure to the above solves a common geometrical problem. Suppose that I have a bunch of data points x 1, x 2,..., x N R D, and I want to find the K-dimensional affine space (subspace plus offset) that comes closest to containing them. Example We don t even need to think of the data as random here; they are just points that we want to fit with a hyperplane. From Chapter Here is 14 a picture of Hastie, 5 Tibshirani, and Friedman Our goal is to find an offset µ R D and a matrix Q with orthonormal columns such that x n µ + Qθ n for all n = 1,..., N, 5 This is pulled from Chapter 14 of Tibshirani and Hastie s Elements of Statistical Learning. 15

for some θ n R K. We cast this as the following optimization problem. Given x 1,..., x N, solve µ,q,{θ n } x n µ Qθ n 2 2 subject to Q T Q = I. If we fix µ and Q, then by arguments very similar to those we have made before, the optimal θ n are given by ˆθ n = Q T (x n µ). This means our objective reduces to solving µ,q (I QQ T )(x n µ) 2 2 subject to Q T Q = I. The offset µ is uncontrained; if we again fix Q, we can solve for the optimal µ by taking a gradient and setting it equal to zero: ( N ) µ (I QQ T )(x n µ) 2 2 = 2 (I QQ T )(x n µ) (( N ) ) = 2(I QQ T ) x n Nµ. We can make the gradient zero by taking the offset µ to be the sample mean (average of all the observed vectors): ˆµ = 1 N x n. 16

All that remains is solving for Q. We have Q:D K (I QQ T )(x n ˆµ) 2 2 subject to Q T Q = I. Again, using an argument that perfectly parallels that in the Technical Details section below, this program is solved by forming S = (x n ˆµ)(x n ˆµ) T, taking and eigenvalue decomposition S = W ΛW T, and then taking Q = [ w 1 w 2 w K ], where w 1,..., w K are the eigenvectors of S corresponding to the K largest eigenvalues. So even though we posed this problem as being purely geometrical, the answer parallels the statistical KL transform we simply replace the true covariance matix R with the sample covariance N 1 S. 17

Technical Details: Subspace Approx. Lemma We prove the subspace approximation lemma from page 2. First, with Q fixed, we can break the optimization over Θ into a series of least-squares problems. Let a 1,..., a N be the columns of A, and θ 1,..., θ N be the columns of Θ. Then Θ A QΘ 2 F is exactly the same as θ 1,...,θ N a n Qθ n 2 2. The above is our classic closest point problem, and is optimized by taking θ n = Q T a n (since the columns of Q are orthonormal). Thus we can write the original problem (2) as Q:M r a n QQ T a n 2 2 subject to Q T Q = I, and then take ˆΘ = ˆQ T A. Expanding the functional and using the fact that (I QQ T ) 2 = (I QQ T ), we have a n QQ T a n 2 2 = = a T n(i QQ T )a n a n 2 2 a T nqq T a n. 18

Since the first term does not depend on Q, our optimization program is equivalent to maximize Q:M r a T nqq T a n subject to Q T Q = I. Now recall that for any vector v, v, v = trace(vv T ). Thus a n QQ T a n = trace(q T a n a T nq) ( ( N ) ) = trace Q T a n a T n Q ( ) = trace Q T (AA T )Q. The matrix AA T has eigenvalue decomposition AA T = UΣ 2 U T, where U and Σ come from the SVD of A (we will take U to be M M, possible adding zeros down the diagonal of Σ 2 ). Now ( ) ( ) trace Q T (AA T )Q = trace Q T UΣ 2 U T Q ( ) = trace W T Σ 2 W, where W = U T Q. Notice that W also has orthonormal columns, as W T W = Q T UU T Q = Q T Q = I. Thus our optimization program has become maximize W :M r trace(w T Σ 2 W ) subject to W T W = I. 19

After we solve this, we can take any ˆQ such that Ŵ = U T ˆQ. This last optimization program is equivalent to a simple linear program that is solvable by inspection. Let w 1,..., w r be the columns of W. Then r trace(w T Σ 2 W ) = w T p Σ 2 w p Notice that = = p=1 r p=1 M w p [m] 2 σ 2 m m=1 M h[m]σm, 2 where h[m] = m=1 h[m] = r W [p, m] 2 p=1 r w p [m] 2. is a sum of the squares of a row of W. Since the sum of the squares of every column of W is one, the sum of the squares of every entry in W must be r, and so M h[m] = r. m=1 It is clear that h[m] is non-negative, but it also true that h[m] 1. Here is why: since the columns of W are orthonormal, they can be considered as part of an orthonormal basis for R M. That is, there is a M (M r) matrix W 0 such that the M M matrix [ W W 0 ] has both orthonormal columns and orthonormal rows thus the sum of the squares of each row are equal to one. Thus the sum of the squares of the first r entries cannot be larger than this. p=1 20

Thus the maximum value trace(w T Σ 2 W ) can take is given by the linear program maximize h R M M h[m]σ 2 m m=1 subject to M h[m] = r, 0 h[m] 1. m=1 We can intuit the answer to this program. Since all of the σm 2 and all of the h[m] are positive, we want to have as much weight as possible assigned to the largest singular values. Since the weights are constrained to be less than 1, this simply means we max out the first r terms; the solution to the program above is ĥ[m] = { 1, m = 1,..., r 0, m = r + 1,..., M. This means that the sum of the squares of the first r rows in Ŵ are equal to one, while the rest are zero. There might be many such matrices that fit this bill, but one of them is [ I Ŵ =, 0] where above, I is the r r identity matrix, and 0 is a (M r) r matrix of all zeros. It is easy to see that choosing ˆQ = [ u 1 u 2 u r ] satisfies [ ] I U T ˆQ =. 0 21