Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to solve them. Version 1: Given an m n matrix A, where m > n, find a unit length vector x that minimizes Ax. Here, and in the rest of this lecture, the norm is the L 2 norm. Also note that minimizing the L 2 norm is equivalent to minimizing the sum squares of the elements of the vector Ax, i.e. the L 2 norm is just the square root of the sum of squares of the elements of Ax. Also, the reason we restrict the minimization to be for x of unit length is that, without this restriction, the minimum is achieved when x = 0 which is uninteresting. We can solve this constrained least squares problem using Lagrange multipliers, by finding the x that minimizes: x T A T Ax + λ(x T x 1). ( ) That is, taking the derivative of (*) with respect to λ and setting it to 0 enforces that x has unit length. Taking derivatives with respect to the x components enforces A T Ax + λx = 0 which says that x is an eigenvector of A T A. So which eigenvector gives the least value of x T A T Ax? Since x T A T Ax 0, it follows that the eigenvalues of A T A are also greater than or equal to 0. Thus, to minimize Ax, we take the (unit) eigenvector of A T A with the smallest eigenvalue, i.e. we are sure this eigenvalue is non-negative. Version 2: Given an m n matrix A with m > n, and given an m-vector b, minimize Ax b. If b is 0 then we have the same problem above, so let s assume b 0. We don t need Lagrange multipliers in this case, since the trivial solution x = 0 is no longer a solution so we don t need to avoid it. Instead, we expand: Ax b 2 = (Ax b) T (Ax b) = x T A T Ax 2b T Ax + b T b and then take partial derivatives with respect to the x variables and set them to 0. This gives or 2A T Ax 2A T b = 0. A T Ax = A T b (1) which are called the normal equations. Assume the columns of the m n matrix A have full rank n, i.e. they are linearly independent so A T A is invertible. (Not so obvious.) Then, the solution is x = (A T A) 1 A T b. last updated: 8 th Dec, 2016 at 08:30 1
What is the geometric interpretation of Eq. (1)? Since b is an m-dimensional vector and A is an m n matrix, we can uniquely write b as a sum of a vector in the column space of A and a vector in the space orthogonal to the column space of A. To minimize Ax b, by definition we find the x such that the distance from Ax to b is as small as possible. This is done by choosing x such that Ax is the component of b that lies in the column space of A, that is, Ax is the orthogonal projection of b to the column space of A. Note that if b already belonged in the column space of A then the error Ax b would be 0 and there would be an exact solution. Pseudoinverse of A In the above problem, we define the pseudoinverse of A to be the n m matrix, A + (A T A) 1 A T. Recall we are assuming m > n, so that A maps from a lower dimensional space to a higher dimensional space. 1 Note that A + A = I. Moreover, AA + = A(A T A) 1 A T and AA + projects any vector b R m onto the column space of A, that is, it removes from b the component that is perpendicular to the column space of A. This is illustrated in the figure below. Note that the plane in the figure on the right contains the origin, since it includes A x = 0, where x = 0. A m x n x = A + b + b Ax A n x m The pseudoinverse maps in the reverse direction of A, namely it maps b in an m-d space to an x in an n-d space. Rather than inverting the mapping A, however, only inverts the component of b that belongs to the column space of A, i.e. A + A = I. It ignores the component of b that is orthogonal to the column space of A. 1 Note that if m = n then A + = A 1. last updated: 8 th Dec, 2016 at 08:30 2
Non-linear least squares (Gauss-Newton method) Let s look at a commmon application of version 2 above. Suppose we have m differentiable functions f i (x) that take R n to R m. The problem we consider now is, given an initial value x 0 R n, find a nearby x that minimizes f(x). Note that this problem is not well defined, in the sense that nearby is not well defined. Nonetheless, it is worth considering problems for which one can seek to improve the solution from some initial x 0. Consider a linear approximation of the m-vector of functions f(x) in the neighborhood of x 0, and try to minimize f(x 0 ) + f(x) (x x 0), where the Jacobian f(x) is evaluated at x 0. It is an m n matrix. Notice that if you square this function then you get a quadratic that increases as any compoonent of x goes to infinity. So there is a unique minimum of this quadratic and that s what we want to find. This linearized minimization problem is of the form we saw above in version 2 where A = f(x) and b = f(x 0 ) f(x) x 0. However, since we are using a linear approximation to the f(x) functions, we do not expect to minimize f(x) exactly. Instead, we iterate x (k+1) x (k) + x. At each step, we evaluate the Jacobian at the new point x (k). Examples To fit a line y = mx + b to a set of points (x i, y i ) we found the m and b that minimized i (y i mx i b) 2. This is a linear least squares problem (version 2). To fit a line x cos θ +y sin θ = r to a set of points (x i, y i ), we found the θ and r that minimized i (ax i + by i r) 2, subject to the constraint a 2 + b 2 = 1. As we saw in lecture 15, this reduces to solving for the (a, b) that minimized i (a(x i x) + b(y i ȳ)) 2, which is the version 1 least squares problem. Once we have θ, we can solve for r since (recall lecture 15), namely r = x cos θ + ȳ sin θ. Vanishing point detection (lecture 15). It is of the form version 2. Recall the image registration problem where we wanted the (h x, h y ) that minimizes: {I(x + h x, y + h y ) J(x, y)} 2 (x,y) Ngd(x 0,y 0 ) This is a non-linear least squares problem, since the (h x, h y ) are parameters of I(x+h x, y+h y ). To solve the problem, we took a first order Taylor series expansion of I(x + h x, y + h y ) and thereby turned it into a linear least squares problem. Our initial estimate of (h x, h y ) was (0, 0). We then iterated to try to find a better solution. Note that we required that A is of rank 2, so that A T A is invertible. This condition says that the second moment matrix M needs to be invertible. last updated: 8 th Dec, 2016 at 08:30 3
SVD (Singular Value Decomposition) In the version 1 least squares problem, we need to find the eigenvector of A T A that had smallest eigenvalue. In the following, we use the eigenvectors and eigenvalues of A T A to decompose A into a product of simple matrices. This singular value decomposition is a very heavily used tool in data analysis in many fields. Strangely, it is not taught in most introductory linear algebra courses. For this reason, I give you the derivation. Let A be an m n matrix with m n. The first step is to note that the n n matrix A T A is symmetric and positive semi-definite. (It is obvious that A T A is symmetric, and it is easy to see that the eigenvalues of A T A are non-negative since x T A T Ax 0 for any real x. ) It follows from basic linear algebra A T A has an orthonormal set of eigenvectors. [Why? The claim is that any symmetric real valued matrix M has an orthogonal (and hence orthonormal) set of eigenvectors. Here is why. First, if we have two eigenvectors v 1 and v 2 with distinct eigenvalues λ 1 and λ 2, then v T 1 Mv 2 = v T 1 M T v 2 because M is symmetric, and λ 1 v T 1 v 2 = λ 2 v T 1 v 2 and since we are assuming λ 1 λ 2, it follows that v T 1 v 2 = 0. Second, if we have two eigenvectors v 1 and v 2 with the same eigenvalue, then we can perform Gram-Schmidt orthogonalization and to make the two eigenvectors orthogonal. Note I have not shown that if M is n n then there must be n distinct eigenvectors. I SHOULD FILL THAT IN ONE DAY.] Let V be an n n matrix whose columns are the orthonormal eigenvectors of A T A. Since the eigenvalues are non-negative, we can write 2 them as σ 2 i, that is, σ i is the square root of the ith eigenvalue. We can define σ i > 0, i.e. it is the positive square root. Define Σ to be an n n diagonal matrix with values Σ ii = σ i on the diagonal. The elements, σ i are called the singular values of A. I emphasize: they are the square roots of the eigenvalues of A T A. Also note that Σ 2 is an n n diagonal matrix, and A T AV = VΣ 2. Next define Ũm n as follows: Then Ũ m n A m n V n n. (2) Ũ T Ũ = V T A T AV = V T VΣ 2 = Σ 2. Thus, the n columns of Ũ are orthogonal and of length σ i. We now normalize the columns of Ũ by defining an m n matrix U whose columns are orthonormal (length 1), so that UΣ = Ũ. Substituting into Eq. 2 and right multiplying by V T gives us A m n = U m n Σ n n V T n n. Thus, any m n matrix A can be decomposed into A = UΣV T where U is m n, Σ is an n n diagonal matrix, and V is n n. A similar construction can be given when m < n. 2 Forgive me for using the symbol σ yet again, but σ is always used in the SVD. last updated: 8 th Dec, 2016 at 08:30 4
One often defines the singular value decomposition of A slightly differently than this, namely one defines the U to be m m, by just adding m n orthonormal columns. One also needs to add m n rows of 0 s to Σ to make it m n, giving A m n = U m m Σ m n V T n n For our purposes there is no important difference between these two decompositions. Finally, note Matlab has a function svd which computes the singular value decomposition. One can use svd(a) to compute the eigenvectors and eigenvalues of A T A. last updated: 8 th Dec, 2016 at 08:30 5