Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Least Squares Tom Lyche Centre of Mathematics for Applications, Department of Informatics, University of Oslo October 26, 2010

Linear system Linear system Ax = b, A C m,n, b C m, x C n. under-determined (m < n): No solution, or an infinite number of solutions. square (m = n): Unique solution if nonsingular. Otherwise either no solution, or an infinite number of solutions. over-determined (m > n): Either no solution, a unique solution or an infinite number of solutions. But an over-determined system usually has no solution.

The Least Squares Problem Definition Given A C m,n and b C m. We call an x C n which minimizes r(x) 2 2 = Ax b 2 2 a least squares solution of Ax = b. We set E(x) := Ax b 2 2 = r(x) 2 2. To find x which minimizes E(x) is called the Least Squares Problem (LSQ). Since the square root function is monotone, minimizing E(x) or E(x) is equivalent.

Example 1 x 1 = 1 1 1 x 1 = 1, A = 1, x = [x 1 ], b = 1, x 1 = 2 1 2 Ax b 2 2 = (x 1 1) 2 + (x 1 1) 2 + (x 1 2) 2. Setting the first derivative with respect to x 1 equal to zero we obtain 2(x 1 1) + 2(x 1 1) + 2(x 1 2) = 0 x 1 = 4/3, the average of b 1, b 2, b 3. The second derivative is positive and x 1 = 4/3 is a global minimum.

The Normal Equations E(x) = Ax b 2 2 = x Bx 2c x + β, B = A A, c = A b, β = b b. E(x) := [ E x 1,..., E x n ] T = 2(Bx c). x least squares solution Bx c = 0 A Ax = A b (the Normal Equations). We will show the converse: A Ax = A b x least squares solution.

Example 2 We choose m abscissas t 1,..., t m, n functions φ 1, φ 2,..., φ n defined for t {t 1, t 2,..., t m }, and m positive numbers w 1,..., w m. We want to find x = [x 1, x 2,..., x n ] T such that E(x) := m k=1 w k [ n j=1 x j φ j (t k ) y k ] 2 is as small as possible. Typical examples of functions might be polynomials, trigonometric functions, exponential functions, or splines. The numbers w k are called weights.

Example 2 is a Least Squares Problem E(x) := m k=1 w k [ n j=1 x j φ j (t k ) y k ] 2 = Ax b 2 2, B = A T A = [ m k=1 w kφ i (t k )φ j (t k )] n i,j=1 Rn,n. [ c = A T m ] n b = k=1 w 1/2 k φ i (t k ) R n. i=1

w i = 1, i = 1,..., m, φ 1 (t) = 1,φ 2 (t) = t,φ n (t) = t n 1 The normal equations n = 3: m t k t 2 k B 3 x = tk t 2 k t 3 k t 2 k t 3 k t 4 k x 1 x 2 x 3 = yk tk y k t 2 k y k. symmetric positive definite if at least n distinct t s.

Hilbert Matrix t i = (i 1)/(m 1), i = 1,..., m. 1 B m n H n H 3 = 1 1 2 1 1 2 3 1 1 3 4 1 3 1 4 1 5 1 m m k=1 ti+j k = 1 m m k=1 K 1 (H 6 ) 3 10 7. ( k 1 ) i+j 1 x i+j dx = 1. m 1 0 i+j+1 Extremely ill-conditioned even for moderate n. Use different basis for polynomials.

What is next? The Pseudoinverse Orthogonal Projections LSQ: Existence and Uniqueness Numerical Solution Methods Perturbation Theory

Recall SVD, SVF, and Outer product form m n m n n r r n V * Σ * 1 V 1 A = m U Σ = U 1 = σ 1 u 1 v 1 * +...+σr u r v r * Figure: Three forms of SVD, decomposition (center left), factorization (center right), and outer product (right).

The Pseudoinverse Suppose A = U 1 Σ 1 V 1 C m,n is a singular value factorization of A. The matrix A := V 1 Σ 1 1 U 1 is called the pseudo-inverse of A. If A C m,n then A C n,m. If (1) ABA = A, (2) BAB = B, (3) (BA) = BA, and (4) (AB) = AB then B := A. A is independent of the factorization chosen to represent it. If A is square and nonsingular then A A = AA = I and A is the usual inverse of A. Any matrix has a pseudo-inverse, and so A is a generalization of the usual inverse.

Example 1 1 1/ 2 A = 1 1 = 1/ 2 [2] [ 1/ 2 1/ 2 ] 0 0 0 [ ] 1/ 2 B := A = 1/ [1/2] [ 1/ 2 1/ 2 0 ] = [ ] 2 1 1 1 0 4 1 1 0 If we guess a candidate B for A and (1) ABA = A, (2) BAB = B, (3) (BA) = BA, and (4) (AB) = AB. Then B := A.

What is next? The Pseudoinverse Orthogonal Projections LSQ: Existence and Uniqueness Numerical Solution Methods Perturbation Theory

Recall Direct sum and Orthogonal Sum Suppose S and T are subspaces of R n or C n. We define Sum: X := S + T := {s + t : s S and t T }; Direct Sum: If S T = {0}, then S T := S + T. Orthogonal Sum: Suppose, is an inner product on R n or C n. S T is an orthogonal sum if s, t = 0 for all s S and all t T. span(a) ker(a ) is an orthogonal sum with respect to the usual inner product s, t := s t. For if y = Ax span(a) and z ker(a ) then y z = (Ax) z = x (A z) = 0. orthogonal complement: T = S := {x X : s, x = 0 for all s S}.

Basic facts Suppose S and T are subspaces of R n or C n. Then S + T = T + S and S + T is a subspace of R n or C n. dim(s + T ) = dim S + dim T dim(s T ) dim(s T ) = dim S + dim T. C m = span(a) ker(a ) Every v S T can be decomposed uniquely as v = s + t, where s S and t T. If S T is an orthogonal sum then s is called the orthogonal projection of v into S.

Pythagoras v t S s If s, t = 0 then s + t 2 = s 2 + t 2. Here v := v, v.

Orthogonal basis for span(a) and ker(a ) A = UΣV, A = VΣ T U. AV = UΣ, A U = VΣ T A [ V 1 V 2 ] = [ U 1 U 2 ] [ ] Σ 1 0 0 0, A [ U 1 U 2 ] = [ V 1 V 2 ] [ ] Σ 1 0 0 0. AV 1 = U 1 Σ 1, AV 2 = 0, A U 1 = V 1 Σ 1, A U 2 = 0 U 1 is an orthonormal basis for span(a) U 2 is an orthonormal basis for ker(a ).

Orthogonal Projections and SVD A = [ U 1 U 2 ] [ ] [ ] Σ 1 0 V 1 0 0 V 2 U 1 is an orthonormal basis for span(a) U 2 is an orthonormal basis for ker(a ). Let b C m. Then b = UU b = [U 1 U 2 ] [ ] U 1 U b = U 1 (U 1b) + U 2 (U 2b). 2 b 1 := U 1 U 1b span(a) is the orthogonal projection of b into span(a). b 2 = U 2 U 2b ker(a ) is the orthogonal projection into ker(a ).

Orthogonal Projections and SVD b 1 = AA b, b 2 = (I AA )b b b2 Span(A) b1

Example 1 0 The singular value decomposition of A = 0 1 is 0 0 A = I 3 AI 2. [ ] [ ] 1 0 0 1 0 0 A = I 2 I 0 1 0 3 =. 0 1 0 1 0 [ ] 1 0 0 AA = 0 1 1 0 0 = 0 1 0 0 1 0 0 0 0 0 0 0 0 0 I 3 AA = 0 0 0 0 0 1 If b = [b 1, b 2, b 3 ] T, then b 1 = AA b = [b 1, b 2, 0] and b 2 = (I 3 AA )b = [0, 0, b 3 ] T.

What is next? The Pseudoinverse Orthogonal Projections LSQ: Existence and Uniqueness Numerical Solution Methods Perturbation Theory

Existence, Uniqueness, Characterization Theorem The least squares problem always has a solution. The solution is unique if and only if A has linearly independent columns. Moreover, the following are equivalent. 1. x is a solution of the least squares problem. 2. A Ax = A b 3. x = A b + z, for some z ker(a), and where A is the pseudo-inverse of A. We have x 2 A b 2 for all solutions x of the least squares problem.

Proof Existence Let b = b 1 + b 2, where b 1 span(a) and b 2 ker(a ) are the orthogonal projections into span(a) and ker(a ), respectively Since b 2v = 0 for any v span(a) we have b 2(b 1 Ax) = 0 for any x C n. Therefore, for x C n, b Ax 2 2 = (b 1 Ax)+b 2 2 2 = b 1 Ax 2 2+ b 2 2 2 b 2 2 2, with equality if and only if Ax = b 1. Since b 1 span(a) we can always find such an x and existence follows.

Proof 1,2,3 and Uniqueness 1 2: By what we have shown x solves the least squares problem if and only if Ax = b 1 so that b Ax = b 1 + b 2 Ax = b 2 ker(a ), or A (b Ax) = 0 or A Ax = A b. 1 = 3: Suppose Ax = b 1 and define z := x A b. Then Az = Ax AA b = b 1 b 1 = 0 and z ker(a). 3 = 1: If x = A b + z with z ker(a) then Ax = AA b + Az = b 1. If A has linearly independent columns then ker(a) = {0} and x = A b is the unique solution.

Proof Minimum norm property Suppose x = A b + z, with z ker(a) is a solution. Right singular vectors of A: [v 1,..., v r, v r+1,..., v n ] = [V 1, V 2 ], V 2 is a basis for ker(a) and V 2V 1 = 0. If A = V 1 Σ 1 1 U 1 and z ker(a) then z = V 2 y for some y C n r and z A b = y V 2V 1 Σ 1 1 U 1b = 0. Thus z and A b are orthogonal so that x 2 2 = A b + z 2 2 = A b 2 2 + z 2 2 A b 2 2.

What is next? The Pseudoinverse Orthogonal Projections LSQ: Existence and Uniqueness Numerical Solution Methods Normal Equations QR Decomposition and Factorization SVD and SVF Perturbation Theory

Numerical Solution: Normal Equations Assume A R m,n, b R m A has linearly independent columns (m n). Then B := A T A is symmetric positive definite, Can use R T R factorization of B.

Computing A T A A = [a :1,..., a :n ] = [ a T 1:. a T m: ]. 1. (A T A) ij = a T :i a :j, (A T b) i = a T :i b, (inner product form), 2. A T A = m i=1 a i:a T i:, AT b = m i=1 b ia i:,(outer product form). The outer product form is suitable for large problems since it uses only one pass through the data importing one row of A at a time from some separate storage.

Complexity Consider the number of operations to compute B := A T A. We need 2m flops to find each a T :i a :j. Since B is symmetric we only need to compute n(n + 1)/2 such inner products. It follows that B can be computed in O(mn 2 ) flops. The computation of B using outer products can also be done in O(mn 2 ) flops by computing only one half of A. In conclusion, the number of operations are O(mn 2 ) to find B, 2mn to find A T b, O(n 3 /3) to find R, O(n 2 ) to solve R T y = c and O(n 2 ) to solve Rx = y. Since m n, the bulk of the work is to find B.

Condition Number Issue: Squaring the Trouble A problem with the normal equation approach is that the linear system can be poorly conditioned. The 2-norm condition number of B := A T A is the square of the condition number of A. K 2 (B) = (σ 1 /σ n ) 2 = K 2 (A) 2. One difficulty which can be encountered is that the computed A T A might not be positive definite.

Numerical Solution using the QR Factorization Assume A R m,n, b R m A has linearly independent columns (m n). Then Suppose A = Q 1 R 1 is a QR factorization of A. A T A = R T 1 Q T 1 Q 1 R 1 = R T 1 R 1, A T b = R T 1 Q T 1 b. Since A has rank n the matrix R T 1 is nonsingular and can be canceled. Thus A T Ax = A T b = R 1 x = c 1, c 1 := Q T 1 b.

R 1 x = c 1, c 1 := Q T 1 b. We can use Householder transformations or Givens rotations to find R 1 and c 1. Consider using the Householder triangulation algorithm. We find R = Q T A and c = Q T b, where A = QR is the QR decomposition of A. The matrices R 1 and c 1 are located in the first n rows of R and c.

Example Consider the least squares problem with 1 3 1 1 A = 1 3 7 1 1 4 and b = 1 1. 1 1 2 1 A QR decomposition A = QR is 1 3 1 1 1 1 1 2 2 3 1 3 7 1 1 4 = 1 1 1 1 1 2 1 1 1 1 0 4 5 0 0 6. 1 1 2 1 1 1 1 0 0 0

Example A QR-factorization A = Q 1 R 1 is obtained by dropping the last column of Q and the last row of R so that 1 1 1 A = 1 1 1 1 2 2 3 2 1 1 1 0 4 5 = Q 1 R 1 0 0 6 1 1 1 The least squares solution x is found by solving the system 1 2 2 3 x 1 0 4 5 = 1 1 1 1 1 1 1 1 1 1 2 0 0 6 1 1 1 1 x 2 x 3 we find x = [1, 0, 0] T. 1 1

Complexity The leading term in the number of flops to compute a QR decomposition is approximately 2mn 2 2n 3 /3. The number of flops needed to form the normal equations, taking advantage of symmetry is approximately mn 2. Thus for m much larger than n using Householder triangulation requires twice as many flops as the an approach based on the normal equations. Also, Householder triangulation have problems taking advantage of the structure in sparse problems.

Condition Number; less trouble The 2 norm condition number for the system R 1 x = c 1 is K 2 (R 1 ) = K 2 (Q 1 R 1 ) = K 2 (A) = K 2 (A T A), the square root of the condition number for the normal equations. Thus if A is mildly ill-conditioned the normal equations can be quite ill-conditioned and solving the normal equations can give inaccurate results. The QR factorization approach is quite stable.

Numerical Solution using the Singular Value Factorization This method can be used even if A does not have full rank. It requires knowledge of the pseudo-inverse of A. x = A b + z is a least squares solution for any z ker(a). When rank(a) is less than the number of columns of A then ker(a) {0}, and we have a choice of z. One possible choice is z = 0 giving the minimal norm solution A b.

Example 1 1 Find all least square solution of A = 1 1 0 0 [ ] 1 1 0 The pseudo-inverse of A is A = 1. 4 1 1 0 [ 1, 1] T is a basis for ker(a). If b = [b 1, b 2, b 3 ] T, then for any z R the vector x = 1 [ ] 1 1 0 1 [ ] b b 4 1 1 0 2 1 + z 1 b 3 is a solution of min Ax b 2 and this gives all solutions. z = 0 gives the minimal norm solution.

Matlab x = lscov(a,b) returns the ordinary least squares solution to the linear system of equations A*x = b, b can also be an m-by-k matrix, and lscov returns one solution for each column of b. When rank(a) n, lscov sets the maximum possible number of elements of x to zero to obtain a basic solution. Not the same as minimal norm solution. Uses QR factorization

What is next? The Pseudoinverse Orthogonal Projections LSQ Theory: Existence and Uniqueness Numerical Solution Methods Perturbation Theory Perturbing the right hand side

Perturbing the right hand side Theorem Suppose A C m,n has linearly independent columns, and let b, e C m. Let x, y C n be the solutions of min Ax b 2 and min Ay b e 2. Finally, let b 1, e 1 be the projections of b and e on span(a). If b 1 0, we have for any operator norm 1 e 1 K(A) b 1 y x x K(A) e 1 b 1, K(A) = A A. τ = N( A * ) b b 2 e span( A) e 1 b 1

Proof Subtract x = A b from y = A b + A e y x = A e = A e 1 since A e = A e 1. Thus y x = A e 1 A e 1 Moreover, b 1 = Ax A x. Therefore y x / x A A e 1 / b 1 proving the rightmost inequality. From A(x y) = e 1 and x = A b 1 we obtain the leftmost inequality.