Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016 1. Let V be a vector space. A linear transformation P : V V is called a projection if it is idempotent. That is, if P 2 P 2. Exercise: If P is a projection then so is Q I P. Moreover, im Q ker P ker Q im P 3. Let P : V V a projection then V is a direct sum: V im P ker P Proof. Let v V. Then v P x + (I P )x im P + ker P Let y im P ker P. Since y im P there exists z V such that y P z. Since y ker P, P y P 2 z 0. By idempotency P 2 z P z and so y 0. 4. If W is a complete inner product space then a projection P in W is called orthogonal if im P and ker P are orthogonal subspaces. 5. Let P be a projection. The following are equivalent: (a) P is self-adjoint. (b) P is orthogonal. Proof: Let P be orthogonal. ker P are orthogonal. Since im P and P x, (y P y) (x P x), P y 1 http://pennance.us 0 Hence x, P y P x, P y P x, y x, P y Thus, P is self adoint. Conversely, if P is self adjoint P x, y P y P 2 x, y P y P x, P (I P )y P x, (P P 2 )y 0 6. Claim: Let W be a Hilbert space. If U is a closed (under the norm topology) subspace of W, then orthogonal projection on U exists. Proof: [From Wikipedia] Let x W and define f(u) x u, u U The infimum of f exists and by completeness f has a minimum at some u U. Let P x u. Clearly P 2 P. Now let e x P x. Then for any non zero vector u U: e e, u u 2 u 2 e 2 e, u 2 u 2 From this it follows that unless e, u 0. the vector satisfies w P x + e, u u u 2 x w < x P x contradicting minimality. Thus, for all u U, and x W x P x, u 0
In particular, x P x, P x 0. Also, for any u U (x + y) P (x + y), u 0 (x P x) + (y P y), u 0 Subtraction yields P x + P y P (x + y), u 0 Finally, choosing u P x+p y P (x+y) shows that: P x + P y P (x + y) By a similar argument λp x P (λx) for every scalar λ. Hence P is linear. 7. Orthogonal Projection Special Case Let a be a non zero column vector in IR n with span C(a) {ta : t IR} and x IR n. The orthogonal projection of x on C(a) is the unique point P a x C(a) which satisfies O x P a x, a 0 x Pa x Writing P a x ta and solving for t yields: x, a P a x a, a a 8. Exercise: The matrix of P a relative to the standard basis is is: P a(a T a) 1 a T a 9. Example: In statistics, the mean of a data vector y is determined by projection onto the vector 1 (1, 1,... 1) T IR n. Specifically: y, 1 P 1 (y) 1, 1 1 y 1 +... + y n 1 n ȳ1 We write ȳ for the mean vector ȳ1. 10. Claim: Let V be a finite dimensional inner product space and W be a proper subspace of V with basis (e 1,..., e k ). Let g be the metric matrix: (g ij ) e i, e j. If x V then,the orthogonal projection of x on W is given by: where Proof: P W x c 1 e 1 + c k e k e 1, x c g 1 e 2, x e k, x c 1 e 1 + c k e k + x x (1) Taking inner products with each of the e j k e j, e i c i e j, x i1 In matrix form e 1, x e 2, x gc e k, x Hence 11. Special Cases e 1, x c g 1 e 2, x e k, x (2) 2
(a) If the basis is orthonormal, then g is the k k identity matrix I, so that c i e i, x. In this case: k P W x e i, x e i i1 (b) If the basis is merely orthogonal then g is diagonal and easily inverted to give, c i e i, x e i, e i 12. Let V IR m and W V the column space C(A) of an m n real matrix A with independent columns A 1, A 2,..., A n, and y IR n then, (2) for projection onto W takes the form: A 1, y A 2, y gc A n, y Since (A T ) 1, y (A T ) 2, y (A T ) n, y A T y g(i, j) (A i, A j (A T ) i, A j A T A(i, j) The projection coefficient vector c is given by: A T Ac A T y In Statistics, this is called the normal equation. The orthogonal projection of y onto W C(A) is: P W y c i A i Ac A(A T A) 1 A T y Hence, projection onto the column space C(A) has matrix P A(A T A) 1 A T (3) 13. Equation (3) can be proven directly using properties of the row and column spaces. 14. Lemma: If the columns A i of A are independent then the (square) matrix A T A is invertible. Proof: It suffices to show that A T Ax 0 has unique solution x 0. A T Ax 0 x T A T Ax 0 (AX) T Ax 0 Ax 0 x i A i 0 Since the A i are independent it must be that x 0. 15. Claim: Let A IR m n have independent columns. Then the matrix P A(A T A) 1 A T determines an orthogonal projection onto C(A). Proof: Notice P 2 P so P is indeed a projection. Since P A A the subspace C(A) is P -invariant. Moreover, if x IR m and y (A T A) 1 A T x then P x Ay yi A i C(A) and so P is a projection onto C(A). Since the columns of A are independent, A has a left inverse. From this fact and the preceeding lemma we have: b ker P P b 0 A(A T A) 1 A T b 0 (A T A) 1 A T b 0 A T b 0 b N(A T ) 3
But the left nullspace N(A T ) is orthogonal to the the column space C(A). Since im P and ker P are orthogonal subspaces it follows that the projection P is orthogonal. 16. Equation (3) can also be obtained geometrically as follows. Let A IR m n has independent columns, and let matrix P represent orthogonal projection onto C(A). Let y IR n and ŷ P y. Since ŷ C(A) there exists ˆx such that Aˆx ŷ. By orthogonality (Aˆx y) C(A) and I P is the orthogonal projection onto N(A T ). Thus if y IR m and ŷ P y and e y ŷ, then the residual vector e belongs to the left nullspace. R(A) dim n ĉ 0 Aĉ ŷ P y ŷ y e + ŷ e ŷ C(A) dim n R m N(A T ) dim m n But C(A) N(A T ), hence A T (Aˆx y) 0 Since A T A is invertible it follows that ˆx (A T A) 1 A T y Thus the projection of y on C(A) is ŷ Aˆx where ˆx is the solution of. A T Aˆx A T y 17. The action of an arbitrary m n real matrix is illustrated in the following diagram (due to Strang). R(A) dim r R n N(A) dim n r x r x n 0 Ax r b Ax b x x r + x n Ax n 0 18. When A has independent columns, the nullspace N(A) is trivial, IR m C(A) N(A T ) b C(A) dim r R m N(A T ) dim m r 4 19. Example Consider the problem of finding the line y b + mx which best fits the points (1, 1), (2, 2) and (3, 2). An exact fit would require that: or in matrix form: b + 1m 1 b + 2m 2 b + 3m 2 1 1 [ ] 1 1 2 b 2 m 1 3 }{{} 2 }{{} x }{{} A y This system has no solution since y / C(A). The normal equation A T Aˆx A T y [ ] [ ] [ ] 3 6 ˆb 5 6 14 ˆm 11 Solving for the projection coefficients: [ ] [ ] 1 [ ] ˆb 3 6 5 ˆm 6 14 11 [ ] 2/3 1/2
The projection of y on the column space of A is: or equivalently ŷ Aˆx 2 1 1 + 1 1 2 3 2 1 3 7/6 5/3 13/6 ˆb + 1 ˆm 7/6 ˆb + 2 ˆm 5/3 ˆb + 3 ˆm 13/6 It follows that the regression line line ŷ ˆb + ˆmx contains the points (1, 7/6), (2, 5/3), and (3, 13/6). Since the projection ŷ is the point in C(A) which minimizes ŷ y it follows that the sum of the squares of the vertical distances of the data points (1, 1), (2, 2) and(3, 2) from the regression line is minimized. For this reason, the above procedure is known as the method of least squares. The vector e y ŷ 1/6 1/3 1/6 is called the residual vector and is orthogonal to C(A). 20. The Mean. The simplest case of least squares uses a model spanned by a single vector 1 (1, 1,..., 1) T IR n Let ȳ (y 1, y 2,..., y n ) T. As noted previously, the projection of ȳ on the vector 1 is just the mean vector ȳ ȳ1. This means that ȳ is the real number b IR which minimizes y b 1 2 (y i b) 2. We can also obtain this result using the normal equations. Let A (1, 1,..., 1) T and b (b, b,... b) T. The equation Ay b will not have a solution unless b belongs to the one dimensional column space C(A). The normal equation A T Aŷ A T b nŷ b ŷ ȳ so, as already proven, the projection of y on C(A) is Aȳ ȳ 1. Let v 1 1 IR n and extend to a basis: variance {}}{ v 1 1, v 2,... v n The difference C n (y) y ȳ which lies in the span of v 2,... v n is called the centering of y. The map C n is just the projection of y onto the subspace perpendicular to 1. The random variable S 2 n i1 (Y i Ȳ )2 Y Ȳ 2 provides an unbiased estimate of the variance of the random variable Y (of which, y is an instance). data: y mean: ȳ y ȳ }{{} C n(y) Finally we note that, since the projection is orthogonal, (y ȳ) x 5
21. Statistical Appendix Let Y (Y 1, Y 2,..., Y n ) be a vector of independent random variables with means and standard deviations specified by: and u IR n then (a) where µ (µ 1, µ 1,..., µ n ) (µ 1, µ 1,..., µ n ) σ (σ 1, σ 2,..., σ n ) E(Y u) u µ Var(Y u) u 2 σ 2 σ 2 (σ 2 1, σ 2 2,..., σ 2 n) u 2 (u 2 1, u 2 2,..., u 2 n). (b) If Y i (µ, σ 2 ) and u 1 Then E(Y u) (u 1)µ Var(Y u) σ 2 Moreover E(Y u) 0 whenever u 1 0. (c) If Y i (µ, σ 2 ) and u 1 and u 1 then E(Y.u) 2 σ 2. (d) If Y i (µ, σ 2 ) and u, v are unit vectors in IR n, the following are equivalent: i. Y u, Y v are independent ii. u v 0 (e) If Y i N(µ, σ 2 ) then and u 1 then Y u is also normal with variance σ 2. 22. Corollary. Let Y (Y 1, Y 2,..., Y n ) IR n be a random vector with Y i (µ, σ 2 ). Let S 2 n i1 (Y i Ȳ )2 Y Ȳ 2 then ES 2 n σ 2 Proof. Let u 1 1/ n IR n be extended (say, by Gram Schmidt) to an orthonormal basis u 1, u 2,..., u n. By orthonormal expansion: Y P ui Y Hence, i1 P u1 Y + P ui Y i2 Ȳ + P ui Y Y Ȳ 2 i2 P ui Y 2 i2 i2 ( u i Y ) 2 For i 2, u i 1 and therefore E(Y u i ) 2 σ 2. Since the vectors u i are orthogonal the projection coefficients are independent and it follows that ESn 2 σ 2. 23. Linear Regression in IR n. For n 2 an orthogonal basis is used in which u 1 1 n 1, u 2,..., u p are designated as basis vectors for the model space M and the remaining vectors u p+1,... u p+q as basis for the error space M. u } 1 1/ variance {}}{ n, u {{ 2,..., u } p, u p+1,... u p+q model space Notice that if u M, and Y (Y 1, Y 2,..., Y n ) IR n is a random vector with Y i (µ, σ 2 ) then (as in the proof of the corollary) E(Y u) 2 σ 2 and it follows that [ q i1 E (Y u ] p+i) 2 σ 2 q 6
In the special case n 2, q 1, and u 1 (1, 1)/ 2 u 2 ( 1, 1)/ 2 Then, as advertized: [ ] (Y u2 ) 2 E E 1 σ 2 [ (Y2 ) ] 2 Y 1 2 For details, please see [2]. response: y fit: ŷ mean ȳ y ȳ residuals: e y ŷ Model Space effect: ŷ ȳ Sources Linear Regression in Higher Dimensions -diagram taken from Pruim 1. Gilbert Strang, Introduction to Linear Algebra. 2. Statistical Methods: The Geometric Approach (Springer Texts in Statistics), by David Saville and Graham R. Wood. 3. Foundations and Applications of Statistics: An Introduction Using R (Pure and Applied Undergraduate Texts), by Randall Pruim 2. 2 Whose color scheme we have followed 7