Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products We now return to a discussion on the geometry of vectors. There are many applications of the notion of orthogonality, some of which we will discuss. A basic (geometric) question that we will address shortly is the following. Suppose you are given a plane P and a point p (in R 3 ). What is the distance from p to P? That is, what is the length of the shortest possible line segment that one could draw from p to P. Definition. Let u, v R n. The inner product of u and v is defined as u v = u T v = [u u n. = u v + + u n v n. The inner product is also referred to as the dot product. Another product, the cross product, will be discussed at a later time. Example. Let a = 3 5 and b = Theorem. Let u, v, w R n and let c R. Then () u v = v u. () (u + v) w = (u w) + (v w). v v n. Compute a b and b a. (3) (cu) v = c(u v). The inner product is a useful tool to study the geometry of vectors. (4) u u and u u = if and only if u =. Definition. The length (or norm) of v R n is the non-negative scalar v defined by v = v v = v + + v n and v = v v. A unit vector is a vector of length. Note that cv = c v for c R. If v, then u = v v is the unit vector in the same direction as v. Example 3. Find the lengths of a and b in Example and find their associated unit vectors.
Recall that the distance between two points (a, b ) and (a, b ) in R is determined by the wellknown distance formula (a a ) + (b b ). We can similarly define distance between vectors. Definition 3. For u, v R n, the distance between u and v, written d(u, v), is the length of the vector u v. That is, d(u, v) = u v. Example 4. Find the distance between a and b in Example. Definition 4. Two vectors u, v R n are said to be orthogonal (to each other) if u v =. Orthogonality generalizes the idea of perpendicular lines in R. Two lines (represented as vectors u, v) are perpendicular if and only if the distance from u to v equals the distance from u to v. (d(u, v)) = u ( v) = u + v = (u + v) (u + v) = u + u v + v (d(u, v)) = u v = (u v) (u v) = u u v + v. Hence, these two quantities are equal if and only if u v = u v. Equivalently, u v =. The next theorem now follows directly. Theorem 5 (The Pythagorean Theorem). Two vectors u, v R n are orthogonal if and only if u + v = u + v. Definition 5. Let W R n be a subspace. The set W = {z R n : z w = for all w W } is called the orthogonal complement of W. Part of your homework will be to show that W is a subspace of R n. Example 6. Let W be a plane through the origin in R 3 and let L be a line through perpendicular to W. If z L and w W are nonzero, then the line segment from to z is perpendicular to the line segment from to w. In fact, L = W and W = L. Theorem 7. Let A be an n n matrix. Then (RowA) = NulA and (ColA) = NulA T. Proof. If x NulA, then Ax = by definition. Hence, x is perpendicular to each row of A. Since the rows of A span RowA, then x (RowA). Conversely, if x (RowA), then x is orthogonal to each row of A and Ax =, so x NulA. The proof of the second statement is similar. Let u, v R be nonzero. By the Law of Cosines, u v = u + v u v cos θ. Rearranging gives u v cos θ = [ u + v u v = [ (u + u ) + (v + v) (u v ) (u v ) = u v + u v = u v. Hence, cos θ = u v u v
. Orthogonal Sets Definition 6. A set of vectors {u,..., u p } in R n is said to be an orthogonal if u i u j = for all i j. If, in addition, each u i is a unit vector, then the set is said to be orthonormal. Example 8. Show that the following set is orthogonal. Is it orthonormal? If not, find a set of orthonormal vectors with the same span. 3 3, 3 3, 8 7 3 4 Theorem 9. If S = {u,..., u p } is an orthogonal set of nonzero vectors in R n, then S is linearly independent and hence a basis for a subspace spanned by S. Proof. Write = c u + + c p u p. Then = u = (c u + + c p u p ) u = c (u u ) + + c p (u p u ) = c (u u ). Since u u (because u ), then c =. Repeating this argument with u,..., u p gives c = = c p =. Hence, S is linearly independent. Let {u,..., u p } be an orthogonal basis for a subspace W of R n. Let u W and write y = c u + + c p u p. Then y u i = c i (u i u i ), and so c i = y u i u i u i, i =,..., p. Example. Show that the set S below is an orthogonal basis of R 3 and express the given vector x as a linear combination of these vectors. 8 S =, 4,, x = 4. 3 Here is an easier version of the problem hinted at in the beginning of this chapter. Given a point p and a line L (in R ), what is the distance from p to L. The solution of this uses orthogonal projections. Let L = Span{u} and p given by the vector y. We need to know the length of the vector orthogonal to u through y. By translation, this is equivalent to the length of the vector z = y ŷ where ŷ = αu for some scalar α. Then = z u = (y αu) u = y u (αu) u = y u α(u u). Hence, α = y u y u u u and so ŷ = u uu. Note that if we replace u by cu for any scalar c this definition does not change and thus we have defined the projection for all of L.
Definition 7. Given vectors y, u R n, and L = Span{u}, the orthogonal projection of y onto L is defined as ŷ = proj L y = y u u u u. Note that this gives a decomposition of the vector y as y = ŷ + z where ŷ L and z L. Hence, every vector in R n can be written (uniquely) as the sum of an element in L and an element in L. Since L L = {}, then it follows that dim L = n. In the next section we will generalize this to larger subspaces. [ [ 4 Example. Compute the orthogonal projection of onto the line L through and the 7 origin. Use this to find the distance from y to L. Definition 8. If W is a subspace of R n spanned by an orthonormal set S = {u,..., u p }, then we say S is an orthonormal basis of W. Example. The standard basis is an orthonormal basis of R n. Theorem 3. An m n matrix U has orthonormal columns if and only if U T U = I. Proof. Write U = [u u n. Then u T u T u u T u u T u n U T u U = [u T u n = u u T u u T u n.. u T.... n u T n u u T n u u T n u n Hence, U T U = I if and only if u i u i = for all i and u i u j = for all i j. Theorem 4. Let U be an m n matrix with orthonormal columns and let x, y R n. Then () Ux = x () (Ux) (Ux) = x y (3) (Ux) (Ux) = if and only if x y =. Proof. We will prove (). The rest are left as an exercise. Write U = [u u n. Then Ux = Ux Ux = (u x + u n x n ) (u x + u n x n ) = i,j (u i x i ) (u j x j ) = i,j x i x j (u i u j ) = i x i (u i u i ) = i x i = x.
3. Orthogonal Projections The next definition generalizes projections onto lines. Definition 9. Let W be a subspace of R n with orthogonal basis {u,..., u p }. For y R n, the orthogonal projection of y onto W is given by proj W y = y u u u u + + y u n u n u n u n. This definition matches our previous one when W is -dimensional. Note that proj W y W because it is a linear combination of basis elements. Also note that the definition simplifies when the basis {u,..., u p } is orthonormal. In this case, if we let U = [u u p, then proj W y = UU T y for all y R n. Theorem 5 (Orthogonal Decomposition Theorem). Let W be a subspace of R n with orthogonal basis {u,..., u p }. Then each y R n can be written uniquely in the form y = ŷ + z where ŷ W and z W. In fact ŷ = proj W y and z = y ŷ. Proof. Note that if W = {}, then this theorem is trivial. As noted above, proj W y W. We claim z = y ŷ W. ( ) y u z u = (y ŷ) u = y u ŷ u = y u u = y u y u =. u u It is clear that this holds similarly for u,, u p. By linearity, z y =, so z W. To prove uniqueness, let y = w + x be another decomposition with w W and x W. Then w+x = y = ŷ+z, so (w ŷ) = (z x). But (w ŷ) W and (z x) W. Since W W = {}, then w ŷ = so w = ŷ. Similarly, z = x. We will show in the next section that every subspace has an orthogonal basis. Corollary 6. Let W be a subspace of R n with orthogonal basis {u,..., u p }. Then y W if and only if proj W y = y. Example 7. Let W = Span{u, u } below. Note that u and u are orthogonal. Write y (below) as a vector ŷ W and z W. 3 u =, u =, y =. Theorem 8 (Best Approximation Theorem). Let W be a subspace of R n and y R n. Then ŷ = proj W y is the closest point to W in the sense that y ŷ < y v for all v W, v y. 6
4. The Gram-Schmidt Process Orthogonal projections give us a way to find an orthogonal basis for any W of R n. Example 9. Let W = Span{x, x } with x, x below. Construct an orthogonal basis for W. x = 3, x = 8 5. Let v = x and W = Span{v }. It suffices to find a vector v W orthogonal to W. Let p = proj W x W. Then x = p + (x p) where x p W. 6 v = x p = x x 8 9 v v = 5 v v 6 3 = 5. Now v v = and v, v W. Hence, {v, v } is a basis for W. Note that if we wanted an orthonormal basis for W then we can just take the unit vectors associated to v and v. 3 This process could continue. Say W was three-dimensional. We could then let W = Span{v, v } and find the projection of x 3 onto W. We ll prove the next theorem using this idea. Theorem (The Gram-Schmidt Process). Given a basis {x,..., x p } for a nonzero subspace W R n, define v = x v = x x v v v v v 3 = x 3 x 3 v v v v x 3 v v v v. v p = x p x p v v x p v v x p v p v p v v v v v p v p Then {v,..., v p } is an orthogonal basis for W. In addition, Span{v,..., v k } = Span{x,..., x k } for all k p. Proof. For k p, set W k = Span{x,..., x k } and V k = Span{v,..., v k }. Since v = x. Then it (trivially) holds that W = V and {v } is orthogonal. Suppose for some k, k < n, that W k = V k and that {v,..., v k } is an orthogonal set. Define v k+ = x k+ proj Wk x k+ W k W k+.
By the Orthogonal Decomposition Theorem, v k+ is orthogonal to W k. Since x k+ W k+, then v k+ W k+. Hence, {v,..., v k+ } is an orthogonal set of k + nonzero vectors in W k+ and hence a basis of W k+. Hence, W k+ = V k+. The result now follows by induction. Example. Let W = Span{x, x, x 3 } with x i below. Construct an orthogonal basis for W. x =, x =, x 3 =. Set v = x. Then v = x x v v v v = 3 = /3 /3 /3. Now, v 3 = x 3 x 3 v v v v x 3 v v v v = 3 5 /3 /3 /3 = / /. Hence, an orthogonal basis for W is {v, v, v 3 }.
5. Least-squares problems In data science, one often wants to be able to approximate a set of data by a curve. Possibly, one might hope to construct the line that best fits the data. This is known (by one name) as linear regression. In this section we ll study the linear algebra approach to this problem. Suppose the system Ax = b is inconsistent. Previously, we gave up all hope then of solving this system because no solution existed. However, if we give up the idea that we must find an exact solution and instead focus on finding an approximate solution, then we may have hope of solving. Definition. If A is an m n matrix and b R n, a least-squares solution of Ax = b is a vector ˆx R n such that for all x R n, b Aˆx b Ax. Geometrically, we think of Aˆx as the projection of b onto ColA. That is, if ˆb = proj ColA b, then the equation Ax = ˆb is consistent. Let ˆx R n be a solution (there may be several). By the Best Approximation Theorem, ˆb is the point on ColA closest to b and so Aˆx is a least-squares solution to Ax = b By the Orthogonal Decomposition Theorem, b ˆb is orthogonal to ColA. Hence, if a j is any column of A, then a j (b ˆb) =. That is, a T j (b ˆb) =. But a T j is a row of A T and so A T (b ˆb). Replacing ˆb with Aˆx and expanding we get A T Aˆx = A T b. The equations corresponding to this system are the normal equations for Ax = b. We have now essentially proven the following theorem. Theorem. The set of least-squares solutions of Ax = b coincides with the nonempty set of solutions of the normal equations A T Ax = A T b. Example 3. Find a least-squares solution of the inconsistent system Ax = b where 5 A = 4 and b =. 3 We will use normal equations. First we compute [ [ 5 9 5 A T A =, A T b =. 9 5 To solve the equation A T Ax = A T b we invert A T A. [ [ [ ˆx = (A T A) A T b = 9 5 /9 =. 9 9 5 5/9 Hence, when A T A is invertible then the least-squares solution ˆx is unique and ˆx = (A T A) A T b.
As an application of this, we ll see how to fit data to a line using least-squares. To match notation commonly used in statistical analysis, we denote the equation Ax = b by Xβ = y. The matrix X is referred to as the design matrix, β as the parameter vector, and y as the observation vector. Suppose we have a a set of data points (x, y ), (x, y ),..., (x n, y n ), perhaps from some experiments. We would like to model this data be a line to predict outcomes that did not appear in our experiment. Say this line is written y = β + β x. The residual of a point (x i, y i ) is the distance from that point to the line. The least-squares line is the line that minimizes the sum of the squares of the residuals. Suppose the data was all on the line. Then they would all satisfy, β + β x = y β + β x = y. β + β x n = y n. We could write this system as Xβ = y where x x X =, β =.. x n [ β β y y, y =.. If the data does not lie on the line (and this is likely) then we want the vector β to be the leastsquares solution of Xβ = y that minimizes the distance between Xβ and y. Example 4. Find the equation y = β +β x of the least-squares line that best fits the data points (4, ), (, ), (3, 3), (5, 5). We build the matrix X and vector y from the data, 4 X = 3, y = 3. 5 5 For the least-squares solution of Xβ = y, we have the normal equation X T Xβ = X T y where [ [ 4 X T X =, X T y =. 46 37 Hence, [ β β y n [ 4/35 = (X T X) X T y =. 7/35
7. Diagonalization of Symmetric Matrices We have seen already that it is quite time intensive to determine whether a matrix is diagonalizable. We ll see that there are certain cases when a matrix is always diagonalizable. Definition. A matrix A is symmetric if A T = A. 3 4 Example 5. Let A = 6. 4 3 Note that A T = A, so A is symmetric. The characteristic polynomial of A is χ A (t) = (t + )(t 7) so the eigenvalues are and 7. The corresponding eigenspaces have bases, λ =,, λ = 7,,. Hence, A is diagonalizable. Now we use Gram-Schmidt to find an orthogonal basis for R 3. Note that the eigenvector for λ = is already orthogonal to both eigenvectors for λ = 7. / v =, v =, v 3 =. Finally, we normalize each vector, / u =, u = / /3 /3, u 3 = / /3 Now the matrix U = [u u u 3 is orthogonal and so U T U = I. /3 /3. Theorem 6. If A is symmetric, then any two eigenvectors from different eigenspaces are orthogonal. Proof. Let v, v be eigenvectors for A with corresponding eigenvalues λ, λ, λ λ. Then λ (v v ) = (λ v ) T v = (Av ) T v = v T A T v = v T Av = v T (λ v ) = λ (v v ). /3 Hence, (λ λ )(v v ) =. Since λ λ, then we must have v v =. Based on the previous theorem, we say that the eigenspaces of A are mutually orthogonal. Definition. An n n matrix A is orthogonally diagonalizable if there exists an orthogonal n n matrix P and a diagonal matrix D such that A = P DP T. Theorem 7. If A is orthogonally diagonalizable, then A is symmetric.
Proof. Since A is orthogonally diagonalizable, then A = P DP T for some orthogonal matrix P and diagonal matrix D. A is symmetric because A T = (P DP T ) T = (P T ) T D T P T = P DP T = A. It turns out the converse of the above theorem is also true! The set of eigenvalues of a matrix A is called the spectrum of A and is denoted σ A. Theorem 8 (The Spectral Theorem for symmetric matrices). Let A be a (real) n n symmetric matrix. Then the following hold. () A has n real eigenvalues, counting multiplicities. () For each eigenvalue λ of A, geomult λ (A) = algmult λ (A). (3) The eigenspaces are mutually orthogonal. (4) A is orthogonally diagonalizable. Proof. Every eigenvalue of a symmetric matrix is real. The second part of () as well as () are immediate consequences of (4). We proved (3) in Theorem 6. Note that (4) is trivial when A has n distinct eigenvalues by (3). We prove (4) by induction. Clearly the result holds when A is. Assume (n ) (n ) symmetric matrices are orthogonally diagonalizable. Let A be n n and let λ be an eigenvalue of A and u a (unit) eigenvector for λ. By the Gram- Schmidt process, we may extend u to an orthonormal basis {u,..., u n } for R n where {u,..., u n } is a basis for W. Set U = [u u u n. Then u T U T Au u T Au n [ AU =..... = λ. B u T n Au u T n Au n The first column is as indicated because u T i Au = u T i (λu ) = λ(u i u ) = λδ ij. As U T AU is symmetric, = and B is a symmetric (n ) (n ) matrix that is orthogonally diagonalizable with eigenvalues λ,..., λ n (by the inductive hypothesis). Because A and U T AU are similar, then the eigenvalues of A are λ,..., λ n. Since B is orthogonally diagonalizable, there exists an orthogonal matrix Q such that Q T BQ = D, where the diagonal entries of D are λ,..., λ n. Now [ T [ [ [ [ λ λ λ = =. Q B Q Q T BQ D This is one of the problems on the extra credit homework assignment.
[ [ Note that is orthogonal. Set V = U. As the product of orthogonal matrices is Q Q orthogonal, V is itself orthogonal and V T AV is diagonal. Suppose A is orthogonally diagonalizable, so A = UDU T where U = [u u n and D is the diagonal matrix whose diagonal entries are the eigenvalues of A, λ,..., λ n. Then A = UDU T = λ u u T + + λ n u n u T n. This is known as the spectral decomposition of A. Each u i u T i (u i u T i )x is the projection of x onto Span{u i}. is called a projection matrix because Example 9. Construct a spectral decomposition of the matrix A in Example 5 3 4 Recall that A = 6 and our orthonormal basis of Col(A) was 4 3 / u =, u = /3 /3, u 3 = /3 /3. / /3 /3 Setting U = [u u u 3 gives U T AU = D = diag(, 7, 7). The projection matrices are / / u u T =, u u T = / / The spectral decomposition is /8 /9 /8 /9 8 9 /9 /8 /9 /8, u 3u T 3 = 4/9 /9 4/9 /9 /9 /9 4/9 /9 4/9. 7u u T + 7u u T u 3 u T 3 = A.