Linear Algebra, Summer 2011, pt. 3

Linear Algebra, Summer 011, pt. 3 September 0, 011 Contents 1 Orthogonality. 1 1.1 The length of a vector....................... 1. Orthogonal vectors......................... 3 1.3 Orthogonal Subspaces....................... 4 1.4 Orthogonality and the fundamental subspaces.......... 6 Projection onto subspaces. 8.1 Projection onto one dimensional subspaces............ 8. Least squares............................ 10.3 Projection onto subspaces..................... 11 3 Orthogonal Matrices and Gram-Schmidt. 14 3.1 Gram-Schmidt........................... 17 1 Orthogonality. We have seen that basis vectors are a great way of describing a vector space, in part because each vector can be expressed uniquely as a linear combination of basis vectors. However, not all bases are created equally: there are some that we like more than others. Computationally and aesthetically, we prefer basis vectors which are at right angles to each other, and of length one. That is to say, basis vectors which look like the basis vectors from R or R 3. We will do this in three steps: 1. What is the length of a vector? 1

. What does it mean for two vectors or subspaces to be orthogonal? 3. Given a basis for a subspace, how to we create orthogonal vectors which span the same subspace? Let us hit these questions right away. 1.1 The length of a vector. Given a vector x = (x 1, x in R, we are probably comfortable with the idea that the length of this vector (which we will denote by x is given by the Pythagorean theorem: x = x 1 + x. Also, for x = (x 1, x, x 3 R 3, the formula is x = x 1 + x + x 3. Then we generalize this to x = (x 1, x,..., x n R n. Definition 1.1 (Length of a vector.. For x = (x 1, x,..., x n R n, we define x := x 1 + x + + x n. We may prefer to write this as a multiplication of matrices: x = x T x, where we are writing x above as a column vector, as usual. As an example, we will expect the vector (, 3, 4 to have length 9, and indeed, ( 3 4 3 4 = 4 + 9 + 16 = 9 = (, 3, 4.

1. Orthogonal vectors. To motivate the definition of orthogonal vectors, we will again turn to the Pythagorean theorem. First let s state the theorem in terms of vectors: Theorem 1.1 (Pythagorean Theorem. If x and y are orthogonal, then x + y = x y. This is the theorem in two or three dimensions. What we do is we define orthogonality so that this theorem is true! In particular, we need x T x + y T y = (x y T (x y = x T x + y T y x T y. This is true only when x T y = 0. Hence, we say that two vectors x and y are orthogonal if x T y = 0. Example 1.1. The vectors (1, 1, 3 and ( 1,, 1 are orthogonal, since (1, 1, 3 ( 1,, 1 = 1 1 + 1 + 3 1 = 1 + 3 = 0. Here is a surprising statement that is easy to prove: Proposition 1.1. If {x 1,..., x n } are mutually orthogonal (i.e., each is orthogonal to all the others and nonzero, then they are linearly independent. Proof. This is a beautiful proof, since we use only the definitions of linear independence and orthogonality. In particular, we will take a linear combination of the vectors adding to zero, then take a dot product, and show each coefficient must be zero. Specifically, suppose c 1 x 1 + c n x n = 0. We will then take a dot product of each side with x 1, which is orthogonal to all the other x j s, except for itself. We get (c 1 x 1 + c n x n x 1 = c 1 x 1 = 0 x 1 = 0. Since x 1 is nonzero, we conclude that c 1 must be equal to 0. We repeat this n times to find that all c 1,..., c n must be 0, hence {x 1,..., x n } are linearly independent. Another important notion we should record is the following: 3

Definition 1. (Orthonormal.. We say a collection of vectors {x 1,..., x n } are orthonormal provided 1. x T j x T k = 0 for all j k, and. x j = 1 for all j = 1,,... n. This definition is definitely motivated by our favorite basis for euclidean space: {(1, 0, 0, (0, 1, 0, (0, 0, 1}. We can readily verify that these vectors have length 1 and are mutually orthogonal. In fact, it is not hard to describe all such vectors in R.] Example 1. (Rotation matrices.. There is a matrix that rotates vectors (x, y by a fixed amount θ, which we ll call R θ. To derive a formula for it, let s take a point (x, y R, and write it in polar coordinates, (x, y = (r cos ω, r sin ω. Then we hope that R θ ( r cos ω r sin ω = ( r cos (ω + θ r sin (ω + θ = From this we can deduce that ( cos θ sin θ R θ = sin θ cos θ ( r cos ω cos θ r sin ω sin θ r cos ω sin θ + r sin ω cos θ Now if we wish to rotate the usual basis for R by an angle θ, we will get ( ( ( cos θ sin θ 1 cos θ =, sin θ cos θ 0 sin θ.. and ( cos θ sin θ sin θ cos θ ( 0 1 = These are all the orthonormal bases for R. ( sin θ cos θ. 1.3 Orthogonal Subspaces. In the previous section, we asked when two vectors were orthogonal, which we generalized to talking about when a collection of vectors was orthogonal. Now we define what it means for two subspaces is orthogonal. Looking around a room for an example of such a situation, we are tempted to say that the 4

floor and a wall look like two orthogonal planes. However, this does not turn out to be the right definition to get all the mileage we want out of this. In particular, there are certain vectors in the wall that point in the same direction as vectors in the floor. The right definition is as follows: Definition 1.3 (Orthogonal subspaces.. Two subspaces V and W are orthogonal if for every v V and w W, v T w = 0. Thus, for example, the x y plane and the y z plane are not orthogonal subspaces of R 3, since, for example, (0, 1, 0 is in both the x y plane and the y z plane (and (0, 1, 0 (0, 1, 0 = 1 0. Here is a useful criterion for deciding when two subspaces are orthogonal: Proposition 1.. If V and W are subspaces of R k with bases {v 1,..., v m } and w 1,..., w n }, respectively, then V and W are orthogonal if and only if their basis vectors are mutually orthogonal. Proof. Certainly if V and W are orthogonal, then the basis vectors are mutually orthogonal: all vectors in V will be orthogonal to all the vectors in W. What about the other direction? Suppose we have a vector v V, and a vector w W. Let A be the k m matrix with the basis for V as its columns, and let B be the k n matrix with the basis for W as its columns. Then we can write v and w as for a unique choice of c 1, c. Now we have only the calculation v = Ac 1, and w = Bc, v T w = (Ac 1 T (Bc = c T 1 (A T Bc. But each entry of A T B is the dot product of a basis vector of V with a basis vector of W. By hypothesis each of these is zero. Hence v T w = c T 1 (A T Bc = c T 1 0 m n c = 0. Hence the dot product of any two vectors in the subspaces is zero, so orthogonal basis vectors imply orthogonal subspaces. 5

Notice that in R 3, we can have two orthogonal lines, or a plane and an orthogonal line, but there are not enough dimensions to fit two orthogonal planes in, since the planes would each have dimension, and 4 vectors in R 3 cannot be linearly independent (and so cannot be orthogonal. However, in R 4, there is more room: the plane spanned by {(1,, 0, 0, (0, 1, 0, 0} is orthogonal to both the line {(0, 0, 1, 5} and the line {(0, 0, 5, 1}, which are orthogonal to each other. 1.4 Orthogonality and the fundamental subspaces. Here s a beautiful and surprising fact: given an m n matrix A, the row space is orthogonal with the null space, and the column space is orthogonal to the left null space! First, we verify that both the row space and null space are subspaces of R n, and that C(A, N(A T are both subspaces of R m. Now we should prove this fact. Theorem 1.. Let A be an m n matrix. Then the row space and null space are orthogonal subspaces of R n. Proof. The insight to this is that if we take x N(A, then Ax = 0. This means that the inner product of each row of A with x is equal to zero. In other words, each row of A is orthogonal to x. The rest of the proof is simply pointing out that if the rows of A are r 1,..., r m, then an arbitrary vector in C(A T can be written w = c 1 r 1 + + c m r m. Taking the dot product of w with x gives us w T x = c 1 r T 1 x + c m r T mx = 0. Hence the dot product is zero for any vector in the row space and any vector in the null space. We actually have a second, much cleaner proof! 6

Proof. Suppose v C(A T, and x N(A. Then v = A T y for some vector y R m. In this case, v T x = (A T yx = y T Ax = y T 0 = 0. We leave the other proof as an exercise! Example 1.3. Let A = 3 4 6 6 9 Then the row space is one dimensional, and equal to multiples of (, 3. Hence the null space must be all multiples of ( 3,, since this is the only orthogonal vector to the line determined by (, 3. Similarly, the column space is generated by (, 4, 6, so the left null space must be generated by two mutually orthogonal vectors to this one. Pleasantly, the equation for this plane is given by x + 4y + 6z = 0, which is exactly what the condition (, 4, 6 (x, y, z = 0 says. We ve thus derived the equation for a plane, by observing that it must be orthogonal to a certain line. Notice that the null space is not just orthogonal to the row space. It contains every vector orthogonal to the row space. We have a word for this: Definition 1.4 (Orthogonal complement.. Given a subspace V, we call the subspace of all vectors orthogonal to V the orthogonal complement of V, and denote it by V ( V perp.. Using our new words, we can say the following: The row space is the orthogonal complement of the null space. The column space is the orthogonal complement of the left null space. This doesn t look like such a big deal, but it gives us the following theorem for free: Theorem 1.3. The equation Ax = b has a solution if and only if b T y = 0 whenever A T y = 0. 7

Projection onto subspaces. We now turn our attention back to projecting onto subspaces. We first treat the easiest example, that of projection onto a line, and then general projection matrices. We should clarify our goal. Given a subspace S, we hope to have a matrix P S so that for any vector b, the vector P S b is not only in S, but it is the closest vector in S to b..1 Projection onto one dimensional subspaces. In our first situation, we will project onto a line. Intuitively, given vectors a and b, we are asking how much does b point in the direction a? Our answer starts by considering these two vectors in the plane. Note that this is general in a certain sense, since two vectors in R 3 still only span a plane. Now some trigonometry allows us to derive that a T b = a b cos θ. In particular, we notice that, if α is the angle a = (a 1, a makes with the axis, and β is the angle b = (b 1, b makes with the axis (and we suppose that β > α, then sin α = a α, cos α = a 1 α, and similarly for β and b. Now the angle between a and b is θ = β α. We plus in cos θ = cos β cos α + sin β sin α = a 1b 1 + a b a b = at b a b. Now let us call p the closest point to b on the line a. Then the line from p to b must be at a right angle to a. That is to say, a T (p b = 0. However, we also know that p = λa, since the projection must lie on the line determined by a. together, we find λa T a a T b = 0, Putting these 8

so Thus we find that λ = at b a T a. p = at b a T a a. However, our goal is to find a matrix P a so that p = P a b. Since at b a T a is just a number, we can move the vector a to the left side of the equation and get P a = aat a T a. We will record a few observations about P a : 1. If a R n, then aa T is an n n matrix.. The matrix aa T has rank 1, since each column (and row! will be a multiple of a. Hence, the row and column space both have basis {a}. 3. P a is symmetric. 4. The matrix is scale invariant: P a = P a. Intuitively this is pleasing, since the subspace defined by a is the same as the subspace defined by a. 5. If we apply P a to a vector already on the line determined by a, it doesn t do anything: ( aa T P a (λa = (λa = λ aat a a T a a T a = λa. Intuitively, if a point is already on this line, then the closest point to it on the line is itself. 6. P a = P a. This is a generalization of the previous observation. Since P a b = λa lies on a, applying P a more won t do anything. Now an example. 9

Example.1 (Projection onto (1,, 3. Suppose we wish to project vectors onto the line spanned by (1,, 3. Then by the formula above, P (1,,3 = 1 1 ( 1 3 = 1 1 3 4 6. 14 14 3 3 6 9 Then, for example, the vector b = (0, 0, 1 is projected to P (1,,3 b = 1 3 1 3 0 4 6 0 =. 14 3 6 9 1. Least squares. Now let us look at an applied example. Suppose we are searching for a linear relationship between a person s height and their weight (in reality, we will get a much better fit from height 3, since volume and length live in different dimensions, but let s ignore this. Anyways, supposing we take a random sample of 5 people, and find their heights (in inches are h = (69, 7, 80, 80, 73, and their respective weights are w = (173, 193, 8, 11, 187 (note: this is actual data from the 010-011 Rice basketball team, so the randomness of the data could be argued. Now we are claiming that hx = w, i.e. 69 7 80 80 73 ( x = 173 193 8 11 187 A reasonable, and computationally friendly, way to measure how good our fit is is via least squares. That is to say, for any potential solution x, we have an error which is given by E(x = (69x 173 +(7x 193 +(80x 8 +(80x 11 +(73x 187. 14 3 7 9 14 In matrix notation, E(x = hx w. 10

From calculus, we know that to find a minimizer to this equation, we must take a derivative and set it to zero: 0 = (69(69x 173 + 7(7x 193 + 80(80x 8 + 80(80x 11 + 73(73x 187. Again, matrices simplify the notation quite a bit: 0 = E (x = d dx hx w = h T (hx w. Hence, we have x = ht w h T h. Wading through the equations above (or simply using our derived formula, we find that in this case, x = 1434 4679.6574. Looking back, we also had E(.6574 398.95 19.97. However, the important thing is that even though hx = w is not solvable, the best (in the sense of least squares solution to it is given by the projection of w onto h..3 Projection onto subspaces. Suppose instead of looking for a relationship between height and weight, we had more explanatory variables. For concreteness, let us add in a variable for how old they are, and try to solve Ax = w, where A is now a matrix whose first column contains heights, and whose second column contains an age. For our particular example, we will take 69 19 7 A = 80 80 19. 73 1 11

Also, x, rather than being a single number indicating how much height affects weight, will now be a vector, x = (h, a, indicating the relative inputs of height and age on weight. Again, there is no solution to Ax = w, so we will try to find an x which minimizes E(x = Ax w. But we now recognize this as finding the closest vector Ax in the column space of A to w. Hence, we are just projecting w onto C(A. We have other information too: Appealing to geometry, know that this error vector must be perpendicular to C(A. Hence (Ax w N(A T, since the left null space contains all vectors perpendicular to the column space of A. In other words, A T (Ax w = 0. Then if we knew that A T A was invertible, we could write down that x = (A T A 1 A T w, so that our least squares approximation is w = Ax = A(A T A 1 A T w. It is important to notice that A is typically a very tall, very non-square matrix, so we usually cannot use the identity (A T A 1 = A 1 (A T 1, and conclude that w = w. Also, in practice, the best way to solve this problem is to take a problem Ax = b, and instead solve A T Ax = A T b. In this way we avoid having to find an inverse matrix, and can instead use Gaussian elimination. There is also the condition we have used that A T A is invertible. We have a condition for checking when this is true: Theorem.1. If the columns of A are linearly independent, then A T A is invertible. We actually prove this by proving an easier fact: that A T A and A have the same nullspaces. Then if the columns of A are linearly independent, A has no null space, so A T A has no nullspace. But A T A is square (and symmetric, though that doesn t matter here, so A T A is invertible. 1

Lemma.1. A T A and A have the same nullspace. Proof. To show this, we will first show that every vector in the null space of A is also in the nullspace of A T A. Then we show the opposite: every vector in the nullspace of A T A is in the nullspace of A. The first part of the proof is easy: if Ax = 0, then A T Ax = A T 0 = 0. Now, as planned, suppose that A T Ax = 0. Taking the dot product with x, we get x T A T Ax = Ax = 0. But the only vector with length 0 is the zero vector, so Ax = 0, and x is also in the nullspace of A. so Going way back to our example, we get that (.63 x, 0.1 w = Ax 183.4 191.6 1.6 1. 194.1. This compares vaguely well with the observed weights of 173 193 w = 8 11. 187 As a point of interest, the square root of the squared error is about 19.96. Hence, adding the age of the players didn t really help all that much. Using the cube of the heights, this error drops to 7.8. Also note that as you add more explanatory variables, you are should expect to get less error, since you are projecting onto larger subspaces. 13

3 Orthogonal Matrices and Gram-Schmidt. We first define an orthogonal matrix, which is a pretty bad name for the matrix- not because it lies, but because it doesn t tell the whole truth: Definition 3.1 (Orthogonal Matrix.. An orthogonal matrix is a square matrix whose columns are orthonormal. be Unwinding some definitions, we observe that an orthogonal matrix must Square: n n, Normalized: each column must have length 1, and Orthogonal: the columns are mutually orthogonal. Let s first observe that the columns of an orthogonal matrix are an orthonormal basis for R n. Now some examples: Example 3.1 (Permutation matrices are orthogonal.. Every permutation matrix is an orthogonal matrix: 0 1 0 0 P = 0 0 0 1 1 0 0 0 0 0 1 0 has columns of length 1, and each is orthogonal to the others. Example 3. (Rotation matrices are orthogonal.. The rotation matrices we introduced earlier are also orthogonal. We defined ( cos θ sin θ R θ :=. sin θ cos θ Notice that the length of the columns is 1: ( ( cos θ cos θ sin θ = cos θ + sin θ = 1. sin θ and the columns are orthogonal: ( sin θ cos θ ( cos θ sin θ = 0. 14

Another great property of orthogonal matrices: Theorem 3.1. If Q is an orthogonal matrix, then Q 1 = Q T. The proof of this is simply that the product of two matrices is a series of inner products, and the inner product between orthogonal matrices is zero. Hence the only nonzero entries in the product Q T Q will be along the diagonal, which will be the length of the columns of Q, which is 1. Now these orthonormal bases have yet another great property: suppose you have a vector b, which you d like to write in the basis given by the columns of Q. More precisely, we wish to write b = x 1 q 1 + + x n q n. In order to do this, we need to solve for each x j. But we do this by just taking the inner product of each side with respect to q j. Then we find q T j b = x j q T j q j = x j. Example 3.3 (Changing bases.. Suppose we have a vector b = (1, T. Let s first write it as a linear combination of the vectors which are the columns of I, the identity matrix. Then we get and so ( 1 x 1 = ( 1 0 ( 1 x = ( 0 1 ( 1 ( 1 = 1 0 = 1, =, ( 0 + 1 which we probably could ve figured out before starting this. Let s now write the same vector as a sum of the columns of the orthogonal matrix ( Q = 1 3 3 1 A similar calculation to the above gives us x 1 = 1 + 3.,, and x = 3. 15

Then we get that ( 1 = 1 + 3 ( 1 3 + 3 ( 3 1 a much less obvious conclusion (though the calculation was mechanically simple! We give another example from calculus. Example 3.4. Suppose we wish to analyze the graph of the function x 6xy + y = 1. We know from calculus that since this is a second order equation, it must be a conic section- either a parabola, hyperbola, or ellipse. However, this is not an equation that is typically taught to be recognized in calculus courses. In particular, the cross term 6xy throws us off. One way of getting around this is via a change in basis: we ll define ( x y ( cos θ sin θ = sin θ cos θ ( u v =, ( u cos θ v sin θ u sin θ + v cos θ What we now do is choose a θ so that there is no cross term. In particular, we find that, through direct substitution, 1 = (u cos θ v sin θ 6(u cos θ v sin θ(u sin θ + v cos θ + (u sin θ + v cos θ = u (1 6 cos θ sin θ 6uv(cos θ sin θ + v (1 + 6 sin θ cos θ. Hence, we choose any θ (remember, this exercise is just to make calculations easier so that cos θ sin θ = 0. One such θ is π/4. Plugging in, we get u + 4v = 1. This is an equation that we recognize as a hyperbola in the (u, v-plane. Reflecting on our calculation, the (u, v plane is a rotation of the (x, y plane by an angle of π/4. We could alternatively view this as a graph of a rotated hyperbola in the (x, y plane.. 16

3.1 Gram-Schmidt. The goal of Gram-Schmidt is to take a (generally non-orthonormal basis {x 1,..., x n } for a subspace S, and to find an orthonormal basis {q 1,..., q n } for S. This is an algorithm, and a computationally simple one at that, though the arithmetic quickly becomes ugly. Not finished! 17