A Note on Two Different Types of Matrices and Their Applications

A Note on Two Different Types of Matrices and Their Applications Arjun Krishnan I really enjoyed Prof. Del Vecchio s Linear Systems Theory course and thought I d give something back. So I ve written a little note on a couple of topics which introduce applications of two different kinds of matrices that you might not encounter in the course. Hopefully, you ll find it interesting. There are other reasons I wrote this and they are both purely selfish: I could use some proof writing practise, and I really enjoy working these kinds of things out for myself. If I have inadvertently messed up a proof or ignored a deeper mathematical understanding of the objects under consideration, please let me know. The only excuse I have to offer is that I m an engineer, and that I m more comfortable with nuts and bolts than with tangent bundles and cohomologies. I d appreciate any kind of feedback. 1 Rotation in higher dimensions This material is based on a discussion I had with Prof. V. Balakrishnan when I was at the Indian Institute of Technology Madras. All the essential ideas are his, and I just wrote this stuff down and expanded on a few minor aspects. 1.1 Preliminaries To fix ideas, let us see what defines a rotation in R 3, and then extend this to a general finite dimensional vector space. Rotation has the following properties: 1. It is clear from our intuition that rotation is a linear operator. We can use intuition here because we are still in the process of defining rotation. That is, if we have two vectors a, b and a real constant α, R(αa + b) = αr(a)+r(b). The picture is immediate in two dimensions, and with a little thought one realizes that the picture is equally valid in three dimensions if one looks down the axis of rotation and omits the components of a and b along this axis. All this tells us is that R has a matrix representation in any basis, and we can now talk about matrices instead of operators. 2. Rotation leaves the inner product between two vectors unchanged. That 1

is, (Ra, Rb) = (a, b) or (1) a T R T Rb = a T b (2) for every a, b R 3, where (, ) represents the inner product. Choose some orthogonal basis, and select the vectors vectors a, b from the basis set {e 1, e 2,, e n }. By choosing all possible combinations for a and b, and using Eq. 1, we find that R T R = I,where I is the identity matrix. In other words, R T = R 1. It follows that: det(rr 1 ) = det(rr T ) = 1 det(r) = ±1 The property of inner product invariance can be understood in more intuitive terms as the invariance of the angle between two vectors and invariance of the length of vectors under the rotational transformation. It is simple to verify the converse, and show that angle invariance (in R 2 and R 3, where the word angle has meaning) implies inner product invariance. Matrices which preserve the inner product are called orthogonal. 3. Every rotation matrix has an orthogonal generator. I cooked this property up to suit our purposes - we will use this property to show that det(r) = 1 only. By cooked up, I mean that it seems to capture the continuity property of rotations intuitively to me, and so I ll proceed to define it mathematically. Define: for every rotation matrix R and every N R, an orthogonal matrix dr s.t., R = dr N. Perhaps a more talented mathematician can define this carefully in terms of norms and define an infinitesimal generator. Suppose we fix some odd N, we must have det(dr N ) = (det(dr)) N. Then, it follows that since the generator is orthogonal, it must have determinant plus one. 1 In summary, the defining properties of rotation matrices R are, R T R = I, det(r) = +1, and these are the properties required to extend the concept of rotation to higher dimensions. 1 When det(r) = 1 - say in the case of a mirror inversion through the origin where R = I - it is clear that no such infinitesimal generator exists when N is even 2

1.2 Axes of rotation In R 3, an axis of rotation is an invariant vector under the rotation operation. That is, a vector v is along an axis of rotation if, Rv = v. In more familiar terms, v is along the axis of rotation if it is an eigenvector corresponding to the eigenvalue +1. Next, notice that the eigenvalues of a rotation matrix lie along the unit circle in the complex plane. Let v be an eigenvector and λ the corresponding eigenvalue. Then, v T v = v T R T Rv = λ 2 v T v λ 2 = 1. So an eigenvalue can be complex or ±1. Suppose the eigenvalues are {λ 1, λ 2,, λ p } with appropriate multiplicities. Then, we know: p λ i = det(r) = +1 (3) i=1 This condition places the constraint that the eigenvalue 1 can appear only with even multiplicities. The complex eigenvalues appear in pairs anyway since R has only real entries and λλ = 1 from Eq. 3. When n = 2, there can be no concept of axis. In fact, the only rotation matrix with a +1 eigenvalue is the identity. 2 Suppose the dimensionality of the space n 2 is even. Then, it follows from the above that the eigenvalue +1 must also appear with an even multiplicity. That is, in an even dimensional space, there can never exist a unique axis of rotation. Now suppose n is odd, then there is the possibility that +1 can appear with multiplicity 1, and it is possible to specify a unique axis of rotation. Of course this is not certain, and we won t attempt to find conditions for this to happen. However, when n = 3, there always is a unique axis of rotation. To paraphrase Prof. Balakrishnan, we are indeed lucky to living in R 3! 3 2 The second partial derivative test We are all familiar with the second partial derivative test to classify the critical points of a function of two variables. There are some nice matrices to study in this regard when we want to generalize the test to higher dimensions - this is the motivation. The result will rely on something called the spectral theorem 2 The student is encouraged to verify this. I always wanted to say that. 3 One might argue that this is baseless rhetoric and that the extension of the concept of rotation to higher dimensions is an artificial construct - I agree. 3

(because it says something about the eigenvalues of a matrix) for real symmetric matrices, and there are some neat proofs used here. The discussion is elementary and I pretty much re-derived it from scratch. One problem with all the proofs is that I implicitly assume a basis since we need to use the gradient operator. I believe the concepts are easy to generalize, and I vaguely recollect the terms Fréchet and Gâteaux in being used in this regard.... 2.1 The test in two variables In the following, we will use the variable x to represent both a vector and a single variable, and the meaning will be clear from context (hopefully). At a critical point, the gradient vector of a scalar function f : R n R is zero. To further classify the critical points into maxima, minima and saddle points, we have the following test in R 2. Assume that f has continuous second derivatives at the critical point (x 0, y 0 ) and construct the Hessian H of the function at the critical point. This is the matrix of partial derivatives: [ ] fxx f H = xy f yx f yy Then, the test says: det(h) > 0 and f xx > 0 det(h) > 0 and f xx < 0 det(h) < 0 det(h) = 0 Minimum Maximum Saddle point Inconclusive (4) 2.2 In higher dimensions To examine these conditions in greater detail and generalize it to higher dimensions, we need to make a few statements regarding the structure of the problem. R n has the usual inner-product (, ) defined on it and the Euclidean norm induced by the inner product. The cartesian basis is implicitly assumed for the reasons mentioned earlier. Now, let us go back to the definition of a extremal point. A function has a minimum at a critical point x 0 if there exists an ɛ > 0 such that f(x) f(x 0 ) x s.t. x x 0 < ɛ. (5) The condition for a maximum is obvious and we have a saddle if for every ɛ > 0, there are x 1, x 2 s.t., f(x 1 ) > f(x 0 ) and f(x 2 ) < f(x 0 ). To analyze the behavior around the critical point, we need to use Taylor s theorem. We can write Taylor s theorem for a continuous function of several variables in terms of inner products or in matrix notation. It says that around 4

some point x 0, f(x) = f(x 0 ) + ( f(x0), x x 0 ) + 1 2 (x x 0, H f(x0)(x x 0 )) + R(x x 0 ) f(x) = f(x 0 ) + (x x 0 ) T f(x0) + 1 2 (x x 0) T H f(x0)(x x 0 ) + R(x x 0 ) where f(x0) represents the gradient vector and H f(x0) is the Hessian matrix of f at x 0. The key property we use here is that the remainder R(x x 0 ) can be made as small as possible by making x x 0 very small. This is such a regularly used property that we hardly ever bother about when it is applicable. For this, we need to make a further assumption on f(x) and state that not only are the second partials continuous, but also differentiable at the critical point. Then there is an ɛ ball around x 0 in which the remainder R(x x 0 ) is smaller in absolute value than the absolute value of other terms if they are both not zero. At the critical point, the gradient is the zero vector and the behavior of f(x) is dictated by the second derivative term. For there to be a minimum, say, then the term x T H f(x0)x must be greater than zero for every x in a ball of radius of ɛ about x 0. It is easy to see that since the Hessian is a linear operator, this condition must be true for every x. To summarize, x T H x0 x > 0 x Minimum x T H x0 x < 0 x Maximum x T H x0 x < 0 x S R and x T H x0 x > 0 x R S Saddle where S is some subset of R n and H is the hessian matrix. Clairaut s theorem tells us that the Hessian matrix is symmetric, i.e., order of partial differentiation does not matter. A symmetric matrix H having the first two property in Eq. 6 is called positive-definite or negative-definite. Now, we must study the properties of symmetric matrices to put the conditions in more useful terms. 2.2.1 Where does this Hessian come from? We just stated Taylor s theorem in the last section in terms of the Hessian and the gradient - how does this form arise? In one or two dimensions, one can visualize a tangent line or plane z = f(x 0 ). There is a minimum or maximum at a critical point if the function does not cross the tangent plane some small ball of radius ɛ around a critical point. Then, one way of using the more familiar single variable Taylor theorem is to consider the change in the value of f(x) along every line through the critical point. So if we would like a critical point to be a minimum, we would require the second derivative to be positive in every direction. This is most conveniently understood using the concept of gradient. The change in a function of several variables f(x 1,, x n ) can be written as: (6) 5

df = n i=1 f x i dx i Then, we can define an operator that gives a vector of partial derivatives, take its dot product with some unit vector û = (u 1,, u 2 ) and find the change in f(x) in the u direction as: df u du = (û. )f(x 0) It follows that the second derivative - the change in change of f - along u is, d 2 f u du 2 = (û. )((û. )f(x 0)). To see how this can be written in terms of the Hessian in Eq. 6, we can use Einstein s summation notation. The gradient operator in a direction u will be written as u i i f(x) where i is short-hand for the partial derivative with respect to x i. The second derivative is just u k k (u i i f(x)). Rearranging, we find u k ( k i f(x))u i which we can write as an explicit sum for clarity as, ( u k i k k i f(x)u i ) = u T Hu 2.3 Properties of Symmetric Matrices To better understand the conditions in Eq. 6, we state and indicate the proof of some results regarding symmetric matrices. We work in the restricted setting of R n, and work only with symmetric matrices, rather than in a more general Hilbert space with symmetric operators to avoid some technical issues. 4 Definition (Tangent Space). Let x R n,. Then the tangent space is defined as T = {t (x, t) = 0} It follows from the linearity of the inner product that T is a subspace and one can show that T span{x} = R n. Lemma 1 (Eigenvalues of Symmetric Matrices). The eigenvalues of a real symmetric matrix are real. Proof. Let λ be an complex eigenvalue of H and let v be the associated eigenvector. Since H is symmetric, (Hv) T = v T H. 4 That basically means I don t know how and don t want to make a mistake. 6

We know (from class) that λ and v, the complex conjugates are also an eigenvalue eigenvector pair. Then, and it follows that λ = λ v T H v = λv T v = λv T v, Lemma 2 (Inner Product Property). Let A be a symmetric matrix. (Av, x) = (v, Ax) x, v. Then, Proof. (Av) T = v T A T = v T A (v, Ax) = v T Ax = (Ax) T v = x T Av = (x, Av) Theorem 1 ( Real Spectral Theorem). Real symmetric matrices are diagonalizable and can be written in the form Q T DQ, where Q is an orthogonal matrix and D is diagonal matrix, both with purely real entries. Proof. Note. I st... I mean adapted this proof from somewhere on the net, and the guy who posted it got it from some book (Apostol s Calculus or something). There are no details in the stated proof, and so I expanded it and wrote a proof for engineers. We will prove that we can construct n distinct eigenvectors and that they are orthogonal. For a general diagonalizable matrix A, the similarity transformation to the eigenvector basis is P 1 AP. P has as it s columns the eigenvectors of A, and since the eigenvectors are orthogonal, we can construct an orthogonal P by simply normalizing the eigenvectors. Claim 2. A has at least one eigenvector. Pf. Define a function f = (x, Ax) on the surface of the unit ball S = {x x, x < 1} in R n. Then the gradient of f is given by (see Appendix), f = 2Ax Now, S is a closed and bounded set and f is continuous. Then, there is a generalization of Weierstrass Theorem (I m not sure about the name) that states that f reaches has a maximum at some x 1. Let T x1 be the tangent space at x 1. Then, it is clear that f is orthogonal to T x1 (see Appendix). Hence it follows that, Ax 1 = 0 or Ax 1 = λx 1. Claim 3. T x1 is an invariant subspace of A. 7

Pf. Let v T x1. By lemma 2, (Ax 1, v) = (Av, x 1 ). But by claim 2, This implies Av T x1. (Ax 1, v) = 0. Then, all we have to do is restrict A to T x1 (apply it only to vectors in T x1 ) 5, and consider the same function f = (Ax, x) on the unit sphere in T x1 and repeat the procedure until all n eigenvectors are found. It is clear that the eigenvectors are orthogonal. Since the eigenvalues and corresponding eigenvectors are real, it follows that A is diagonalizable in the form specified. 2.4 Conclusion Since the hessian H can be diagonalized in the form specified by the spectral theorem, we can perform a coordinate transformation to the eigenvector basis. The condition for positive definiteness remains identical in the new basis because, x T Hx = (x T Q T )D(Qx). Then, it is clear that positive definiteness requires that all the eigenvalues of H be strictly positive and negative definiteness requires all the eigenvalues to be strictly negative. The saddle point condition requires some of the eigenvalues to be positive and some to be strictly negative. Of course, if even one of the eigenvalues is zero, it means that the second derivative is zero in some direction and we need higher order tests to analyze the behavior of the critical point here. It is easy to understand the conditions for the two-dimensional case given by Eq. 4 now. Pretty neat, no? A Constrained Optimization First, an ugly proof: Let f : R n R be a continuous function with continuously differentiable first partials. Let S be the surface of the unit sphere in R n. Weierstrass theorem states that a maximum exists - call it x 0. We need to show that f(x0) is orthogonal to every vector in the the tangent space T x0. First notice that any vector v can be expressed as a sum αx 0 + βˆt where ˆt is a unit vector in T x0 and β > 0 (because T x0 is a subspace and the T + spanx 0 = R n ). Then suppose the contrary: that is, let x0 f = αx 0 + βˆt so that it is not orthogonal to T x0. Now, Taylor s theorem states that: f(x) = f(x 0 ) + f(x0) (x x 0 ) + R(x x 0 ) (7) 5 There is a theorem called the Invariant Subspace Theorem which tells us that if a linear operator T is invariant on a subspace S, there is another operator T S that defined on the subspace S s.t., T S (x) = T (x) x S. Obvious, isn t it. 8

Since the second derivative exists, there exists an δ > 0 s.t. for every x x 0 < δ, R(x x 0 ) < f(x0) (x x 0 ) Now, choose an x = ax 0 + bˆt from S s.t. x x 0 with b > 0 and a has the same sign as α. It is clear that one such exists because one can obtain the following conditions for a and b: 0 < (x, x) = k = a 2 + b 2 1 ((a 1)x 0 + bˆt, (a 1)x 0 + bˆt) = (a 1) 2 + b 2 < ɛ 1 ɛ 2 < a < 1 The inner product in Eq. 7 is α(a 1) + βb. βb can be made positive always, and αa is positive as well if α < 0. Suppose α > 0, then we need, βb > α(1 a). After a little algebraic manipulation, we find that the condition on a > α2 β 2 α 2 + β 2 Since β > 0, it is clear that the RHS of the previous inequality is strictly less than 1 and its obvious that we can select an a which does the job. What we have done here is to just state a general principle in optimization problems called the method of Lagrange multipliers. That is, given a f(x) subject to a constraint equation g(x) = c, the extremal points of f(x) are found at the points where: f = λ g λ R Why should this be the case? The gradient of g(x) = c always points in a direction normal to the curve. When walking along this curve, we find extrema of the function f(x) when the total derivative of df has no change in the direction tangential to the constraint curve. That is, the gradient of f(x) must be normal to the constraint curve or parallel to g. 6 In our case, the constraint equation is g(x) = (x, x) and g = 2x. B Gradient of (x, Ax) We stated that (x, Ax) = 2Ax. The simplest proof is using Einstein s summation notation. In simplest terms, a repeated index in the summation notation tells us we have to sum over that index. Let A = {a ij } and x = {x j }. As an illustrative example, we can write the components of the gradient of f(x) = Ax as, ( (Ax)) k = a ij x j = a ik. x k 6 There is a nice article on Lagrange Multipliers in Wikipedia. 9

Then for the function f(x) = (x, Ax), each component of the gradient vector is given by, ( (x, Ax)) k = (x i a ij x j ) x k = a kj x j + x i a ik where we have used the product rule for differentiation. Since A is symmetric, a kj = a jk and the result follows immediately from the above. 10