Principal Component Analysis (PCA) PCA is a widely used statistical tool for dimension reduction. The objective of PCA is to find common factors, the so called principal components, in form of linear combinations of the variables under investigation, and to rank them according to their importance. Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R, R = r 11 r 21 r N1 r 12 r 22 r N2...... r 1T r 2T r NT. That is, r it is the return of asset i at time t. Usually centered data are used, so that R R/(T 1) is the sample covariance matrix (or correlation matrix) of the returns under study. 1
The First Principal Component Let us start with one variable, say p. Variable p takes T values, to be arranged in a column vector p = [p 1,..., p T ]. p is not yet determined, but let us proceed as if it were. Then our approximation takes the form R pa, where a is an N dimensional column vector, i.e., r 11 r 21 r N1 r 12 r 22 r N2...... r 1T r 2T r NT = p 1 p 2. p T p 1 a 1 p 1 a 2 p 1 a N p 2 a 1 p 2 a 2 p 2 a N...... p T a 1 p T a 2 p T a N. [ ] a1 a N. 2
Thus, r it is approximated by p t a i. The matrix of discrepancies is R pa. Our criterion for choosing p and a will be to select these vectors such that the sum of squares of all T N discrepancies is minimized, i.e., N i=1 T (r it p t a i ) 2 = tr[(r pa ) (R pa )], (1) t=1 using property (14) of the trace (see Appendix). Note that the product pa remains unchanged when p is multiplied by some scalar c 0 and a by 1/c. By imposing T p 2 t = p p = 1, (2) t=1 we obtain uniqueness except for sign. 3
Then our objective function (1) becomes S = tr[(r pa ) (R pa )] = tr(r R) tr(ap R) tr(r pa ) +tr(a }{{} p p a ) =1 = tr(r R) 2p Ra + a a, (3) using that, from (13), tr(ap R) = tr(p Ra) = p Ra, tr(r pa ) = tr(pa R ) = tr(a R p) = a R p = p Ra, and tr(aa ) = tr(a a) = a a. 4
Differentiating (3) with respect to a (for given p) and putting the derivative equal to zero, S a = 2R p + 2a = 0, gives a = R p. (4) Now substitute (4) in the objective function (3) to obtain S = tr(r R) p RR p, showing that our new task is to maximize p RR p with respect to p, subject to (2). The Lagrangian is L = p RR p + λ(p p 1). 5
The first order condition requires that L p = 2RR p 2λp = 0 where I is the identity matrix. (RR λi)p = 0, (5) For (5) to have a nontrivial solution (p 0), we must have that det(rr λi) = 0, (6) which means that p is an eigenvector of the T T positive semidefinite matrix RR corresponding to the eigenvalue (or root) λ. As RR has, in general, N nonzero eigenvalues (if the sample covariance matrix is of full rank), we have to determine which eigenvalue is to be taken. 6
To do so, multiply (5) by p, resulting in p RR p = λp p = λ, (7) which, as we want to maximize p RR p, means that we should take the largest root of RR. Note that all roots of RR are nonnegative, and the positive roots are those of R R, which is T 1 times the sample covariance matrix of the returns under consideration. Note that by multiplying (5) by R we also obtain (R R λi) R p }{{} =a = (R R λi)a = 0, (8) which means that a is an eigenvector of R R corresponding to the largest root of R R (note that R R and RR have the same nonzero eigenvalues). Furthermore, (4) and (5) imply λp (5) = RR p (4) = Ra p = 1 Ra. (9) λ 7
Vector p given by (9), which is a linear combination of the original variables in R, is the first principal component of the N variables in R. 8
Other Principal Components Let us use subscripts for the first principal component, i.e., p 1, a 1, λ 1, and similarly for the second, third,... principal component. Currently, our matrix is approximated by p 1 a 1. The residual matrix is R p 1 a 1, which in turn will be approximated by another principal component, p 2, with corresponding coefficient vector a 2. As before, for identification, put p 2p 2 = 1. Then we want to minimize S 2 = tr[(r p 1 a 1 p 2 a 2) (R p 1 a 1 p 2 a 2)]. It turns out that the second principal component p 2 is equal to the unit length eigenvector of RR corresponding to the second largest eigenvalue, λ 2, of RR, or, equivalently, of R R. 9
Moreover, a 2 R R, and is the corresponding eigenvector of p 2 = 1 λ 2 Ra 2. We can go on in this way by deriving principal components. The ith such component minimizes the sum of squares of the discrepancies that are left after the earlier components have done their work. The result is that p i is the unit length characteristic vector of RR corresponding to the ith largest eigenvalue, λ i. To find the length of vector a i, use p i = Ra i /λ i, which gives p ip i = 1 = a ir Ra i /λ 2 i = a ia i λ i /λ 2 i a i a i = λ i. 10
As R R and RR have the same nonzero eigenvalues, one may also work in terms of the sample 1 covariance matrix T 1 R R, which is of primary interest in our context. This means that we perform a PCA on the Variables R/ T 1, where R contains the centered (demeaned) returns. In general, if we use r principal components to approximate the variables under study, the approximation is given by R/ T 1 r p i a i = P A, i=1 where P = [p 1,..., p r ], A = [a 1,..., a r ], and an approximation for the covariance matrix is R R/(T 1) AP P A = AA, as P P = I. (10) 11
P P = I follows from our normalization p i p i = 1 and the fact that eigenvectors corresponding to different eigenvalues of symmetric matrices are orthogonal (see Appendix). Note that this means that the principal components are uncorrelated. Note that this approximation will be singular as long as r < N. A full rank covariance matrix can be obtained, however, and quite similar to the Single Index Model, by adding a diagonal matrix of asset specific error variance terms (which are assumed to be uncorrelated). The easiest way to do so is just to replace the diagonal elements of (10) with the sample variances of the individual assets. 12
The rationale behind this procedure is that we want to reduce the number of risk factors to a lower dimension. That is, we hope to capture the systematic part of asset covariation by using just a few principal components, while the covariation in the sample covariance matrix which is not captured by these first few principal components is due to random noise, i.e., it will not improve or even considerably deteriorate forecasts of future asset covariance. As this is a statistical factor model, the factors need not have an economic or financial interpretation. The discussion of principal component analysis given here closely follows Henri Theil (1971). Principles of Econometrics. Amsterdam: John Wiley & Sons. See, in particular, pp. 46-56. 13
Choosing the Number of Principal Components The eigenvalues may be used to measure the relative importance of the corresponding components. The argument is based on the criterion used: The sum of squares of all T N discrepancies. Before any component is used the discrepancies are the elements of R, and their sum of squares is N i=1 T rit 2 = tr(r R). t=1 14
The residual sum of discrepancies with r principal components is given by ( S = tr R = tr(r R) 2 = tr(r R) 2 ) ( r p i a i R i=1 r i) p i a i=1 r tr(r p i a i) + tr(a i p ip j a j) i j r r tr(r p i a i) + a ia i i=1 i=1 i=1 = tr(r R) 2 i p irr p i + i p irr p i = tr(r R) i p irr p i = tr(r R) i p ip i λ i = tr(r R) r λ i, i=1 where the third equality uses p i p j = 0 (1) for i j (i = j). 15
Thus, component i accounts for a reduction of the sum of squared discrepancies equal to λ i. For example, component i accounts for λ i tr(r R) = N λ i λ j j=1 percent of the total variation, and the first r principal components account for r λ j j=1 tr(r R) = r λ j j=1 N λ j j=1 percent of the total variation. 16
The following selection methods are frequently used in practical work: Percent of variance: For a fixed fraction δ, choose r such that it is the smallest number for which r λ j j=1 tr(r R) δ. Average Eigenvalue: Keep all principal components whose eigenvalues exceed the average eigenvalue, N 1 j λ j. Scree Graphs: This is named after the geological term scree (Geröllfeld), referring to the scree at the foot of a rocky cliff. Here, the relevant eigenvalues are the cliff and the unimportant components are represented by the smaller eigenvalues forming the scree. Clearly these methods do not represent formal statistical tests but rather rules of thumb. 17
Example Consider our 24 stocks from the DAX, monthly returns over the period 1996-2001, 60 observations for each stock. The average eigenvalue is given by 91.6254. Thus, when we use the Average Eigenvalue rule to determine the number of components, we will use the first 6 principal components. 1 When we want to employ the Percent of variance rule with, for example, δ = 0.75, we use the first 7 principal components. The Scree Graph also suggests something in this direction. (?) 1 The eigenvalues are shown in the table on the next page. 18
i λ i λ i / 24 j=1 λ j i j=1 λ i/ 24 j=1 λ j 1 746.2738 0.3394 0.3394 2 305.8800 0.1391 0.4785 3 183.0373 0.0832 0.5617 4 134.0729 0.0610 0.6227 5 115.0188 0.0523 0.6750 6 98.9506 0.0450 0.7200 7 82.4595 0.0375 0.7575 8 69.9632 0.0318 0.7893 9 66.0017 0.0300 0.8193 10 60.7800 0.0276 0.8469 11 54.2673 0.0247 0.8716 12 46.9439 0.0213 0.8930 13 42.5606 0.0194 0.9123 14 35.6098 0.0162 0.9285 15 27.8244 0.0127 0.9412 16 24.1203 0.0110 0.9521 17 23.3074 0.0106 0.9627 18 20.6172 0.0094 0.9721 19 15.4306 0.0070 0.9791 20 12.3780 0.0056 0.9848 21 11.6064 0.0053 0.9900 22 9.4735 0.0043 0.9943 23 7.7192 0.0035 0.9979 24 4.7125 0.0021 1.0000 19
800 Eigenvalues of Sample Covariance Matrix 700 600 500 400 300 200 100 0 0 5 10 15 20 25 20
Economic Interpretation of the Components Compared to approaches using financial or macroeconomic variables as factors, the factors extracted using a purely statistical procedure such as PCA are more difficult to interpret (at least for equity portfolios). An exception is the first factor, which is usually highly correlated with an appropriate market index. That is, the first principal component captures the common trend. For our example, suppose we use the first 6 principle components. Then the correlations between these 6 components and the DAX index are as follows: 1 2 3 4 5 6 0.888 0.366 0.106 0.060-0.081-0.005 The first row of the table indicates the component, the second is the correlation with the DAX. 21
Appendix The Trace of a Square Matrix The trace of an n n matrix is the sum of its diagonal elements: tr(a) = n a i i. (11) i=1 Clearly tr(a + B) = tr(a) + tr(b). Moreover, for A m n and B n m, tr(ab) = tr(ba) = m i=1 n a ij b ji. (12) j=1 It follows from (12) that, for conformable matrices A, B and C (permutation rule), tr(abc) = tr(bca) = tr(cab). (13) 22
The sum of squares of all elements a ij of an m n matrix A can be written as the trace of A A: tr(a A) = m i=1 n a 2 ij. (14) j=1 23
Eigenvalues and Eigenvectors An eigenvalue (or root) of an n n matrix A is a real or complex scalar λ satisfying the equation Ax = λx (15) for some nonzero vector x, which is an eigenvector corresponding to λ. Note that an eigenvector is only determined up to a scalar multiple. Equation (15) can be written as (A λi)x = 0, which requires that matrix A λi is singular, or, equivalently, det(a λi) = 0. (16) 24
As det(a λi), which is known as the characteristic polynomial of matrix A, is a polynomial of degree n in λ, an n n matrix has n eigenvalues (counting multiplicities). For illustration, consider the 2 2 matrix [ ] a11 a A = 12. a 21 a 22 Matrix A s characteristic equation is [ ] λ a11 a P (λ) = det(λi 2 A) = det 12 a 21 λ a 22 = (λ a 11 )(λ a 22 ) a 12 a 21 = λ 2 (a 11 + a 22 )λ + a 11 a 22 a 12 a 21 = λ 2 tr(a)λ + det A = 0, which is polynomial of degree 2 in λ, i.e., a quadratic. Thus, A has eigenvalues λ 1 2 = tr(a) ± tr(a) 2 4 det A. (17) 2 25
A general property is that the sum λ 1 + + λ n of the eigenvalues of an n n matrix A is equal to its trace, i.e., tr(a) = n a ii = i=1 n λ i. (18) i=1 For our example, from (17), it is directly observable that λ 1 + λ 2 = a 11 + a 22 = tr(a). In general, the eigenvalues of a matrix may be real or complex. However, for positive definite symmetric matrices (e.g., covariance matrices), we have the following results: i) The eigenvalues of a positive definite matrix are positive. To see this, recall that, for such as matrix, x Ax > 0, x 0. 26
Then, using the definition of an eigenvalue, 0 < x Ax = λx x for a positive definite matrix, thus λ > 0. ii) The eigenvectors of any symmetric matrix are orthogonal if they correspond to different roots: Write λ 1 and λ 2 (λ 1 λ 2 ) for the two roots, and x and y for the corresponding vectors: Ax = λ 1 x (19) Ay = λ 2 y. (20) Multiply (19) by y and (20) by x. Since A = A for symmetric matrix A, x Ay = y Ax, and it follows that 0 = y Ax x Ay = (λ 1 λ 2 )x y. Hence x y = 0. 27
(iii) For any n m matrix A, A A and AA have the same nonzero eigenvalues. (The number of nonzero eigenvalues is equal to the rank of A.) (Premultiplication by A shows that (AA λi)x = 0 implies (A A λi)a x = 0.) 28