Capítulo 12 FACTOR ANALYSIS 12.1 INTRODUCTION

Size: px

Start display at page:

Download "Capítulo 12 FACTOR ANALYSIS 12.1 INTRODUCTION"

Tyler Bennett
5 years ago
Views:

1 Capítulo 12 FACTOR ANALYSIS Charles Spearman ( ) British psychologist. Spearman was an officer in the British Army in India and upon his return at the age of 40, and influenced by the work of Galton, he decided to do his doctoral thesis on the objective measurements of intelligence. He proposed the first factor analysis model based on a common factor, the g factor, and a specific component. Upon receiving his PhD he was named full professor and occupied the first Chair in Psychology at University College, London INTRODUCTION Factor analysis is used to explain variability among a set of observed variables in terms of a small number of latent, or unobserved variables, called factors. For example, suppose that we take twenty body measurements from a person: height, length of torso and extremities, shoulder width, weight, etc. It is intuitive to think that these measurements are not independent from each other, and that if we know one of them we can predict the others with a small 107

2 108 CAPÍTULO 12. FACTOR ANALYSIS marginoferror. Oneexplanationforthisfactisthatthesemeasurementsaredetermined by the same genes and therefore they are highly correlated. Thus if we know some of them, we can predict with small error the values of the other variables. As a second example, suppose we are interested in studying human development around the world and that we have many economic, social and demographic variables available, all in general, interdependent, and related to development. We can ask ourselves whether the development of a country depends on a small number of factors such that, if we knew their values we could predict the set of variables for each country. As a third example, we use different tests to measure the intellectual capacity of an individual to process information and solve problems. We can ask ourselves if there are factors, not directly observable which explain the set of observed results. The set of these factors is what we will call intelligence and it is important to know how many different dimensions this concept has and how to characterize and measure them. Factor analysis came about thanks to the interest of Karl Pearson and Charles Spearman in understanding the dimensions of human intelligence in the 1930 s; as a result, many of their advances were produced in the area of psychometry. Factor analysis is related to principal components, but there are certain differences. First, principal components are constructed to explain variance, whereas factors are constructed to explain the covariance or correlations between the variables. Second, principal components is a descriptive tool, while factor analysis assumes a formal statistical model. On the other hand, principal components can be seen as a particular case of Factor analysis, as we will see later 12.2 THE FACTOR MODEL Basic Hypothesis Suppose that we observe a vector of variables x, of dimensions (p 1), in elements of a population. The factor analysis model establishes that this vector is generated by the equation: x = μ + Λf + u (12.1) where: 1. f is a vector (m 1) of latent variables, or unobserved factors. We assume that it follows a distribution N m (0, I), that is, the factors have zero mean, are independent from each other and have a normal distribution. 2. Λ is a matrix (p m) of unknown constants (m <p). It contains the coefficients which describe how the factors, f, affect the observed variables, x, and is called the loading matrix. 3. u is a vector (p 1) of unobserved perturbations. It includes the effect of all those variables which are different from the factors influencing x. We assume that u has a distribution N p (0, ψ) where ψ is diagonal, and that the perturbations are uncorrelated with the factors f.

3 12.2. THE FACTOR MODEL 109 With these three hypotheses we deduce that: (a) μ is the mean of the variable x, since both the factors and the perturbations have a zero mean; (b) x has a normal distribution, being the sum of the normal variables, and letting V be thecovariancematrix x N p (μ, V). The equation (12.1) implies that given a random sample of n elements generated by the factor model, each piece of data x ij can be written as: x ij = μ j + λ j1 f 1i λ jm f mi + u ij i =1,...,n j =1,..., p which decomposes x ij, the observed value in the individual i of the variable j, asasumof m+2 terms. The firstisthemeanofthevariablej, μ j, from the second to m+1 contains the effect of the m factors, and the last is a specific perturbation of each observation, u ij. The effects of the factors on x ij are the product of the coefficients λ j1,..., λ jm, which depend on the relationship between each factor and the variable j, (and that they are the same for all the items in the sample), times the values of the m factors in the sampling item i, f 1i,..., f mi. Joining the equations for all of the observations, the data matrix, X, (n p), can be written as: X = 1μ 0 + FΛ 0 +U where 1 is a n 1 vector of ones, F is a matrix (n m) which contains the m factors for the n items of the population, Λ 0 isthetransposeoftheloadingmatrix(m p) whose constant coefficients relate the variables and the factors and U is a matrix (n p) of perturbations Properties The loading matrix Λ contains the covariances between the factors and the observed variables. Note that the covariance matrix (p m) between the variables and the factors is obtained by multiplying (12.1) by f 0 on the right and taking expected values: E (x μ)f 0 = ΛE [ff 0 ]+E [uf 0 ]=Λ since, from the hypothesis, the factors are uncorrelated (E [ff 0 ]=I), have a mean of zero, and are uncorrelated with the perturbations (E [uf 0 ]=0). This equation indicates that the terms λ ij of the loading matrix, Λ, represent the covariance between the variable x i and the factor f j, and, as the factors have unit variance, they are also the regression coefficients when we explain the observed variables by the factors. In the particular case in which the x variables are standardized, the terms λ ij are also the correlations between the variables and the factors. The covariance matrix between the observations verifies that, according to (12.1): V = E (x μ)(x μ) 0 = ΛE [ff 0 ] Λ 0 + E [uu 0 ] since E[fu 0 ]=0the factors and the noise are uncorrelated. Thus, we obtain the fundamental property: V = ΛΛ 0 + ψ, (12.2)

4 110 CAPÍTULO 12. FACTOR ANALYSIS which establishes that the covariance matrix of the observed data can be decomposed as the sum of two matrices: (1) The first, ΛΛ 0, is a symmetric matrix of rank m<p.this matrix contains the part which is common to the set of variables and depends on the covariance between the variables and the factors. (2) The second, ψ, is diagonal, and contains the specific partofeachvariable, whichis independent from the rest. This decomposition implies that the variances of the observed variables can be written as: mx σ 2 i = λ 2 ij + ψ 2 i, i =1,...,p. j=1 where the first term is the sum of the effects of the factors and the second is the variance of the perturbation. Letting mx h 2 i = λ 2 ij, be the sum of the factor effects, which we will call communality, we get j=1 σ 2 i = h 2 i + ψ 2 i, i =1,...,p. (12.3) This equality can be interpreted as a decomposition of the variance in: Observed variance = Common variance + Specific variance (Communality) which is analogous to the classical decomposition of variability in an explained part and another unexplained, which is carried out in the analysis of variance. In the factor model the explained part is due to the factors and the unexplained is due to noise. This equation is the basis for the analysis which follows. Example: Let us assume that we have three variables generated by two factors. The covariance matrix must verify σ 11 σ 12 σ 13 σ 21 σ 22 σ 23 σ 31 σ 32 σ 33 = λ 11 λ 12 λ 21 λ 22 λ 31 λ 32 λ11 λ 21 λ 31 + λ 12 λ 22 λ 32 ψ ψ ψ 33 This equality provides 6 different equations (remember that since V is symmetric it has only 6 different terms). This first is: σ 11 = λ λ ψ 11 We let h 2 1 = λ λ 2 12 be the contribution of the two factors in variable 1. Thesixequations are: σ ii = h 2 i + ψ 2 i i =1, 2, 3 σ ij = λ i1 λ j1 + λ i2 λ j2 i =1, 2, 3 i 6= j

5 12.2. THE FACTOR MODEL Uniqueness of the model In the factor model, neither the loading matrix, Λ, nor the factors, f, are observable. This poses the problem of indeterminacy: we will say that two representations (Λ, f) and (Λ, f ) are equivalent if Λf = Λ f This situation leads to two types of indeterminacy. (1) A set of data can be explained with the same accuracy with correlated or uncorrelated factors. (2) The factors are not determined uniquely. Let us analyze these two indeterminacies. To show the first, note that if H is any nonsingular matrix, the representation (12.1) can be written as x = μ + ΛHH 1 f + u (12.4) and letting Λ = ΛH be the new loading matrix, and f = H 1 f be the new factors: x = μ + Λ f + u, (12.5) where the new factors f now have a distribution N 0, H 1 (H 1 ) 0 and, thus, they are correlated. Analogously, starting from the correlated factors, f N(0, V f ), we can always find an equivalent expression of the variables using a model with uncorrelated factors. To show this, let A beamatrixsothatv f = AA 0. (This matrix always exists if V f is positive definite), then A 1 V f (A 1 ) 0 = I, and writing x = μ + Λ(A)(A 1 )f + u, and taking Λ = ΛA as the new coefficient matrix of the factors, and f = A 1 f as the new factors, the model is equivalent to another with uncorrelated factors. This indeterminacy is solved in the hypothesis of the model by always taking the uncorrelated factors. The second type of indeterminacy appears because if H is orthogonal, the model x = μ + Λf+ u and the x = μ +(ΛH)(H 0 f)+u are indistinguishable. Both contain uncorrelated factors, with an identity covariance matrix. In this sense, we say that the factor model is indeterminate to rotations. This indeterminacy is solved by imposing restrictions on the components of the loading matrix, as we will see in the next section. Example: We assume that x =(x 1,x 2,x 3 ) 0, and the following factorial model M 1 : x = f1 + f 2 and the factors are uncorrelated. We are going to write this model as another equivalent model of uncorrelated factors. Taking H = , this matrix is orthogonal since 1 1 H 1 = H 0 = H. Thus x = f1 1 +[u] f 2 u 1 u 2 u 3

6 112 CAPÍTULO 12. FACTOR ANALYSIS Denoting this model as M 2, it can also be written as: x = g1 +[u] g 2 and the new factors, g, are related to the previous ones, f, by: g1 ³ 1 1 f1 = 2 1 g f 2 and are thus a rotation of the initial factors. uncorrelated. Their variance matrix is: V g = ³ We prove that these new factors are also V f ³ 2 1 and if V f = I V g = I, from which it is deduced that the models M1 and M2 are indistinguishable Normalization of the factor model Since the factor model is indeterminate to rotations, the matrix Λ is unidentified. This implies that although we observe the whole population, and μ, and V areknown,wecannot determine Λ uniquely. The solution is to impose restrictions on its terms. The two principal estimation methods which we will study next use some of the two following normalizations: Criterion 1: Requires: Λ 0 m pλ p m = D = Diagonal (12.6) With this normalization, the vectors which define the effect of each factor over the p observed variables are orthogonal. In this way, besides being uncorrelated the factors produce the most distinct effects in the variables. We are going to prove that this normalization defines a loading matrix uniquely. Firstweassumethatwehave amatrixλ such that the product Λ 0 Λ is not diagonal. We transform the factors with Λ = ΛH, where H is the matrix which contains the columns of the eigenvectors of Λ 0 Λ. Then: Λ 0 Λ = H 0 Λ 0 ΛH (12.7) and since H diagonalizes Λ 0 Λ the matrix Λ verifies the condition (12.6), and we see now that this is the only matrix which does so. Suppose that we rotate this matrix and let Λ = ΛC where C is orthogonal. Then the matrix Λ 0 Λ = C 0 Λ 0 Λ C will not be diagonal. Analogously, if we start with a matrix which verifies (12.6) and we rotate it, it will no longer verify this condition.

7 12.2. THE FACTOR MODEL 113 When this normalization is verified, postmultiplying the equation (12.2) by Λ, allows us to write (V ψ) Λ = ΛD, which means that the columns of Λ are eigenvectors of the matrix V ψ, which has the diagonal terms of D as its eigenvalues. This property is used in the estimation by the principal factor method. Criterion 2: Requires: Λ 0 ψ 1 Λ = D =Diagonal (12.8) In this normalization the effects of the factors on the variables, weighted by the variances of the perturbations of each equation become uncorrelated. As before, this normalization defines a loading matrix uniquely. To show this, we assume that Λψ 1 Λ is not diagonal, and we transform with Λ = ΛH. Then: Λ 0 ψ 1 Λ = H 0 Λ 0 ψ 1 Λ H (12.9) and since Λ 0 ψ 1 Λ is a symmetric matrix and non-negative definite, it can always be diagonalized if we choose as H the matrix which contains the eigenvectors of Λ 0 ψ 1 Λ in columns. Analogously, if (12.8) is verified from the beginning and we rotate the loading matrix this condition is no longer verified. This is the normalization used in maximum likelihood estimation. Its justification is that in this way the factors are conditionally independent given the data, as is shown in Appendix With this normalization, postmultiplying the equation (12.2) by ψ 1 Λ,weget and premultiplying by ψ 1/2, the result is: which implies Vψ 1 Λ Λ = Λ D ψ 1/2 Vψ 1 Λ ψ 1/2 Λ = ψ 1/2 ΛD ψ 1/2 Vψ 1/2 ψ 1/2 Λ = ψ 1/2 Λ (D + I) andweconcludethatthematrixψ 1/2 Vψ 1/2 has eigenvectors ψ 1/2 Λ with eigenvalues D + I. This property is used in maximum likelihood estimation Maximum number of factors If we replace the theoretical covariance matrix, V, in (12.2) with the sampling matrix, S, the system will be identified if it is possible to solve it uniquely. In order to do so there is a restriction on the number of possible factors. The number of equations which we obtain from (12.2) is equal to the set of terms of S, which is p + p(p 1)/2 =p(p +1)/2. The

8 114 CAPÍTULO 12. FACTOR ANALYSIS number of unknowns in the second term is pm, the coefficients of the matrix Λ, plus the p terms of the diagonal of ψ, minus the restrictions imposed in order to identify the matrix Λ. Assuming that Λ 0 ψ 1 Λ is diagonal, this supposes m(m 1)/2 restrictions on the terms of Λ. For the system to be determined there must be a number of equations equal to or greater than the number of unknowns. If there are fewer equations than unknowns it is impossible to find a single solution and the model is unidentified. If the number of equations is exactly equal to the number of unknowns there will be a single solution. If there are more equations than unknowns, we can solve the system using least squares, finding the values of the parameters which minimize the estimation errors. Therefore: which assumes: p + pm m(m 1) 2 p(p +1) 2 p + m p 2 2pm + m 2, that is (p m) 2 1 p + m. Thereadercanprovethatthisequationimpliesthatwhenp is not large (less than 10) then approximately the maximum number of factors must be less than half the number of variables minus one. For example, the maximum number of factors with 7 variables is THE PRINCIPAL FACTOR METHOD The principal factor method is a method for estimating the loading matrix based on principal components. It avoids the need to solve maximum likelihood equations which are more complex. It has the advantage that the dimension of the system can be identified approximately. Because of its simplicity it is used in many computer programs. Its basis is the following: suppose that we can obtain an initial estimation of the variance matrix of the perturbations bψ. Then, we can write S ψ b = ΛΛ 0, (12.10) and as S ˆψ is symmetric, it can always be decomposed as: S b ψ = HGH 0 =(HG 1/2 )(HG 1/2 ) 0 (12.11) where H is square of order p and orthogonal, G is also of order p, diagonal and contains the eigenvalues of S ψ. b The factor model establishes that G must be diagonal of type: G G = 1m m O m (p m) O (p m) (p m) O (p m) m since S ψ b has rank m. Thus, if we let H 1 be the matrix p m which contains the eigenvectors associated with the non-null eigenvalues of G 1 we can build an estimator of Λ i by the p m matrix: bλ = H 1 G 1/2 1 (12.12)

9 12.3. THE PRINCIPAL FACTOR METHOD 115 Note that the resulting normalization is: bλ 0 b Λ = G 1/2 1 H 0 1H 1 G 1/2 1 = G 1 = Diagonal (12.13) since the eigenvectors of symmetric matrices are orthogonal. Thus H 0 1H 1 = I m.therefore, with this method we obtain an estimate of the matrix Λ b with columns which are orthogonal to each other. In practice, the estimation is carried out iteratively as follows: 1. Start (i =1)from an initial estimation of Λ b i and compute ψ b i by using ψ b i = diag ³S Λ b Λ b Calculate the square and symmetric matrix Q i = S b ψ i. 3. Obtain the spectral decomposition of Q i so that Q i = H 1i G 1i H 0 1i + H 2i G 2i H 0 2i where G 1i contains the m greatest eigenvalues of Q i and H 1i its eigenvectors. We choose m so that the rest of the eigenvectors contained in G 2i are all small and of similar size. The matrix Q i may not be positive definite and some of its eigenvalues can be negative. This is not a serious problem if these eigenvalues are small and we can assume them to be near zero. 4. Take Λ b i+1 = H 1i G 1/2 1i and go back to (1). Iterate until reaching convergence, that is, until kλ n+1 Λ n k <². The estimators obtained will be consistent but not efficient as in the case of maximum likelihood. Neither are they invariant to linear transformations as are the ML, that is, the same results are not necessarily obtained with the covariance matrix and with that of correlations. To put this idea into practice, we have to specify how to obtain the initial estimator Λ b i or ψ b i, a problem which is known as communality estimation Communality estimation Estimating the terms ψ 2 i is equivalent to defining values for the diagonal terms, h 2 i, of ΛΛ 0, since h 2 i = s 2 i ψ b 2 i. The following alternatives are used: 1. Take b ψ i =0. This is equivalent to extracting the principal components of S. This supposes taking b h 2 i = s 2 i (in the case of correlations b h 2 i =1), which is clearly its maximum value, so that we can begin with a significant bias. 2. Take b ψ 2 j =1/s jj, wheres jj is the j-th diagonal element of the precision matrix S 1. According to Appendix 3.2, this is equivalent to taking h 2 j as: b h 2 j = s 2 j s 2 j(1 R 2 j)=s 2 jr 2 j, (12.14)

10 116 CAPÍTULO 12. FACTOR ANALYSIS where Rj 2 is the multiple correlation coefficient between x j and the rest of the variables. Intuitively, the greater Rj 2 is, the greater the communality b h 2 j. With this method we start with a low biased estimation of h 2 i,since b h 2 i h 2 i. To show this, suppose, for example, that the true model for the variable x 1 is mx x 1 = λ 1j f j + u 1 (12.15) j=1 which is associated with the decomposition σ 2 1 = h ψ 2 1. The proportion of explained variance is h 2 1/σ 2 1. If we write the regression equation x 1 = b 2 x b p x p + ² 1 replacing each variable with its expression in terms of the factors we have: x 1 = b 2 ³X λ2j f j + u b p ³X λpj f j + u p + ². (12.16) which leads to a decomposition of the variance σ 2 1 = b h b ψ 2 1.Clearly b h 2 1 h 2 1,sincein (12.16) we force the noise u 1,...,u p of each equation to appear as a regressor of the factors as in (12.15). Moreover, it is possible that a factor affects x 1 but not the rest, so that it will not appear in equation (12.16). To summarize, the estimated communality in (12.16) will have a lower bound than the true value of the communality. Example: In this example we will show in detail the iterations of the principal factor algorithm for the ACCIONES data in Annex I. The covariance matrix of these data in logarithms is, S = In order to estimate the loading matrix we carry out the steps of the principal factor algorithm as described above. Before starting the algorithm we need to set the bound to decide the convergence. We make ε large, 0.05, so that in few iterations the algorithm converges despite the accumulated rounding errors. Step 1. Taking the second alternative for the initial estimation of the commonalities diag( b ψ 2 i )= 1/s jj. where s jj is the j-th element of matrix S 1 bψ 2 i = S 1 = / / / =

11 12.3. THE PRINCIPAL FACTOR METHOD 117 Step 2. We calculate the square and symmetric matrix Q i = S ψ b i Q i = = Step 3. Spectral decomposition of Q i and separation into two terms H 1i G 1i H 0 1iand H 2i G 2i H 0 2i. The eigenvalues of Q i are 0.379, 0.094, and We observe that one of them is negative, thus the matrix is not positive definite. Since there is an eigenvalue that is much larger that the rest we will take a single factor. This implies the decomposition = Step 4. We calculate Λ b i+1 = H 1i G 1/2 1i bλ i+1 = = This is the first estimation of the loading matrix. We are going to iterate to improve this estimation and to do this, we return to Step 1. Step 1. We estimate the terms in the diagonal of ψ b i using ψ b i = diag ³S Λ b Λ b 0 bψ i = diag = Step 2. We calculate the square, symmetric matrix Q i = S b ψ i Q i = =

12 118 CAPÍTULO 12. FACTOR ANALYSIS Step 3. Spectral decomposition of Q i = H 1i G 1i H 0 1i + H 2i G 2i H 0 2i Step 4. We calculate Λ b i+1 = H 1i G 1/2 1i bλ i+1 = = = and check whether the convergence criterion is fulfilled, kλ n+1 Λ n k <² =0.106 ² = we return to Step 1 until the criterion is fulfilled. Step 1. We estimate again ψ b i = diag ³S Λ b Λ b 0 bψ i = diag = Step 2. We calculate the square symmetric matrix Q i = S ψ b i Q i = = Step 3. Spectral decomposition of Q i. We indicate only the first eigenvector and eigenvalue = H 2i G 2i H 0 2i

13 12.3. THE PRINCIPAL FACTOR METHOD 119 Step 4. Calculate bλ i+1 = H 1i G 1/2 1i bλ i+1 = = is: we check whether the convergence criterion is fulfilled =0.05 ² = ˆΛ n+1 ˆΛ n <². The convergence criterion has been fulfilled and the model with the estimated parameters u 1 u 2 u 3 x = N f , 0 u 1 u 2 u We see that the equation of the factor obtained is quite different from the first component that was obtained in exercise 5.1. Example: For the INVEST database a descriptive analysis was carried out in Chapter 4 in which a logarithmic transformation for all the variables and the elimination of the US data was proposed. Using this set of data, once it has been standardized, we are going to illustrate the calculations of a single factor via the principal factor method (in the following example 2 factors are considered). We are going to compare the two proposed methods to initiate the algorithm with the standardized data. In the firstcasewestarttheiterations with bψ j =0= b h 2 (0) =1, and the number of iterations needed before converging is 6. The stop criterion in step k of the algorithm is, in this case, that the maximum differences between the commonalities in k and k 1 be less than The following table shows the estimations of the commonalities for steps i =0, 1, 2, 3, 6. ĥ 2 (0) ĥ 2 (1) ĥ 2 (2) ĥ 2 (3) ĥ 2 (6) INTER.A INTER.B AGRIC BIOLO MEDIC CHEMIS ENGIN PHYSIC

14 120 CAPÍTULO 12. FACTOR ANALYSIS The final result once the algorithm has converged is shown in bold. If we begin the algorithm with the second method, bψ j =1 R 2 j = b h 2 (0) = R 2 j, thenumberofiterationsbeforeconvergenceis5. Thefollowingtableshowshowtheestimations of the commonalities vary for steps i =0, 1, 2, 3, 5. ĥ 2 (0) ĥ 2 (1) ĥ 2 (2) ĥ 2 (3) ĥ 2 (5) INTER.A INTER.B AGRIC BIOLO MEDIC CHEMIS ENGIN PHYSIC The figuresinboldshowthefinal result once convergence of the algorithm has been reached. Having initiated the algorithm in the point nearest the end, the convergence was faster, and in the second iteration the result is quite close to the end. We can see how the initial estimation of the commonalities, ĥ2 (0), is the upper bound of the final estimation, ĥ2 (5). In the following table we present the estimation of ˆΛ (0) which we started from in both methods and the estimation of the final loadings obtained. bψ j =0 ψj b =1 Rj 2 Final Factor1 Factor1 Factor1 INTER.A INTER.B AGRIC BIOLO MEDIC CHEMIS ENGIN PHYSIC The second method gives us a ˆΛ (0) closer to the final result, especially for those variables where the specific variability is greater Generalizations The principal factor method is a procedure for minimizing the function: F = tr (S ΛΛ 0 ψ) 2. (12.17)

15 12.3. THE PRINCIPAL FACTOR METHOD 121 Note that this function can be written as px F = i=1 px (s ij v ij ) 2 (12.18) j=1 where v ij are the elements of the matrix V = ΛΛ 0 + ψ. However, using spectral decomposition, given a squared symmetric and non-negative matrix S, the best approximation as far as least squares (12.18) using a matrix of rank m, AA 0 is obtained by taking A = HD 1/2, where H contains the eigenvectors and D 1/2 the roots of the eigenvalues of S (see Appendix 5.2), which is what the principal factor method does. Harman (1976) developed the MINRES algorithm which minimizes (12.17) more efficiently than the principal factor method, and Joreskog (1976) proposed the ULS (unweighted least squares), which is based on taking the derivative in (12.17), obtaining Λ b as a function of ψ and then minimizing the resulting function with a Newton-Raphson type non-linear algorithm. Example: With the INVEST data, used in the previous example, we present the factorial analysis for two factors performed with a computer program using the principal factor method. Table 12.1 indicates the variability of both factors. The second factor explains little variability (2%) but has been included because of its clear interpretation. Factor1 Factor2 Variability P h P h i=1 P h Tabla 12.1: Variability explained by the first two factors estimated using the principal factor method. The principal factor algorithm begins with ψ b j =1 Rj, 2 and 14 iterations have been performed before converging at the weights presented in Table Factor1 Factor2 ψ 2 i INTER.A INTER.B AGRIC BIOLO MEDIC CHEMIS ENGIN PHYSIC Tabla 12.2: Loading matrix of the factors and communalities The first factor is the sum of the publications in all the databases, and it gives an idea of the volume. According to this factor the countries would be ordered according to their scientific output. The second factor contrasts biomedical research with technological

16 122 CAPÍTULO 12. FACTOR ANALYSIS research. This second component separates Japan and the UK, countries with heavy scientific output. Figure 12.1 shows a graph of the distribution of the countries over these two factors. The reader should compare these results with those obtained in Chapter 5 (exercises 5.6 and 5.10) with principal components. Figura 12.1: Representation of the countries on the plane formed by the first two factors MAXIMUM LIKELIHOOD ESTIMATION ML estimation of parameters Direct approach The matrices of the parameters can be formally estimated using maximum likelihood. The density function of the original observations is N p (μ, V). Therefore, the likelihood is that which we saw in Chapter 10. Replacing μ with its estimator, x, the support function for V is: log(v X) = n 2 log V n 2 tr SV 1, (12.19) and replacing V with (12.2) the support function of Λ and ψ is : L(Λ, ψ) = n log ΛΛ 0 + ψ + tr(s(λλ 0 + ψ) 1. (12.20) 2

17 12.4. MAXIMUM LIKELIHOOD ESTIMATION 123 The maximum likelihood estimators are obtained by maximizing (12.20) with respect to the matrices Λ and ψ. Taking the derivative with respect to these matrices and after certain algebraic manipulations which are shown in Appendix 12.1, (see Anderson, 1984, pp ; or Lawley and Maxwell, 1971), the following equations are obtained: where D is the normalization matrix bψ = diag (S ˆΛˆΛ 0 ) (12.21) ³ bψ 1/2 (S I) ψ b 1/2 ³ ³ ψ b Λ 1/2 b = bψ Λ 1/2 b D (12.22) bλ 0 b ψ 1 b Λ = D =diagonal. (12.23) These three equations allow us to solve the system by a Newton-Raphson type iterative algorithm. The numerical solution is difficult to find because there may not be a solution whereby b ψ is a positive definite, and it is then necessary to turn to the estimation with restrictions. We observe that (12.22) leads to an equation of eigenvalues: it tells us that bψ 1/2 b Λ contains the eigenvectors of the symmetric matrix ³ bψ 1/2 (S I) b ψ 1/2 and that D contains the eigenvalues. The iterative algorithm for solving these equations is: 1. Start with an initial estimation. If we have an estimation Λ b i,(i =1the first time), from the principal factor method, for example, the matrix ψ b i is calculated using bψ i = diag ³S Λ b iλ b 0i. Alternatively, we can estimate the matrix ψ b i directly as in the principal factor method. 2. The squared symmetric matrix A i is calculated as A i = ψ b 1/2 i ³S ψ i b 1/2 bψ i = bψ 1/2 Sψ b 1/2 I. This matrix weights the terms of S by their importance in terms of specific components. 3. The spectral decomposition of A i is obtained such that A i = H 1i G 1i H 0 1i + H 2i G 2i H 0 2i where the m greatest eigenvalues of A i are in the (m m) diagonal matrix, G 1i and the p m smallest of G 2i and H 1i and H 2i contain the corresponding eigenvectors. 4. Take Λ b i+1 = ψ b 1/2 i H 1i G 1/2 1i and substitute it in the likelihood function, which is maximized with respect to ψ. This part is easy to do with an algorithm of non-linear optimization. With this result, go back to (2), iterating until convergence is reached. It is possible for this algorithm to converge at a maximum point where some of the terms of the matrix ψ are negative. This inadequate solution is sometimes called a Heywood solution. The existing programs change these values for positive numbers and attempt to find another maximum point, although the algorithm does not always converge.

18 124 CAPÍTULO 12. FACTOR ANALYSIS Appendix 12.1 shows the proof that the ML estimation is invariant to linear transformations of the variables. Therefore, the result of the estimation does not depend - as happens with principal components - on the use of the covariance or correlation matrix. An additional advantage of maximum likelihood is that we can obtain asymptotic variances of the estimators using the information matrix in the optimum. We see that when the diagonal terms of matrix b ψ are approximately equal, the ML estimation will lead to results similar to those of the principal factor method. Note that by substituting b ψ = ki in the the ML estimator equations, both methods use the same normalization and equation (12.22) is analogous to (12.11), which is solved in the principal factor method. The EM algorithm An alternative procedure for maximizing the likelihood is to consider the factors as missing values and to apply the EM algorithm. The joint likelihood function of the data and the factors can be written as f(x 1,...,x n, f 1,..., f n )=f(x 1,...,x n f 1,..., f n ) f(f 1,..., f n ). The support for the whole sample is log(ψ, Λ X, F) = n 2 log ψ 1 X (xi Λf i ) 0 ψ 1 (x i Λf i ) 1 X fi f i, (12.24) where we assume that the mean of variables x i is zero, which is equivalent to substituting the mean with its sample mean estimator. We observe that, given the factors, the estimation of Λ could be done as a regression. On the other hand, given the parameters we could estimate the factors, as we will see in section In order to apply the EM algorithm we need: (1) Step M: maximize the complete likelihood with respect to Λ and ψ, assuming that the f i values of the factors are known. This is easy to do since the rows of Λ are obtained by doing regressions between each variable and the factors, and the diagonal elements of ψ are the residual variances in these regressions. (2) Step E: Calculate the expectation of the complete likelihood with respect to the distribution of the f i given the parameters. Developing (12.24) it can be shown that the expressions which appear in the likelihood are the covariance matrix between the factors and the covariance matrix between the factors and the data. The details of this estimation can be seen in Bartholomew and Knott (1999, p.49) Other estimation methods Because the maximum likelihood method is complicated, other approximate methods have been proposed for calculating estimators with similar asymptotic properties but with simpler calculations. One of these is the generalized least squares which we will present next. To justify it, we observe that the ML estimation can be reinterpreted as follows: if no restrictions exist for V the ML estimator of this matrix is S and substituting this estimation in (12.19) the support function in the maximum is: n 2 log S n 2 p.

19 12.4. MAXIMUM LIKELIHOOD ESTIMATION 125 Maximizing the support function is equivalent to minimizing with respect to V the discrepancy function obtained by subtracting the support from the previous maximum value (12.19). The function obtained with this difference is: F = n 2 tr SV 1 n p log SV 1 2 which indicates that we want to make V as close to S as possible, measuring the distance between both matrices by the trace and the determinant of the product SV 1. We observe that since V is estimated with restrictions SV 1 1 and the logarithm will be negative or null. If we concentrate on the first two terms and reject the determinant, the function to be minimized is: F 1 = tr SV 1 p = tr SV 1 I = tr (S V) V 1 which computes the differences between the observed matrix S and the estimated V, but gives a weight to each difference which depends on the size of V 1. This leads to the idea of generalized least squares (GLS), where we minimize tr (S V) V 1 2 and it can be proved that if we iterate the GLS procedure we obtain efficient asymptotic estimators. Example: We are going to illustrate the ML estimation for the INVEST data. Assuming two factors we obtain the results shown in the following tables: Factor1 Factor2 Variability P h P h i=1 P h Tabla 12.3: Variability explained by the first two factors estimated using maximum likelihood. Factor1 Factor2 ψ 2 i INTER.A INTER.B AGRIC BIOLO MEDIC CHEMIS ENGIN PHYSIC Tabla 12.4: Loading matrix of the factors

20 126 CAPÍTULO 12. FACTOR ANALYSIS If we compare these results with those obtained using the principal factor method (exercise 12.5) we see that the first factor is similar, although it increases the weight of physics and there are relative differences between the weights of the variables. The second factor has more changes but its interpretation is similar as well. The variances of the specific components display few changes in both approaches. Figures 12.2 and 12.3 show the weights and the projection of the data over the plane of the factors. Figura 12.2: Weights of the INVEST variables in the two factors using ML estimation DETERMINING THE NUMBER OF FACTORS Likelihood test Suppose that a model with m factors has been estimated. The test to see whether the decomposition holds can be considered as a likelihood ratio test: H 0 : V = ΛΛ 0 +ψ H 1 : V 6= ΛΛ 0 +ψ. This test is similar to the partial sphericity which we studied in Chapter 10, although there are differences due to the fact that we do not require the specific componentstohave equal variance. Let ˆV 0 be the value of the covariance matrix of the data estimated under H 0. Then, the likelihood ratio test is: λ =2(ln(H 1 ) ln(h 0 ))

21 12.5. DETERMINING THE NUMBER OF FACTORS 127 Figura 12.3: Projection of the countries onto the plane of the two factors using ML estimation. By (12.19), the likelihood function under H 0 is: ln(h 0 )= n 2 log V0 b n ³ 2 tr SV b 0 1 (12.25) whereas under H 1 the estimator of V is S and we have: ln(h 1 )= n 2 log S n 2 tr(ss 1 )= n np log S 2 2, (12.26) the likelihood ratio can be written using these two expressions: λ = n(log b V 0 + tr(s b V 1 0 ) log S p) It can be proved (Appendix 12.1), that the ML estimator of V under H 0, ˆV 0, minimizes the distance to S measured with the trace, that is: ³ tr SV b 0 1 = p (12.27) therefore, the likelihood ratio is λ = n log V b 0 (12.28) S and, thus, measures the distance between ˆV 0 and S in terms of the determinant, n log SV b 0 1, which is the second term of the likelihood.

22 128 CAPÍTULO 12. FACTOR ANALYSIS The test rejects H 0 when λ is greater than the percentile 1 α of a distribution χ 2 g with g degrees of freedom, µ given by g = dim(h 1 ) dim(h 0 ). The dimension of the parametric p space of H 1 is p + = p(p +1)/2, equaltothenumberofdifferent elements in V. 2 The dimension of H 0 is pm bymatrixλ plusthep elements of ψ, minusm(m 1)/2 restrictions resulting from the condition that Λ 0 ψ 1 Λ must be diagonal. Therefore: g = p + p(p 1)/2 pm p + m(m 1)/2 = (12.29) = (1/2) (p m) 2 (p + m) Bartlett (1954) showed that the asymptotic approximation of the χ 2 distribution improves in finite samples by introducing a correction factor. With this modification, the test rejects H 0 if (n 1 2p +4m +5 )ln ˆΛˆΛ 0 + ˆψ 6 S > χ 2 [((p m) 2 (p+m))/2] (1 α) (12.30) Generally, this test is applied sequentially: The model is estimated with a small value, m = m 1 (which can be m 1 =1)andH 0 is tested. If it is rejected, we re-estimate with m = m 1 +1, continuing until H 0 is accepted. An alternative procedure proposed by Joreskog (1993), which works better against moderate deviations from normality is the following: calculate the statistic (12.30) for m = 1,...,m max.letx 2 1,...,X 2 m max be their values, and g 1,...,g mmax their degrees of freedom. We calculate the differences X 2 m X 2 m+1 and consider these differences to be values of a χ 2 with g m g m+1 degrees of freedom. If the value obtained is significant we increase the number of factors and proceed in this way until we cannot find a significant improvement in the fit ofthemodel. The test (12.28) allows for an interesting interpretation. The factor model establishes that the difference between the covariance matrix, S (p p), and a diagonal matrix of rank p, ψ, is approximately a symmetric matrix of rank m, ΛΛ 0,thatis: S b ψ ' b Λ b Λ 0. Premultiplying and postmultiplying by b ψ 1/2 wegetthatmatrixa, givenby: must be asymptotically equal to the matrix: A = b ψ 1/2 S b ψ 1/2 I, (12.31) B = b ψ 1/2 b Λ b Λ 0 b ψ 1/2, (12.32) and have rank m asymptotically, instead of rank p. It is shown in Appendix 12.2 that the test (12.28) is equivalent to checking if the matrix A has rank m, which must be asymptotically true by (12.32), and the test (12.28) can be written as λ = n px log(1 + d i ) (12.33) m+1

23 12.5. DETERMINING THE NUMBER OF FACTORS 129 where d i are the p m smaller eigenvalues of the matrix A. The null hypothesis is rejected if λ is too large compared to the χ 2 distribution with (1/2)((p m) 2 p m). In Appendix 12.2 it is shown that this test is a particular case of the likelihood test over the partial sphericity of a matrix that we presented in Whenthesamplesizeislargeandm is small compared to p, if the data do not follow a multivariate normal distribution then the test generally leads to a rejection of H 0. This is a frequent problem in hypothesis testing for large samples where we tend to reject H 0. Therefore, it is necessary when it comes to deciding on the number of factors, to differentiate between practical significance and statistical significance, just as with any hypothesis test. This test is very sensitive to deviations from normality so in practice, the statistic (12.28) is used more as a measure of the fit of the model rather than as a formal test Selection criteria An alternative to selecting the number of factors by a test is to consider the problem as a choice of models. We then estimate the factor model for different numbers of factors, calculate the support function in the maximum for each model and, applying the Akaike criterion, we choose the model where AIC(m) = 2L(H 0,m )+2n p is minimum. In this expression 2L(H 0,m ) is the support function for the model which establishes m factors particularized in the ML estimators, which are given by (12.25), and n p is the number of parameters in the model. We observe that this expression takes into account that by increasing m the likelihood of L(H 0,m ) increases, or the deviance 2L(H 0,m ) decreases, but this effect is balanced by the number of parameters, which appear penalizing the above equation. Using the likelihood equation (12.20) and the condition (12.27), since n p = mp + p m(m 1)/2, we have AIC(m) =n log ˆΛˆΛ 0 + ˆψ + np 2m [p + p/m (m 1)/2] The AIC criterion can be described as minimizing the differences AIC(m) AIC(H 1 ), whereinallthemodelswesubtractthesamequantity,aic(h 1 ), the value of AIC for the model which assumes there is no factor structure and estimates the covariance matrix without restrictions. Then the function to minimize is AIC (m) =2(L(H 1 ) L(H 0,m )) 2g = λ(m) 2g where λ(m) is the difference of the supports (12.28), where in this expression bv 0 is estimated with m factors, and g is the number of degrees of freedom given by (12.30) (12.29). An alternative criterion is the BIC which we saw in Chapter 11. With this criterion, instead of penalizing the number of parameters with 2 we do it with logn. This criterion applied to the factor model using the differences of the support is: BIC(m) =λ(m) g log n

24 130 CAPÍTULO 12. FACTOR ANALYSIS Example: We apply the maximum likelihood method to the INVEST data to perform a test on the number of factors. If we base the test on the equation (12.28) we obtain the following table, m λ g m p value AIC BIC , 73 for example, if m =1the number of degrees of freedom is (1/2) ((7) 2 (9)) = 20. We see that for α = 0.05 we cannot reject the null hypothesis that one factor is sufficient. Nevertheless, the Akaike criterion indicates that the minimum is obtained with two factors, and the BIC criterion confirms, with little difference, the choice of one factor. Since the p value of the test is at the limit we are going to compare this test with the procedure proposed by Joreskog. The first step is to use the correction proposed by Bartlett, which we carry out multiplying the statistics χ 2 m by (n 1 (2p+4m+5)/6)/n. For example, the corrected statistic for p =1is, X 2 1 = ((20 1 ( )/6)/20) 31.1 =23.06 and the following table shows the results. p Xm 2 Xm 2 Xm+1 2 g m g m+1 p value This method indicates that we reject the hypothesis of one factor, but we cannot reject the hypothesis of two factors, thus we conclude that the number of factors, chosen with Joreskog s method, is equal to two. As we see, in this example Joreskog s criterion coincides with that of Akaike s. Example: For the HBS data from the Household Budget Survey in Annex??, weapply the factor model estimated by maximum likelihood. The data have been transformed into logarithms to improve the asymmetry, the same as was done in the Principal Components analysis presented in examples 4.2 and 4.3. For this analysis we have also standardized the observations. We accept the test of a single factor given that the p-value is The estimation of the weights of this factor is approximately a weighting with less weight given to the sections of food, clothing and footwear, as shown in Table 12.5 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 Factor Tabla 12.5: Loading vector of the factors

25 12.6. FACTOR ROTATION FACTOR ROTATION As we saw in section , the loading matrix is not identified through multiplying by orthogonal matrices, which are equivalent to rotations. In factor analysis the space of the columns in the loading matrix is defined, but any base in this space can be a solution. In order to choose from among possible solutions, the interpretation of the factors is taken into account. Intuitively, it is easier to interpret a factor when it is associated with a block of observed variables. This occurs if the columns of the loading matrix, which represent the effect of each factor on the observed variables, contain high values for certain variables and small ones for others. This idea can be considered in different ways which give rise to different criteria for defining the rotation. The coefficients of the orthogonal matrix which define the rotation are obtained by minimizing an objective function. This expresses the desired simplicity in the representation which we get as a result of the rotation. The most frequently used criterion is the Varimax. Varimax Criteria The interpretation of the factors is made easier if those which affect some variables do not affect others and vice versa. This objective leads to the criterion of maximizing the variance of the coefficients which define the effects of each factor on the observed variables. In order to specify this criterion, we let δ ij be the coefficients of the loading matrix associated with the factor j in the i =1,...,p equations after the rotation and δ j is the vector which is the column j of the loading matrix after the rotation. We want the variance of the squared coefficients of this vector to be maximum. The squared coefficients are taken in order to eliminate the signs since we are interested in their absolute value. Letting δ.j = P δ 2 ij/p be the mean of the squares of the components of the vector δ j, the variability for factor j is: 1 p px (δ 2 ij δ.j ) 2 = 1 p i=1 px px δ 4 ij (1/p) 2 ( δ 2 ij) 2, (12.34) i=1 i=1 and the criterion is to maximize the sum of the variances for all of the factors given by: q =(1/p) mx j=1 px δ 4 ij (1/p) 2 i=1 m X j=1 ( px δ 2 ij) 2. (12.35) Let Λ be the loading matrix estimated initially. The problem is to find an orthogonal matrix M such that the matrix δ given by and whose coefficients δ ij are given by δ = ΛM, δ ij = λ 0 i m j where λ 0 i is the row i of the matrix Λ and m j is the column j of the matrix M we are looking for, verifies the condition that these coefficients maximize (12.35). The terms of the matrix i=1

26 132 CAPÍTULO 12. FACTOR ANALYSIS M are derived from (12.35) for each of the terms m ij taking into account the restrictions of orthogonality m 0 im i =1; m 0 im j =0(i 6= j). The result is the varimax rotation. Example: If we apply a varimax rotation to the ML estimation of the INVEST data from example 7.6 we obtain the result shown in Figure This new loading matrix is the result Figura 12.4: The result of applying a varimax rotation to the INVEST factors. of multiplying the coefficients of the orthogonal matrix that defines the rotation, M, by the loading matrix obtained through the ML estimation and shown in example 12.6 δ = Λ M = Oblique rotations The factor model is indeterminate not only to orthogonal rotations but to oblique rotations aswell. Infact,aswesawinsection6.1themodelcanbeestablishedwithcorrelatedor uncorrelated factors. The solution obtained from the estimation of Λ always corresponds to uncorrelated factors, but we can ask ourselves if there is a solution with correlated factors

27 12.7. FACTOR ESTIMATION 133 which has a more interesting interpretation. Mathematically this implies the definition of new factors f = Hf, whereh is a nonsingular matrix which can, in general, be interpreted as an oblique rotation. The new covariance matrix of the factors is V f = HH0. There are various procedures for obtaining oblique rotations, such as Quartmin, Oblimax, Promax, etc. more information on which can be found in specialized literature. The problem with oblique rotations is that the factors are correlated and thus cannot be interpreted independently FACTOR ESTIMATION In many problems the interest of factor analysis lies in determining the loading matrix, not in the particular values of the factors in the elements of the sample. Nevertheless, in other cases we want to obtain the values of the factors over the observed elements. There are two procedures for estimating the factors: the first, introduced by Bartlett, supposes that the vector of the values of the factors for each observation is a parameter to be estimated. The second supposes that this vector is a random variable. Next we will briefly reviewboth procedures The factors as parameters The (p 1) vector of the values of the variables in the individual, i, x i, has a normal distribution with mean Λf i, where f i is the (m 1) vector of the factors for the element i in the sample, and the covariance matrix ψ, x i N p (Λf i, ψ) The parameters f i can be estimated by maximum likelihood as shown in Appendix The resulting estimator is that of generalized least squares, given by b fi = ³ bλ 0 b ψ 1 b Λ 1 bλ 0 b ψ 1 xi. (12.36) which has a clearly intuitive interpretation: if we know Λ, the factor model x i = Λf i + u i is a regression model with dependent variable x i, explicative variables in the columns of Λ and parameters f i. Since the perturbation, u i, is not distributed as N(0, I) but as N(0, ψ), we have to use generalized least squares, which leads to (12.36) The factors as random variables The second method is to assume that the factors are random variables, and to look for a linear predictor which minimizes the mean square error of prediction. As before, we let f i be the values of the factors in the individual i and x i is the vector of observed variables,

Chapter 4: Factor Analysis

Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.