Stat 206, Week 6: Factor Analysis

Size: px

Start display at page:

Download "Stat 206, Week 6: Factor Analysis"

Irene Wilkins
6 years ago
Views:

1 Stat 206, Week 6: Factor Analysis James Johndrow Introduction Factor analysis aims to explain correlation between a large set (p) of variables in terms of a smaller number (m) of underlying factors. The factors are assumed unobserved, and there is also observation noise/random error. Spearman thought of the following example. Consider children s exam performance in X 1 = classics, X 2 = french, X 3 = English, with observation correlation matrix R = A model to reduce from p = 3 to m = 1 variables is X 1 = l 1 F + ϵ 1 X 2 = l 2 F + ϵ 2 X 3 = l 3 F + ϵ 3. In this model, F is referred to as a common factor 1, l 1, l 2, l 3 are the factor loadings, and ϵ 1, ϵ 2, ϵ 3 are random errors. The common factor might have an interpretation as general ability. The errors ϵ j capture differences in ability in subject j from general ability. 2 The orthogonal factor model 1 or latent factor, since it is unobserved 2 and also that the exam score is an imperfect measure of the ability of a student. We can generalize this to a factor model with m common factors. A model with this structure that generates a random p-vector X is given by X = µ + LF + ϵ where µ is a p ˆ 1 mean vector, L is a p ˆ m factor loadings matrix, F is an mˆ1 vector of factors, and ϵ is a pˆ1 vector of random errors. We can also express this as

2 stat 206, week 6: factor analysis 2 X j = µ j + mÿ l jk F k + ϵ j, where µ j is the mean of X j, l jk is the loading of the jth component of X on the kth factor, F k is the kth common factor, and ϵ j is the jth specific factor. Here j = 1,..., p indexes variables (components of the random vector X), and k = 1,..., m ă p indexes factors. We make the following assumptions on the unobserved random vectors F, ϵ: 1. E[F ] = 0, cov(f ) = I m. 2. E[ϵ] = 0, cov(ϵ) = Ψ, with Ψ a diagonal matrix. 3. cov(f, ϵ) = 0. The name orthogonal factors comes from the assumption cov(f ) = I m. The elements of Ψ = diag(ψ j ) are called specific variances or uniquenesses. There are important consequences of these assumptions for covariance structure cov(x, F ) = L. Let s write these out coordinate-wise Σ = cov(x) = LL 1 + Ψ cov(x j, F k ) = l jk, is the loading of the jth variable on the kth factor, and where Σ jj = var(x j ) = (LL 1 ) jj + ψ j h 2 j + ψ j mÿ h 2 j = (LL 1 ) jj = l 2 jk is the ith communality. Consider again Spearman s example. Here p = 3, m = 1, and

3 stat 206, week 6: factor analysis 3 L = Σ = l 1 l 2 l 3 l 1 l 2 l 3, ) ψ (l 1 l 2 l 3 + ψ 2 0 l ψ 1 l 1 l 2 l 1 l 3 = l ψ 2 l 2 l 3 l ψ 3 An identifiability issue exists with the model as we have presented it thus far. If Γ is any m ˆ m orthogonal matrix, then ψ 3 and so we can replace LF = (LΓ)(Γ 1 F ), L Ñ LΓ, F Ñ Γ 1 F without changing the dependence structure of X, since Σ = LL 1 + Ψ = LΓΓ 1 L 1 + Ψ. Since the model doesn t change, we ll later use this in two ways 1. To rotate the factors to make them more interpretable 2. To assist in optimization for maximum likelihood estimation. Exercise 1. This describes the non-identifiability completely. Prove that if LL 1 = L L 1 then there exists an m ˆ m orthogonal matrix Γ such that L = LΓ. (Hint: use the singular value decomposition). We can make the model identifiable by imposing constraints on the loadings. For example, we can constrain L 1 Ψ 1 L = diag(a 1,..., a m ) (1) to be an m ˆ m diagonal matrix, which is equivalent to saying that the columns of Ψ 1/2 L are orthogonal. Sometimes the variant constraint that LD 1 L 1 is diagonal, where D = diag(σ 11,..., Σ mm ) is used. The model accomplises dimension reduction. We have seen that the orthogonal factor model yields the decomposition

4 stat 206, week 6: factor analysis 4 Σ = LL 1 + Ψ. (2) Σ has p diagonal and p(p 1)/2 off-digonal parameters, for a total of ν = p(p + 1)/2. Now let s count the parameters on the right side of (2). Since Ψ is diagonal, it has p parameters. It looks like L has pm parameters, but (1) imposes m(m 1)/2 constraints, so the right side actually has ν 0 = pm m(m 1)/2 + p free parameters. The reduction in parameters from general Σ to the form (2) is thus s = p(p + 1)/2 pm + m(m 1)/2 p = 1 2 [(p m)2 (p + m)]. Generally, s ą 0, and then represents the extent of dimension reduction accomplished by the factor model. The value s can be huge when m! p, so factor models are particularly attractive in highdimensional settings when p is large, particularly if n is not also very large. If Σ = LL 1 + Ψ represents a null hypothesis, and general Σ represents the alternative, then s = ν ν 0 also gives the degrees of freedom of the asymptotic χ 2 distribution of the likelihood ratio test (see below). The factor model is scale invariant. If X follows the factor model X = µ + LF + ϵ then so does CX for a diagonal matrix C: Y = CX = Cµ + CLF + Cϵ = µ C + L C F + ϵ C where cov(ϵ C ) = CΨC 1 is diagonal since C is diagonal. So, in this sense, factor analysis is unaffected by rescaling of the variables. Use of the (sample) correlation matrix Thus we may standardize the variables Z = V 1/2 (X µ), V = diag(σ)

5 stat 206, week 6: factor analysis 5 and specify a factor model for the population correlation matrix P = L z L 1 z + Ψ z, L z = V 1/2 L, Ψ z = V 1/2 ΨV 1/2, with data x i we standardize z i = p V 1/2 (x i sx) with p V = diag(s) and fit R = p L z p L 1 z + p Ψ z. In this case p ψ j = 1 ph jj = 1 řm i=1 p l 2 ji, and p Σ jj = s jj. Returning to Spearman s example, p = 3, m = 1 so s = 1/2[(3 1) 2 (3 + 1)] = 0. In fitting = L p zl p1 z + Ψ p z, 1 there are 6 equations in 6 unknowns. Indeed, you can check that.983 pl z =.844, Ψz p = diag(.034,.278,.370)..794 Methods of estimation One method uses principal components. Motivation: start from spectral decomposition of Σ, Σ = UΛU 1. We seek an approximation with a small number m ă p factors. A natural choice the best rank m approximation, as we have seen is to set Σ m = U m Λ m U 1 m = U m Λ 1/2 m Λ 1/2 m U m = U m Λ 1/2 m (U m Λ 1/2 m ) 1 = LL 1 (3) where U m consists of the first m columnns of u and Λ m = diag(λ 1,..., λ m ). Our factor model says that Σ has the form LL 1 + Ψ, so if the L in (3) were the correct one, then recalling that Ψ is diagonal we might set Ψ = diag(σ LL 1 ). With actual data, we mimic these steps, either using S or R

6 stat 206, week 6: factor analysis 6 1. Perform a spectral decomposition of S or R 2. Choose m ă p, define L = p U m p Λ 1/2 m 3. Set Ψ = diag(s L L 1 ). How good is this fit? How do we choose m? 1. We can look at the residual matrix m = S ( L L 1 + Ψ), which we would like to be small. By definition, the diagonal elements of m are zero, and it can be shown that m 2 F = p λ 2 m p λ 2 p, so one guide would be to consider m such that the right side is small compared to ř p λ2 j. 2. We might alternatively evaluate the jth common factor by its contribution worards the total variance. Earlier, we saw that Σ jj = h 2 j + ψ j = mÿ l 2 jk + ψ j (4) so that the contribution to Σ jj from the kth factor is l 2 jk, and the contribution to the total variance tr(s) = ř p Σ jj by the kth factor is k=1 pÿ l 2 jk. (5) N.B.: in the matrix of squared loadings (l 2 jk ) in (4), we take a row sum; in (5) it s a column sum. With our principal component estimate, we take the sum of squares of the kth column pÿ l 2 jk = λ p k pe k 2 2 = λ p k, so the proportional contribution to the total variance from the kth factor is pλ k tr(s) = λk p ř p p. λ j

7 stat 206, week 6: factor analysis 7 Estimation for factor models Let s estimate factor models for the djia data. load('../../datasets-other/djia/djia.rdata') df <- djia.ldr[,3:ncol(djia.ldr)] n <- nrow(df); p = ncol(df) df <- data.frame(df) names(df) <- colnames(df) for (j in 1:p) { Fj <- ecdf(df[,j]) df[,j] <- qnorm(fj(df[,j])-1/(2*n)) } R <- cor(df); eig <- eigen(r) lambda <- eig$values cper <- data.frame(per.exp = cumsum(lambda)/sum(lambda)) cper$j <- seq(p) ggplot(cper,aes(x=j,y=per.exp)) + geom_point() + ylim(c(0,1)) per.exp Gamma = eig$vectors L = Gamma%*%diag(sqrt(lambda)) m <- 4 L4 <- L[,1:m] j Figure 1: cumulative percent explained, djia data df.l = data.frame(l4) names(df.l) <- paste('l',seq(3),sep='') df.l$stock <- names(df) df.l <- melt(df.l,id='stock') ggplot(df.l,aes(y=stock,x=variable,fill=value)) + geom_tile() psi <- diag(r)-diag(l4%*%t(l4)) res3 <- R - L4%*%t(L4)-diag(psi) sum(c(res3^2)) ## [1] stock xom wmt vz utx unh trv pfe pgt nke msft mrk mmm mcd jpm ko ibm intc jnj hd ge gs dis csco dd cat ba axp L1 L2 L3 NA variable value max(c(res3^2)) ## [1] sum(c(r^2)) ## [1]

8 stat 206, week 6: factor analysis 8 # res3 <- abs(res3)/abs(r) # standardize to give a better notion of how much error there is in each elem # res3 <- data.frame(res3) # names(res3) <- names(df) # res3$stock <- names(df) # res3 <- melt(res3,id='stock') # ggplot(res3,aes(x=variable,y=stock,fill=value)) + geom_tile() Maximum likelihood method load('../../datasets-other/djia/djia.rdata') df <- djia.ldr[,3:ncol(djia.ldr)] n <- nrow(df); p = ncol(df) df <- data.frame(df) names(df) <- colnames(df) for (j in 1:p) { Fj <- ecdf(df[,j]) df[,j] <- qnorm(fj(df[,j])-1/(2*n)) } fit4 <- factanal(df,4,rotation="none") colsums(fit4$loadings^2) ## Factor1 Factor2 Factor3 Factor4 ## i.psi = diag(1/fit4$uniquenesses) L = fit4$loadings LPL = t(l)%*%i.psi%*%l # check diagonal max(lpl-diag(diag(lpl))) ## [1] e-16 Psi = diag(fit4$uniquenesses) res2 = R-(L%*%t(L)+Psi) sum(c(res2^2)) ## [1] max(c(res2^2)) ## [1] sum(r^2) ## [1] Rhat <- L%*%t(L)+Psi s <- ((p-m)^2-(p+m))/2

9 stat 206, week 6: factor analysis 9 References

9.1 Orthogonal factor model.

9.1 Orthogonal factor model. 36 Chapter 9 Factor Analysis Factor analysis may be viewed as a refinement of the principal component analysis The objective is, like the PC analysis, to describe the relevant variables in study in terms