Sparse orthogonal factor analysis

Size: px

Start display at page:

Download "Sparse orthogonal factor analysis"

Claude Morgan
6 years ago
Views:

1 Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In the procedure, an alternating least squares algorithm is used for estimating parameters for a specified sparseness of loadings and the suitable sparseness is selected by an information criterion. It is of worth to note that the proposed procedure constrains the sparseness without using a penalty function. Key word: factor analysis, sparse loading matrix, direct sparseness constraint 1 Introduction Factor analysis (FA) is classified as exploratory (EFA) or confirmatory (CFA). In EFA, the factor loading matrix is unconstrained and has rotational freedom which is exploited to rotate the matrix so that it approximates a matrix with zero elements. In CFA, some loadings are constrained to be zero and the loading matrix has no rotational freedom (Mulaik, 2010). One refers to a loading matrix including zero elements as its being sparse, which is the property indispensable for loadings to be interpretable. In EFA, a loading matrix is rotated toward a sparse matrix, but the literal sparseness is not attained, since rotated loadings cannot exactly be equal to zero. On the other hand, some loadings are fixed exactly to zero in CFA. However, the problem in CFA is that the number of zero loadings and their locations must be chosen by users in subjective manners. In order to overcome the above difficulties, we propose a new FA procedure, which is neither EFA nor CFA, for estimating the optimal orthogonal factor solution with a sparse loading matrix that has a suitable number of zero elements, whose locations are also estimated computationally. The procedure to be proposed consists of the following two stages: Kohei Adachi, Graduate School of Human Sciences, Osaka University, Japan; adachi@hus.osaka-u.ac.jp Nickolay, T. Trendafilov, Department of Mathmatics and Statistics, Open University, UK; Nickolay.Trendafilov@open.ac.uk

2 2 KoheiAdachi and Nickolay T. Trendafilov [A] The optimal solution is obtained for a specified number of zero loadings. [B] The optimal number of zero loadings is selected among possible numbers. Stages [A] and [B] would be described in Sections 2-3 and Section 4, respectively. In the area of principal component analysis (PCA), many procedures, called sparse PCA, have been proposed in the last decade, e.g. Jolliffe, Trendafilov & Uddin, 2003; Zou, Hastie, & Tibshirani, As in our FA procedure, they obtain sparse loadings. However, besides the difference between PCA and FA, our approach does not rely on penalty functions, which is the standard way to induce sparseness in the existing sparse PCA. 2 Sparse Factor Problem The main goal of FA is estimate the p-variables m-factors matrix containing loadings and the p p diagonal matrix 2 including unique variances from the n- observation p-variables (n > p) column-centred data matrix X. For this goal, FA can be formulated with some different loss functions, among which we choose the function f(f, U,, ) = X (F+U) 2 = X ZB 2 (1) recently presented by de Leeuw (2004), Unkel and Trendafilov (2010), and Trendafilov and Unkel (2011). Here, B = [,] is a p (m + p) block matrix and Z = [F, U] is the n (m + p) one containing the common and unique factor matrices expressed as F (n m) and U (n p), respectively. The factor score matrix Z is constrained to satisfy n 1 ZZ = I m+p. (2) with I m+p the identity matrix of order m + p. We propose to minimize (1) over F, U,, and subject to (2) and SP() = q, (3) where SP() expresses the sparseness of, i.e., the number of its elements being zero, and q is a specified integer. The reason for our choosing loss function (1) is that it can be rewritten as f(f, U,, ) = X (FA+U) 2 + n n 1 XF 2 and be easily minimized over subject to (3) as seen in the next section. (1) 3 Algorithm For minimizing (1) subject to (2) and (3), we consider alternately iterating the update of each parameter matrix. First, let us consider updating so that (1) or (1) is minimized subject to (3) while Z = [F, U] and are kept fixed. Such optimal update of = ( jk ) is given by 0 iff a jk a jk q, (4) a jk otherwise

3 Sparse orthogonal factor analysis 3 where a jk is the (j, k) element of A = (a jk ) = n 1 XF (5) and a q is the q-th smallest absolute value among those of the elements in A. Next, let us consider updating diagonal. We can find (1) is minimized for = diag(n 1 XU), (6) when Z = [F, U] and are fixed. Finally, let us consider updating Z = [F, U] so that (1) is minimized subject to (2) with and kept fixed. Since (1) is rewritten as trxx +ntrbb 2tr(XB)Z using (2), its minimum is found to be attained for Z = n 1/2 PQ = n 1/2 P 1 Q 1 + n 1/2 P 2 Q 2, (7) with P = [P 1, P 2 ] and Q = [Q 1, Q 2 ] obtained through the singular value decomposition (SVD) of the n (m+ p) matrix n 1/2 XB; n 1/2 1 XB = PQ = [P 1,P 2 ] Q 1 mom = P1 Q 1 Q 1, (8) 2 with m O m the m m matrix of zeros. Here, rank(xb) = p is assumed, expresses the diagonal matrix of order m + p with its first p p diagonal block being positive definite matrix 1, and P and Q satisfy PP = QQ = QQ = I m+ p with P 1 and Q 1 being n p and m p matrices, respectively. Although (7) and (8) show that Z cannot be uniquely determined, the p-variables (m+p)-factors covariance matrix n 1 XZ = [n 1 XF, n 1 XU] = [A, n 1 XU] used for updates (4) and (6) is given uniquely by n 1 XZ = (n 1/2 X)(n 1/2 Z) = (B + Q 1 1 P 1 )(PQ) = B + Q 1 1 Q 1. (9) This equality follows from that the Moore-Penrose inverse of B is given by B + = B(BB) 1 B, since rank(xb) = p implies B being of full-row rank: the use of BB + = I p in (8) leads to n 1/2 X = P 1 1 Q 1 B +, which is transposed and post-multiplied by (7) to give (9) (Adachi, 2012). Comparing (9) with (5) and (6), we find that they are rewritten as A = B + Q 1 1 Q 1 H m, = diag(b + Q 1 1 Q 1 H p ), using H m = [I m, m O p ] and H p = [ p O m, I p ] with m O p the m p matrix of zeros. Here, we should distinguish between on the left hand side of (6) and its counterpart in B = [,] on the right hand side. The former is the updated one, while the latter one is from the previous iteration. The above equations show that and can be updated without obtaining Z, only if sample covariance matrix S = n 1 XX is available, and even if the original data matrix X is given. That is, (8) shows that the eigenvalue decomposition (EVD) BSB = Q Q 1 gives the matrices Q 1 and 1 needed in (5) and (6), with (5) being used for (4). Further, the resulting loss function value can be computed without the use of X: substituting (2), (5) and (6) into an expanded form of loss function (1), we can rewrite it as f(, ) = ntrs + ntr 2ntrA ntr 2. Further, (1) can be simplified into f(b) = n{trs tr(+ 2 )} = n(trs trbb), by noting that (4) implies tra = tr. Then, the standardized loss function (5) (6) f S (B) = 1 trbb/trs, (10) which takes a value within [0,1], can be used for convenience instead of f(, ).

4 4 KoheiAdachi and Nickolay T. Trendafilov The optimal solution with sparseness (2) is thus given by the following algorithm: Step 1. Initialize B = [,]. Step 2. Perform EVD of BSB. Step 3. Update with (6). Step 4. Obtain A with (5). Step 5. Update with (4). Step 6. Finish if convergence is reached; otherwise, go back to Step 2. To avoid missing the global minimum, we run the algorithm multiple times with different random initialization of B in Step 1, and the optimal one is selected via a procedure described in Section 5. We denote the resulting solution of B as [ ˆ q, ˆ q ], where the subscript q indicates the particular number of zeros used in (3). Bˆ q = 4 Sparseness Selection Sparseness can be restated as parsimony: the greater SP() implies that fewer parameters are to be estimated and the resulting loss function value is greater. Thus, the sparseness selection means to choose a FA model with the optimal combination of attained loss function value and parsimony. For such model selection, we can use the information criteria (Schwarz, 1978) which are defined using maximum likelihood (ML) estimates. Although a maximum likelihood method is not used in our algorithm, we assume that Bˆ q = [ ˆ q, ˆ q ] is equivalent to the maximum likelihood FA solution which maximizes log likelihood L(,) = 0.5n{log+ 2 + trs(+ 2 ) 1 } with the locations of the zero loadings constrained to be those of ˆ q. Under this assumption, we propose to use an information criterion BIC (Schwarz, 1978) for choosing the optimal q. BIC can be expressed as for Bˆ q BIC(q) = 2 L( ˆ q, ˆ q) q log n + c * (11) with c * a constant irrelevant to q. The optimal sparseness is thus defined as qˆ argmin BICq ( ) (12) qminqqmax and ˆB qˆ (i.e., Bˆ q with q = qˆ ) is chosen as the final solution Bˆ, wtih q min = m(m1)/2 and q max = p(m1). 5 Simulation Study We performed a simulation study to assess the proposed procedure with respect to exactness in identifying the true sparseness and locations of zero loadings, goodness of the recovery of parameter values, and sensitivity to local minima.

5 Sparse orthogonal factor analysis 5 Figure 1: Three loading matrices of simple structure (left) and two ones of bi-factor structure (right). # r # r r # r r # # # # # r # r # r r # # # # # # r # r r # # # # # r # r r # r # r r # r # r r # r # r # r r r # r # r # r # r # # # r # r # r # # # # # r # r # r # # # # r r # r r # # # # r r # r r # # r # r r r # r r # # r # # r # r r # # r # # r r # # # blank: 0 r r # # r # : non-zero r r # # r r r : 0 or non-zero r r # We used the five types of the true shown in Figure 1. For each type, we generated 40 sets of {,, S} by the following steps: 1) Each diagonal element of was set to u(0.1 1/2, 0.7 1/2 ). 2) A non-zero value in was set to u(0.4, 1), while an element denoted by r in Figure 1 was randomly set to zero or u(0.4, 1). 3) was normalized so as to satisfy diag(+ 2 ) = I p. 4) Setting n = 200p, we sampled each row of X from the centred p-variate normal distribution with its covariance matrix ) Intervariable correlation matrix S was obtained from X. Here, u(, ) denotes a value drawn from the uniform distribution of the range [, ]. The procedures described in Sections 2, 3 and 4 were applied to the resulting 200 (= 40 5) S, where the algorithm in Section 3 was run multiple times with a twooptimal-solutions stopping procedure. Using Bˆ ql for the solution resulting of B from the l-th run, the procedure is listed as follows: Phase 1. Set L q = 50 and obtain Bˆ ql for l = 1,, L q ; find l * = arg min 1l L fs( ˆ q Bql ) and set Bˆ q = B ˆ ql *. Phase 2. Finish, if Bˆ q is equivalent to B ~ q resulting from the ~ l -th run with ~ l l * ; otherwise, go to Phase 3. Phase 3. Set L q := L q + 1, and let B ~ q be the output from another run. Phase 4. Exchange Bˆ q for B ~ q if f S ( B ~ q ) < f S ( Bˆ q ). Phase 5. Finish if Bˆ q = q Here, the equivalence of B ~ or L q = 200; otherwise, go back to Phase 3. Bˆ q = [ ˆ q, ˆ q ] and B ~ q = [ ~ q, ~ q ] is defined as 2 1 ( ˆ q ~ q 1 /mp + ˆ q 1 p ~ q 1 p 1 /p) is less than 10 3, where 1 denotes the sum of the absolute values of the elements of the argument and 1 p is the p 1 vector of ones. Except Bˆ q and B ~ q, the rest L q 2 solutions are local minimizers. Clearly, the L q value indicates the sensitivity of the algorithm to local minima. We obtained the average of L q values over all runs for each data set. As a result, the quartiles of the averages over 200 data sets are 89, 120, and 155, which demonstrates high sensitivity to local minima. Nevertheless, good performances of the proposed procedure are shown next. Table 1 shows the distributions of the indices measuring the correctness of qˆ and Bˆ (over 200 data sets). The percentiles of BES = ( q ˆ q) / q, which assesses the relative bias of the estimated sparseness from the true q, show that sparseness was

6 6 KoheiAdachi and Nickolay T. Trendafilov Table 1: Distributions of indices for correctness of estimated sparseness and parameters Percentile BSE Identification Rate Difference R 00 R ## satisfactorily estimated, though it tended to be underestimated. Indices R 00 and R ## are the rates of the zero and non-zero elements in the true correctly identified by ˆ. Non-zero elements is found to have been exactly identified in Table 2. The fourth and 2 fifth indices are mean absolute differences ˆ 1 /(pm) and ˆ 1 p 2 1 p 1, whose percentiles show that the parameter values were recovered very well. 6 Conclusions In order to overcoming the difficulties with EFA and CFA, we proposed a new FA procedure in which the optimal solution is estimated subject to the direct sparseness constraint on loadings and the best sparseness is selected using BIC. The simulation study demonstrated that the true sparseness and parameter values are recovered well in the procedure. References 1. Adachi, K.: Some contributions to data-fitting factor analysis with empirical comparisons to covariance-fitting factor analysis. J. Japan. Soc. Comp. Stat., 25, (2012). 2. de Leeuw, J.: Least squares optimal scaling of partially observed linear systems. In:. van Montfort, K., Oud, J., Satorra, A. (Eds.), Recent Developments of Structural Equation Models: Theory and Applications. pp Kluwer Academic Publishers, Dordrecht (2004). 3. Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comp. Graph. Stat., 12, (2003). 4. Trendafilov, N.T., Unkel, S.: Exploratory factor analysis of data matrices with more variables than observations. J. Comp. Graph. Stat., 20, (2011). 5. Unkel, S., Trendafilov, N.T.: Simultaneous parameter estimation in exploratory factor analysis: An expository review. International Stat. Review, 78, (2010). 6. Mulaik, S. A.: Foundations of Factor Analysis, Second Edition. CRC Press, Boca Raton (2010). 7. Schwarz, G.: Estimating the dimension of a model. Ann. Stat., 6, (1978). 8. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comp. Graph. Stat., 15, (2006).

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is