Regularized NNLS Algorithms for Nonnegative Matrix Factorization with Application to Text Document Clustering

Size: px

Start display at page:

Download "Regularized NNLS Algorithms for Nonnegative Matrix Factorization with Application to Text Document Clustering"

Bertram Parrish
6 years ago
Views:

1 Regularized NNLS Algorithms for Nonnegative Matrix Factorization with Application to Text Document Clustering Rafal Zdunek Abstract. Nonnegative Matrix Factorization (NMF) has recently received much attention both in an algorithmic aspect as well as in applications. Text document clustering and supervised classification are important applications of NMF. Various types of numerical optimization algorithms have been proposed for NMF, which includes multiplicative, projected gradient descent, alternating least squares and active-set ones. In this paper, we discuss the selected Non-Negatively constrained Least Squares (NNLS) algorithms (a family of the NNLS algorithm proposed by Lawson and Hanson) that belong to a class of active-set methods. We noticed that applying the NNLS algorithm to the Tikhonov regularized LS objective function with a regularization parameter exponentially decreasing considerably increases the accuracy of data clustering as well as it reduces the risk of getting stuck into unfavorable local minima. Moreover, the experiments demonstrate that the regularized NNLS algorithm is superior to many well-known NMF algorithms used for text document clustering. 1 Introduction NMF decomposes an input matrix into lower-rank factors that have nonnegative values and usually some physical meaning or interpretation. The reduced dimensionality is very useful in data analysis due to many advantages such as denoising, computational efficiency, greater interpretability and easier visualization. Hence, NMF has already found diverse applications [2, 4, 7, 11, 12, 18, 20, 21, 22, 23, 24] in multivariate data analysis, machine learning, and signal/image processing. Text document clustering can be regarded as the unsupervised classification of documents into groups (clusters) that have similar semantic features. The Rafal Zdunek Institute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw, Poland rafal.zdunek@pwr.wroc.pl R. Burduk et al. (Eds.): Computer Recognition Systems 4, AISC 95, pp springerlink.com Springer-Verlag Berlin Heidelberg 2011

2 758 R. Zdunek grouping can be basically obtained with hierarchical (tree-like structure) or partitional (flat clustering) techniques [13]. Xu et al [24] showed that NMF can perform text document partitioning by capturing semantic features in a document collection and grouping the documents according to similarity of these features. Different from Latent Semantic Indexing (LSI), NMF provides nonnegative semantic vectors that do not have to be orthogonal or even independent. Moreover, the coefficients of linear combinations of the semantic vectors are also nonnegative, which means that only additive combinations are permitted. The semantic vectors in NMF have also easier interpretation - namely each of them represents a base topic. Thus, each document can be represented by an additive combination of the base topics. Partitional clustering can be readily performed in the space of linear combination coefficients. An intensive development of NMF-based techniques for document clustering brought in a variety of nonnegativity constrained optimization algorithms. Xu et al [24] applied the standard and weighted Lee-Seung multiplicative algorithms to document clustering, and demonstrated that NMF outperforms SVD-based clustering methods. Shahnaz et al [22] proposed the GD-CLS algorithm that combines the standard multiplicative algorithm for updating the semantic vectors and the regularized projected LS algorithm with a constant regularization parameter for updating the other factor. Several NMF algorithms (tri-nmf, semi-orthogonal and convex NMF) for partitional clustering have been developed by Ding et al [8, 9, 10, 19]. Another group of NMF algorithms is based on so-called Graph NMF (GNMF) [6] that incorporates a priori information on a local structure of clustered data. Locality Preserving NMF (LPNMF), which has been proposed by Cai et al [5] for document clustering, is an extension of GNMF. The above-mentioned algorithms mostly perform multiplicative updates that assure a monotonic convergence but their convergence rate is usually very small. To accelerate the slow convergence, several additive update algorithms have been proposed, including Projected Gradient (PG) descent, Alternating Least Squares (ALS) and active-set algorithms. A survey of PG and ALS algorithms for NMF can be found in [7, 25]. Since the estimated factors in NMF-based document clustering are expected to be large and very sparse, a good choice seems to be active-set algorithms. In this paper, we discuss the selected active-set algorithms that are inspired by the Non-Negative Least Squares (NNLS) algorithm, proposed by Lawson and Hanson [17] in Bro and de Jong [3] considerably accelerate the NNLS algorithm by rearranging computations for cross-product matrices. The solution given by the NNLS algorithms is proved to be optimal according to the Karush-Kuhn-Tucker (KKT) conditions. Unfortunately, these algorithms are not very efficient for solving nonnegativity constrained linear systems with multiple Right-Hand Side (RHS) vectors since they compute a complete pseudoinverse once for each RHS. To tackle this problem, Benthem and Keenan [1] devised the Fast Combinatorial NNLS (FC- NNLS) algorithm and experimentally demonstrated it works efficiently for energydispersive X-ray spectroscopy data. The FC-NNLS algorithm works also efficiently for some NMF problems. Kim and Park [16] applied the FC-NNLS algorithm to

3 Regularized NNLS Algorithms for Text Document Clustering 759 the l 1 - and l 2 -norm regularized LS problems in NMF, and showed that the regularized NNLS algorithms work very efficiently for gene expression microarrays. Their approach assumes constant regularization parameters to enforce the desired degree of sparsity. Here we apply the FC-NNLS algorithm to text document clustering, assuming the regularization parameter decrease exponentially with iterations to trigger a given character of iterative updates. The paper is organized in the following way. The next section discusses the concept of NMF for document clustering. Section 3 is concerned with the NNLS algorithms. The experiments for text document clustering are presented in Section 4. Finally, the conclusions are given in Section 5. 2 NMF for Document Clustering According to the model of document representation introduced in [24], first the collection of documents to be clustered is subject to stop-words removal and words stemming, and then the whole processed collection is represented by the termdocument matrix Y = [y it ] R I T +, where T is the number of documents in the document collection, I is the number of words after preprocessing of the documents, and R + is the nonnegative orthant of the Euclidean space R. Each entry of Y is modeled by: ( ) T y it = t it log, (1) d i where t it denotes the frequency of the i-th word in the t-th document, and d i is the number of documents containing the i-th word. Additionally, each column vector of Y is normalized to the unit l 2 -norm. The matrix Y is very sparse, since each document contains only a small portion of the words that occur in the whole collection of documents. The aim of NMF is to find such lower-rank nonnegative matrices A R I J and X R J T that Y = AX R I T, given the matrix Y, the lower rank J (the number of topics), and possibly a prior knowledge on the matrices A and X. Assuming each column vector of Y = [y 1,...,y T ] represents a single observation (a datum point in R I ), and J is a priori known number of clusters (topics), we can interpret the column vectors of A = [a 1,...,a J ] as the semantic feature vectors or centroids (indicating the directions of central points of clusters in R I ) and the entry x in X = [x ] as an indicator (probability) how the t-th document is related to the j-th topic. To estimate the matrices A and X from Y, we assume the squared Euclidean distance: D(Y AX) = 1 2 Y AX 2 F. (2) The gradient matrices of (2) with respect to A and X are expressed by: G A = [g (A) i j ] = A D(Y AX) = (AX Y)X T R I J, (3)

4 760 R. Zdunek G X = [g (X) ] = X D(Y AX) = A T (AX Y) R J T. (4) From the stationarity of the objective function, we have G A 0 and G X 0, which leads to the ALS algorithm: A YX T (XX T ) 1, X (A T A) 1 A T Y. (5) The update rules (5) do not guarantee nonnegative solutions which are essential for NMF. Some operations are necessary to enforce nonnegative entries. The simplest approach is to replace negative entries with zero-values, which leads to the projected ALS algorithm: where A P ΩA [ YX T (XX T ) 1], X P ΩX [ (A T A) 1 A T Y ], (6) Ω A = A R I J : a i j 0 }, Ω X = X R J T : x 0 }. (7) Unfortunately, the projected ALS algorithm does not assure that the limit point is optimal in the sense it satisfies the KKT optimality conditions: g (A) i j = 0, if a i j > 0, and g (A) i j > 0, if a i j = 0. (8) g (X) = 0, if x > 0, and g (X) > 0, if x = 0, (9) and the complementary slackness: g (A) i j a i j = 0, g (X) x = 0. (10) 3 NNLS Algorithms Several update rules have been proposed in the literature that satisfy the KKT conditions. Here we restrict our considerations to the NNLS algorithms for updating the matrix X, given the matrices A and Y. The discussed algorithms can also update the matrix A by applying them to the transposed system: X T A T = Y T. 3.1 NNLS Algorithm for Single RHS The NNLS algorithm was originally proposed by Lawson and Hanson [17], and currently it is used by the command lsqnonneg(.) in Matlab. This algorithm (denoted by LH-NNLS) iteratively partitions the unknown variables into the passive set P that contains basic variables and the active set R that contains active constraints, and updates only the basic variables until the complementary slackness condition in (10) is met. Let P = j : x > 0 } and R = 1,...,J}\P, and the partition: t : x t = [x (P) t ;x (R) t ] T R J, and g t = xt D(y t Ax t ) = [g (P) t ;g (R) t ] T R J.

5 Regularized NNLS Algorithms for Text Document Clustering 761 The columns of A can be also partitioned in the similar way: A = [A P A R ], where A P = [a,p ] and A R = [a,r ]. The basic variables can be estimated by solving the unconstrained LS problem: min A P x (P) x t (P) t y t 2 }, (11) where A P has full column rank. The nonbasic variables in the KKT optimality point should be equal zero. In contrary to the projected ALS, the NNLS algorithm does not replace negative entries with zero-values, which is equivalent to determine nonbasic variables from unconstrained LS updates, but it starts from all nonbasic variables and tries to iteratively update a set of basic variables. The LH-NNLS algorithm is given by Algorithm 1. Algorithm 1: LH-NNLS Algorithm Input : A R I J, y t = R I Output: x t 0 such that x t = argmin xt Ax t y t 2 1 Initialization: P = /0, R = 1,...,J}, x t = 0, g t = A T y t ; 2 repeat 3 k k+ 1 ; 4 m = argmin j R g }; // the constraint to add 5 if g mt < 0 then 6 P P m, and R R\m; 7 else 8 stop with x t as an optimal solution x t (P) = ( (A P ) T ) 1 9 A P (AP ) T y t where A P = [a,p ] R I P ; 10 while x (P) 0 for j = 1,...,J do x (P) α = min j P x (P) x (P) x (P) ; // the step length 11 0 [ x t x t (P) + α( x t (P) x t (P) T 12 ); 0] ; N = j : x = 0 } 13 ; // the constraint to drop 14 P P\N, and R R N ; x t (P) = ( (A P ) T ) 1 15 A P (AP ) T y t where A P = [a,p ] R I P ; [ x t x t (P) T 16 ; 0] ; 17 until R = /0 or max j R g } < tol ; Bro and de Jong [3] considerably speed up this algorithm for I >> J by precomputing the normal matrix A T A and the vector A T y t, and then replace the steps 9 and 15 in Algorithm 1 with:

6 762 R. Zdunek x (P) t = ( (A T A) P,P ) 1(A T y t ) P. (12) Unfortunately, the inverse of (A T A) P,P must be computed for each t, which is very expensive if the number of RHS is very large. 3.2 Regularized NNLS Algorithm for Multiple RHS Van Benthem and Keenan [1] tackled the problem of a high computational cost for multiple RHS. They noticed that for a sparse solution with multiple column vectors, a probability of finding columns vectors that have the same layout of the zero-entries (active constraints) is high. Hence, after detecting such vectors in X, their passive entries can be updated computing the inverse of (A T A) P,P only once. The NNLS algorithm proposed by Van Benthem and Keenan is referred to as FC-NNLS. The CSSLS algorithm (Algorithm 3) is a part of the FC-NNLS algorithm. We applied the FC-NNLS algorithm to the penalized Euclidean function: Algorithm 2: Regularized FC-NNLS Algorithm Input : A R I J, Y = R I T, λ R + Output: X 0 such that X = argmin X AX Y F + λ X F 1 Initialization: M = 1,...,T }, N = 1,...,J}; 2 Precompute: B = [b i j ] = A T A+λI J and C = [c it ] = A T Y ; 3 X = B 1 C ; // unconstrained minimizer 1 if x P = [p ], where p = > 0, ; 4 0 otherwise // passive entries 5 F = t M : j p I} ; // set of columns to be optimized x x if p = 1, 0 otherwise ; 6 7 while F /0 do 8 P F = [p,f ] R J F, C F = [c,f ] R J F ; 9 [x,f ] =cssls(b,c F,P F ) ; // Solved with the CSSLS algorithm 10 H = t F : min j N x } < 0} ; // Columns with negative vars. 11 while H /0 do 12 s H, select the variables to move out of the passive set P; 13 P H = [p,h ] R J H, C H = [c,h ] R J H ; 14 [x,h ] =cssls(b,c H,P H ); 15 H = t F : min j N x } < 0} ; // Columns with negative vars W = [w ] = C F BX F, where X F = [x,f ] ; // negative gradient Z = t F : j w (1 P F ) = 0} ; // set of optimized columns F F\Z ; // set of columns to be optimized 1 if j = argmax p = j w (1 P F ), t F }, ; // updating p otherwise

7 Regularized NNLS Algorithms for Text Document Clustering 763 Algorithm 3: CSSLS Algorithm Input : B R J J, C = R J K, P R J K Output: X 1 M = 1,...,K}, N = 1,...,J}, P = [p 1,...,p K ] ; 2 Find the set of L unique columns in P: U = [u 1,...,u L ] = uniquep} ; d j = } 3 t M : p t = u j ; // columns with identical passive sets 4 for j = 1,...,L do 5 [X] u j,d j = ( [B] u j,u j ) 1 [C]u j,d j D(Y AX) = 1 2 Y AX 2 F + λ X 2 F, (13) where λ is a regularization parameter. In consequence, we have the Regularized FC-NNLS algorithm. The aim of using regularization for the discussed problems is rather to trigger the character of iterations than to stabilize ill-posed problems since the matrix A is not expected to be ill-conditioned. 3.3 Regularized NNLS-NMF Algorithm The Regularized NNLS-NMF algorithm is given by Algorithm 4. Motivated by Algorithm 4: Regularized NNLS-NMF Algorithm Input : Y R I T, J - lower rank, λ 0 - initial regularization parameter, τ - decay rate, λ - minimal value of regularization parameter, Output: Factors A and X 1 Initialize (randomly) A and X; 2 repeat 3 k k+ 1; 4 λ = λ + λ 0 exp τk} ; // Regularization parameter schedule 5 X fcnnls(a,y,λ) ; // Update for X d (X) j = t=1 T x2, X diag (d (X) j ) 1} } X, A Adiag d (X) 6 j ; 7 Ā fcnnls(x T,Y T,λ), A = Ā T ; // Update for A } d (A) j = I i=1 a2 i j, X diag d (A) j X, A Adiag (d (A) j ) 1} 8 ; 9 until Stop criterion is satisfied ; [26], we propose to gradually decrease the regularization parameter λ with alternating iterations for NMF, starting from a large value λ 0. Considering updates for X and A in terms of Singular Value Decomposition (SVD) of these matrices, it is obvious that if λ is large, only the right singular vectors that correspond to the largest singular values take part in the updating process. When the number of iterations is

8 764 R. Zdunek large (in practice about 100), λ λ, where we take λ = 10 12, and the singular vectors corresponding to the smallest singular values participate in the updates. 4 Experiments The experiments are carried out for two databases of text documents. The first database (D1) is created from the Reuters database by randomly selecting 500 documents from the following topics: acq, coffee, crude, eran, gold, interest, money-fx, ship, sugar, trade. The database D1 has 5960 distinct words; thus Y R and J = 10. The other database (D2) comes from the TopicPlanet document collection. We selected 178 documents classified into 6 topics: air-travel, broadband, cruises, domain-names, investments, technologies, which gives 8054 words. Thus Y R and J = 6. For clustering, we used the following algorithms: standard Lee-Seung NMF for the Euclidean distance [18], GD-CLS [22], Projected ALS [2, 7], Uni-orth NMF [10] separately for the matrix A and X, Bi-orth NMF [10], Convex-NMF [8], Newton NMF [7], k-means (from Matlab2008) with a cosine measure, LSI, and Regularized FC-NNLS (RFC-NNLS). All the algorithms are terminated after 20 iterations. For the RFC-NNLS algorithm, we set the following parameter: λ = 10 12, λ 0 = 10 8 and τ = 0.2. All the algorithms are intensively tested under the software developed by M. Jankowiak [14]. Table 1 Mean-accuracy, standard deviations, and averaged elapsed time (in seconds) over 10 MC runs of the tested NMF algorithms. Algorithm TopicPlanet Reuters Accuracy Std. Time [sec] Accuracy Std. Time [sec] Lee-Seung NMF GD-CLS Projected ALS Uni-orth (A) Uni-orth (X) Bi-orth Convex-NMF Newton-NMF k-means LSI RFC-NNLS Each tested algorithm (except for LSI) is run for 10 Monte Carlo (MC) trials with a random initialization. The algorithms are evaluated (Table 1) in terms of

9 Regularized NNLS Algorithms for Text Document Clustering 765 the averaged accuracy of clustering, standard deviations, and the averaged elapsed time over 10 trials. The accuracy is expressed with a ratio of correctly classified documents to the total number of documents. 5 Conclusions We proposed the Tikhonov regularized version of the FC-NNLS (RFC-NNLS) algorithm and efficiently applied it to text document clustering. The experiments demonstrate that the RFC-NNLS algorithm outperforms all the tested algorithms in terms of the accuracy and resistance to initialization. The computational complexity of the RFC-NNLS is comparable to fast algorithms such as the standard Lee-Seung NMF, ALS, Newton-NMF, and LSI. Acknowledgment. This work was supported by the habilitation grant N N ( ) from the Ministry of Science and Higher Education, Poland. References [1] Benthem, M.H.V., Keenan, M.R.: J. Chemometr. 18, (2004) [2] Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Comput. Stat. Data An. 52, (2007) [3] Bro, R., Jong, S.D.: J. Chemometr. 11, (1997) [4] Buciu, I., Pitas, I.: Application of non-negative and local nonnegative matrix factorization to facial expression recognition. In: Proc. Intl. Conf. Pattern Recognition (ICPR), pp (2004) [5] Cai, D., He, X., Wu, X., Bao, H., Han, J.: Locality preserving nonnegative matrix factorization. In: Proc. IJCAI 2009, pp (2009) [6] Cai, D., He, X., Wu, X., Han, J.: Nonnegative matrix factorization on manifold. In: Proc. 8th IEEE Intl. Conf. Data Mining (ICDM), pp (2008) [7] Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley and Sons, Chichester (2009) [8] Ding, C., Li, T., Jordan, M.I.: IEEE T. Pattern. Anal. 32, (2010) [9] Ding, C., Li, T., Peng, W.: Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence, chi-square statistic, and a hybrid method. In: Proc. AAAI National Conf. Artificial Intelligence (AAAI 2006) (2006) [10] Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proc 12th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp ACM Press, New York (2006) [11] Du, Q., Kopriva, I.: Neurocomputing 72, (2009) [12] Heiler, M., Schnoerr, C.: J. Mach. Learn. Res. 7, (2006) [13] Jain, A.K., Murty, M.N., Flynn, P.J.: ACM Comput. Surv. 31, (1999) [14] Jankowiak, M.: Application of nonnegative matrix factorization for text document classification. MSc thesis (supervised by Dr. R. Zdunek), Wroclaw University of Technology, Poland (2010) (in Polish) [15] Kim, H., Park, H.: Bioinformatics 23, (2007)

10 766 R. Zdunek [16] Kim, H., Park, H.: SIAM J. Matrix Anal. A 30, (2008) [17] Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs (1974) [18] Lee, D.D., Seung, H.S.: Nature 401, (1999) [19] Li, T., Ding, C.: The relationships among various nonnegative matrix factorization methods for clustering. In: Proc. 6th Intl. Conf. Data Mining (ICDM 2006), pp IEEE Computer Society, Washington DC, USA (2006) [20] O Grady, P., Pearlmutte, B.: Neurocomputing 72, (2008) [21] Sajda, P., Du, S., Brown, T.R., Stoyanova, R., Shungu, D.C., Mao, X., Parra, L.C.: IEEE T. Med. Imaging 23, (2004) [22] Shahnaz, F., Berry, M., Pauca, P., Plemmons, R.: Inform. Process. Manag. 42, (2006) [23] Sra, S., Dhillon, I.S.: Nonnegative matrix approximation: Algorithms and Applications. UTCS Technical Report TR-06-27, Austin, USA (2006), [24] Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR 2003: Proc 26th Annual Intl ACM SIGIR Conf. Research and Development in Informaion Retrieval, pp ACM Press, New York (2003) [25] Zdunek, R., Cichocki, A.: Comput. Intel. Neurosci. (939567) (2008) [26] Zdunek, R., Phan, A.H., Cichocki, A.: Aust. J. Intel. Inform. Process. Syst. 12, (2010)

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Non-Negative Matrix Factorization with Quasi-Newton Optimization Rafal ZDUNEK, Andrzej CICHOCKI Laboratory for Advanced Brain Signal Processing BSI, RIKEN, Wako-shi, JAPAN Abstract. Non-negative matrix