Cholesky Decomposition Rectification for Non-negative Matrix Factorization

Size: px

Start display at page:

Download "Cholesky Decomposition Rectification for Non-negative Matrix Factorization"

Jasmine Gibbs
6 years ago
Views:

1 Cholesky Decomposition Rectification for Non-negative Matrix Factorization Tetsuya Yoshida Graduate School of Information Science and Technology, Hokkaido University N-14 W-9, Sapporo , Japan Abstract. We propose a method based on Cholesky decomposition for Non-negative Matrix Factorization (NMF). NMF enables to learn local representation due to its non-negative constraint. However, when utilizing NMF as a representation leaning method, the issues due to the non-orthogonality of the learned representation has not been dealt with. Since NMF learns both feature vectors and data vectors in the feature space, the proposed method 1) estimates the metric in the feature space based on the learned feature vectors, 2) applies Cholesky decomposition on the metric and identifies the upper triangular matrix, 3) and utilizes the upper triangular matrix as a linear mapping for the data vectors. The proposed approach is evaluated over several real world datasets. The results indicate that it is effective and improves performance. 1 Introduction Previous representation learning methods have not explicitly considered the characteristics of algorithms applied to the learned representation [4]. When applying Non-negative Matrix Factorization (NMF) [5,6,8,1] to document clustering, in most cases the number of features areset to the number of clusters [8,2]. However, when the number of features is increased, the non-orthogonality of the features in NMF hinder the effective utilization of the learned representation. We propose a method based on Cholesky decomposition [3] to remedy the problem due to the non-orthogonality of features learned in NMF. Since NMF learns both feature vectors and data vectors in the feature space, the proposed method 1) first estimates the metric in the feature space based on the learned feature vectors, 2) applies Cholesky decomposition on the metric and identifies the upper triangular matrix, 3) and finally utilize the upper triangular matrix as a linear mapping for the data vectors. The proposed method is evaluated over several document clustering problem, and the results indicate the effectiveness of the proposed method. Especially, the proposed method enables the effective utilization of the the learned representation by NMF without modifying the algorithms applied to the learned representation. No label information is required to exploit the metric in the feature space, and the proposed method is fast and robust, since Cholesky decomposition is utilized [3]. M. Kryszkiewicz et al. (Eds.): ISMIS 2011, LNAI 6804, pp , c Springer-Verlag Berlin Heidelberg 2011

2 Cholesky Decomposition Rectification for Non-negative Matrix Factorization Cholesky Decomposition Rectification for NMF We use a bold capital letter for a matrix, and a lower italic letter for a vector. X ij stands for the element in a matrix X. tr stands for the trace of a matrix, and X T stands for the transposition of X. 2.1 Non-negative Matrix Factorization Under the specified number of features q, Non-negative Matrix Factorization (NMF) [6] factorizes a non-negative matrix X= [x 1,, x n ] R p n + into two non-negative matrices U=[u 1,, u q ] R p q +, V=[v 1,, v n ] R q n + as X UV (1) Each x i is approximated as a linear combination of u 1,, u q. Minimization of the following objective function is conducted to obtain the matrices U and V: J 0 = X UV 2 (2) where stands for the norm of a matrix. In this paper we focus on Frobenius norm F [6]. Compared with methods based on eigenvalue analysis such as PCA, each element of U and V are non-negative, and their column vectors are not necessarily orthogonal in Euclidian space. 2.2 Clustering with NMF Besides image analysis [5], NMF is also applied to document clustering [8,2]. In most approaches which utilize NMF for document clustering, the number of features are set to the number of clusters [8,2]. Each instance is assigned to the cluster c with the maximal value in the constructed representation v. c =argmaxv c (3) c where v c stands for the value of c-th element in v. 2.3 Representation Learning with NMF When NMF is considered as a dimensionality reduction method, some learning method such as SVM (Support Vector Machine) or kmeans is applied for the learned representation V. In many cases, methods which assume Euclidian space (such as kmeans) are utilized for conducting learning on V [4]. However, to the best of our knowledge, the issues arising from the non-orthogonality of the learned representation has not been dealt with. 2.4 Cholesky Decomposition Rectification One of the reasons of the above problem is that, when the learned representation V is utilized, usually the square distance between a pair of instances (v i, v j )is calculated as (v i - v j ) T (v i - v j ) by (implicitly) assuming that v i is represented in

3 216 T. Yoshida some Euclidian space. However, since u 1,, u q learned by NMF are not orthogonal each other in general, the above calculation is not appropriate when NMF is utilized to learn V. If we know the metric M which reflects non-orthogonality in the feature space, the square distance can be calculated as (v i v j ) T M(v i v j ) (4) This corresponds to the (squared) Mahalanobis generalized distance. We exploit the property of NMF in the sense that data matrix X is decomposed into i) U, whose column vectors spans the feature space, and ii) V, which are the representation in the feature space. Based on this property, the proposed method 1) first estimates the metric in the feature space based on the learned feature vectors, 2) applies Cholesky decomposition on the metric and identifies the upper triangular matrix, 3) and finally utilizes the upper triangular matrix as a linear mapping for the data vectors. Some learning algorithm is applied to the transformed representation from 3) as in [4], We explain 1) and 2) in our proposed method. Note that the proposed method enables to effectively utilize the learned representation by NMF, without modifying the algorithms applied to the learned representation. Estimation of Metric via NMF. In NMF, when approximating the data matrix X and representing it as V in the feature space, the explicit representation of features in the original data space can also be obtained as U. Thus, by normalizing each u such that u T u=1 as in [8], we estimate the metric M as the Gram matrix U T U of the features. M = U T U, s.t. u T l u l =1, l =1,...,q (5) Contrary to other metric learning approaches, no label information is required to estimate M in our approach. Furthermore, since each data is approximated (embedded) in the feature space spanned by u 1,, u q, it seems rather natural to utilize eq.(5) based on U to estimate the metric of the feature space. Cholesky Decomposition Rectification. Since the metric M is estimated by eq.(5), it is guaranteed that M is symmetric positive semi-definite. Thus, based on Linear algebra [3], it is guaranteed that M can be uniquely decomposed by Cholesky decomposition with the upper triangular matrix T as: M = T T T (6) By substituting eq.(6) into eq.(4), we obtain the rectified representation TV: V TV (7) based on the upper triangular matrix T via Cholesky decomposition.

4 Cholesky Decomposition Rectification for Non-negative Matrix Factorization 217 Algorithm 1. Cholesky Decomposition Rectification for NMF (CNMF) CNMF(X, algnmf, q, parameters) Require: X R p n + //data matrix Require: algnmf ;//the utilized NMF algorithm Require: q; //the number of features Require: pars; //other parameters in algnmf 1: U, V := run algnmf on X with q (and pars) s.t. u T l u l =1, l =1,...,q 2: M := U T U 3: T := Cholesky decomposition of M s.t. M = T T T 4: return U, TV The proposed algorithm CNMF is shown in Algorithm 1. 3 Evaluations 3.1 Experimental Settings Datasets. We evaluated the proposed algorithm on 20 Newsgroup data (20NG) 1. Each document is represented as the standard vector space model based on the occurrences of terms. We created three datasets for 20NG (Multi5, Multi10, Multi15 datasets, with 5, 10, 15 clusters). 50 documents were sampled from each group (cluster) in order to create a sample for one dataset, and 10 samples were created for each dataset. For each sample, we conducted stemming using porter stemmer 2 and MontyTagger 3, removed stop words, and selected 2,000 words with large mutual information. We conducted experiments on the TREC datasets, however, results on other datasets are omitted due to page limit. Evaluation Measures. For each dataset, the cluster assignment was evaluated with respect to Normalized Mutual Information (). Let C, Ĉ stand for the random variables over the true and assigned clusters. is defined as = I(Ĉ;C) ( [0, 1]) where H( ) is Shannon Entropy, I(; ) is Mutual (H(Ĉ)+H(C))/2 Information. corresponds to the accuracy of assignment. The larger is, the better the result is. Comparison. We utilized the proposed method on 1) NMF [6], 2) WNMF [8], 3) GNMF [1], and evaluated its effectiveness. Since these methods are partitioning based clustering methods, we assume that the number of clusters k is specified. WNMF [6] first converts the data matrix X utilizing the weighting scheme in Ncut [7], and applies the standard NMF algorithm on the converted data. GNMF [1] constructs the m nearest neighbor graph and utilizes the graph Laplacian for the adjacency matrix A of the graph as a regularization term as: 1 jrennie/20newsgroups/. 20news was utilized. 2 martin/porterstemmer 3 hugo/montytagger

5 218 T. Yoshida Multi5 Multi10 Multi Multi5 NMF NMF+c WNMF WNMF+c GNMF GNMF+c Multi Multi Fig. 1. Results on 20 Newsgroup datasets ( ) (upper:kmeansclower:skmeans) J 2 = X UV 2 + λ tr(vlv T ) (8) where L = D - A (D is the degree matrix), and λ is the regularization parameter. Parameters. Cosine similarity, was utilized as the pairwise similarity measure. We varied the value of q and conducted experiments. The number of neighbors m was set to 10 in GNMF, andλ was set to 100 based on [1]. The number of maximum iterations was set to 30. Evaluation Procedure. As the standard clustering methods based on Euclidian space, kmeans and skmeans were applied to the learned representation matrix V from each method, and the proposed representation TV in eq.(7). Since NMF finds out local optimal, the results (U, V) depends on the initialization. Thus, we conducted 10 random initialization for the same data matrix. Furthermore, since both kmeans and skmeans are affected from the initial cluster assignment, for the same representation (either V or TV), clustering was repeated 10 times with random initial assignment. 3.2 Results The reported figures are the average of 10 samples in each dataset 4. The horizontal axis corresponds to the number of features q, the vertical one to. In the legend, solid lines correspond to NMF, dotted lines to WNMF, and dash lines to GNMF. In addition, +c stands for the results by utilizing the proposed method in eq.(7) and constructing TV for each method. 4 The average of 1,000 runs is reported for each dataset.

6 Cholesky Decomposition Rectification for Non-negative Matrix Factorization 219 The results in Fig. 1 show that the proposed method improves the performance of kmeans (the standard Euclidian distance) and skmeans (cosine similarity in Euclidian space). Thus, the proposed method can be said as effective to improve the performance. Especially, skmeans was substantially improved (lower figures in Fig. 1). In addition, when the proposed method is applied to WNMF (blue dotted WNMF+c), equivalent or even better performance was obtained compared with GNMF. On the other hand, the proposed method was not effective to GNMF, since the presupposition in Section 2.4 does not hold in GNMF. As the number of features q increases, the performance of NMF and WNMF degraded. On the other hand, by utilizing the proposed method, NMF+c and WNMF+c were very robust with respect to the increase of q. Thus, the proposed method can be said as effective for utilizing large number of features in NMF. 4 Concluding Remarks We proposed a method based on Cholesky decomposition to remedy the problem due to the non-orthogonality of features learned in Non-negative Matrix Factorization (NMF). Since NMF learns both feature vectors and data vectors in the feature space, the proposed method 1) first estimates the metric in the feature space based on the learned feature vectors, 2) applies Cholesky decomposition on the metric and identifies the upper triangular matrix, 3) and finally utilize the upper triangular matrix as a linear mapping for the data vectors. The proposed method enables the effective utilization of the learned representation by NMF without modifying the algorithms applied to the learned representation. References 1. Cai, D., He, X., Wu, X., Han, J.: Non-negative matrix factorization on manifold. In: Proc. of ICDM 2008, pp (2008) 2. Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix trifactorizations for clustering. In: Proc. of KDD 2006, pp (2006) 3. Harville, D.A.: Matrix Algebra From a Statistican s Perspective. Springer, Heidelberg (2008) 4. Kamvar, S.D., Klein, D., Manning, C.D.: Spectral learning. In: Proc. of IJCAI 2003, pp (2003) 5. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, (1999) 6. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proc. of Neural Information Processing Systems (NIPS), pp (2001) 7. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), (2007) 8. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proc. of SIGIR 2003, pp (2003)

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds Jiho Yoo and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong,