NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering Andri Mirzal Faculty of Computing Universiti Teknologi Malaysia Skudai, Johor Bahru, Malaysia Email: andrimirzal@utm.my Abstract Analyzing cancers using microarray gene expression datasets is currently an active research in medical community. There are many tasks related to this research, e.g., clustering and classifcation, data compression, and samples characterization. In this paper, we discuss the task of cancer clustering. The spectral clustering is one of the most commonly used methods in cancer clustering. As the gene expression datasets usually are highly imbalanced, i.e., containing only a few tissue samples (hundreds at most but each is expressed by thousands of genes, filtering out some irrelevant and potentially misleading gene expressions is a necessary step to improve the performance of the method. In this paper, we propose an unsupervised gene selection algorithm based on the nonnegative matrix factorization (NMF. Our algorithm is designed by making use of the clustering capability of the NMF to select the most informative genes. Clustering performance of the spectral method is then evaluated by comparing the results using the original datasets with the results using the pruned datasets. Our results suggest that the proposed algorithm can be used to improve clustering performance of the spectral method. Keywords cancer clustering, gene selection, nonnegative matrix factorization, spectral clustering I. INTRODUCTION Cancer clustering is a task of grouping samples from patients with cancers so that samples with the same type can be clustered in the same group (usually each group refers to a specific cancer type [1]. In some datasets, normal tissues are also included for controlling purpose []. In the literature, one can find two related terms: cancer clustering and cancer classification which sometimes are used interchangeably. In this paper we explicitly differentiate these terms: cancer clustering refers to the unsupervised task for grouping the samples, and cancer classification refers to the supervised task where the classifiers are trained first before being used to classify the samples. In recent years, thousands of new gene expression datasets are being generated. These datasets usually consist of only a few samples (hundreds at most, but each sample is represented by thousands of gene expressions. This characteristic makes analyzing the datasets quite challenging because most clustering and classification techniques usually perform poorly when the number of samples is small. In addition, the high dimensionality of the data suggests that many of the gene expressions are actually irrelevant and possibly misleading, and thus gene selection procedure should be employed to clean the data. For classification problem, the small number of samples creates an additional problem of overfitting the classifiers [3]. The using of gene selection procedure to improve classification performance has been extensively studied [1], [3] [17]. Most of the proposed methods are based on the support vector machines (SVMs, and it has been shown that the methods can significantly improve performances of the classifiers. However in cancer clustering research, gene selection is not well studied yet. The common approach is to use the whole dimensions which potentially reduces performances of the clustering algorithms because the data can contain irrelevant and misleading gene expressions. In this paper, we propose an unsupervised gene selection algorithm based on the nonnegative matrix factorization (NMF. NMF is a matrix factorization technique that decomposes a nonnegative matrix into a pair of other nonnegative matrices. It has been successfully applied in many problem domains including clustering [4] [6], [18] [3], image analysis [33] [37], and feature extraction [4], [5], [38], [39]. The proposed algorithm is designed by using the fact that the NMF can be used for grouping similar genes in unsupervised manner and that the membership degrees of each gene to the clusters are directly given by entries in the corresponding column of the coefficient matrix. We then use the proposed algorithm to improve performance of the spectral clustering. II. THE SPECTRAL CLUSTERING The spectral clustering is a family of multiway clustering techniques that make use of eigenvectors of the data matrix to perform the clustering. Depending on the choosing of the matrix, the number of eigenvectors, and the algorithm to infer clusters from the eigenvectors, there are many spectral clustering algorithms available, e.g., [40] [4] (a detailed discussion on the spectral clustering can be found in ref. [43]. Here we use spectral clustering algorithm proposed by Ng et al. [41]. We choose this algorithm because of its intuitiveness and clustering capability. Algorithm 1 outlines the algorithm where R+ M N denotes an M-by-N nonnegative matrix and B+ M K denotes an M-by-K binary matrix. III. THE PROPOSED ALGORITHM Given a nonnegative data matrix A R+ M N,theNMF decomposes the matrix into the basis matrix B R+ M R and the coefficient matrix C R+ R N such that: A BC.
Algorithm 1 A spectral clustering algorithm by Ng et al. [41] 1 Input: Rectangular data matrix A R+ M N with M data points, and the number of clusters K. Construct symmetric affinity matrix Ȧ R+ M M from A by using the Gaussian kernel. 3 Normalize Ȧ by Ȧ D 1/ ȦD 1/ where D is a diagonal matrix with D ii = A j ij. 4 Compute K eigenvectors that correspond to the K largest eigenvalues of Ȧ, and form ˆX R M K = [ˆx 1,...,ˆx K ], where ˆx k is the k-th eigenvector. 5 Normalize every row of ˆX, i.e., X ij X ij /( j X ij 1/. 6 Apply k-means clustering on rows of ˆX to obtain clustering indicator matrix X B+ M K. To compute B and C, usually the following optimization problem is used: min B,C J(B, C =1 A BC F s.t. B 0, C 0, (1 where X F denotes the Frobenius norm of X. There are many algorithms proposed to solve the optimization problem in eq. 1. However, for clustering purpose, there is not much performance difference between the standar NMF algorithm proposed by Lee and Seung [44] and the more advanced and application specific algorithms [4] [6], [18] [3]. Accordingly, we use the standard NMF algorithm. Algorithm outlines the algorithm where b (k mr denotes the (m, r entry of B at k-th iteration, X T denotes the transpose of X, and δ denotes a small positive number to avoid division by zero. Algorithm The standard NMF algorithm [44]. Initialization: B (0 > 0 and C (0 > 0. for k =0,...,maxiter do end for b (k+1 mr c (k+1 rn b (k (AC (kt mr mr (B (k C (k C (kt mr + δ c (k rn m, r (B (k+1t A rn (B (k+1t B (k+1 C (k rn + δ r, n Let A denotes sample-by-gene matrix containing the gene expression data and R denotes the number of cancer classes. By using Algorithm to factorize A into B and C, column n-th of C describes the clustering membership degrees of nth gene to each cluster with the more positive the entry the more likely the gene to belong to the corresponding cluster. For hard clustering case, the membership is determined by the most positive entry. Further, if we normalize each column of C, i.e., c rn c rn / r c rn, the entries in each row will be comparable and consequently row r-th of C will describe the membership strength of the genes to the r-th cluster. Thus, we can sort these rows to find the most informative genes to the corresponding clusters. And by choosing some top genes for each cluster, we can select the most informative genes and remove some irrelevant and misleading genes. This process is the core of our algorithm. But because the NMF does not have uniqueness property, the process is repeated so that only genes that consistently come at the top are selected. Because of this repetition process, we introduce a score scheme that assigns some predefined scores to the top genes at each trial. And genes with the largest cumulative scores are then selected as the most informative genes. Our score scheme is based on the MotoGP scoring system, but the scores are assigned only to top 10 genes in each cluster (the scores for the top 10 genes are: 5, 0, 16, 13, 11, 10, 9, 8, 7, and 6. Algorithm 3 outlines the complete gene selection procedure. Algorithm 3 NMF based gene selection algorithm. 1 Input: Gene expression data matrix A R M N + (the rows correspond to the samples and the columns correspond to the genes and the number of cluster R. Normalize each column of A, i.e., a mn a mn / m a mn. 3 for l =0,...,L do a Compute C using Algorithm. b Normalize each column of C, i.e., c rn c rn / r c rn c Sort in descending order each row of C. d Assign scores to the top 10 genes in each row of C. e Accumulate the scores by adding the current scores to the previous ones. 4 end for 5 Select some top genes G according to the cumulative scores. IV. EXPERIMENTAL RESULTS To evaluate the capability of the proposed algorithm in improving performance of the spectral clustering, six publicly available cancer datasets from the work of Souto et al. [45] in which they compiled the first most comprehensive datasets collected from many resources (there are 35 datasets in total were used. Tables I summarizes the information of the datasets. As shown the datasets are quite representative as the number of classes varied from to 10, the number of samples varied from tens to hundreds, and we also have one dataset, Su-001, that contains multiple type of cancers. There are some parameters need to be chosen. The first is maxiter in Algorithm. Here we set maxiter to 100 as the standard NMF algorithm is known to be fast in minimizing the TABLE I. CANCER DATASETS. Dataset name Tissue #Samples #Genes #Classes Nutt-003-v Brain 8 1070 Armstrong-00-v Blood 7 194 3 Tomlins-006-v Prostate 9 188 4 Pomeroy-00-v Brain 4 1379 5 Yeoh-00-v Bone 48 56 6 Su-001 Multi 174 1571 10
error only for the first iterations [49]. The second is the number of trials L in Algorithm 3. After several attempts, we found that there was not much performance gain between L = 100 and L > 100. Thus we set L to be 100. The third is the number of top genes G in step 5 of Algorithm 3. After several attempts, G was set to 0, 1600, 50, 300, 000, and 00 for Nutt, Armstrong, Tomlins, Pomeroy, Yeoh, and Su respectively. And δ in Algorithm was set to 10 8. To evaluate clustering performance, two metrics were used: Accuracy and Adjusted Rand Index (ARI. Accuracy is the most commonly used metric to measure performance of clustering algorithms in medical community. It measures the fraction of the dominant class in a cluster. Accuracy is defined with [3]: Accuracy = 1 max c rs, M s r=1 where r and s denote the r-th cluster and s-th reference class respectively, R denotes the number of clusters produced by clustering algorithm, M denotes the number of samples, and c rs denotes the number of samples in r-th cluster that belong to s-th class. The values of Accuracy are between 0 and 1 with 1 indicates a perfect agreement between the reference classes and the clustering results. In machine learning community, this metric is also known as Purity [5]. Adjusted Rand Index (ARI has a value ranges from -1 to 1, with 1 indicates the perfect agreement and values near 0 or negatives correspond to clusters found by chance. ARI is defined wth [46] [48]: rs ARI = [ 1 r ( crs + s R ( N 1 ( c s ( c s r s ] ( N 1 r s ( c s, where c r denotes the number of samples in r-th cluster, and c s denotes the number of samples in s-th class. The experiment procedure is as follows. First Algorithm 3 was used to select top genes from the original data matrix A R+ M N. Then a new pruned data matrix A R+ M G was formed with the top G genes. This matrix was then inputted to Algorithm 1 to obtain clustering indicator matrix. The clustering quality was then measured by using Accuracy and ARI. Because of the nonuniqueness of the NMF, this procedure was repeated 100 times to get more statistically sound results. Fig. 1 shows performance of the spectral clustering with and without the gene selection procedure. As shown, the spectral clustering performed quite well in three datasets (Armstrong, Pomeroy, and Su, and rather produced unsatisfactory results in the other three datasets (Nutt, Tomlins, and Yeoh. The gene selection improved clustering performance of the spectral clustering in all cases with better improvements are observed in the cases where clustering results are rather unsatisfactory. These imply that there are not many irrelevant and misleading genes in the first cases so that the results with and without the gene selection are comparable. On the other hand, there are some of those genes in the second cases that were removed by the gene selection process. Table I dan II give the detailed experimental results for 100 trials where the values are displayed in format average values ± standard deviation values. Accuracy ARI 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0.1 0 0.7 0.6 0.5 0.4 0.3 0. 0.1 0 Without Gene Selection With Gene Selection Nutt Armstrong Tomlins Pomeroy Yeoh Su (a Accuracy Without Gene Selection With Gene Selection Nutt Armstrong Tomlins Pomeroy Yeoh Su (b ARI Fig. 1. Performance of the spectral clustering with and without gene selection measured by Accuracy and ARI (average values over 100 runs. TABLE II. TABLE III. ACCURACY AND ARI WITHOUT GENE SELECTION. Dataset name Accuracy ARI Nutt-003-v 0.571 ± 0.000 0.00 ± 0.000 Armstrong-00-v 0.861 ± 0.000 0.64 ± 0.000 Tomlins-006-v 0.587 ± 0.04 0.7 ± 0.019 Pomeroy-00-v 0.759 ± 0.03 0.503 ± 0.035 Yeoh-00-v 0.675 ± 0.030 0.347 ± 0.048 Su-001 0.744 ± 0.030 0.53 ± 0.050 ACCURACY AND ARI WITH GENE SELECTION. Dataset name Accuracy ARI Nutt-003-v 0.664 ± 0.036 0.095 ± 0.033 Armstrong-00-v 0.878 ± 0.014 0.667 ± 0.037 Tomlins-006-v 0.679 ± 0.031 0.343 ± 0.045 Pomeroy-00-v 0.767 ± 0.018 0.540 ± 0.03 Yeoh-00-v 0.730 ± 0.07 0.408 ± 0.030 Su-001 0.746 ± 0.033 0.569 ± 0.044
V. CONCLUSION We have presented a gene selection algorithm based on the NMF to select the most informative genes from a microarray gene expression dataset. The experimental results showed that the proposed algorithm improved performance of the spectral clustering with more visible improvements are observed in the cases where the spectral clustering produced rather unsatisfactory results. ACKNOWLEDGMENT The author would like to thank the reviewers for useful comments. This research was supported by Ministry of Higher Education of Malaysia and Universiti Teknologi Malaysia under Exploratory Research Grant Scheme R.J130000.788.4L095. REFERENCES [1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 86(5439, pp. 531-537, 1999. [] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, Vol. 415(6870, pp. 436-44, 00. [3] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene Selection for Cancer Classification using Support Vector Machines, Machine Learning, Vol. 46(1-3, pp. 389-4, 00. [4] J.P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov, Metagenes and molecular pattern discovery using matrix factorization, Proc. Natl Acad. Sci. USA, Vol. 101(1, pp. 4164-4169, 003. [5] C.H. Zheng, D.S. Huang, D. Zhang, and X.Z. Kong, Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection, IEEE Transactions on Information Technology in Biomedicine, Vol. 13(4, pp. 599-607, 009. [6] N. Yuvaraj and P. Vivekanandan, An efficient SVM based tumor classification with symmetry Non-negative Matrix Factorization using gene expression data, Proc. Int l Conf. on Information Communication and Embedded Systems, pp. 761-768, 013. [7] M. Pirooznia, J.Y. Yang, M.Q. Yang, and Y. Deng, A comparative study of different machine learning methods on microarray gene expression data, BMC Genomics, Vol. 9(Suppl 1, pp. S13, 008. [8] X. Liu, A. Krishnan, and A. Mondry, An Entropy-based gene selection method for cancer classification using microarray data, BMC Bioinformatics, Vol. 6, pp. 76, 005. [9] L. Wang, F. Chu, and W. Xie, Accurate Cancer Classification Using Expressions of Very Few Genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 4(1, pp. 40-53, 007. [10] L.Y. Chuang, H.W. Chang, C.J. Tu, and C.H. Yang, Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry, Vol. 3(1, pp. 9-37, 008. [11] P. Mitra and D.D. Majumder, Feature Selection and Gene Clustering from Gene Expression Data, Proc. the 17th Int l Conf. on Pattern Recognition, pp. 343-346, 004. [1] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer and D. Haussler, Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, Vol. 16(10, pp. 906-914, 000. [13] S. Moon and H. Qi, Hybrid Dimensionality Reduction Method Based on Support Vector Machine and Independent Component Analysis, IEEE Transactions on Neural Networks and Learning Systems, Vol. 3(5, pp. 749-761, 01. [14] Y. Lee and C.K. Lee, Classification of multiple cancer types by multicategory support vector machines using gene expression data, Bioinformatics, Vol. 19(9, pp. 113-1139, 003. [15] X. Zhang, X. Lu, Q. Shi, X. Xu, H.E. Leung, L.N. Harris, J.D. Iglehart, A. Miron, J.S. Liu, and W.H. Wong, Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data, BMC Bioinformatics, Vol. 7(197, 006. [16] Y. Lu and J. Han, Cancer classification using gene expression data, Information Systems, Vol. 8(4, pp. 43-68, 003. [17] H.H. Zhang, J. Ahn, X. Lin. and C. Park, Gene selection using support vector machines with non-convex penalty, Bioinformatics, Vol. (1, pp. 88-95, 006. [18] F. Shahnaz, M.W. Berry, V. Pauca, and R.J. Plemmons, Document clustering using nonnegative matrix factorization, Information Processing & Management, Vol. 4(, pp. 373-386, 006. [19] W. Xu, X. Liu and Y. Gong, Document clustering based on nonnegative matrix factorization, Proc. ACM SIGIR, pp. 67-73, 003. [0] M. Berry, M. Brown, A. Langville, P. Pauca, and R.J. Plemmons, Algorithms and applications for approximate nonnegative matrix factorization, Computational Statistics and Data Analysis, Vol. 5(1, pp. 155-173, 007. [1] J. Yoo and S. Choi, Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds, Proc. 9th Int l Conf. Intelligent Data Engineering and Automated Learning, pp. 140-147, 008. [] J. Yoo and S. Choi, Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds, Information Processing & Management, Vol. 46(5, pp. 559-570, 010. [3] Y. Gao and G. Church, Improving Molecular cancer class discovery through sparse non-negative matrix factorization, Bioinformatics, Vol. 1(1, pp. 3970-3975, 005. [4] D. Dueck, Q.D. Morris, and B.J. Frey, Multi-way clustering of microarray data using probabilistic sparse matrix factorization, Bioinformatics, Vol. 1(1, pp. 145-151, 005. [5] H. Kim and H. Park, Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis, Bioinformatics, Vol. 3(1, pp. 1495-150, 007. [6] K. Devarajan, Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology, PLoS Computational Biology, Vol. 4(7, pp. e100009, 008. [7] H. Kim and H. Park, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method, SIAM J. Matrix Anal. Appl., Vol. 30(, pp. 713-730, 008. [8] P. Carmona-Saez, R.D. Pascual-Marqui, F. Tirado, J.M Carazo, and A. Pascual-Montano, Biclustering of gene expression data by non-smooth non-negative matrix factorization, BMC Bioinformatics, Vol. 7(78, 006. [9] K. Inamura, T. Fujiwara, Y. Hoshida, T. Isagawa, M.H. Jones, C. Virtanen, M. Shimane, Y. Satoh, S. Okumura, K. Nakagawa, E. Tsuchiya, S. Ishikawa, H. Aburatani, H. Nomura, and Y. Ishikawa, Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and non-negative matrix factorization, Oncogene, Vol. 4, pp. 7105-7113, 005. [30] P. Fogel, S.S. Young, D.M. Hawkins, and N. Ledirac, Inferential, robust non-negative matrix factorization analysis of microarray data, Bioinformatics, Vol. 3(1, pp. 44-49, 007. [31] G. Wang, A.V. Kossenkov, and M.F. Ochs, LS-NMF: A modified nonnegative matrix factorization algorithm utilizing uncertainty estimates, BMC Bioinformatics, Vol. 7(175, 006. [3] J.J.Y. Wang, X. Wang, and X. Gao, Non-negative matrix factorization by maximizing correntropy for cancer clustering, BMC Bioinformatics, Vol. 14(107, 013. [33] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, The Journal of Machine Learning Research, Vol. 5, pp. 1457-1469, 004. [34] S.Z. Li, X.W. Hou, H.J. Zhang, and Q.S. Cheng, Learning spatially localized, parts-based representation, Proc. IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, pp. 07-1, 001. [35] D. Wang and H. Lu, On-line learning parts-based representation via incremental orthogonal projective non-negative matrix factorization, Signal Processing, Vol. 93(6, pp. 1608-163, 013.
[36] A. Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehman, and R.D. Pascual-Marqui, Nonsmooth nonnegative matrix factorization, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.8 (3, pp. 403-415, 006. [37] N. Gillis and F. Glineur, A multilevel approach for nonnegative matrix factorization, Journal of Computational and Applied Mathematics, Vol. 36(7, pp. 1708-173, 01. [38] H. Kim and H. Park, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method, SIAM. J. Matrix Anal. & Appl., Vol. 30(, pp. 713-730, 008. [39] W. Kim, B. Chen, J. Kim, Y. Pan, and H. Park, Sparse nonnegative matrix factorization for protein sequence motif discovery, Expert Systems with Applications, Vol. 38(10, pp. 13198-1307, 011. [40] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., Vol. (8, pp. 888-905, 000. [41] A. Ng, M.I. Jordan, and Y. Weiss, On spectral clustering: analysis and an algorithm, Proc. Advances in Neural Information Processing Systems, pp. 849-856, 00. [4] S.X. Yu and J. Shi, Multiclass spectral clustering, Proc. IEEE Int l Conf. on Computer Vision, pp. 313-319, 003. [43] U. Luxburg, A tutorial on spectral clustering, Statistics and Computing, Vol. 17, pp. 395-416, 007. [44] D. Lee and H. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, Vol. 401(6755, pp. 788-791, 1999. [45] M.C.P. Souto, I.G. Costa, D.S.A. Araujo, T.B. Ludermir, and A. Schliep, Clustering cancer gene expression data: a comparative study, BMC Bioinformatics, Vol. 9(497, 008. [46] W.M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, Vol. 66(336, pp. 846-850, 1971. [47] L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, Vol. (1, pp. 193-18, 1985. [48] N.X. Vinh, J. Epps, and J. Bailey, Information theoretic measures for clustering comparison: Is a correction for chance necessary?, Proc. 6th Annual Int l Conf. on Machine Learning, pp. 1073-1080, 009. [49] C.J. Lin, On the convergence of multiplicative update algorithms for nonnegative matrix factorization, IEEE Transactions on Neural Networks, Vol. 18(6, pp. 1589-1596, 007.