Multi-Task Co-clustering via Nonnegative Matrix Factorization

Size: px

Start display at page:

Download "Multi-Task Co-clustering via Nonnegative Matrix Factorization"

Nelson Strickland
5 years ago
Views:

1 Multi-Task Co-clustering via Nonnegative Matrix Factorization Saining Xie, Hongtao Lu and Yangcheng He Shanghai Jiao Tong University Abstract Recent results have empirically proved that, given several related tasks with different data distributions and an algorithm that can utilize both the task-specific and cross-task knowledge, clustering performance of each task can be significantly enhanced. This kind of unsupervised learning method is called multi-task clustering. We focus on tackling the multi-task clustering problem via a 3-factor nonnegative matrix factorization. The oect of our approach consists of two parts: (1) Within-task co-clustering: co-cluster the data in the input space individually. (2) Cross-task regularization: Learn and refine the relations of feature spaces among different tasks. We show that our approach has a sound information theoretic background and the experimental evaluation shows that it outperforms many state-of-theart single-task or multi-task clustering methods. 1. Introduction Many real-world problems can be seen as a series of related, yet self-contained tasks. One great example, also illustrated in [8], is that given the web pages from several, e.g. four, different universities, we aim to classify (cluster) these pages into different categories, such as Student, Faculty, Project, Course and so on. Considering the categorizing (clustering) work on the dataset of each school as a single task, then we will have four tasks. These tasks are closely related for the similar contents and common vocabulary they share, thus it is natural to think of tackling these tasks simultaneously, rather than follow the more traditional approach of dealing with each task independently of the others. The remaining question is how to handle the relations between different tasks. We cannot simply merge all tasks together and use the traditional methods, due to the difference in data distributions. Above situation leads to an important approach to machine learning, i.e. multi-task learning [2], in which how to characterize task relations becomes the most fundamental concern. Some representative works include [12], [6]. ased on the same philosophy, recently, people began to focus on the unsupervised version of multi-task learning. In [9] and [8], a subspace learning method and a kernel method were proposed to explicitly tackle with the multi-task clustering problem. They are both derived from supervised multi-task learning methods. However, they only use the raw term features for text clustering tasks, which is not sufficient sometimes. Intuitively, instead of using the raw terms and only cluster the instances (documents), we would like to cluster the features (words) at the same time. ecause the feature clusters, representing the concepts, are more stable a- mong different tasks, this idea is called co-clustering. Following the information-theoretic co-clustering approach presented in [4], Self-taught Clustering(STC) [3] was proposed, where the oect is to minimize the loss in mutual information before and after clustering, and utilize the common feature clusters as a bridge for knowledge transfer. STC can be regarded as a domain adaptation method for clustering, which is a similar problem as multi-task clustering. However, STC focuses on only two tasks, i.e. the target data and the auxiliary data, and only target data clustering performance is evaluated. In this paper, based on above observations, we follow the basic idea of information-theoretic co-clustering on both instance space and feature space, and propose a novel algorithm for multi-task clustering. Instead of using the original iterative algorithm for information-theoretic co-clustering, we show that it can also be solved in Nonnegative Matrix Factorization(NMF) framework [10]. ecause the data and their probability distributions in the real world are nonnegative, our model is interpretable comparing to others. Furthermore, many applications make use of the (co- )clustering aspect of NMF for its nice characteristics showed in [15] [17]. In detail, we explored the 3-factor NMF based on KL-Divergence[5]. similar idea was al-

2 so used in [11],[13]. 2. The Proposed Method 2.1 Problem Formulation Suppose we are given several clustering tasks, i.e. X (1),X (2),..., X (i),..., these tasks can be regarded as discrete random variables. The i-th task X (i) takes values from the value set {x (i) 1,..., x(i) n i }, where n i is the number of instances in the i-th task. Let Z (i) be the discrete random variable, taking values from the value set {z (i) 1,..., z(i) d }, that corresponds to the feature space of the i-th task with dimensionality d. We assume the dimensionality of the feature vector of all the tasks is the same. The bag-of-words model used in our experiment will automatically does the augmentation by padding zeros. Denote clustering functions as C x (i) : X (i) X (i) and C z (i) : Z (i) Z (i). X(i) and Z (i) are used to denote these two functions for brevity. The goal of multi-task clustering is to partition the data set X (i) of each task into c clusters { x (i) j }c j=1. We assume that the number of clusters in each task is the same, which is also assumed in existing multi-task literature. 2.2 Oective Function We first review the preliminaries in informationtheoretic co-clustering(itcc). The mutual information I(X; Y ) between two random variables X and Y is a fundamental measure of information X contains about Y(and vice versa). ITCC judge the quality of a coclustering by the resulting loss in mutual information, i.e. I(X; Y ) I( X; Ỹ ) (1) Definition 1. Let p(x, Z) denote the joint probability distribution of X and Z with respect to the co-clusters C x (X) and C z (Z); formally, p(x, y) =p( x, z)p(x x)p(z z); (2) where p( x, z) is defined as p( x, z) = x x p(x, z) (3) z z The proof can be found in [4]. Lemma 2. When co-clustering functions C x (i) (X) and C z (i) (Z) are fixed, the oective function of ITCC in e- quation (1) can be reformulated as I(X; Y ) I( X; Ỹ )=D(p(X, Y ) p(x, Y )) (4) Documen space X1 Word space Z1 Z2 X2 where D( ) denotes the Kullback-Leibler(KL) divergence also known as relative entropy, where D(p q) = p(x) x p(x)log q(x) ased on above preliminaries, in this work, we model our method s oective function as Z4 Z3 J = i [I(X (i) ; Z (i) ) I( X (i) ; Z (i) )] X3 X4 +λ i [I(Z (i) ; Z (j) ) I( Z (i) ; Z (j) )]; (5) y lemma 2, we re-formulate the function as Figure 1. A Simple Illustration of Our Model Let p(x (i),z (i) ) be the joint probability distribution with respect to the i-th task. We denote it as a matrix P (i) of size n i d. Let p(z (i),z (j) ) be the joint probability distribution with respect to the i-th and j-th task. It is a matrix of size d d. It describes the similarity of words in the raw vocabulary. We will use W (i,j) to denote this matrix, in which the entry W w (i,j) 1,w 2 is the joint probability for the co-occurrence of word w 1 and w 2 between task i and j. oth of the two matrices can be estimated from data observations. J = i +λ i [D(p(X (i),z (i) ) p(x (i),z (i) ))] [D(q(Z (i),z (j) ) q(z (i),z (j) ))]; (6) It is noticeable that our model contains two parts, working simultaneously. First, task-specific co-clustering, in which we co-cluster the data individually; second, cross-task regularization, in which we mine and refine the relations between feature clusters from all tasks. According to the information theory, this oective will not only minimize the mutual information (MI) between instance space and feature space of each task, but

3 also minimize the MI between any two feature spaces from different tasks, before and after co-clustering. However, the joint distributions in above equations are difficult to optimize in a traditional way. Note that the key to this optimization task is to find a matrix approximation. Following the idea of [7], we use the multiplicative update rules of 3-factor NMF, for its inborn fitness for our problem: the task-specific co-clustering and cross-task regularization can be both formulated in an NMF framework, thus it is easy to be jointly optimized. We use KL divergence as the error criteria for NMF, thus it is consistent with the information-theoretic background. A 3-factor nonnegative matrix factorization is decomposition of a nonnegative dyadic data matrix P R M D + that takes form P USV T min U 0,S 0,V 0 D(P USVT ) (7) where U R M Kw +, S R Kw K d + and V R D K d +. M is the number of instances, D is the dimensionality of feature vector, Kw and Kd are the number of clustering partitions for features and instances, respectively. Willing to adapt the NMF framework to our oective function where the matrix to be factorized is a joint probability distribution, we let P = 1 in equation (8), thus we define normalizing matrix D U diag(1 T U), D V diag(1 T V), where 1 = [1,...,1] T, then P (UD U 1 )(D U SD V )(VD V 1 ) T (8) Comparing (8) with (2), one can see that the marginal distributions p(x x) and p(z z) are associated with (UD U 1 ) and (VD V 1 ) respectively, and joint distribution p( x, z) is represented by entries of (VD V 1 ). Now we can re-formulate (6) as J = i + λ i D(P (i) U (i) S (i) W V (i) ) D(W (i,j) U (i) S (i,j) U (j) ) Expand the KL divergence J = {P (t) P (t) log [U t (t) S (t) W V (t) ] P (t) +[ S (t) W V (t) ] } + λ {W (t,l) W (t,l) log (10) [U t l=t+1 (t) S (t,l) U (l) ] W (t,l) +[ S (t,l) U (l) ] } s.t. ( ) =1, (V (t) ) =1 i i 2.3 Optimization Algorithm The oect function in (10), though seems complicated at first glance, can be optimized easily by a set of (9) multiplicative updates, derived in a similar way to [14]. We use the iterative normalization technique [18] for the normalization constraints, where the corresponding factor matrices are normalized during the iterative process. Theorem 3 (The multiplicative update rules of N- MFMTCC). V (t) V (t) S W (t) S (t,l) a = P (t) a = V (t) P (t) = / a = V (t) / a (t) = SW = S (t,l) ia [S(t) W V (t) ] ja/[ S (t) W V (t) ] ia + λa a [S(t) W V (t) ] ja + λ ai [ S (t) /[ S (t) W V (t) ] ai a [ S (t) ( ) aj (V (t) ) aj ab P (t) ab ab W (t,l) ab ai V (t) /[ S (t) W V (t) ] ab ab ai V (t) ai U (l) /[ S (t,l) U (l) ] ab ab ai U (l) Suppose we have N tasks, then 1 t < l N. A = t 1 l=1 a W (l,t) ai [U (l) S (l,t) ]aj/[u (l) S (l,t) U (t) ] ai + N l=t+1 a W (t,l) ia [S (t,l) U (l) ] ja/[ S (t,l) U (l) ] ia = t 1 l=1 a [U (l) S (l,t) ]aj + N l=t+1 a [S(t,l) U (l) ] ja We summarize our optimization method in (10) in Algorithm 1. Algorithm 1 Algorithm for NMFMTCC Input: number of document clusters K d, number of words clusters K w, the document-term matrices {P (1), P (2),..., P (N) }, and the trade-off parameter λ. 1. Calculate the joint probability distribution matrices W (t,l) for each task pair (P (t), P (l) ), t<l. 2. Obtaining the initial U (i) and V (i) for each task i by simultaneous K-means clustering of rows and columns, and we set U (i) U (i) +0.2, V (i) V (i) Initialize the S and S W with row-normalized constants. repeat 4. For each task t, update and V (t) ; 5. For each task t, normalize and V (t) ; 6. For each task t, update S W (t) ; 7. For each pair of tasks t and l, t < l, update S (t,l) ; until Convergence Output: The document clustering result V (t) and word clustering result for each task t. 2.4 Correctness of Convergence We will show that the above update rules can correctly converge to a local optimum. We show this by updating V, with fixed U, S and S W. Other update

4 rules can be derived similarly. The oective function of V (t) with other variables fixed is J V (t) = P (t) {P (t) log P (t) [ S W (t) V (t) ] +[ S W (t) V (t) ] } where [ S W (t) V (t) ] = ab U ias W ab V jb. Derivatives with respect to matrix V is thus given by J V (t) V (t) = a [ S (t) P (t) ai [ S (t) [ S (t) W V (t) ] ai The KKT complementarity condition for the nonnegativity of V (t) gives ([ S (t) a P (t) ai [ S (t) [ S W (t) V (t) ] ai )V (t) =0 (11) This is a fixed point relation that local minima for V (t) must satisfy. The KKT condition gives an update rule just the same as the updating rule for V. So the correctness of convergence is guaranteed since the solution converged from the updating rules must satisfy (11). 3. Experimental Evaluations In our experiments, we make a comparison with our proposed multi-task co-clustering method, i.e. N- MFMTCC, with many widely used single-task clustering methods include K-means, Kernel K-means, Normalized Cut [16], Graph-regularized Nonnegative Matrix Factorization(GNMF)[1]. We also present the results of these algorithms on simply merged data, e.g. All Kmeans. More over, we compare our approach with three recent proposed multi-task clustering method, i.e. LNKMTC LSSMTC and LSKMTC[9],[8]. We also e- valuate the clustering performance with different parameter settings. The information theory based Normalized Mutual Information(NMI) is defined as NMI = c i=1 c j=1 ni,j log n i,j n i ˆn j ( c i=1 ni log n i n )( c j=1 ˆnj log ˆn j n ) (12) where n i denotes the number of data contained in the cluster C i (1 i c), ˆn j is the number of data belonging to the G j (1 j c), and n i,j denotes the number of data that are in the intersection between the cluster C i and the class G j. It can measure how similar two clusters are, and is suitable in our context. The larger the NMI is, the better the clustering result will be. We experiment on WebK4 1 data set. It is a subset of WebK data set, which consists of seven classes /www/data/ of web pages collected from computer science departments of four different universities. Frequently, only four classes are used (student, faculty, course, project); thus it is called WebK4. We use Rainbow 2 toolkit for data preprocessing. We set the number of document clusters as the true number of classes for all the clustering algorithms. We use the similar parameter settings for K-means, Kernel K-means and NCut as [8]. The results of LSSMTC, LNKMTC and LSKMTC are quoted directly from literature. For our method, we set K d = 4,K w = 8. The trade-off parameter λ is set to 1 after searching the grid {0, 0.5, 1, 5, 10, 100}. The max iteration number in NMFMTCC is set to 50. For each algorithm and parameter setting, we repeat clustering for 10 times. The average results are shown in Table 1. Normalized Mutual Information task1 task2 task3 task the trade off parameter lambda the number of word clusters Figure 2. The clustering NMI corresponding to different parameter settings x Reconstruction Error On WebK4 Figure 3. the convergence of NMFMTCC Table 1. Results on Web-K4 Dataset # Methods task1 task2 task3 task4 NMI(%) NMI(%) NMI(%) NMI(%) Kmeans 23.53± ± ± ±0.00 KKM 21.65± ± ± ±3.43 NCut 25.99± ± ± ±0.77 GNMF 24.76± ± ± ±0.17 All Kmeans 22.58± ± ± ±0.79 All KKM 22.05± ± ± ±2.86 All NCut 24.12± ± ± ±0.34 All GNMF 26.16± ± ± ±5.59 LSSMTC 33.69± ± ± ± 0.96 LNKMTC 36.34± ± ± ±1.21 LSKMTC 40.85± ± ± ±1.16 NMFMTCC(0) 31.53± ± ± ±4.34 NMFMTCC 41.49± ± ± ± Discussions With the experimental result, we have the following conclusions: 1) As can be seen, simply clustering the data of all the tasks together (e.g. All Kmeans, All KKM and All NCut) does not necessarily improve the clustering result, because the data distributions of different tasks are not the same, and combining 2 mccallum/bow/

5 the data together directly violates the i.i.d. assumption in single-task clustering. 2) our proposed NMFMTC- C method performs better than other methods, mainly due to the following reasons: We use the idea of coclustering to cluster columns and rows simultaneously while other methods only consider about clustering in the instance space. We generalize our model to pairwise knowledge transfer between task-specific feature spaces, i.e. any two feature spaces is connected during the optimization procedure. In the optimization stage, we adopt the NMF framework, which leads to a interpretable solution as well as good performance. Another point of view is that NMF for co-clustering problem can be seen as a soft version of information-theoretic co-clustering, thus shares similar or even better characteristics. 3) The clustering accuracy corresponding to different trade-off parameter λ is showed in Figure 1. The detailed clustering performance when λ = 0 is reported in Table 1 as a method named NMFMTC- C(0), note that with λ =0, our method degenerates to a co-clustering method on each task individually. The clear drop of the NMI proves that the cross-task regularization in our method can indeed help in a multi-task clustering problem. 4 Conclusions and Future work We proposed a Multi-task Co-clustering via Nonnegative Matrix Factorization (NMFMTCC) method. NMFMTCC follows the idea of the well-known information-theoretic co-clustering, but in an matrix factorization framework. We optimize an oective function which consists of two parts, i.e. task-specific co-clustering and cross-task feature space regularization. Experimental result shows significant improvement over other related methods. esides the scenario of text clustering, we will try to exploit the multi-task clustering idea to more applications in the future, such as Collaborative Filtering and clustering based Natural Image Classification Acknowledgements The work was supported by NSFC (No ), the National High Technology Research and Development Program of China (No. 2008AA02Z310) and 973 (No. 2009C320901) References [1] D. Cai, X. He, J. Han, and T. Huang. Graph regularized nonnegative matrix factorization for data representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(8): , [2] R. Caruana. Multitask learning. Machine Learning, 28(1):41 75, [3] W. Dai, Q. Yang, G. Xue, and Y. Yu. Self-taught clustering. In Proceedings of the 25th international conference on Machine learning, pages ACM, [4] I. Dhillon, S. Mallela, and D. Modha. Informationtheoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [5] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [6] T. Evgeniou and M. Pontil. Regularized multi task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [7] E. Gaussier and C. Goutte. Relation between plsa and nmf and implications. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [8] Q. Gu, Z. Li, and J. Han. Learning a kernel for multi-task clustering. In Twenty-Fifth AAAI Conference on Artificial Intelligence, [9] Q. Gu and J. Zhou. Learning the shared subspace for multi-task clustering and transductive transfer classification. In Data Mining, ICDM 09. Ninth IEEE International Conference on, pages Ieee, [10] D. Lee, H. Seung, et al. Learning the parts of oects by non-negative matrix factorization. Nature, 401(6755): , [11] T. Li, V. Sindhwani, C. Ding, and Y. Zhang. ridging domains with words: opinion analysis with matrix trifactorizations. Proceedings of the 10th SDM, [12] A. Lindbeck and D. Snower. Multi-task learning and the reorganization of work. Journal of Labor Economics, 18(3): , [13] M. Long, W. Cheng, X. Jin, J. Wang, and D. Shen. Transfer learning via cluster correspondence inference. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages IEEE, [14] D. Seung and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13: , [15] F. Shahnaz, M. erry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2): , [16] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8): , [17] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages ACM, [18] F. Zhuang, P. Luo, H. Xiong, Q. He, Y. Xiong, and Z. Shi. Exploiting associations between word clusters and document classes for cross-domain text categorization? Statistical Analysis and Data Mining, 2010.

Multi-Task Clustering using Constrained Symmetric Non-Negative Matrix Factorization

Multi-Task Clustering using Constrained Symmetric Non-Negative Matrix Factorization Samir Al-Stouhi Chandan K. Reddy Abstract Researchers have attempted to improve the quality of clustering solutions through