Dimensional reduction of clustered data sets

Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which contain clusters. The model simultaneously performs the clustering and the dimensional reduction in an unsupervised manner, providing an optimal visualisation of the clusterig. We prove that the resulting dimensional reduction results in Fisher s linear discriminant analysis as a limiting case. Testing of the model both on real and artificial data shows marked improvements in the quality of the visualisation and clustering. 1 Introduction Dimensional reduction techniques form an important chapter in statistical machine learning. Linear methods such as principal component analysis (PCA) and multidimensional scaling are classical statistical tools. While a lot of current research focuses on non-linear dimensional reduction techniques, linear methods are still widely used in a number of applications, from computer vision to bioinformatics. In particular, PCA is still the preferred tool for dimensional reduction in applications such as microarray analysis, where the noisy nature of the data and the lack of knowledge of the underlying processes means that linear models are generally more trusted. A major advance in the understanding of dimensional reduction was provided by Tipping and Bishop s probabilistic interpretation of PCA [Tipping and Bishop, 1999b, PPCA]. The starting point for PPCA is in defining a factoranalysis style latent variable model y = W x + µ + ɛ. (1) Here y is a D-dimensional vector, W is a tall, thin matrix with D rows and q columns (with q < D), x is a q-dimensional latent variable, µ is a constant (mean) vector and ɛ N ( 0, σ 2 I ) is an error term. Placing a spherical unitary normal prior on the latent variable x, Tipping and Bishop proved that, having observed i.i.d. data y i, at maximum likelihood the columns of W spanned the principal subspace of the data. Therefore, the generative model (1) could be said to provide a probabilistic equivalent of PCA. 1

This result opened the way for a number of applications and generalisations: it allowed a principled treatment of missing data and mixtures of PCA models [Tipping and Bishop, 1999a], noise propagation through the model [Sanguinetti et al., 2005] and, through different choices of priors, allowed the linear dimensional reduction framework to be extended to cases where the data could not be modelled as Gaussian [Girolami and Breitling, 2004]. One major limitation of PCA, which is shared by its extensions, is that it is based on matching the covariance of the data. As such, it is not obvious how to incorporate information on the structure of the data. For example, if the data set has a cluster structure, the dimensional reduction obtained through PCA is not guaranteed to respect this structure. A specific application where this problem is particularly acute is bioinformatics: many microarray studies contain samples from different conditions (e.g. different types of tumour, or control versus diseased), so the data can be expected to be clustered according to conditions [e.g. Golub et al., 1999, Pomeroy et al., 2002]. Dimensional reduction using PCA does not lead to a good visualisation, and further information needs to be included (such as restricting the data set to include only genes that are significantly differentially expressed in the different conditions). The problem is even more significant if we view dimensional reduction as a feature extracting technique. It is generally assumed that the genes capturing most of the variation in the data are the most important for the physiological process under study [Alter et al., 2000]. However, when the data set contains clusters, the most important features are clearly the ones that discriminate between clusters, which, in general, do not coincide with the directions of greatest variation in the whole dataset. While unsupervised techniques for dimensional reduction of data sets with clusters are relatively understudied, the supervised case leads to linear discriminant analysis (LDA). It was Fisher [1936] who first provided the solution to the problem of identifying the optimal projection in order to separate two classes. Under the assumption that the class conditional probabilities are Gaussian with identical covariance, he proved that the optimal 1-dimensional projection to separate the data maximises the Rayleigh coefficient d 2 J (e) = σ1 2 +. (2) σ2 2 Here e is the unit vector that defines the projection, d is the projected distance of the (empirical) class centers and σi 2 is the projected empirical variance of class i. This criterion can be generalised to the general covariance, multi-class case [De la Torre and Kanade, 2005]. In this note we propose a latent variable model to perform dimensional reduction of clustered data sets in an unsupervised way. The maximum likelihood solution for the model fulfills an optimality criterion which is a generalisation of Fisher s criterion, and reduces to maximising Rayleigh s coefficient under particular limiting conditions. The rest of the paper is organised as follows: in the next section we describe the latent variable model and provide an expectation-maximisation (EM) algo- 2

rithm for finding the maximum likelihood solution. We then prove that, under certain assumptions, the likelihood of the model is a monotonic function of Rayleigh s coefficient, so that LDA can be retrieved as a limiting case of our model. Finally, we discuss the advantages of our model, as well as possible generalisations and relationships with existing models. 2 Latent Variable Model Let our data set be composed of N D-dimensional vectors y i, y = 1,..., N. We have prior knowledge that the data set contains K clusters, and we wish to obtain a q < D-dimensional projection of the data which is optimal with respect to the clustered structure of the data. The latent variable model formulation we present is closely related to and inspired by the extreme component analysis (XCA) model of Welling et al. [2003]. We explain the data using two sets of latent variables which are forced to map to orthogonal subspaces in the data; the model then takes the form y = Ex + µ + W z. (3) In this equation, E is a D q matrix which defines the projection, so that E T E = I q, the identity matrix in q dimensions. µ is a D dimensional mean vector, and W is a D (D q) matrix whose columns span the orthogonal complement of the subspace spanned by the columns of E, i.e. W T E = 0. Notice a key difference from the PPCA model of equation (1) is the absence of a noise term; this is common to our model and the XCA model. Choosing spherical Gaussian prior distributions for the latent variables x and z returns the XCA model of Welling et al. [2003]. Instead, in order to incorporate the clustered structure in the data, we will select the following prior distributions z N (0, I D q ) K x π k N ( m k, σ 2 ) I q K π k = 1. (4) Thus, we have a mixture of Gaussians generative distribution in the directions determined by the columns of E, and a general Gaussian covariance structure in the orthogonal directions. Intuitively, the mixture of Gaussians part will tend to optimise the fit to the clusters, which (given the spherical covariances) will be obtained when the subspace spanned by E interpolates the centres of the cluster in the optimal (least squares) way. Meanwhile, the general Gaussian covariance in the orthogonal directions will seek to maximise the variance of each cluster in the directions orthogonal to E, hence minimising the sum of the 3

projected variances. Therefore, the projection defined by E will seek an optimal compromise between separating the cluster centres and obtaining tight clusters. From the definitions of the model and the priors, it is clear that we can set the mean vector µ to zero and consider centred data sets without loss of generality. We can therefore assume that the centres of the mixture components in the prior distribution for x sum to zero, 2.1 Likelihood K m k = 0. Given the model (3) and the prior distributions on the latent variables (4), it is straightforward to marginalise the latent variable and obtain a likelihood for the model. Using the orthogonality of E and W, we readily obtain a log-likelihood for the model L (y j, E, W, m k, π, σ) = Nq log σ 2 + [ N K ] log π k exp 1 ( E T ) 2 2σ 2 y j m k N log C 1 2 tr ( C 1 S ). Here S = 1 N N y jyj T is the empirical covariance of the (centred) data, C = W W T and we have omitted a number of constants. With a slight notational abuse, we denote C 1 = W ( W W ) T 2 W T (using the pseudoinverse of W ) and C as the product of the non-zero eigenvalues of C. We notice immediately that, as W is unconstrained (except for being orthogonal to E and full column rank), at maximum likelihood W W T will match exactly the covariance of the data in the directions perpendicular to E. Therefore, the last term in equation (5) will just be a constant D q 2 irrespective of the other parameters and can be dropped from the likelihood. Also, we can use the orthogonality between E and W to rewrite C in terms of E alone. This can be done by observing that, if P and P represent two mutually orthogonal projections that sum to the identity, then the determinant of a symmetric matrix A can be expressed as A = P AP T P AP T. But since C matches exactly the covariance in the directions orthogonal to E, we obtain that C = S E T SE. Substituting these results into the log-likelihood (5), and omitting terms which do not depend on the parameters, we obtain L (y j, E, m k π, σ) = Nq log σ 2 [ N K ] + log π k exp 1 ( E T ) 2 2σ 2 y j m k + N log E T SE (6). 4 (5)

This expression is now simply a mixture of Gaussian likelihood in the projected space spanned by the columns of E, plus a term that depends only on the projection E. This suggests a simple iterative strategy to maximise the likelihood. 2.2 E-M algorithm We can optimise the likelihood (6) using an E-M algorithm [Dempster et al., 1977]. This is an iterative procedure that is proved to converge to a (possibly local) maximum of the likelihood. We introduce unobserved binary class membership vectors c R K, and denote by ỹ = ye the projected data. Introducing the responsibilities γ jk = π k exp 1 2σ 2 ( E T y j m k ) 2 K π k exp 1 2σ 2 (E T y j m k ) 2 (7) we obtain a lower bound on the log-likelihood in the form Q (E, m k, σ, π) = N K γ jk log [π k p (ỹ c k, m k, σ)] + N log E T SE (8) where the class conditional probabilities p (ỹ c k, Θ) are the Gaussian mixture components and we denoted collectively as Θ all the model parameters. Optimising the bound (8) with respect to the parameters leads to fixed point equations. For the latent means these are N m k = γ jkỹ j N γ. (9) jk This simply tells us that the latent means are the means of the projected data weighted by the responsibilities. Similarly, we obtain a fixed point equation for σ 2 N K σ 2 = γ jk (ỹ j m k ) 2 K = σ2 k (10) Nq Nq where we have introduced the projected variances σk 2. Again, we can easily obtain an update for the mixing coefficients π k = N N k N = γ jk. (11) N Substituting equations (9) into equation (10), and equation (10) into equation (8) we obtain, ignoring constants, Q (E) = log ( ES K E T ) 1 E T SE (12) where S K = K N N γ jk (y j γ ) ( N jky j N γ y j γ jky j N jk γ jk ) T. 5

Optimisation of the likelihood w.r.t. the projection directions E is then a matter of simple linear algebra, and is solved by a generalised eigenvalue problem. Equation (12) is noteworthy. The denominator is a soft assignment, multi dimensional generalisation of the sum of intra-cluster variances which appeared in the denominator of equation (2). Therefore, optimising the likelihood with respect to the projection subspace E will tend to make the projected clusters tighter. Furthermore, the matrix S can be decomposed as S K +S T [e.g. Bishop, 2006], where S T is the generalisation of the inter-cluster variance appearing in the numerator of (2). Therefore, the likelihood bound is a monotonic increasing function of a generalisation of the Rayleigh coefficient, and can be viewed as a probabilistic version of LDA. In the next section we will make precise this correspondence in the original Fisher s discriminant case, and show under which conditions the two techniques are equivalent. 3 Relationship with linear discriminant analysis We now make a number of simplifying assumptions: we assume our data set to be composed of two clusters, and we choose to project it to one dimension, so that q = 1 and the projection is defined by a single vector e rather than a matrix E. If the clusters are well separated, assuming they are balanced in the data, the likelihood of equation (6) can be simplified neglecting the probability that points in one cluster were generated by the other cluster, obtaining L = Nq log σ 2q + 1 2 K N k 1 2σ 2 [ (y j em k ) T e] 2 + N log e T Se. (13) where we replaced π i = 1 2 and N i is the number of points assigned cluster i. Using the fact that µ = N y j = 0, we define y j = µ i + v j, (14) where µ i = 1 N i Ni y j. The intra-cluster covariances are then S i = 1 N i vj v T j. Since there are two clusters, we also have µ 1 = µ 2 = µ and m 1 = m 2 = m. Using equation (14), we obtain that S = S 1 + S 2 + 6µµ T. Defining σi 2 = et S i e the projected variance of cluster i and d = 2µ T e the projected distance separating the clusters, it is easy to compute the maximum likelihood estimates for m and σ 2, given respectively by m = d, σ 2 = α ( σ1 2 + σ2 2 ) with α a constant. Notice that these can also be obtained from equations (9-10) by taking the limit when the responsibilities are 0 or 1 (hard assignments). 6

Therefore, we obtain that the likelihood (13) can be rewritten as ( σ 2 L = log 1 + σ2 2 + 6d 2 ) σ1 2 + + const = log [1 + 6J (e)] + const σ2 2 where J (e) is the Rayleigh coefficient of equation (2). Therefore, the maximum likelihood solution is obtained at the maximum of the Rayleigh coefficient, showing that the model reduces to Fisher s discriminant in these conditions. 4 Discussion In this paper we have addressed the question of selecting an optimal linear projection of a high dimensional data set, given prior knowledge of the presence of cluster structure in the data. The proposed solution is a generative model which is made up of two orthogonal latent variable models: one is an unconstrained, PPCA-like model [Tipping and Bishop, 1999b], the other is a projection (i.e. the loading vectors have unit lengths) with a mixture of Gaussians prior on the latent variables. The result of this rather artificial choice of priors is that the PPCA-like part will tend to maximise the variance of the data in the directions orthogonal to the projection, hence forcing the projected clusters to maximise the inter-cluster spread, while minimising the intra-cluster variance. Indeed, we prove that the maximum likelihood estimator for the projection plane in our model satisfies an unsupervised, soft assignment generalisation of the multidimensional version of the criterion for LDA. We also prove that when the projection is one dimensional and the clusters are well separated, the likelihood of the model reduces to a monotonic function of the Rayleigh coefficient used in Fisher s discriminant analysis. Our model is related to several previously proposed models. It can be viewed as an adaptation of the XCA model of Welling et al. [2003] to the case when the data set has a clustered structure; it is remarkable though that the generalised LDA criterion is obtained without imposing that the variance in the directions orthogonal to the projection be maximal. Bishop and Tipping [1998] dealt with the problem of visualising clustered data sets by proposing a hierarchical model based on a mixture of PPCAs [Tipping and Bishop, 1999a]. While this approach addresses successfully the problem of analysing the detailed structure of each cluster, it does not provide a single optimal projection for visualising all the clusters simultaneously. Perhaps the contribution that is closest to ours is in Girolami [2001]. This paper considered the clustering problem using an Independent Component Analysis (ICA) model with one latent binary variable corrupted by Gaussian noise. The author then proved that the maximum likelihood solution returned an unsupervised analogue of Fisher s discriminant. This approach is somewhat complementary to ours in that it uses discrete rather than continuous variables, but the end result is remarkably close. In fact, in the one dimensional, two classes case the objective function of the ICA model is the same as the likelihood (6). The major difference comes when considering latent spaces which are more than 7

one dimensional, which is generally the most important case. Generalising the ICA approach would enforce an independence constraint between the directions of the projection which is not plausible; this is not a problem in our approach. Also, treatment of cases where the intra-cluster spread is different for the different clusters is much more straightforward in our model; in Girolami [2001] s approach one has to introduce a class-dependent noise level, while in our approach it just amounts to a straightforward modification of the E-M algorithm. There are several interesting directions for generalising the present work. One interesting direction would be to extend it to non-linear dimensional reductions techniques such as Bishop et al. [1998], Lawrence [2005]. While this is in principle possible due to the shared probabilistic nature of these models, in practice it will require considerable further work. Another important direction would be to automate model selection issues (such as the number of clusters or number of latent dimensions) using Bayesian techniques such as Bayesian PCA and non-parametric Dirichlet process priors Bishop and Tipping [1998], Neal [2000]. While this is feasible, careful consideration should be given to issues of numerical efficiency. Acknowledgments The author wishes to thank Mark Girolami for useful discussions and suggestions on ICA models for clustering. References O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. PNAS, 97(18):10101 10106, 2000. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Singapore, 2006. C. M. Bishop, M. Svensen, and C. K. Williams. GTM: the generative topographic mapping. Neural Computation, 10(1):215 234, 1998. C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualisation. IEEE Transactions PAMI, 20(3):281 293, 1998. F. De la Torre and T. Kanade. Multimodal oriented discriminant analysis. In Proceedings of the International Conference on Machine Learning (ICML), 2005. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1 38, 1977. 8

R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179 188, 1936. M. Girolami. Latent class and trait models for data classification and visualisation. In Independent Component Analysis: principles and practice. Cambridge University Press, Cambridge, 2001. M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression. Bioinformatics, 20(17):3021 3033, 2004. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531 7, 1999. N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783 1816, 2005. R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249 265, 2000. S. M. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifdin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415:436 442, 2002. G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence. Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics, 21(19):3748 3754, 2005. M. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443 482, 1999a. M. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3):611 622, 1999b. M. Welling, F. Agakov, and C. K. Williams. Extreme component analysis. In Advances in Neural Information Processing Systems, 2003. 9