arxiv: v1 [stat.me] 7 Aug 2015

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 7 Aug 2015"

Rudolf Wilcox
5 years ago
Views:

1 Dimension reduction for model-based clustering Luca Scrucca Università degli Studi di Perugia August 0, 05 arxiv:508.07v [stat.me] 7 Aug 05 Abstract We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets. Keywords: Dimension reduction, Model-based clustering, Gaussian mixture, Sliced inverse regression, Feature selection. Introduction Model-based clustering assumes that the observed data are generated from a mixture of G components, each representing the probability distribution for a different group or cluster (McLachlan and Peel, 000). The general form of a finite mixture model is f(x) = G g= π gf g (x θ g ), where the π g s are the mixing probabilities such that π g 0 and π g =, f g ( ) and θ g are, respectively, the density and the parameters of the g-th component (g =,..., G). With continuous data, we often take the density for each mixture component to be the multivariate Gaussian φ(x µ g, Σ g ) with parameters θ g = (µ g, Σ g ). Thus, clusters are ellipsoidal, centered at the means µ g, and with other geometric features, such as volume, shape and orientation, determined by Σ g. Parsimonious parametrization of covariance matrices can be adopted through eigenvalue decomposition in the form Σ g = λ g D g A g D g, where λ g is a scalar controlling the volume of the ellipsoid, A g is a diagonal matrix specifying the shape of the density contours, and D g is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery, 99; Celeux and Govaert, 995). Fraley and Raftery (006, Table ) report some parametrizations of within-group covariance matrices available in the MCLUST software, and the corresponding geometric characteristics. Maximum likelihood estimates for this type of mixture models can be computed via the EM algorithm (Dempster et al, 977; Fraley and Raftery, 00), while model selection could be based on the Bayesian information criterion (BIC) (Fraley and Raftery, 998) or the integrated complete-data likelihood (ICL) criterion (Biernacki et al, 000). In this paper we propose a dimension reduction method for model-based clustering, where we pursue the estimation of a reduced subspace which captures most of the clustering structure contained in the data. Following the work of Li (99, 000) on Sliced Inverse Regression (SIR), information on the dimension reduction subspace is obtained from the variation on group means

2 and, depending on the estimated mixture model, on the variation on group covariances. The estimated directions, ordered by importance as quantified by the associated eigenvalues, span a dimension reduction subspace. Observations may then be plotted on such a subspace, hence providing summary plots which visualize the clustering structure, and may help to understand the clustering and improve accuracy. In Section we introduce the problem of clustering on a dimension reduced subspace, and we show that assignment of an observation to a given cluster is unchanged if performed on a suitable subspace. Then we discuss estimation of the directions which span the reduced subspace, and we provide some of their properties. In Section we review and adapt a greedy search procedure for selecting a subset of the new constructed variables which retains most of the clustering information contained in the data. In Section 4 we describe a data analysis example based on a synthetic data set; then we discuss results from simulation studies where we compare the proposed approach to model-based clustering performed on both the original variables and the leading principal components. Section 5 presents applications of the proposed procedure on some real data sets. The final section contains some concluding remarks. Dimension reduction for model-based clustering Classical procedures for dimensionality reduction are principal components analysis and factor analysis, both of which reduce dimensionality by forming linear combinations of the features. The first method seeks a lower-dimensional representation that account for the most variance of the features, while the second looks for the most correlation among the features. However, neither method correctly addresses the problem of visualizing any potential clustering structure. Chang (98) discussed a simulated example from a 5-dimensional mixture model to show the failure of principal components as a method for reducing the dimension of the data before clustering. Let X = 0.5 d+d Y +Z, where d = i (i =,..., 5), Y Bernoulli(0.), Z N(µ, Σ) with mean µ 5 = [0,..., 0] and covariance matrix Σ 5 5 = [σ ij ], σ ii =, σ ij = 0.f i f j, where the first 8 elements of f are 0.9 and the last 7 are 0.5. With this scheme the first 8 variables can be considered roughly as a block of variables with the same correlations, while the rest of the variables form another block. Using this scheme we simulated n = 00 data points. Figure shows the scatterplot matrix of the first, second and last principal components, and the first GMMDR variable (to be discussed on Section.) obtained from a two components EEE mixture model. As it can be seen, the first two PCs are not able to show any clustering information, but, as discussed by Chang (98), the first and the last PCs are needed. On the contrary, only one GMMDR variable is required to clearly separate the clusters. Furthermore, the coefficients (under unit norm constraint) of the linear combination defining such a direction reproduce the blocking structure used to simulate the variables, with the first 8 variables having coefficients approximately equal to 0. and the remaining 7 coefficients approximately equal to 0.6. Tipping and Bishop (999b) proposed a probabilistic principal component analysis based on a factor analysis model for the data assuming that the distribution of the errors are isotropic, i.e. the covariance matrix is proportional to the identity matrix. They also developed a mixture of probabilistic principal components analyzers model (Tipping and Bishop, 999a), and a hierarchical visualization algorithm for exploring clusters (Bishop and Tipping, 998). McLachlan and Peel (000) and McLachlan et al (00) used mixtures of factor analyzers, by postulating that the distribution of the data within any group or latent class follows a factor analysis model. Parsimonious Gaussian mixture models based on mixtures of factor analyzers were discussed by McNicholas and Murphy (008). In order to deal with high dimensional data, Bouveyron et al (007) proposed a family of parsimonious GMMs on clustering subspaces. All such models mainly address the problem of reducing the number of parameters to be estimated by a suitable reparametrization of the covariance matrix. Since they produce estimates of (latent) factor an-

3 PCA PCA PCA GMMDR Figure : Scatterplot matrix of st, nd and 5th PC, and st GMMDR variable for the Chang data with points marked according to cluster membership. alyzers, they can also be used to reduce dimensionality. However, as noted by McLachlan and Peel (000, p. 48), the resulting plots are not always useful in showing clustering structure. To support such claim, they presented an example using a simple two clusters synthetic data set where the differences between the group means are only in one variable, and with withingroup variation in the other variables being relatively large. They showed that both principal components and factor analyzers were unable to show the underlying clustering structure (see McLachlan and Peel, 000, p. 40, 54 55). We applied the GMMDR method proposed in this paper, and the data projected onto the subspace spanned by the first two directions are shown in Figure : the two clusters appear well separated with one group showing less variation than the other.. Clustering on a dimension reduced subspace Consider the (p ) vector X of random variables, and the discrete random variable Y taking G distinct values to indicate the G clusters. Let β denote a fixed (p d) matrix, where d p, such that Y X β X. This conditional independence statement says that the distribution of Y X is the same as that of Y β X for all values of X in its marginal sample space. As a consequence, the (p ) vector X can be replaced by the (d ) vector β X without loss of clustering information. If d < p, we have effectively reduced the dimension of the predictor vector. The matrix β provides a basis for the subspace S(β), which in a regression context is called a dimension-reduction subspace (DRS) for the regression of Y on X (Li, 99). It can be shown that a minimum DRS may not be unique, but when several of such subspaces exist they all have the same dimension. Let S Y X denote the intersection of all DRSs. Under various reasonable conditions given in Cook (998, Chap. 6), S Y X is a unique minimum DRS and it is

4 Dir Dir Figure : Plot of McLachlan and Peel (000, p. 9 40) synthetic data projected along the first two GMMDR directions, with points marked according to true group origin. The curve represents clustering boundary. called central DRS. The assumption Y X β X implies that Pr(Y = g X) = Pr(Y = g β X), so we may write the density for the g-th component of the mixture model as f g (x) = Pr(Y = g x)f X(x) Pr(Y = g) = Pr(Y = g β x)f X (x) Pr(Y = g) = f g (β f X (x) x) f β X (β x), and for any two groups, g and k, the ratio of the densities is f g (x) f k (x) = f g(β x) f k (β x). Thus, if Y X β X the ratio of the conditional densities is the same whether it is computed on the original variables space or on the DRS, so the clustering information is completely contained in S(β). For instance, consider the simple case of two clusters with bivariate Gaussian distribution x (Y = j) N(µ j, I ) for j =,, where µ = [, 0], µ = [, 0], and I is the ( ) identity matrix. The ratio of the densities is thus given by f (x) f (x) = exp{ (x µ ) (x µ ) + (x µ ) (x µ )} = exp{ x }. It is straightforward to see that the direction containing all the clustering information is β = [, 0], so the ratio of the densities projected along such direction is f (β x) f (β x) = exp{ (β x β µ ) + (β x β µ ) } = exp{ (x + ) + (x ) } = exp{ x }. () 4

5 In model-based clustering an observation is assigned to cluster g by the maximum a posteriori principle (MAP), i.e. to the cluster for which the conditional probability given the data is a maximum: π g f g (x) arg g max Pr(Y = g x) = G j= π jf j (x). By equation (), the last expression is equivalent to arg g max Pr(Y = g β x) = π g f g (β x) G j= π jf j (β x), so the assignment of an observation to a cluster is unchanged if performed on the DRS. Cook and Yin (00, p. 56) considered D(Y X) = arg g max Pr(Y = g x) and defined a discriminant subspace as a subspace S such that D(Y X) = D(Y P S X) for all values of X, where P S denotes the projection operator onto S with respect to the usual inner product. The intersection of all discriminant subspaces, if it exists, is itself a discriminant subspace, and is called the central discriminant subspace (CDS) and denoted by S D(Y X). Finally, the CDS is a subset of the central DRS, S D(Y X) S Y X, and in some cases may be a proper subset. In the case of known group memberships, a variety of methods for dimensionality reduction have been proposed (Cook and Yin, 00). These methods, such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), are not restricted to any particular classification method. In the following subsection we discuss estimation of a dimension reduction subspace for a model-based clustering procedure.. Estimation of GMMDR directions Suppose we describe a set of n observations on p variables through a G components Gaussian mixture model (GMM) of the form f(x) = G g= π gφ(x µ g, Σ g ). From a graphical point of view, we pursue the smallest subspace that can capture the clustering information contained in the data. A natural starting point would require to look for those directions where the cluster means µ g vary as much as possible, provided each direction is orthogonal to the others. This amounts to solving the following optimization problem: arg β max β Σ B β, subject to β Σβ = I d, where Σ B = G g= π g(µ g µ)(µ g µ) is the between-cluster covariance matrix, Σ = n n i= (x i µ)(x i µ) is the covariance matrix with µ = G g= π gµ g, β R p d is the spanning matrix, and I d is the (d d) identity matrix. The solution to this constrained optmization is given by the eigendecomposition of the kernel matrix M I Σ B with respect to Σ. The eigenvectors, corresponding to the first d largest eigenvalues, [v,..., v d ] β, provide a basis for the subspace S(β) which shows the maximal variation among cluster means. There are at most d = min(p, G ) directions which span this subspace. This procedure is similar to the sliced inverse regression (SIR) algorithm (Li, 99). SIR is a dimension reduction method which exploits the information from the inverse mean function (see Li, 000, Chap. ). Instead of conditioning on the response variable (or a sliced version of it if continuous), here we condition on the cluster memberships. SIR directions span at least part of the dimension reduction subspace (Cook, 998, Chap. 6) and they provide an intuitive and useful basis for constructing summary plots. However, they may miss relevant structures in the data when within-cluster covariances are different. The SIR II method (Li, 000, Chap. 5) exploits the information coming from the differences in the class covariance matrices. The kernel matrix is now defined as: G M II = π g (Σ g Σ)Σ (Σ g Σ), g= 5

6 where Σ = G g= π gσ g is the pooled within-cluster covariance matrix, and directions are found through the eigendecomposition of M II with respect to Σ. Albeit the corresponding directions show differences in group covariances, they are usually not able to also show location differences. The proposed approach aims at finding those directions which, depending on the selected Gaussian mixture model, are able to display both variation in cluster means and variation in cluster covariances. Definition. Consider the following kernel matrix M = M I Σ M I + M II. () The basis of the dimension reduction subspace S(β) is the solution of the following constrained optimization: arg β max β Mβ, subject to β Σβ = I d. This is solved through the generalized eigendecomposition Mv i = l i Σv i, v i Σv j = if i = j, and 0 otherwise, l l... l d > 0. () Thus, the basis of the required subspace is given by the GMMDR directions β [v,..., v d ]. The kernel matrix () contains information from variation on both cluster means and cluster covariances. For mixture models which assume constant within-cluster covariance matrices, such as models E, EII, EEI, and EEE (see Fraley and Raftery, 006, Table ), the subspace spanned by M is equivalent to that spanned by M I. Remark. Let S(β) be the subspace spanned by the (p d) matrix β of eigenvectors from (), and µ g and Σ g be the means and covariance, respectively, for the g-th cluster. It can be easily seen that the projection of the parameters onto the subspace S(β) are given by β µ g and β Σ g β. Furthermore, the projection of the (n p) data matrix X onto S(β) can be computed as Z = Xβ; we call these GMMDR variables. Remark. The raw coefficients of the estimated directions are uniquely determined up to multiplication by a scalar, whereas the associated directions from the origin are unique. Hence, we can adjust their length such that they have unit length, i.e. each direction is normalized as β j v j / v j for j =,..., d. In matrix form, let D = diag(v V ), where V is the matrix of eigenvectors from (), then β V D /. For the GMMDR variables Cov(Z) = β Σβ = D / V ΣV D / = D = diag(/ v j ). Thus, they are uncorrelated, and with variances inversely related to the squared length of the eigenvectors of the kernel matrix, while GMMDR directions are orthogonal with respect to Σ-inner product. We now provide some properties as propositions whose proofs are reported in the Appendix A.. Proposition. GMMDR directions are invariant under affine transformation x Cx + a, for C a nonsingular matrix and a a vector of real values. Thus, GMMDR directions in the transformed scale are given by C β. Proposition. Each eigenvalue of the eigendecomposition in () can be decomposed in the sum of the contributions given by the squared variance of the between group means and the average of the squared within group variances along the corresponding direction of the projection subspace, i.e. l i = Var(E(Z i Y )) + E(Var(Z i Y ) ) for i =,..., d. 6

7 Based on this result we may provide an interpretation to the contribution of each direction to the visualization of the clustering structure. Furthermore, directions corresponding to small eigenvalues provide little or no information about differences in cluster means or covariances. Remark. Let X be the (n p) sample data matrix which we assume, with no loss of generality, to have zero mean column-vectors. The sample version M of () is obtained using the corresponding estimates from the fit of a Gaussian mixture model via the EM algorithm. Then, GMMDR directions are calculated from the generalized eigen-decomposition of M with respect to Σ. Remark 4. The dimension of the estimated subspace S(β) is d = min{p, G } for models which assume equal within-cluster covariance matrices (i.e. models E, EII, EEI, and EEE), while d p otherwise. The GMMDR directions are ordered based on the associated eigenvalues, so those directions corresponding to approximately zero eigenvalues can be discarded in practice since clusters largely overlap along such directions. A more formal approach to the selection of relevant directions is discussed in Section. Subset selection of GMMDR variables In the previous section we have discussed the estimation of GMMDR variables, which is basically a linear mapping method onto a suitable subspace. This can be viewed as a form of (soft) feature extraction, where the components are reduced through a set of linear combinations of the original variables. However, it may be possible that a subset of the estimated GMMDR variables provide either no clustering information or redundant clustering information already provided by other variables. Consider the partitioned (p p) matrix (β, β 0 ), which forms a basis for R p. Based on proposition of Cook and Yin (000, p.57), if β 0 X β X Y and β 0 X Y then S(β) is a DRS. This is useful since, albeit it does not indicate how to construct β, it indicates that unnecessary variables β 0 X are independent from relevant variables β X within groups, and that they are also marginally independent from Y. In our case, given that unnecessary GMMDR variables provide no clustering information but require parameters estimation, we would like to detect and discard them. In the following we consider a subset selection method for GMMDR variables following the approach of Raftery and Dean (006), who proposed the use of BIC to approximate Bayes factors for comparing mixture models fitted on nested subsets of variables. A generalization of this approach has been recently discussed by Maugis et al (009).. A criterion for feature selection Let S be the set of indices with dim(s ) = q indexing a subset of q features from the original d GMMDR variables Z (q d), and S = {S \ i} S be the set of dim(s ) = q obtained excluding the i-th feature index from S. The comparison of the two nested subsets can be recast as a model comparison problem and addressed using BIC difference as an approximation to Bayes factor. Following Raftery and Dean (006), the BIC difference for the inclusion of the i-th feature is given by BIC diff (Z i S ) = BIC clust (Z S ) BIC not clust (Z S ), (4) where BIC clust (Z S ) is the BIC value for the best model fitted using features in S, and BIC not clust (Z S ) is the BIC value for no clustering. The latter can be written as BIC not clust (Z S ) = BIC clust (Z S ) + BIC reg (Z i Z S ), i.e. the BIC value for the best model fitted using features in S plus the BIC value for the regression of the i-th feature on the remaining (q ) features in S. Noting that GMMDR 7

8 variables are orthogonal, the general formula for BIC reg (see Raftery and Dean, 006, eq. (7)) reduces to BIC reg (Z i Z S ) = n log(π) n log( σ i ) n (q+) log(n), where σ i is the maximum likelihood estimate of the i-th feature variance. In all cases the best clustering model is identified with respect to both the number of mixture components and model parametrization.. A greedy search algorithm for selecting the best subset GMMDR variables are defined as orthogonal linear combinations of the original variables, and it is likely that only a subset of them is needed for clustering. However, the space of all possible subsets of size q, with q ranging from to d, has number of elements equal to d. An exhaustive search soon becomes unfeasible, even for moderate values of d. To alleviate this problem, Raftery and Dean (006) proposed a greedy search algorithm for finding a local optimum in the model space. The search has both a forward step which evaluates inclusion of a proposed variable, and a backward step which evaluates exclusion of one of the currently included variables. The algorithm stops when consecutive inclusion and exclusion steps are rejected. Since GMMDR variables are orthogonal the search can be simplified by skipping the backward step. This allows us to considerably speed up the computing time. The search starts by selecting the single GMMDR variable which maximizes the BIC criterion. This is equivalent to using the statistic in equation (4), with S = and S = {i} for any i =,..., d. Thus, at the first step equation (4) measures the difference between the best clustering model and the single cluster model for each variable. In the following steps, we select the GMMDR variate which maximizes the BIC difference (4), so taking into account the features already included. We keep adding GMMDR variates until the BIC difference becomes negative. A detailed description of the greedy search algorithm is reported in Appendix A.. At each step, the search over the model space is usually performed with respect to both the model parametrization and the number of clusters. However, we may want to fix the model parameters at the estimated values obtained using the full set of features; in such a case, the order in which the GMMDR directions are selected follows the ordering of the associated eigenvalues.. An iterative procedure for feature selection The greedy search discussed in the previous section is likely to discard some of the GMMDR variates, in particular those associated with small eigenvalues which do not carry any clustering information once other features have been included. A GMM may then be fitted on such constructed variables and the corresponding GMMDR directions estimated. The feature selection step can then be repeated, and the whole process iterated until no directions could be dropped. Depending on the number of initial features, the number of iterations required is usually small. The whole process is described in the following algorithm: Algorithm: GMMDR estimation and feature selection process. Fit a GMM on the original data.. Estimate GMMDR directions using the method in Section. and project the data onto the estimated subspace.. Perform feature selection using the greedy search discussed in Section.. 4. Fit a GMM on the selected GMMDR variables and go to step. 5. Repeat steps 4 until none of the features could be dropped. 8

9 4 Simulations In this section we present a data analysis example based on a synthetic data set to describe the proposed approach, followed by some simulation studies. The adjusted Rand index (ARI), proposed by Hubert and Arabie (985), is adopted for evaluating the clustering obtained and comparing the results arising from competing mixture models. The ARI gives a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. This index has zero expected value in the case of random partition, and it is bounded above by. Thus, higher values represent a better performance. 4. Synthetic data on three overlapping clusters with different shapes Consider a simulated data set with three overlapping clusters on ten variables, with each cluster sample size equal to n g = 50 (g =,, ). The first three variables have been generated from a multivariate normal distribution with means µ = [0, 0, 0], µ = [4,, 6], µ = [, 4, ], and within-groups covariances Σ = , Σ = , and Σ = Seven noise variables, generated independently from a standard normal distribution, were also added. Thus, clustering information is only contained in the first three variables, with the remaining ones been simply noise. The correct model we hope to find is the VVV model with G =. The GMM with the largest BIC is the seven clusters EEI model, closely followed by the model with six clusters. Both models have the wrong number of components, thus yielding a poor classification accuracy. Constraining G = the selected model improves the ARI, albeit there are still some misclassified data points (see Table ). Using the best unconstrained GMM we obtained the GMMDR directions; there are d = 6 of such directions, but only the first three are retained by the feature selection step. As it can be seen from Table, the final VVV mixture model with clusters fitted on the three selected GMMDR variables is able to recover the original clustering structure. Table : Clustering results for the three clusters simulated data. Method Model G BIC ARI GMM (unconstrained) EEI GMM (constrained) VII GMMDR ( directions) VVV The left panel of Figure shows the eigenvalues associated with the GMMDR directions. From such a plot we can see that the first two directions are the most important, with the first showing both difference in means and variances among clusters, while the second almost only differences in means. The last direction adds a small amount of clustering information, mostly from differences in variances. The coefficients for the selected GMMDR directions (see the right panel of Figure ) indicate that only the first three variables seem to be required, since the coefficients for the remaining ones are almost zero. Graphs contained in Figure 4 use GMMDR directions to convey clustering information about the fitted GMM. Plots along the diagonal show univariate component densities, while plots above the diagonal report contours of the estimated densities for each component. Such 9

10 Dir Dir Dir Eigenvalues Eigenvalues Means contrib. Vars contrib. x x x x4 x5 x6 x7 x8 x9 x (4.46%) (40.8%) (5.75%) Directions Figure : Plot of eigenvalues for each estimated direction with contributions from means and variances differences among clusters (left panel) and coefficients defining the GMMDR directions (right panel). graphs clearly reflect the characteristics of the mixture model fitted; in particular, clusters have different volume, shape and orientation. Furthermore, it is quite clear that clusters on the projection subspace appear to be well separated along the first two directions, as it can be seen from the maximum a posteriori (MAP) classification areas below the diagonal of Figure Simulation studies Here we discuss the results from some simulation studies we conducted to compare the proposed GMMDR approach to model-based clustering performed on both the original variables and the leading principal components. The data sets used for such comparison are generated from some overlapping Gaussian mixtures with different covariance matrix structures. In addition, we investigated the effect of the inclusion of a certain number of redundant and noise variables. The former are variables correlated with those bringing clustering information, so they only provide redundant information. The latter are variables whose distribution do not depend on the cluster label, thus they only introduce some noise which tends to obscure the underlying clustering structure to be recovered. The simulation schemes used for generating the data are described below. Model : three overlapping clusters with common covariances. In the first simulation scheme each data set was generated from three overlapping clusters with equal mixing probabilities on three variables. These were generated from a multivariate normal distribution with 0

11 Dir Dir Dir Figure 4: Scatter plots of GMMDR variables for the three clusters simulated data: bivariate component density contours (above diagonal), marginal univariate component densities (diagonal), and MAP regions (below diagonal). means µ = [0, 0, 0], µ = [0,, ], µ = [,, ], and common covariance matrix Σ = To this basic setup, which correponds to a EEE mixture model (see Fraley and Raftery, 006, Table ), we added two further scenarios. In the first scenario (noise variables), seven noise variables were generated from independent standard normal variables. In the second scenario (redundant and noise variables), we generated three variables correlated with each clustering variable (with correlation coefficients equal to 0.9, 0.7, 0.5, respectively) and four independent standard normal variables. In both cases, there was a total of ten variables. Model : three overlapping clusters with constant shapes. In this simulation scheme we generated observations with equal prior from three classes using the mixture model VEV. The means for each class were µ = [0, 0, 0], µ = [4,, 6], and µ = [, 4, ], while for the covariance matrices Σ g = λ g D g AD g (g =,, ) the scale, shape and orientation

12 parameters were, respectively, λ = [0., 0.5, 0.8], A = diag(,, ), and D = , D =.., D = As for the previous model setup, we also considered two further scenarios with the addition of noise variables in the first case, and noise and redundant variables in the second case. Model : three overlapping clusters with unconstrained covariances. For this last scheme we simulated each data set from three overlapping clusters with equal mixing probabilities on three variables. Clustering variables were generated from a multivariate normal distribution with parameters set as in the synthetic example of Section 4., which correspond to a VVV parametrization with G =. Again, we also considered the effect of adding only noise variables, and noise and redundant variables. For each data set simulated as described above we fitted a GMM with number of clusters in the range [, 5] and different model parametrizations, then we selected the model with the highest BIC. A GMM was also fitted to the leading principal components computed from the correlation matrix, or equivalently using standardized variables; the number of components was chosen using the Kaiser s rule, i.e. selecting those with eigenvalues larger than one (Jolliffe, 00, p. 4 5). Finally, we applied the GMMDR approach discussed in Section., followed by the subset selection procedure of Section. For each estimated model we computed the adjusted Rand index (Hubert and Arabie, 985) as a criterion to evaluate the clustering partitions obtained: higher values of ARI correspond to better performance. The results of the simulation procedure based on 000 repetitions for increasing sample sizes are shown on Table, where we reported the average adjusted Rand index for the three fitting procedures to be compared and the corresponding standard errors. When no noise variables are included, the GMMDR procedure yields equivalent clustering accuracy to GMM on the original variables. However, when irrelevant features (either noise or redundant ones) are present the GMMDR procedure significantly improves the accuracy as measured by the adjusted Rand index. This is particular evident for smaller sample sizes (n = 00, 00) when noise variables are present, and when the underlying clusters have different within cluster covariance matrices. The PCA+GMM procedure leads to uniformly small values of ARI, confirming that reducing the dimensionality of the problem through principal components does not help to discover clustering structures in the data. As expected, as the sample size increases all the three methods improve in accuracy, with the GMM method on the original variables and the GMMDR approach leading essentially to the same accuracy for n = 000. A further aspect was also checked, the number of clusters selected by each procedure (results not shown here). Recalling that for all the simulation schemes the number of clusters was three, GMMDR almost always selects the true value, whereas GMM tends to select a larger number of components when noise variables are included and the clusters have different covariance structures. PCA+GMM performs worst than the other two methods, often under-estimating the true number of clusters in the presence of noise variables and for smaller sample sizes (n = 00, 00). Based on the simulation results we may conclude that when there are redundant or noise variables a suitable subset of GMMDR variables allows us to significantly improve cluster identification. The improvement is larger for small samples having different within-cluster covariances and in the presence of noise variables. Overall, using GMMDR variables do not produce worse results than using original variables, but often leads to a better selection of the clustering structure. There appears no reason to perform model-based clustering on principal components.

13 Table : Average adjusted Rand index (with standard errors within parenthesis) based on 000 simulations. No noise Noise variables Noise and redundant variables Model (EEE) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.040) (0.066) (0.008) (0.6) (0.7) (0.0090) (0.889) (0.04) (0.0087) PCA+GMM (0.66) (0.0446) (0.0) (0.700) (0.56) (0.098) (0.4) (0.0909) (0.09) GMMDR (0.047) (0.07) (0.0088) (0.46) (0.04) (0.0090) (0.6) (0.097) (0.0088) Model (VEV) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.00) (0.04) (0.007) (0.085) (0.079) (0.008) (0.09) (0.57) (0.0078) PCA+GMM (0.745) (0.8) (0.084) (0.86) (0.700) (0.47) (0.780) (0.48) (0.009) GMMDR (0.05) (0.04) (0.007) (0.0799) (0.065) (0.008) (0.08) (0.069) (0.0077) Model (VVV) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.05) (0.004) (0.004) (0.60) (0.89) (0.008) (0.9) (0.094) (0.009) PCA+GMM (0.054) (0.005) (0.005) (0.66) (0.77) (0.06) (0.5) (0.0984) (0.06) GMMDR (0.046) (0.004) (0.004) (0.84) (0.00) (0.008) (0.609) (0.085) (0.0099) 4.. Unequal prior probabilities The previous simulation examples assumed equal prior group probabilities. It may be of interest to investigate if the results obtained depend on such assumption. We expect that, provided there is enough cluster information in the data, no dramatic change from previous findings should be observed. Thus, we replicated the previous simulation study for the more complex Model (VVV), but with unequal prior probabilities set at π = (0., 0., 0.6). The results are shown in Table. When no noise variables are included the accuracy of GMM and GMMDR are essentially the same as in the equal prior probabilities case, whereas PCA+GMM performs slightly worse. If irrelevant variables (either noise or redundant ones) are present both GMM and GMMDR yield somewhat worse results when n = 00 or 000. This seems to be due to the fact that BIC often selected models with more components than the true number of clusters. On the contrary, PCA+GMM seems to improve when redundant variables (i.e. variables correlated with clustering ones) are included. Since principal components are computed from the correlation matrix, we argued that correlated variables are likely to provide useful clustering information when some groups are represented by few observations. Overall, we note that GMMDR never performed worst than GMM, and it was also better than PCA+GMM except when irrelevant variables are included and n = High-dimensional data Clustering high-dimensional data is a challenging task. The main problem in fitting a GMM model is the large number of parameters which need to be estimated. To overcome this Ghosh and Chinnaiyan (00) suggested to perform model-based clustering on the first few principal components. Based on the conclusion of Chang (98), discussed also in Section, this approach does not seem to be advisable. Bouveyron et al (007) proposed a method for dealing with high

14 Table : Average adjusted Rand index (with standard errors within parenthesis) based on 000 simulations with unequal prior probabilities. Model (VVV) No noise Noise variables Noise and redundant variables n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.050) (0.005) (0.000) (0.468) (0.66) (0.60) (0.4) (0.88) (0.097) PCA+GMM (0.0970) (0.087) (0.05) (0.0748) (0.050) (0.07) (0.0590) (0.05) (0.054) GMMDR (0.074) (0.004) (0.00) (0.4) (0.78) (0.547) (0.0) (0.60) (0.600) dimensional data. In Section 4. we described a three clusters VEV model () on clustering variables. We also considered the inclusion of redundant variables (i.e., correlated with the clustering ones) and 4 noise variables; there was a total of ten variables according to the scheme { 4}. This basic setup has been generalized to investigated the behavior of the proposed approach as the number of variables increase. For k = {,,, 5} we generated 0k variables following the scheme {k k 4k}; for instance, if k =, there are 0 variables, 9 of them are clustering variables, 9 are correlated with the previous ones, and the remaining are noise variables. We only investigated the most difficult case, i.e. the case of a small sample size, by setting n = 00. As the number of variables increase the accuracy of all the three methods tend to decrease, with GMMDR always more accurate on the average that the other two methods (see Table 4). However, for p = 40 and p = 50 the average ARI for GMMDR dropped just above 0.6. Looking in more detail the results, we realized that, even though a subset of directions could provide better cluster accuracy, BIC frequently selected too many of such directions. To find evidence of this fact we replicated the simulation study, but using the simple model EII (common spherical covariance matrix) and selecting the GMMDR directions on the basis of the entropy criterion (Celeux and Soromen, 996). This provides a measure of the overlap of the clusters, and it is defined as G g= n i= t ig log(t ig ), where t ig denotes the conditional probability that the i-th unit arises from the g-th mixture component. Except for the case p = 0, where the ARI is smaller, and p = 0, where the ARI is essentially equivalent, the accuracy of GMMDR is largely increased (see Table 4). Note also that, although we constrained to fit the wrong EII model, the accuracy of GMM do not worsen. Based on this limited study, it seems that for high-dimensional data GMMDR directions are still able to recover the clustering structure of the data, but the BIC criterion overestimates how many of them are needed. When the entropy criterion is used to select relevant directions the clustering accuracy improve. An extreme form of high-dimensional data is the p n case. Clustering of gene expression data arising from microarray experiments is a typical application and model-based approaches have been proposed (Yeung et al, 00). In order to apply the approach discussed in this paper, a form of regularization is required for the inversion of the covariance matrix in (), such as one of those discussed in Bernard-Michel et al (009). However, these aspects require further studies. 5 Data analysis examples 5. Italian wines Forina et al (986) reported data on 78 wines grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, Barbera). For each wine measurements of 4

15 Table 4: Average adjusted Rand index (with standard errors within parenthesis) based on 500 simulations for increasing dimensionality of the feature space and n = 00. The last two rows show the results from fitting a GMM with a common spherical and equal volume model (EII) and the corresponding GMMDR with relevant directions selected by the entropy criterion. Model (VEV) p = 0 p = 0 p = 0 p = 40 p = 50 GMM PCA+GMM (0.0850) (0.070) 0.78 (0.0744) (0.076) (0.078) GMMDR (0.8) (0.655) (0.574) (0.58) (0.8) 0.6 (0.084) (0.095) (0.960) (0.09) (0.90) GMM (EII) GMMDR (0.0889) (0.0707) (0.0696) (0.075) (0.0765) 0.88 (0.67) (0.087) (0.080) (0.58) (0.9) chemical and physical properties were made, such as the level of alcohol, the level of magnesium, the color intensity, etc. The data set is avalaible at UCI Machine learning data repository The following analyses were performed on a standardized scale. Modelling all the available variables with a GMM selects the model VEI with eight clusters, which gives a small adjusted Rand index (see Table 5). This can be largely improved by the variable selection method proposed by Raftery and Dean (006). A comparable accuracy is also achieved by McNicholas and Murphy (008), who applied a class of parsimonious Gaussian mixture models (PGMM) based on mixtures of factor analyzers; their selected model has q = latent factors and it is denoted as CUU (see McNicholas and Murphy, 008, Table ). We note, however, that, despite the improvement on accuracy, PGMM selects a wrong number of clusters (see Table 5). Table 5: Model-based clustering results for some models fitted to the wine data. Method Number of Model G ARI features GMM VEI GMM (best subset ) 5 VEV 0.78 PGMM CUU GMMDR 5 VEV 0.85 (Malic, Proline, Flavanoids, Intensity, OD80) We applied the GMMDR approach to this data set and we obtained a final VEV model with three clusters estimated on five GMMDR directions. This model yields an adjusted Rand index equal to 0.85 (see Table 5) and the resulting confusion matrix is reported in Table 6. As Figure 5 shows, the first two directions mainly exhibit differences in cluster means, whereas the remaining directions mostly represent differences in cluster variances. We also note that the first three GMMDR directions appear to contain most of the clustering information. Figure 6 shows a static view of a rotating D spin plot for the first three GMMDR directions with data points marked according to cluster membership; Barbera wines () appear to be well separated from the others along the first direction, while Barolo () and Grignolino () wines present a significant separation on the plane identified by the second and third directions. 5

16 Cluster Wines Barolo 57 0 Grignolino 64 5 Barbera Table 6: A classification table for the best model found using GMMDR on the wine data. Eigenvalues Eigenvalues Means contrib. Vars contrib. 4 5 Directions Figure 5: Plot of eigenvalues for each GMMDR direction with contributions from means and variances differences among clusters for the wine data. 5. Australian crabs data This data set contains measurements recorded on 00 specimens of Leptograpsus variegatus crabs on the shore in Western Australia (Campbell and Mahon, 974). Crabs are classified according to their colour form (blue and orange) and sex. There are 50 specimens of each sex of each species. Each specimen has 5 measurements: frontal lobe size (FL), rear width (RW), carapace length (CL) and width (CW), body depth (BD). The main interest is to distinguish between species and sex based on the available morphological measurements. Gaussian mixture modelling performs poorly for this dataset. The model with the highest BIC selects 9 components, which gives a small adjusted Rand index. If we constrain the number of components to match the actual number of classes, the performance is still poor (see Table 7). Raftery and Dean (006) obtained a large improvement through variable selection, ending up with a mixture model based on four variables. This dataset was also analyzed by Bouveyron et al (007, pp ) who obtained a 5% error rate with the high-dimensional data clustering (HDDC) model [a i b i Q i d i ]. Starting with the best unconstrained mixture model (EEE, G = 9) we estimated the GMMDR directions and then we performed feature selection. The final GMMDR model is fitted on three directions with 4 clusters having ellipsoidal, equal volume and shape but different orientation (EEV). The error rate is 7.50%, the same obtained by Raftery and Dean (006). 6

17 Dir Dir Dir Figure 6: A static view of a D spinning plot of the first three GMMDR directions for the wine data. Data points are marked according to cluster membership; the clusters identified refer mainly to () Barolo, () Grignolino, and () Barbera. The resulting accuracy is slightly worst than that for the HDDC model of Bouveyron et al (007), but it has the advantage of providing a visualization of the clustering structure. Table 7: Model-based clustering results for some models fitted to the crabs data. Method Model G Error ARI rate (%) GMM (unconstrained) EEE GMM (constrained) VEV GMM (best subset ) EEV HDDC GMMDR [a i b i Q i d i ] EEV (CW, RW, FL, BD) Summary plots (not shown here) indicate that the first direction mainly separates blue from orange type of crabs, while the second reflects sex differences. The last direction adds only marginal information contrasting blue and orange crabs within sex. Figure 7 plots the uncertainty u i = max g (ẑ ig ) projected onto the first two GMMDR directions, where ẑ ig is the estimated conditional probability that observation i belongs to group g. Misclassified crabs are indicated by a square surrounding the point. As it can be seen, orange and blue crabs appear well separated, whereas discrimination of sex within crabs species is more problematic: some points lie in the regions of higher uncertainty (darker areas), in particular for blue crabs. It is interesting to note that these aspects can also be read-off from the corresponding confusion matrix (see Table 8). 7

18 Cluster Species 4 BF BM OF OM Table 8: A classification table for the final GMMDR model fitted on the crabs data. 6 Conclusion In this paper we addressed the problem of dimension reduction for model-based clustering from the point of view of visualizing the clustering structure. Often, principal component analysis is used as a pre-processing step, and a clustering procedure is applied to the leading principal components. Probabilistic principal components and factor analyzers have been also used for dimension reduction. However, these do not necessarily capture the underlying clustering structure. Unlike other approaches (Tipping and Bishop, 999a,b; McLachlan et al, 00; Bouveyron et al, 007; McNicholas and Murphy, 008), the proposed method is not connected to any underlying latent variables structure, nor an EM-like algorithm is employed in the estimation. Instead, we aim at identifying the subspace which capture most of the clustering information contained in the data. Estimation of directions spanning such subspace is based on the eigendecomposition of a suitable kernel matrix, with unknown parameters estimated from a GMM selected as the best description of the data at hand. Visualization techniques can then be applied using the leading directions in order to obtain summary plots. The estimated directions can be further reduced by recasting the problem of feature selection as a model choice problem and using BIC to approximate Bayes factors. A forward greedy search algorithm is discussed to avoid the search over all possible subsets. The proposed methodology has been applied to simulated and real data sets. In all cases we were able to identify a dimension reduced subspace while preserving the cluster information contained in the original variables. Moreover, simulation studies showed that a suitable subset of directions allows for significant improvement in cluster identification. The improvement is larger when within-cluster covariances are different and redundant or noise variables are present. Finally, it is straightforward to extend the proposed approach in a supervised context, where the class membership for each unit is known in advance. Appendix A. Proofs Proof of proposition Define the transformed data matrix as X = XC + a, where C is a (p p) nonsingular matrix, a is a p-vector of real values, and is a n-vector of ones. It is easy to show that in the transformed scale Σ = C ΣC and the kernel matrix is given by M = C MC. The generalized eigendecompostion in the transformed scale satisfy the equation MCV = ΣCV L, so CV must be equal to the matrix V of eigenvalues of the generalized eigendecomposition of M with respect to Σ, and L is equal to the corresponding diagonal matrix of eigenvalues. Hence, V = C V, L = L, and the GMMDR directions in the transformed scale are given by β = C β. Furthermore, the corresponding variates Z = X β = Xβ + a C β, so they are simply a shifted version of the GMMDR variates in the original scale. 8

Dimension Reduction for Model-based Clustering via Mixtures of Multivariate t-distributions

Dimension Reduction for Model-based Clustering via Mixtures of Multivariate t-distributions by Katherine Morris A Thesis presented to The University of Guelph In partial fulfilment of requirements for