Why MixTRV? Motivations for a new Matlab package for Gaussian mixture inference statistical hypothesis testing& model-based classification/clustering

Size: px

Start display at page:

Download "Why MixTRV? Motivations for a new Matlab package for Gaussian mixture inference statistical hypothesis testing& model-based classification/clustering"

Brent Merritt
5 years ago
Views:

1 Why MixTRV? Motivations for a new Matlab package for Gaussian mixture inference statistical hypothesis testing& model-based classification/clustering Alexandre Lourme Institut de Mathématiques de Bordeaux (IMB), UMR 55, Université Bordeaux I Faculté d Economie Gestion&AES, Université Montesquieu - Bordeaux IV Introduction Nowadays modeling heterogeneous continuous data with a finite sequence of Gaussians is common, and the softwares dedicated to this task proliferate for some years. The oldest softwares, as SNOB (Wallace and Boulton, 968) or EMMIX (McLachlan et al., 999), are gradually replaced with a new generation of Matlab or R packages dedicated to Gaussian modeling within a wider purpose as, for example, discriminant analysis, cluster analysis, high-dimensional data processing, variable selection, statistical hypothesis testing, etc. These recent packages differ not only on the target but also on the inferential method (Maximum Likelihood, Maximum of Completed Likelihood, Minimum Message Length, Bayesian inference, etc.), on the considered parsimonious models, on the embedded model selection criteria etc. So, MixTRV is a Matlab package for both classification, clustering and statistical hypothesis testing, enabling to infer twenty-two Gaussian models by maximum likelihood in a supervised, semi-supervised, unsupervised context. MixTRV is close to recent R packages as bgmm (Biecek et al., 0), mclust (Fraley et al., 0), pgmm (McNicholas et al., 0), mixmod (Lebret et al., nd) or upclass (Russell et al., nd). Nevertheless MixTRV differs from the latter packages on several stability properties of its underlying parsimonious models, properties that are important both for representing and interpreting the inferred model. Several model families K Gaussians fitting a sample of d-dimensional heterogeneous data, the Gaussian k (k {,..., K}) is characterized by a center µ k R d, a covariance matrix Σ k R d d (symmetric positive definite) and a weightπ k > 0 ( K j= π j = ). In order to improve the square error of the inferred model it is usual to consider a family of parsimonious models combining contraints on the previous parameters. In this spirit, each software among bgmm, mclust, pgmm, mixmod, upclass and MixTRV is characterized by an own collection of models defined by specific constraints onπ k,µ k,σ k (k=,..., K). Let us review the model families related the latter softwares so as to highlight, further, the advantages of MixTRV. mclust. Each covariance matrix Σ k is symmetric positive definite. Then its eigenvalues are positive real numbers:α k, α k, α k,d > 0 and Σ k is diagonalizable in an orthonormal basis of eigenvectors. So, notingλ k = diag(α k,,...,α k,d ), there exists an orthogonal matrixd k R d d such as: Σ k = D k Λ k D k. () The columns of D k form an orthonormal basis ofr d. The canonical directions of this basis on the one hand and the principal axes of the ellipsoidal iso-density contours of the Gaussian k on the other hand are pairwise parallel. So, the matrixd k relates the orientation of the Gaussian k.

2 The models of mclust described in Fraley et al. (0) propose ten covariance structures from the following decomposition: Σ k =α,k D k L k D k () where L k = Λ k /α,k. These models are named EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, VVV. The letter V or E in first position indicates thatα k, (k=,..., K) are variable (V) or equal (E). V, E or I stands in second position when the matricesl k (k=,..., K) are assumed to be variable (V), equal to each other (E), or all equal to the identity matrix (I). The letter V, E or I in third position means that the orthogonal matrices D k (k=,..., K) are variable (V), equal (E) or that each of them is a permutation matrix (I) a. As regards the other parameters mclust considers the weightsπ k (k=,..., K) and the centers µ k (k =,..., K) as free. So the mclust model family consists of the ten previous covariance structures. mixmod. The mixmod software described in Biernacki et al. (006) considers fourteen parsimonious structures of covariances. These models are based on the following decomposition deriving from (): Σ k =λ k D k A k D k () whereλ k = Σ k /d is the volume of the component k anda k = Λ k /λ k the shape. Decompositions () and () are close since each of them is obtained from () by normalizing the matrix Λ k. These decompositions only differ on the normalizing factor which is the volumeλ k of the component k in () and the largest eigenvalueα,k ofσ k in (). The fourteen covariance models of Biernacki et al. (006) combine constraints on the volume (λ k ), the shape (A k ) and the orientation (D k ) of the components. These models are divided into three families. The spherical family includes two covariance models namedλi andλ k I. Both of them assume that the matricesa k andd k are equal to the identity matrix. As regards the volumesλ,...,λ K the first (resp. the second) model considers they are homogeneous (resp. heterogeneous). The diagonal family consists of four covariance models namedλb,λ k B,λB k,λ k B k depending on whether (i) the volumes are homogeneous (λ) or free (λ k ) and (ii) the matrices D k A k D k (k=,..., K) are diagonal and equal (B) or just diagonal (B k ). The general family is composed of height covariance models obtained by assuming as homogeneous/heterogeneous the volumes (λ/λ k ), the shapes (A/A k ), the orientations (D/D k ). So,λDA k D means that both volumes and orientations are homogeneous whereas shapes are free. Mixmod considers no parsimonious hypotheses about the centers µ k (k=,..., K) but the weights are either free (π k ) or equal (π). Combining the two latter assumptions about the weights with the fourteen previous covariance structures leads to the wide mixmod model family made of twenty-eight parsimonious models. The standard homoscedastic model with free weights - notedπ k λdad - is one of them. pgmm. The models of pgmm are mixtures of factor analyzers (see Mclachlan and Peel, 000, Chap. 8). Such models are often used to fit high dimensional data (see Bouveyron and Brunet-Saumard, a the permutation matrices are homogeneous (resp. heterogeneous) when the matrices L k (k=,..., K) are supposed to be equal (resp. variable)

3 0), but they are suitable for modeling Gaussian data even outside the specific context of high dimension. For a common given dimension q (q N ) of the latent spaces (see Mclachlan and Peel, 000, p. 0) pgmm proposes twelve covariance structures gathered in a family called EPGMM and described in McNicholas and Murphy (00). These structures rest on the hypothesis that each matrix Σ k can be decomposed according to: Σ k = B k B k +ω k k () whereb k is a matrix with d rows and q columns (q independent of k),ω k is a scalar and k is a diagonal matrix with determinant. When q d the columns of B k define q directions inr d which are close to the factorial axes of the Gaussian k. As regardsω k k, this term is intended to take over the residual variability of the Gaussian k in the other directions ofr d. Each EPGMM model name has four letters among C (for constrained) and U (for unconstrained). In position (resp., ) the letter C or U indicates whether the parameter B k (resp. k,ω k ) is homogeneous or free with respect to k. In fourth position, the letter C or U indicates that matrices k are equal or not equal to the identity matrix. When the matrices k are assumed to be equal to the identity matrix, they are necessarily homogeneous with respect to k. So, if a model name ends with C then the second letter is also C. Combining the previous hypotheses leads to twelve covariance models. The model called CUCU for example supposes that the matrices B k (k=,..., K) are homogeneous and the coefficients ω k (k=,..., K) also, but that the matrices k (k=,..., K) are free. upclass. Generally, in a model-based discriminant analysis context, the model inference involves just labelled data wheareas the unlabelled data are only used in the classification step. The upclass software enables to make use of the information held into the unlabelled data in the inference step by estimating a model both on labelled and unlabelled data (see Russell et al., nd). The parsimonious models of upclass are the same as those of mclust. So, they inherit both their advantages and drawbacks. bgmm. Unlike the previously reviewed softwares, bgmm (see Biecek et al., 0) enables to combine parsimonious hypotheses about both the covariance matrices Σ k (k=,..., K) and the centers µ k (k =,..., K). Each bgmm model name consists of four signs. The first sign (resp. The second sign) is a letter, either E or D depending on whether the centersµ k (resp. the covariance matricesσ k ) are equal or different. The third sign is E when the d variances Σ k (i, i) (i=,..., d) of each component and the d (d)/ covariancesσ k (i, j) ( i< j d) are homogeneous, and D otherwise. The last sign is 0 if the d (d)/ covariancesσ k (i, j) ( i< j d) of each component are supposed to be null and D otherwise. For example the model DDD0 assumes that the centers and the variances are free whereas the component covariances are null. mixtrv. As each covariance matrixσ k is symmetric positive definite, this matrix can be decomposed according to: Σ k = T k R k T k (5) where T k is the diagonal matrix of component standard deviations and R k the associated matrix of correlations. So: T k (i, j)= Σ k (i, j) if i= j and 0 otherwise, and: R k = T k Σ k T k. Moreover the standardized meanv k is defined by: V k = Tk µ k (6) So, the matrixt k appears to be a scale parameter andtk, a normalizing parameter. ThenV k andr k are respectively the center and the covariance matrix of the k-th normalized Gaussian component. MixTRV offers twenty-two parsimonious models. Everyone has a name with four letters. H or F stands in first, third and fourth position depending on whether weightsπ k, correlation matrices R k and

4 standardized mean vectors V k are homogeneous (H) or free (F) with respect to k. F, P or H in second position indicates that the standard deviation matrices T k (k=,..., K) are free (F) proportional (P) or homogeneous (H). For example the model FPHF of MixTRV assumes that weights and standardized means are free through the components whereas standard deviations are proportional and correlations are homogeneous. One can check that both covariance structures of FPHF in MixTRV andπλ k DAD in mixmod are the same. More generally it often happens that several models belonging to different packages among mclust, mixmod, pgmm, bgmm and MixTRV share identical covariance structures. All common covariance structures through the latter packages are summarized in Table, Column. Like bgmm, MixTRV enables to consider parsimonious hypotheses on the centers: when the vectors V k and the matrices T k are simultaneously homogeneous with respect to k, then µ k (k=,..., K) are equal. Unlike (), () and (), the decomposition (5) is canonical. This ensures the existence and the uniqueness of each parameter V k, R k and T k. Although it is not the main virtue of MixTRV this advantage is convenient for inferringv k,r k, T k and helpful for interpreting these parameters. Stability properties The five properties described bellow separate MixTRV from the other softwares mclust, mixmod, pgmm, upclass, bgmm. Indeed the MixTRV model family is the only one the parsimonious models of which satisfy each of the following properties. Property (Model Structure Scale Invariance) A random vectorx R d with a parametric distribution is scale invariant if both models of SX and X are submitted to the same constraints whichever is the diagonal positive definite matrixs R d d. Illustrations X being distributed as a mixture of K Gaussians with homogeneous correlation matrices: R =,...,= R K, then the component correlation matrices ofsx are themselves homogeneous. So the model FFHF of MixTRV is scale invariant. X being distributed as a mixture of K Gaussians with equal variances within each component: Σ k (, )= =Σ k (d, d), the conditional variances of SX are generally not equal. So, the models DDED in bgmm, VII in mclust,π k λi in mixmod, etc. are not scale invariant. Column of Table summarizes which models of bgmm, pgmm, mclust, mixmod, MixTRV satisfy (or not) Property : MixTRV is the only package all the models of which are scale invariant. Here is one reason why discarding models that do not satisfy Property : they often lead to unsuitable graphical representations. For example Fig. a depicts two Gaussian isodensity contours for a bivariate mixmod modelπ k λ k DA k D, within orthonormal axes. Fig. b shows that when x-axis scale is changed, the main axes of the ellipses are no more parallel wheareas the two Gaussians which are represented still have the same orientation. Property (Model Rank Scale Invariance)

5 y y 0 0 x (a) orthonormal axes x (b) changing x-axis scale Fig. : The evidence of equal orientations depends on the axis scaling Γ denoting a likelihood based model selection criterion among AIC (Akaike, 97), BIC (Schwarz, 978), ICL (Biernacki et al., 000), a set of models isγ-scale invariant if rescaling the data does not change the model ranks related toγ. For a model family to be AIC/BIC/ICL - scale invariant, each model of the family must satisfy Property. As each software bgmm, pgmm, mclust and mixmod includes at least one non scale invariant structure, these four packages are neither AIC nor BIC nor ICL - scale invariant. On the opposite it can be proved that the MixTRV model collection is both AIC, BIC and ICL - scale invariant (see Biernacki and Lourme, 0). Column 5 of Table recalls that MixTRV is the only model collection which satisfies Property. Illustration Let us consider the following experimental design in order to illustrate the importance of Property. Each Gaussian mixture of bgmm, pgmm, mclust, mixmod, MixTRV fits the famous Old Faithful Geyser eruptions (see Azzalini and Bowman, 990) - with K = classes interpreted as short and long eruptions - and then the four best models associated to BIC within each family are recorded. This leads to Tables a, b, c depending on wether the two variables Duration and Waiting are measured in minutes minutes, in seconds minutes or both standardized (divided by their standard deviation). One can observe from Table that for each software bgmm, pgmm, mclust and mixmod, the list containing the names of the four best models associated to BIC depends on the units of the data whereas the list keeps unchanged in case of MixTRV. When a model family does not satisfy Property, the model selection procedure is subject to the measurement units and, then, the constraints of the selected model cannot be interpreted as a property of the data. In the previous example about Oldfaithful eruptions, the best mixmod model for BIC isπ k λ k DA k D when Duration and Waiting are both measured in minutes whereasπ k λ k D k AD k is preferred when both variables are standardized. So, homogeneous orientations of the distributions is not an intrinsic property of short and long Oldfaithful erruptions. On the contrary the MixTRV model selected by BIC is FFHF for any measurement units, which supports that homogeneous correlations of Waiting and Duration among short and long eruptions is an intrinsic property of Oldfaithful eruptions. Property (Consistency of Canonical Projections) A random vectorx R d with a parametric distribution is consistent by projection onto the canonical 5

6 planes if any random vector same constraints asx. X R consisting of two distinct components of X, is submitted to the Illustrations X = (X,..., X d ) being distributed as a mixture of K Gaussians with homogeneous standardized means: V =,...,= V K, then the component standardized mean vectors of X= (X i, X j ) are themselves homogeneous whatever is the couple of distinct indexes (i, j). So the structure FFFH of MixTRV is consistent by projection onto the canonical planes. But if X = (X,..., X d ) is distributed as a mixture of K Gaussians with homogeneous volumes: λ =,...,=λ K, the component volumes of X= (X, X ) for example are generally not equal. So, the mixmod modelπ k λd k A k D k is not consistent by projection onto the canonical planes. Column 6 of Table displays the bgmm, pgmm, mclust, mixmod and MixTRV models satisfying (or not) Property : MixTRV and bgmm are the only softwares all the parsimonious models of which are consistent by projection onto canonical planes. Models which do not satisfy Property are not easy to represent in dimension. For example Fig. depicts two Gaussian component isodensity contours related to a -variate mixmod modelπ k λd k AD k : the structure consisting on homogeneous volumes and shapes but free orientations does not remain by projection onto x-y canonical plane. Fig. : Unsustainability of the following structure: homogeneous shapes - homogeneous volumes - free orientations, into the canonical planes 6

7 Property b (Characterization of a Model by Bivariate Margins) A random vector X R d with a parametric distribution is characterizable by bivariate margins if its parameter necessarily satisfies any constraint complied by the parameter of its projections onto the canonical planes. Illustrations LetX= (X,..., X d ) be a random vector inr d distributed as a mixture of K Gaussians. If the component standard deviation matrices of (X i, X j ) are proportional whatever is the couple of distinct indexes (i, j) then the component standard deviation matrices of X are themselves proportional. So, the model FPFF of MixTRV is characterizable by bivariate margins. On the contrary it is possible that any couple of margins (X i, X j ) is distributed according to the model DDED of bgmm whereas X is not distributed according to DDED. So, the bgmm model DDED is not characterizable by bivariate margins. Column 7 of Table summarizes which models of bgmm, pgmm, mclust, mixmod, MixTRV satisfy (or not) Property : MixTRV is the only package all the models of which are characterizable by bivariate margins. Property 5 (Likelihood Ratio Test Scale Invariance) A model collection is scale invariant as regards the Likelihood Ratio Test (LRT) if changing the units of the data keeps all likelihood ratios of any couple of nested models unchanged. Illustration The mclust model family is not scale invariant as regards the LRT. For example the ratio of maximized likelihoods related to the mclust models VEV and VVV - the latter is more complex than the former by three degrees of freedom - inferred on the turtle data of Jolicoeur et al. (960) is equal to.0 when the three variables carapace length, width and height are all measured in cm, and this ratio is equal to.8 when carapace length and width are standardized (divided by their standard deviation). This means that for a given significance level of the LRT, the parameterl k of () - which is related to the shape of male and female turtle distributions - will be considered as homogeneous or free depending on the units of the data. So homogeneous distribution shapes is not an intrinsic property of male and female turtles since it is also related to the measurement units. Actually no one of mclust, mixmod, pgmm, bgmm model families satisfies Property 5: within each of these collections there exists a couple of nested models, the likelihood ratio of which varies according to the units of the data. On the opposite, the MixTRV model set is LRT-scale invariant: for any couple of nested MixTRV models the likelihood ratio remains the same whatever are the units of the data. Table, Column 8 recall that MixTRV is the only model family which satisfies Property 5. b converse of Property 7

8 References Akaike, H. (97). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 9(6):76 7. Azzalini, A. and Bowman, A. W. (990). A look at some data on the old faithful geyser. Applied Statistician, 9(): Biecek, P., Szczurek, E., Vingron, M., and Tiuryn, J. (0). The r package bgmm: Mixture modeling with uncertain knowledge. Journal of Statistical Software, 7(i0). Biernacki, C., Celeux, G., and Govaert, G. (000). Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (7): Biernacki, C., Celeux, G., Govaert, G., and Langrognet, F. (006). Model-based cluster and discriminant analysis with the mixmod software. Computational Statistics&Data Analysis, 5(): Biernacki, C. and Lourme, A. (0). Stable and Visualizable Gaussian Parsimonious Clustering Models. Statistics and Computings, (in press). Bouveyron, C. and Brunet-Saumard, C. (0). Model-based clustering of high-dimensional data: A review. Computational Statistics& Data Analysis. Fraley, C., Raftery, A. E., Murphy, T. B., and Scrucca, L. (0). mclust version for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Rapport de recherche 597, Department of Statistics, University of Washington. Jolicoeur, P., Mosimann, J. E., et al. (960). Size and shape variation in the painted turtle. a principal component analysis. Growth, ():9 5. Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G., and Govaert, G. (n.d.). Rmixmod: The r package of the model-based unsupervised, supervised and semi-supervised classification mixmod library. Journal of Statistical Software. Mclachlan, G. and Peel, D. (000). Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley-Interscience. McLachlan, G. J., Peel, D., Basford, K. E., and Adams, P. (999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software, ():. McNicholas, P. D. and Murphy, T. B. (00). Model-based clustering of microarray expression data via latent gaussian mixture models. Bioinformatics, 6(): McNicholas, P. D., Murphy, T. B., Jampani, K., McDaid, A., and Banks, L. (0). pgmm version.0 for r: Model-based clustering and classification via latent gaussian mixture models. Technical report, Technical Report 0-0, Department of Mathematics and Statistics, University of Guelph. Russell, N., Cribbin, L., and Murphy, T. B. (n.d.). upclass: An r package for updating model-based classification rules. Schwarz, G. (978). Estimating the dimension of a model. The annals of statistics, 6():6 6. Wallace, C. S. and Boulton, D. M. (968). An information measure for classification. The Computer Journal, ():

9 common stability properties covariance package model structure Property Property Property Property Property 5 [D/E]DDD [D/E]DD [D/E]DED + bgmm [D/E]DE0 + + [D/E]EDD [D/E]ED [D/E]EED + [D/E]EE0 + + UUUU + +? UUCU + +? CUUU + +? CUCU +? UCUU + +? pgmm UCCU +? CCUU + +? CCCU + +? UCUC + UCCC? CCUC +? CCCC +? VVV VEV EEV EEE mclust VVI EVI + + VEI EEI VII + + EII + + [π k /π]λ k D k A k D k [π k /π]λd k A k D k + [π k /π]λ k D k AD k [π k /π]λd k AD k [π k /π]λ k DA k D + [π k /π]λda k D + [π mixmod k /π]λ k DAD [π k /π]λdad [π k /π]λ k B k [π k /π]λb k + + [π k /π]λ k B [π k /π]λb [π k /π]λ k I + + [π k /π]λi + + [F/H]FFF [F/H]PFF [F/H]HFF [F/H]FHF [F/H]PHF mixtrv [F/H]HHF [F/H]FFH [F/H]PFH [F/H]HFH [F/H]FHH [F/H]PHH Table : The parsimonious models of bgmm, pgmm, mixmod, mclust, MixTRV: common covariance structures and summary about the models/families that hold (+) or not (-) the stability properties from to 5

10 family rank model BIC model BIC model BIC DDDD. DDDD 59.5 DDDD 80.6 bgmm DEDD 5. DDDO 6. DEDD 55.5 DDDO 57. DEDD 8.6 DDDO 85.5 DEDO 5.6 DEDO 58.9 DEDD 856. CCUC 5.5 UCCC 5. CCUC 88. pgmm CCUU 0.7 CUCU. UCUC 59.7 UCCU CUCU 89. UCUC 80.7 UCUC.7 CCCC 55.9 CCUU 8.0 VVV. VVV 59.5 VEV 88.6 mclust EEE 5. VEV 5. EEE 55.5 VEV 555. VVV 80.6 EEV 8.5 EEV 9. EEV 557. EEE 8.6 π k λ k DA k D 0. π k λ k DA k D 5. π k λ k D k AD k 88.6 mixmod π k λ k D k A k D k. π k λ k DAD.0 π k λ k D k A k D k 59.5 π k λda k D 59.7 π k λ k D k A k D k 80.6 π k λ k DAD 8. π k λda k D. π k λ k DAD 550. π k λ k DA k D 8.6 FFHF 7. FFHF 5.6 FFHF 85.6 mixtrv FFFF. FPHF.0 FFFF 59.5 FPHF 550. FFFF 80.6 FPHF 8. FHHF 5. FHHF 55.5 FHHF 8.6 (a) min min (original units) (b) sec min (c) standardized standardized Table : The four best models within each family (bgmm, pgmm, mclust, mixmod, mixtrv), inferred on the Old Faithful data (K = ) when Duration Waiting measurement units vary. 0

MixTRV Matlab package for Gaussian mixture inference statistical hypothesis testing& model-based classification/clustering

MixTRV Matlab package for Gaussian mixture inference statistical hypothesis testing& model-based classification/clustering Alexandre Lourme Institut de Mathématiques de Bordeaux (IMB), UMR 5251, Université