Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Size: px

Start display at page:

Download "Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs"

Clifton Murphy
5 years ago
Views:

1 Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs

2 Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations. Each subpopulation is modeled separetaly. The overall population is a mixture of these subpopulations. The resulting model is a finite mixture models.

3 Finite Mixture Models The general form of a mixture model with K groups is f(x) = K p k f k (x) k=1 the p k s are the mixing proportions and the f k (.) s denote the densities of the groups. The parameterisation of the group densities depends of the nature (continuous or discrete) of the observed data.

4 An hidden structure model The mixture model is an incomplete data structure model: The complete data are y = (y 1,..., y n ) = ((x 1, z 1 ),..., (x n, z n )) where the missing data are z = (z 1,..., z n ), with z i = (z i1,..., z ik ) are binary vectors such that z ik = 1 iff x i arises from group k. The z s define a partition P = (P 1,..., P K ) of the observed data x with P k = {x i z ik = 1}.

5 Multivariate Gaussian Mixture (MGM) Multidimensional observations x = (x 1,..., x n ) in R d are assumed to be a sample from a probability distribution with density f(x i K, θ) = K p k φ(x i m k, Σ k ) k=1 where the p k s are the mixing proportions and φ(. m k, Σ k ) denotes a Gaussian density with mean m k and variance matrix Σ k. This is the most popular model for clustering of quantitative data.

6 Discrete Data Observations to be classified are described with d discrete variables. Each variable j has m j response levels. Data are represented in the following way: (x 1,..., x n ) where x i = (x jh i ; j = 1,..., d; h = 1,..., m j ) with { x jh i = 1 if i has response level h for variable j x jh i = 0 otherwise.

7 The standard latent class model (LCM) Data are supposed to arise from a mixture of g multivariate multinomial distributions with pdf g f(x i ; θ) = p k (α jh i where k=1 p k m k (x i ; α k ) = k j,h k )xjh α jh k is denoting the probability that variable xj has level h if object i in cluster k, and α k = (α jh k ; j = 1,..., p; h = 1,..., m j), p = (p 1,..., p g ) is denoting the vector of mixing proportions of the g latent clusters, θ = (p k, α k, k = 1,..., g) denoting the vector parameter of the latent class model to be estimated. Latent class model is assuming that the variables are conditionnally independent knowing the latent clusters.

8 First Interest of MBC Many versatile or parsimonious models available

9 MGM: The variance matrix eigenvalue decomposition where Σ k = V k D t k A kd k V k = Σ k ) 1/d defines the component volume (d is the dimension of the observation space), D k the matrix of eigenvectors of Σ defines the component orientation A k the diagonal matrix of normalised eigenvalues defines the component shape. By allowing some of these quantities to vary between components, we get different and easily interpreted models.

10 28 different models Following Banfield & Raftery (1993) or Celeux & Govaert (1995), a large range of 28 versatile (from the most complex to the simplest one) models derived from this eigenvalue decomposition can be considered. The general family: Assuming equal or free proportions, volumes orientations and shapes leads to 16 different models. The diagonal family: Assuming in addition that the component variances matrices are diagonal leads to 8 models. The spherical family: Assuming in addition that the component variance matrices are proportional to the identity matrix leads to 4 models.

11 LMC: a Reparameterization α (a, ε) where binary vector a k = (a 1 k,..., ad k ) provides the mode levels in cluster k for variable j { (a jh 1 if h = arg maxh α ) = jh 0 otherwise. and the ε j k can be regarded as scattering values. (ε jh ) = { 1 α jh if a jh = 1 α jh if a jh = 0. For instance, if α j = (0.7, 0.2, 0.1), the new parameters are a j = (1, 0, 0) and ε j = (0.3, 0.2, 0.1).

12 Five latent class models Denoting h(ij) the level of object i for the variable j, the model can be written f(x i ; θ) = ( ) p k (1 ε jh(jk) k ) xjh(jk) i (ε jh(ij) k ) xjh(ij) i x jh(jk) k. k j Using this form, it is possible to impose various constraints to the scattering parameters ε jh k. The models we consider are the following the standard latent class model [ε jh k ]: The scattering is depending upon clusters, variables and levels. [ε j k ]: The scattering is depending upon clusters and variables but not upon levels. [ε k ]: The scattering is depending upon clusters, but not upon variables. [ε j ]: The scattering is depending upon variables, but not upon clusters. [ε]: The scattering is constant over variables and clusters.

13 Second interest of MBC Many algorithms to estimate the mixture model from different points of view

14 Maximum likelihood estimation The EM algorithm is the reference tool to derive the ML estimates in a mixture model. E step Compute the conditional probabilities t ik, i = 1,..., n, k = 1,..., K that x i arises from the kth component for the current value of the mixture parameters. M step Update the mixture parameter estimates maximising the expected value of the completed likelihood. It leads to weight the observation i for group k with the conditional probability t ik.

15 Classification EM E step Compute the conditional probabilities t ik, i = 1,..., n, k = 1,..., K that x i arises from the kth component for the current value of the mixture parameters. C step Assign each point x i to the component maximising the conditional probability t ik (MAP principle). M step Update the mixture parameter estimates maximising the completed likelihood.

16 Features of CEM CEM aims maximising the complete likelihood where the component label of each sample point is included in the data set. Contrary to EM, CEM converges in a finite number of iterations CEM is a K-means-like algorithm. CEM provides biased estimates of the mixture parameters.

17 Stochastic EM E step Compute the conditional probabilities t ik, i = 1,..., n, k = 1,..., K that x i arises from the kth component for the current value of the mixture parameters. S step Assign each point x i at random to one of the component according to the distribution defined by the (t ik, k = 1,..., K). M step Update the mixture parameter estimates maximising the completed likelihood.

18 Features of SEM SEM generates a Markov chain whose stationary distribution is (more or less) concentrated around the ML paramater estimator. Thus a natural parameter estimate from a SEM sequence is the mean of the iterates values obtain after a burn-in period (SEMmean). An alternative estimate is to consider the parameter value leading to the largest likelihood in a SEM sequence (SEMmax). Different variants (Monte Carlo EM, Simulated Annealing EM) are possible.

19 Strategies for getting sensible ml estimate Aim: Find the maximum likelihood value with specified model and component number. Biernacki, Celeux & Govaert (2003) proposed simple strategies acting in two ways to deal with this problem. Starting values: use of CEM or SEM. Repeating: Run several times EM or algorithms combining CEM cdcccand EM.

20 Third interest of MBC Finite mixture models can be compared and assessed in an objective way

21 Model selection Choosing a parsimonious model in a collection of models. The problem is to solve the bias-variance dilemma. A too simple model leads to a large approximation error. A too complex model leads to a large estimation error. Standard criteria of model selection are AIC and BIC criteria. Both criteria are penalized likelihood criteria.

22 The AIC criterion AIC is approximating the deviance of a model m with ν m free parameters d(x) = 2[log p(x) log p(x m, ˆθ m )] on a single test observation X. of nd(x) E(d(x)) where The penalization is an estimation is the expected deviance on X. D(X) = 2E[log p(x) log p(x m, ˆθ m )] Assuming that the data arose from a distribution belonging to the collection of models in competition, AIC is AIC(m) = 2 log p(x m, ˆθ m ) 2ν m.

23 The BIC criterion BIC is a Bayesian criterion. It is approximating the integrated likelihood of the data p(x m) = p(x m, θ m )π(θ m )dθ m, π(θ m ) being a prior distribution for parameter θ m. BIC is BIC(m) = log p(x m, ˆθ m ) ν m 2 log(n). This approximation is appropriate when one and only one of the competing models is true.

24 Choosing a mixture model in a density estimation context Despite theoretical difficulties in the mixture context Simulation experiments (see Roeder & Wasserman 1997) show that BIC works well at a practical level to choose a sensible Gaussian mixture model, See also the good performances of a MML criterion proposed by Figuereido & Jain (2002). Simulations experiments (Nadolski & Viele 2004) show that a variant of AIC, AIC3 which replaces the penalty 2ν m with 3ν m works well at a practical level to choose a sensible latent class mixture model.

25 Assessing K in a clustering context Assessing the number of groups K is an important but difficult problem. Mixture modelling can be regarded as a semi parametric tool for density estimation purpose or as a model for cluster analysis. In the density estimation context, BIC is doing the job quite well. In the cluster analysis context, since BIC does not take into account the clustering purpose for assessing K, BIC has a tendency to overestimate K regardless of the separation of the clusters. To overcome this limitation, it can be advantageous to choose K in order to get the mixture giving rise to partitioning data with the greatest evidence (Biernacki, Celeux & Govaert 2000).

26 The ICL criterion: definition It leads to consider the integrated likelihood of the complete data (x, z) (or integrated completed likelihood), p(x, z K) = p(x, z K, θ)π(θ K)dθ, Θ K where with p(x, z K, θ) = p(x i, z i K, θ) = n p(x i, z i K, θ) i=1 K k=1 p z ik k [φ(x i a k )] z ik. To approximate this integrated complete likelihood, a BIC-like approximation is possible. It leads to the criterion ICL(K) = log p(x, ẑ K, ˆθ) ν K log n, 2 where the missing data have been replaced by their most probable value for parameter estimate ˆθ.

27 Behavior of the ICL criterion Roughly speaking criterion ICL is the criterion BIC penalized by the estimated mean entropy E(K) = K k=1 n t ik log t ik 0, i=1 t ik denoting the conditional probability that x i arises from the kth mixture component (1 i n and 1 k K). Because of this additional entropy term, ICL favors K values giving rise to partitioning the data with the greatest evidence. ICL appears to provide a stable and reliable estimate of K for real data sets and also for simulated data sets from mixtures when the components are not too much overlapping. But ICL, which is not aiming to discover the true number of mixture components, can underestimate the number of components for simulated data arising from mixture with poorly separated components.

28 Fourth interest of MBC Special questions can be tackled in a proper way in the MBC context

29 Robust Cluster Analysis Using multivariate Student distributions instead of Multivariate Gaussian distributions lead to attenuate the influence of outliers (McLachlan & Peel 2000). Including in the mixture a group from a uniform distribution allows to take into account noisy data (DasGupta & Raftery 1998). Shrinking the group variance matrix in a proper way (see for instance Ciuperca, Idier & Ridolfi 2002)

30 Some other special questions solved with MBC Imposing specific constraints as groups with known distributions. Dealing with missing data at random in a proper way (Hunt & Basford 1999, 2001). Variable Selection for Model-Based Clustering with a stepwise procedure using BIC criterion (Raftery & Dean 2006) (see also Law, Figueiredo & Jain 2004 for an alternative approach). Simple and efficient models in semi-supevised Classification (see for instance Ganesalingam & McLachlan 1978,... ). Assuming the variables are conditionally independent knowing the groups makes valid, in a simple way, the treatment of continuous and discrete data with the same MBC.

31 Concluding Remarks Finite Mixture analysis is definitively a powerful framework for model Cluster analysis. Many free and valuable softwares are available: C.A.MAN, EM- MIX, FLEXMIX, MClust, MIXMOD a, MULTIMIX, SNOB... Since the seminal paper of Diebolt & Robert (1993), Bayesian Inference for mixtures have received a lot of attention. Bayesian analysis of univariate mixtures has became the standard Bayesian tool for density estimation. It cannot be regarded as a standard tool for Bayesian clustering of multivariate data since a lot of problems are not completely solved: slow convergence, subjective weakly informative problems, identifiability (label switching),... a

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Model selection criteria in Classification contexts Gilles Celeux INRIA Futurs (orsay) Cluster analysis Exploratory data analysis tools which aim is to find clusters in a large set of data (many observations