Submitted to the Brazilian Journal of Probability and Statistics

Size: px

Start display at page:

Download "Submitted to the Brazilian Journal of Probability and Statistics"

Rebecca Walters
5 years ago
Views:

1 Submitted to the Brazilian Journal of Probability and Statistics A Bayesian sparse finite mixture model for clustering data from a heterogeneous population Erlandson F. Saraiva a, Adriano K. Suzuki b and Luís A. Milan c a Instituto de Matemática, Universidade Federal de Mato Grosso do Sul b Departamento de Matemática Aplicada e Estatística, Unversidade de São Paulo c Departamento de Estatística, Universidade Federal de São Carlos Abstract. In this paper we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components k previously fixed where many components can be empty. In this model, the number of components k can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters k c (e.g. non empty components) can be random and smaller than the number of components k of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively. 1 Introduction The main goal of the clustering analysis is to group the observed data into homogeneous clusters. The first works on clustering are due Sneath (157) and Sokal and Michener (158). Since the publication of these articles the clustering problem has been studied by many authors as Hartingan and Wong (178), Binder (178), Anderson (185), Banfield and Raftery (1), Bensmail et al. (17) and Witten and Tibshirani (2010). According to Bouveyron and Brunet (201), more and more scientific fields requires clustering data with the aim of understanding the phenomenon of interest. Usual clustering approaches are distance-based, such as k-means (Mac- Queen, 167) and hierarchical clustering (Ward, 16). Although simple and MSC 2010 subject classifications: Primary 62xx; secondary 62Mxx Keywords and phrases. Mixture model, Bayesian approach, Gibbs sampling, Metropolis-Hastings, Split-merge update. 1

2 2 E. F. Saraiva, A. K. Suzuki and L. A. Milan visually appealing these methods can be implemented only for cases where the number of clusters are known. Besides, according to Oh and Raftery (2007) these methods are not based on standard principles of statistical inference and they do not provide a statistically based method for choosing the number of clusters. Under a probabilistic approach McLachlan and Basford (188), Banfield and Raftery (1), McLachlan and Pell (2000) and Fraley and Raftery (2002) proposed the use of a finite mixture model in which each component of the mixture represents a cluster. In these methods, partitions are determined by the EM (expectation-maximization) algorithm and the number of components of the mixture are, in general, determined comparing fitted models with different number of components using some model selection criterion, such that, AIC (Akaike, 174; Bozdogan, 187) or BIC (Schwarz, 178). A similar strategy is adopted in the Bayesian approach, considering the DIC (Spiegelhalter et al., 2002) as model selection criterion, see for example Celeux et al. (2000). This can be seen as a drawback to be overcome, since in practice it may be very tedious to fit several models and afterwards compare them according to a model selection criterion. Also, in many cases the estimation depends on iterative methods which may not converge imposing additional difficulties to the process. Thus, a practical and efficient computational algorithm to estimate the number of cluster jointly with the component-specific parameters is desirable. Under this scenario, the Bayesian approach has been successful, in special, due the reversible-jump Markov Chain Monte Carlo algorithm proposed by Richardson and Green (17) in the context of Gaussian mixture models. However, one difficulty frequently encountered for implementing a reversible-jump algorithm is the construction of efficient transitions proposals that lead to a reasonable acceptance rate. In this paper we consider a sparse finite mixture model for clustering data. This model is a finite mixture model with k components, previously fixed as a large number, where many components can be empty. The large value assumed for k can be interpreted as the maximum number of distinct mixture components. The term sparse refers refers to the existence of many empty components. The main motivation to consider this model is the fact that a finite mixture model is a population model, then given an observed sample y not all k components may have observations in the sample and we may have empty components. In addition, assuming this model, we avoid the need to use the reversible-jump MCMC method. It allowed us to implement the split-merge movements using the observed data; instead of to use a transition function as in a RJMCMC.

3 A Bayesian sparse finite mixture model Thus, we assume that the number of clusters k c (i.e., number of non empty components) is an unknown quantity, but smaller than k. To estimate k c jointly with the other parameters of interest we consider a Bayesian approach. In this approach, we set up a symmetric Dirichlet prior distribution with hyperparameter γ on the weights of the mixture. So, using the Dirichlet integral we integrate out the weights and derive the prior and posterior allocations probabilities. In order to simulate from joint posterior distribution of the parameters of interest we develop a MCMC algorithm, the Split-Merge Allocation sampler (SMAS). This algorithm consist of three steps. In the first step parameters of the not-empty components (e.g. clusters) are updated from its conditional posterior distribution and parameters of the empty components are updated from prior distribution. In the second step a Gibbs sampling is performed to update the latent indicator variables using the posterior allocations probabilities. And the third step consist of a split-merge step that change the number of clusters on the neighbourhood ±1, respectively. This step was inserted within the algorithm in order to allow a major change in configuration of the latent variables in a single iteration of the algorithm, and consequently increasing the mixing of the Markov chain in relation to the number of clusters. The split and merge movements are developed in a way that they are reversible. In a split move each observation is allocated to one of two new partitions based on probabilities which are calculated according to marginal likelihood function from observations previously allocated to two new partitions. Given the proposed allocation of observations, the candidate-values for component parameters are generated from the conditional posterior distributions. Choosing the posterior density as candidate-generating density, the likelihood ratio and the ratio of prior densities from the Metropolis-Hastings acceptance probability cancels the corresponding term in the posterior density, simplifying the computation of the acceptance probability. Thus, the proposed algorithm performs a standard Metropolis-Hastings update with an acceptance probability which depends only on data associated with component(s) selected for a split or a merge. Our approach does not require the specification of transition functions to estimate the number of clusters jointly with the component-specific parameters and when a new cluster is created, through a split or a merge movement, it determines a new partition in the data set. This is due to the way that we implement the split-merge movements which is data-driven instead of parameter based.

4 4 E. F. Saraiva, A. K. Suzuki and L. A. Milan In order verify the performance of the SMAS we develop a simulation study considering data sets generated with a different number of clusters. We also verify its sensibility to the choice of the value of hyperparameter γ of the Dirichlet prior distribution. To do this we explore two scenery. In the first we set up γ = r, where r is value of a grid G. In the second we consider one more hierarchical level with a prior distribution on γ and estimating it from its posterior distribution. For each artificial dataset we present the performance of SMAS in terms of posterior probability for number of clusters, convergence, mixing and autocorrelation. We also apply the methodology to three real data sets. The first is the benchmark data on velocities from distant galaxies diverging each other, previously analyzed by Roeder and Wasserman (17), Escobar and West (15), Richardson and Green (17), Stephens (2000) and Saraiva et al. (2014), available in R software. The second and the third one are Enzyme and Acidity datasets extracted from website / mapjg/mixdata. The remainder of the paper is structured as follows. In Section 2 we describe the sparse finite mixture model for clustering data. In Section we introduce the hierarchical Bayesian approach and develop the SMAS algorithm. In Section 4 the proposed sampler is applied to simulated data sets to access its performance and to real data sets to illustrate its use. Section 5 concludes the paper with final remarks. Additional details are provided in the supplementary material, denoted by prefix SM when referred to in this paper. Table 1 presents the main notations used throughout the article. Table 1 Mathematical notation. Notation Description k Number of components k c Number of clusters θ j Parameter of the j-th component, for j = 1,..., k θ = (θ 1,..., θ k ) The whole vector of parameters w j Weight of the j-th component, for j = 1,..., k Y i The i-th sampled value, for i = 1,..., n c i The i-th indicator variable, for i = 1,..., n y = (y 1,..., y n) The vector of independent observations c = (c 1,..., c n) The vector of latent indicator variables Φ = (θ, c, k c) Current state Φ sp = (θ sp, c sp, kc sp ) Split state Φ mg = (θ mg, c mg, kc mg ) Merge state

5 2 Mixture model for clustering A Bayesian sparse finite mixture model 5 Consider a population composed by k subpopulations, such that, the sampling units are homogeneous with respect to the characteristic under study within the subpopulation and heterogeneous among the subpopulations. Let w 1,..., w k be the relative frequencies of each subpopulation in relation to the overall population, for 0 w j 1 and k j=1 w j = 1. Assume that each subpopulation j is modeled by a probability distribution F (θ j ) indexed by parameter θ j (scalar or vector), for j = 1,..., k. Suppose that the sampling process consists of choosing a subpopulation j with probability w j and then sample a Y i value of this subpopulation, for j = 1,..., k and i = 1,..., n where n is the sample size. Then we can represent each sample unit by the pair (Y i, c i ), where c i is an indicator variable that assume a value of the set {1,..., k} with probabilities {w 1,..., w k }, respectively. Thus, we have that (Y i c i = j, θ j ) F (θ j ) and P (C i = j w) = w j, (2.1) where w = (w 1,..., w k ), for i = 1,..., n and j = 1,..., k. In many practical problems such as clustering problems indicator variables are non-observable (also known as latent variables). The probability of i- th observation coming from subpopulation j is w j, for i = 1,..., n and j = 1,..., k. The marginal probability density function for Y i = y i is given by k f(y i θ, w) = w j f(y i θ j ), (2.2) j=1 where f(y i θ j ) is the probability density function of F (θ j ), θ = (θ 1,..., θ k ) is the whole vector of parameters and w = (w 1,..., w k ) are the weights. Model (2.2) is denominated in the literature by finite mixture model. In this model each probability density function, f( θ j ), corresponds to a component of the mixture. The finite mixture model is a natural probabilistic approach for clustering. However, as model (2.2) is a population model, then given an observed sample y = (y 1,..., y n ) not all k components may have observations in the sample and we may have empty components, i.e., fewer than k components are in the sample (Fruhwirth-Schnatter, 2017). In this case, we have that the number of clusters (i.e., non-empty components) is smaller than the number of components k. Walli et al. (2016) and Fruhwirth-Schnatter (2017) call this model by sparse finite mixture model.

6 6 E. F. Saraiva, A. K. Suzuki and L. A. Milan Our interest is to infer about the number of clusters k c and the latent variables c jointly with the components parameters. Due this, we consider c as parameters to be estimated in model (2.2). 2.1 Joint distribution for complete data Let (y, c) be the complete data, where y = (y 1,..., y n ) is the vector of independent observations and c = (c 1,..., c n ) is the vector of latent indicator variables, with y and c being paired. Consider k c the number of clusters defined by the configuration c, k c k. From model (2.1) we have that the joint probability for the latent indicator variables c is given by π(c = c w, k) = k j=1 w n j j, (2.) where n j is the number of observations y i s allocated to subpopulation j, for j = 1,..., k. The distribution of the number of observations assigned to each component, n 1,..., n k, called the occupation number, is multinomial (n 1,..., n k n, w) Multinomial(n, w), (2.4) where n = n n k. Consider D j = {y i ; c i = j} be the set of observations allocated to component j, for j = 1,..., k. Without loss of generality, assume that clusters are labelled from 1 to k c and that the empty components are labelled from k c + 1 to k. Thus, the joint distribution for complete data (y, c) conditional on mixing proportions w, component parameters θ and number of components k is P (Y = y, C = c w, θ, k) = P (Y = y c, w, θ, k)π(c = c w, k) (2.5) = k n [w j f(y i θ j )] Ic i (j) = j=1 i=1 k c w n j j j=1 f(y i θ j ), Dj where I ci (j) = 1 if c i = j and I ci (j) = 0 otherwise, for i = 1,..., n and j = 1,..., k. At this point some remarks about the label switching are necessary. Note that, the cluster labels j = 1,..., k c are not uniquely determined and a

7 A Bayesian sparse finite mixture model 7 permutation of the labels would lead to the same model. Since our interest lies in inferences on partitions, the non-identifiability of labels would cause a problem in posterior computation and allocation probabilities are useless for partitioning the observations (Stephens, 2000). Therefore we impose restrictions on the class of component means of the clusters to get identifiability, i.e, we assume that µ 1,..., µ kc are the component means for clusters and that µ 1 <... < µ kc. However, it does not prevent the MCMC algorithm described in the next Section from being applicable to other labelling. For further discussion and additional references about label switching see Stephens (2000), Jasra et al. (2005) and their references. Bayesian approach In order to estimate c and the number of clusters k c jointly with the component parameters θ, we consider the following hierarchical Bayesian model Y i c i = j, θ, k F (θ j ), (.1) c i w, k Discrete(w 1,..., w k ), θ j G(η j ), w γ, k Dirichlet(γ), for i = 1,..., n, where G(η j ) is the a priori distribution for component parameters θ j, η j (scalar or vector) are the hyperparameters, for j = 1,..., k, and Dirichlet(γ) represents the symmetric Dirichlet distribution with parameter γ, γ > 0, and probability density function given by π(w γ, k) = Γ(kγ) [Γ(γ)] k k j=1 w γ 1 j. (.2) The choice of the prior distribution G(η j ) for the parameter θ j depends on the probability distribution F (θ j ) assumed for Y i, for i = 1,..., n and j = 1,..., k. In section 4, we discuss the choice of G(η j ) in the context in which F (θ j ) is given by a Gaussian distribution with mean µ j and variance σj 2, i.e., θ j = (µ j, σj 2 ), for j = 1,..., k. It is important to note that depending on the value fixed for the hyperparameter γ the Dirichlet sampling may generate some weights w j = 0 and consequently the multinomial sampling given in (2.4) lead to partitions with n j = 0, for some j {1,..., k}. According to Walli et al. (2016) and Fruhwirth-Schnatter (2017) this happens when we consider a small value for hyperparameter γ.

8 8 E. F. Saraiva, A. K. Suzuki and L. A. Milan In order to verify the sensibility of the results to the choice of the value for the hyperparameter γ we explore two scenery. First we set up γ = r, where r is a value of the set G = {R 1, R 2 } in which R 1 is a grid from 0.01 to 0.0 with a fixed increment step equals to 0.01 and R 2 is a grid from 0.1 to 2 with a fixed increment step equals to 0.1. In the second scenery we consider one more hierarchical level with a gamma prior distribution on γ, γ Γ(a, b), for a, b > 0. The parametrization of the gamma distribution is so that the expected value is a/b and the variance is a/b 2. As our main interest lies on configuration c, then we integrate out the mixing proportion. Taking the product of equations in (2.) and (.2) and integrating out the mixing proportions, we can write the joint probability of c given γ and k as Γ(kγ) π(c = c γ, k) = [Γ(γ)] k Γ(n + kγ) k Γ(n j + γ). (.) The joint probability for complete data (y, c) given θ, γ and k is given by P (Y = y, C = c θ, γ, k) = P (Y = y c, θ, k)π(c = c γ, k) (.4) = k c f(y i θ j ) π(c = c γ, k), Dj j=1 where π(c = c γ, k) is given in (.). Using the Dirichlet integral and keeping all but a single indicator variable fixed, the conditional distribution for a single latent indicator variable c i given all others, denoted by c i = (c 1,..., c i 1, c i+1,..., c n ), is given by j=1 π(c i = j c i, γ) = n j, i + γ n + kγ 1, (.5) where n j, i is the number of observations in D j except the observation y i, for i = 1,..., n and j = 1,..., k. From (.5), the probability of observation y i to be assigned to one empty component is π(c i = j γ c i, γ) = n + kγ 1, (.6) for j {k c i + 1,..., k}, where k c i is the number of clusters defined by configuration c i, for i = 1,..., n. The expression in (.6) is the probability of a observation y i to define a new cluster, for i = 1,..., n.

9 A Bayesian sparse finite mixture model.1 Posterior distribution Using the Bayes theorem, the joint posterior distribution upon which inference is based is given by π(θ, c y, γ, k) P (Y = y c, θ, k)π(c = c γ, k)π(θ k), (.7) where P (Y = y c, θ, k)π(c = c γ, k) = L(θ y, c, γ, k) is the completedata likelihood function for θ, which is equal to the sampling distribution given in (.4), regarded as a function of the unknown parameters θ, and π(θ k) = k j=1 π G (θ j η j ) is the joint prior distribution for θ with π G (θ j η j ) being the probability density function of the prior distribution G(η j ), for j = 1,..., k. In order to sample from the joint posterior distribution in (.7) and estimate parameters of interest, we propose the Split-Merge allocation sampler algorithm (SMAS). This is a MCMC algorithm that makes use of the following three moves types. (a) update the parameters θ; (b) update the allocation indicators c; (c) split one cluster into two, or combine two clusters into one..1.1 Updating the parameters. Conditional on configuration c, we have k c clusters and (k k c ) empty components. The conditional posterior distribution for θ j is given by where π(θ j y, c, k) L(θ j D j )π G (θ j η j ), (.8) f(y i θ j ), if D j ; L(θ j D j ) = D j 1, if D j = (.) is the likelihood function for component j, for i = 1,..., n and j = 1,..., k. Thus, we update component parameters according to Algorithm 1. Algorithm 1: Let the state of the Markov chain consist of c = (c 1,..., c n ) and θ = (θ 1,..., θ k ). Conditional on configuration c, update component parameters θ as follows: (1) For clusters j, j = 1,..., k c, generate θ j from posterior distribution given in (.8); (2) For empty components j, j = k c +1,..., k, generate θ j from prior distribution, θ j π G (θ j );

10 10 E. F. Saraiva, A. K. Suzuki and L. A. Milan () Accept the updated values, θ updated, only if adjacency condition for component parameters of the clusters is met, i.e., if µ updated 1 <... < µ updated k c. Otherwise, keep θ updated = θ;.1.2 Updating the allocation indicators. Given the component parameters θ, the conditional posterior probability for C i = j is given by π(c i = j y i, θ j, c i, γ) = n j, i + γ n + kγ 1 f(y i θ j ), (.10) for j = 1,..., k c i and i = 1,..., n. We need now to define the probability of an observation y i to be allocated to an empty component, i.e., to create a new cluster. In order to probability of creating a new cluster does not depend of the parameter value θ j, which is generated from prior distribution, we integrate parameters out for this case. Thus, the conditional posterior probability for C i = j is π(c i = j γ y i, c i, γ, η j ) = f(y i θ j )π G (θ j η j )dθ j (.11) n + kγ 1 for j {k c i + 1,..., k} and i = 1,..., n. Let P i = ( ) P c i, P 0i be the vector of allocation ) probabilities for the i- th observation, where P c i = (P 1,..., P kc i for P j given in (.10) and ) P 0 = (P kc i +1,..., P k for P j given in (.11), for j = 1,..., k c i, j = k c i + 1,..., k and i = 1,..., n. Thus, we update the latent indicator variables according to Algorithm 2. Algorithm 2: Let the state of the Markov chain consist of c = (c 1,..., c n ) and θ = (θ 1,..., θ k ). Conditional on θ, update c = (c 1,..., c n ) as follows. For i = 1,..., n: (1) remove c i from current state c, obtaining c i and k c i ; (2) calculate the vector of allocation probabilities P i ; () generate an auxiliary variable Z i = (Z i1,..., Z ik ) Multinomial(1, P i ); (4) If Z ij = 1, for j {1,..., k c i }, set up c i = j and do n j = n j, i + 1; (5) If Z ij = 1 do n j = 1 and k c = k c i + 1. Generate a value for the component parameter θ j of the new cluster from the posterior distribution π(θ j y i ). Relabel the k c clusters in order to maintain the adjacency condition. If the component mean µ j of the new cluster is so that:

11 A Bayesian sparse finite mixture model 11 (a) µ j = min 1 j k c µ j, then do j = 1 and relabel all other clusters doing j + 1; (b) µ j = max µ j, then do j = k c and keep all other clusters 1 j k c labels; (c) µ j < µ j < µ j+1, for j {1, k c }, then do j = j + 1 and relabel all other clusters j j + 1 doing j = j + 1. At this point, the estimation procedure could be done by iterating between algorithms 1 and 2. But, algorithm 2 may be inefficient in situations where clusters have near means. This happens because the algorithm updates only one latent indicator variable at a time. Consequently, we may have a poor exploration of observation clusters and the algorithm may be trapped in local modes. This is the reason why we insert a split-merge step within the MCMC algorithm in order to increase the mixing on k c..2 Split-merge movements Let Φ = (θ, c, k c ) be the current state of the MCMC algorithm and Φ sp = (θ sp, c sp, kc sp ) and Φ me = (θ me, c me, kc me ) be the proposal states obtained by split and merge movements respectively. As dimension of the parametric space of θ = (θ 1,..., θ k ) is fixed, the acceptance probability for a split or a merge movement is given by the Metropolis-Hastings acceptance probability (Chib and Greenberg, 15), i.e., Ψ[Φ Φ] = min(1, A ), where A = P (y c, θ, k) π(c γ, k) π(θ η, k) q[φ Φ ] P (y c, θ, k) π(c γ, k) π(θ η, k) q[φ Φ], (.12) where means either a split or a merge, q[ ] is the transition proposal which is obtained by a split or a merge depending on the type of proposal, P (y c, θ, k)π(c γ, k) is the complete-data likelihood function which is equal to the sampling distribution given in (.4), regarded as a function of the unknown parameters θ, and π(θ k) = π G (θ j η j ) is the joint prior distribution for θ. k j=1.2.1 Split movement. Let P sp kc and P me kc be the probabilities of proposing a split and a merge, respectively, with P sp kc +P me kc = 1. These probabilities depend on k c because if k c = 1, we can propose only a split, P sp kc = 1; in the other hand, if k c = k we can propose only a merge, P me kc = 1. For 2 k c (k 1) we use P sp kc = P me kc = 1/2.

12 12 E. F. Saraiva, A. K. Suzuki and L. A. Milan Provided we choose a split, consider C 2 be the number of clusters with n j 2. Select a component D j, with n j > 2, with probability P j C2 = 1 C 2 and propose a split of observations y i D j in two new sets D j1 and D j2 as follows: (i) Let y h1 and y h2 be the minimum and the maximum value of the set D j, respectively; (ii) Do D j1 = {y h1 }, D j2 = {y h2 } and n j1 = n j2 = 1; (iii) For i = 1,..., n do the following: (a) if c i = j and y i / {y h1, y h2 }, then allocate y i in D j1 with probability n j1 f(yi θ j1 )π(θ j1 D j1 )dθ j1 P j1 (y i ) =, n j1 f(yi θ j1 )π(θ j1 D j1 )dθ j1 + n j2 f(yi θ j2 )π(θ j2 D j2 )dθ j2 where π(θ m D m ) is the posterior distribution for θ m given D m, for m = j 1, j 2 ; (b) Generate an indicator variable I i Bernoulli (P j1 (y i )). If I i = 1, y i D j1. Then, do D j1 = {D j1 } {y i }, n j1 = n j1 +1 and c sp i = j 1. Otherwise, do D j2 = {D j2 } {y i }, n j2 = n j2 + 1 and c sp i = j 2 ; (c) If c i = j and y i = y h1, then do P j1 (y i ) = 1 and c sp i = j 1. If c i = j and y i = y h2, then do P j1 (y i ) = 0 and c sp i = j 2 ; We fix the new labels as j 1 = j, j 2 = j + 1 and for all other labels j > j we do j = j + 1, for j {1,..., k}. Thus, we have a new configuration c sp = (c sp 1,..., csp n ) with kc sp = k c + 1 clusters where (i) for all c i c, so that, c i = j, do c sp i = j 1 or c sp i = j 2 according to step (iii) above; (ii) for all c i c, so that c i = j for j < j, do c sp i = c i ; (iii) for all c i c, so that c i = j for j > j, do c sp i = c i + 1; for i = 1,..., n, j {1,..., k c } and j {1,..., k} \ {j 1, j 2 }. The probability of configuration D j1 and D j2 is P alloc = y i D j1 P j1 (y i ) y i D j2 (1 P j1 (y i )). Conditional on D j1 and D j2, generate candidate-values θ sp j 1 and θ sp j 2 for parameters θ j1 and θ j2 from posterior distributions π(θ j1 D j1 ) and π(θ j2 D j2 ), respectively. Thus we have a new vector of parameters θ sp = ( θ sp 1,..., θsp k c, θ sp k,..., ) c+1 θsp k, where (i) for all θ sp j θ sp, so that j < j 1, do θ sp j = θ j ;

13 A Bayesian sparse finite mixture model 1 (ii) for all θ sp j θ sp, so that j > j 2, do θ sp j = θ j 1; for θ j θ and j {1,..., k} \ {j 1, j 2 }. Now we must check if the adjacency condition is met in the split proposal, i.e., if component means for clusters are in increasing numerical order, µ sp j 1 1 < µsp j 1 < µ sp j 2 < µ sp j In the case where it is not, the proposal is rejected because the movement may not be reversible by the merge proposal. If the adjacency condition is met, we have a new configuration c sp and a new set of parameters θ sp with kc sp = k c +1 clusters. This transition proposal is denoted by Φ sp Φ and its probability is given by q[φ sp Φ] = P sp kc P j C2 P alloc π(θ j1 D j1 )π(θ j2 D j2 ), (.1) where π(θ m D m ) is the posterior density for θ m, for m = j 1, j Merge movement. We deal with merge a reverse of split movement. This movement is initialized choosing sets D j1 and D j2. We establish a criterion which merges clusters adjacent in relation to the current values of their means. This is due to the adjacency condition assumed in the split movement. Thus, the probability of selecting D j1 and D j2 for a merge is 1, if k c = 2; P j1,j 2 = P j1 P j2 j 1 + P j2 P j1 j 2 = 2k c, if k c > 2 and j 1 = 1 or j 2 = k c ; 1 k c, if k c > 2 and j 1 1 or j 2 k c ; where P b1 is the probability of choosing cluster b 1 and P b2 b 1 is the conditional probability of choosing cluster b 2 given the previous choice of b 1. After selecting clusters D j1 and D j2, we join them in a single cluster D j, i.e., we do D j = {D j1 } {D j2 }. We fix the new labels as j = j 1 and for all other labels j with j j 2 we do j = j 1. Thus, we have a new configuration c me = (c me 1,..., cme n ) with kc me = k c 1 clusters, where (a) for all c i c, so that c i = j and j j 1, do c mg i = c i ; (b) for all c i c, so that c i = j and j j 2, do c mg i = c i 1; for i = 1,..., n, j = 1,..., k c and j {1,..., k c 1}. Conditional on D j, generate the candidate-value θ mg j for parameter θ j from posterior distribution ( π(θ j D j ). This determine ) a new vector of parameters θ me = θ1 me,..., θme k c 1, θk me c,..., θk me, where (i) for all θj me θ me, so that j < j 1, do θj me = θ j ; (ii) for all θj me θ me, so that j j 2, do θj me = θ j +1, for θ j θ and j {1,..., k 1} \ {j 1, j 2 };

14 14 E. F. Saraiva, A. K. Suzuki and L. A. Milan (iii) in order to complete θ me, generate θk me from prior distribution, θ k π G (θ k ). Here we also must check if the adjacency condition is met, i.e., µ me j 1 < µ me j < µ me j+1. In the case where it is not, the proposal is rejected because the movement may not be reversible by the split proposal. If the adjacency condition is met, the merge proposal determine the new configuration c me and a new vector of parameters θ me with kc me = k c 1 clusters. This transition proposal is denoted by Φ me Φ and its probability is given by q[φ me Φ] = P me kc P j1,j 2 π(θ j D j )π G (θ k η k ). (.14) Defined the transition probabilities for split and merge movements, it is important to note that, given the current state Φ, the probability of proposing a split of the cluster D j in D j1 and D j2 (i.e., the split state Φ me ) is equivalent to being in the state with D j1 and D j2 merged in D j (i.e., the merge state Φ me ) and proposing the move back to the current state Φ. In terms of transition probability this means that q [Φ sp Φ] = q [Φ Φ me ] = P sp k me c P j C me 2 P alloc π(θ j1 D j1 )π(θ j2 D j2 ). (.15) Besides, it is also important to note that, in the merge proposal there is only one way to merge the observations from two components in one component. Thus, we need to calculate the corresponding probability, P alloc, of generating the current split state from the proposed merge state. This is done in the same way as the split proposal, but now considering the known of the current split state. Analogously to (.15), we have that q [Φ me Φ] = q [Φ Φ sp ] = P me k sp c P j1,j 2 π(θ j D j )π G (θ k η k ). (.16).2. Acceptance probability From equation (.12), the acceptance probability for a split movement is Ψ[Φ sp Φ] = min(1, A sp ), where A sp = P (y csp, θ sp, k) π (c sp γ, k) π (θ sp k) q [Φ Φ sp ] P (y c, θ, k) π (c γ, k) π (θ k) q [Φ sp Φ]. The ratio of the joint probabilities of y conditional on latent indicator variables and component parameters is given by ( ) ( ) P (y c sp, θ sp, k) L θ sp j 1 D j1 L θ sp j 1 D j1 = ( ), (.17) P (y c, θ, k) L θ sp j 1 D j1 where L(θ m D m ) is given in (.), for m = j, j 1, j 2.

15 A Bayesian sparse finite mixture model 15 From (.), the ratio of the joint prior probability for latent indicator variables is π (c sp γ, k) π (c γ, k) = Γ(n j 1 + γ)γ(n j2 + γ). (.18) Γ(n j + γ)γ(γ) The prior distributions ratio for component parameters is π (θ sp k) = π (θ k) k ( π G j=1 θ sp j η j ) k π G (θ j η j ) j=1 ( ) ( ) π G θ sp j 1 η j1 π G θ sp j 2 η j2 =. (.1) π G (θ j η j ) π G (θ k η k ) From (.1) and (.16), the transition probability ratio for the split proposal is q [Φ Φ sp ] q [Φ sp Φ] = P sp me kc P j1,j 2 1 π(θ j D j )π G (θ k η k ) Q ( ) ( ) sp P r =, (.20) P sp kc P j C2 P alloc π θ sp j 1 D j1 π θ sp j 2 D P j2 alloc where 1 Q sp = P me kc sp P 2, if k c = 1; j1,j 2 ( = 1 ) 1 Ic(k 1) C2 2 k P sp kc P c+1, if k c K 1 ; j C2 Ic(k 1) C2 2 k c+1, if k c K 2 ; I c (k 1) = 1 if k c = k 1 and I c (k 1) = 0 otherwise, K 1 = {2 k c k 1; and j 1 = 1 or j 2 = k c }, K 2 = {2 k c k 1 and j 1 1 or j 2 k c } and P r is the posterior densities ratio, given by π(θ j D j )π G (θ k ) π(θ j1 D j1 )π(θ j2 D j2 ) = L(θ j D j ) π G (θ j η j )π G (θ k η k ) I(D j1 )I(D j! ), L(θ j1 D j1 )L(θ j2 D j2 ) π G (θ j1 η j1 )π G (θ j2 η j2 ) I(D j ) where I(D m ) = L(θ m D m )π G (θ m η m )dθ m is the normalizing constant, for m = j, j 1, j 2. Multiplying (.17), (.18), (.1) and (.20), a split is accepted with probability Ψ[Φ sp φ] = min(1, A sp ) for A sp = I(D j 1 )I(D j2 ) Γ(n j1 + γ)γ(n j2 + γ). I(D j ) Γ(n j + γ) P alloc Similarly, the acceptance probability for a merge is Ψ[Φ me Φ] = min(1, A me ) = 1 A sp. Q sp

16 16 E. F. Saraiva, A. K. Suzuki and L. A. Milan.2.4 Split-merge allocation sampler. Now the split-merge procedure is described as an algorithm denominated by Split-Merge allocation sampler (SMAS). SMAS Algorithm: Initialize with a configuration c (0). For l-th iteration of the algorithm do : (i) Update the component parameters θ using algorithm 1; (ii) Update the indicator variables c using algorithm 2; (iii) Choose between split or merge with probabilities P sp kc and P me kc ; (iv) Accept the proposal with probability Ψ[Φ Φ], where is either a sp or a me; (a) If a split proposal is accepted, do k c (l) = k c (l 1) + 1; (b) If a merge proposal is accepted, do k c (l) = k c (l 1) 1; (c) Otherwise, maintain k (l) = k (l 1) c c ; In order to estimate the number of clusters, we discard the first B iterations as a burn in and calculate N(k c = j), the number of times that k c = j in the L B iterations, for j {1,..., k}. Let P (k c = j) = N(k c = j)/(l B) be the estimated posterior probability for k c = j, for j = 1,..., k. Thus, k ( c = argmax P (kc = j)) is the estimate for the number of clusters. 1 j k Conditional on k c, consider (i) = L ) I L kc (l) ( kc k be the number of iterations for which k c = k c in l=b+1 c L B iterations, where I (l) k = 1 if in l-iteration k c (l) = k c and I (l) c k = 0 c otherwise; (ii) N ij = L I (l) c l=b+1 i (j)i k (l) c be the number of times that observation y i is allocated in component j in iterations, where I L kc (l) c (j) = 1 if in i l-th iteration c i = j and I (l) c (j) = 0 otherwise, for i = 1,..., n and j = 1,..., k. i Let P (c i = j) = N ij be the posterior probability that the observation y i L kc belongs to cluster j, for j = 1,..., k c. If P ( ) (c i = j) = max P (ci = j ), 1 j k c then we consider that y i belongs to cluster j, for i = 1,..., n and j = 1,..., k c. We estimate parameters θ j, j = 1,..., k c, considering the average of the generated values, i.e.,

17 A Bayesian sparse finite mixture model 17 4 Data analysis θ j k c = 1 L L kc l=b+1 ) θ (l) j ( kc I k c (l). We illustrate the performance of the proposed method by using simulated data and three real data sets. Following Richardson and Green (17), we model these datasets by assuming an univariate normal mixture model. From model (2.2) f(y i θ j ) is the ( density ) of a normal distribution with mean µ j and variance σj 2 and θ j = µ j, σj 2, for j = 1,..., k. In order to explore the fully conjugation we ( consider ) the following prior distributions for component parameters θ j = µ j, σj 2, ( ) µ j σj 2, µ 0, λ N µ 0, σ2 j and σj 2 α, β Γ(α, β) λ ( ) where µ 0, λ, α and β are hyperparameters, N µ 0, σ2 j represents the normal distribution with mean µ 0 and variance σ2 j λ and Γ(α, β) represents the Gamma distribution with location parameter α and scale parameter β. The parametrization of the Gamma distribution is so that the mean is α/β and the variance is α/β 2. These prior distributions are also used by Casella et al. (2000), Nobile and Fearnside (2007) and Saraiva et al. (2016). Since it may be unrealistic to assume the availability of strong prior information regarding component parameters θ j in practice, we specify the hyperparameters values according to guidelines of Richardson and Green (17) and Saraiva et al. (2016). Thus, we set µ 0 = ε, α = 2 and β = 0.2/10R 2, where ε is the midpoint of the observed variation interval of the data and R is the length of this interval. We also fix the hyperparameter λ = 10 2 in order to get a prior distribution for component means with large variance. The normalizing constant I(D j ) present in the acceptance probability for split-merge is presented in section S-1 of the SM. 4.1 Simulated data sets λ Consider the population model given in (2.2) with k = 10 components. To simulate the data sets, we set up the number of clusters k c and the component parameters for the k c clusters according to the specified in Table 2.

18 18 E. F. Saraiva, A. K. Suzuki and L. A. Milan In the set up A 1 the two clusters have equal variances and weights while A 2 has different variances and weights. In A the three clusters have equal variances and similar weight and A 4 has different variances and weights. In A 5 we consider four clusters with same weights values and two clusters with higher variances in relation to the other two, and in A 6 we consider five clusters with different weights and one cluster with large variance in relation to the others. Table 2 Number of clusters and parameter values used for simulating the datasets. Artificial data set Number of clusters A 1 k true = 2 A 2 k true = 2 A k true = A 4 k true = A 5 k true = 4 A 6 k true = 5 Parameter values µ 1 = 0, µ 2 =, σ 1 = 1, σ 2 = 1, w 1 = 0.50, w 2 = 0.50, µ 1 = 0, µ 2 = 4, σ 1 = 1, σ 2 = 2, w 1 = 0.70, w 2 = 0.0, µ 1 =, µ 2 = 0, µ = σ 1 = 1, σ 2 = 1, σ = 1 w 1 = 0.0, w 2 = 0.40, w = 0.0 µ 1 = 5, µ 2 = 0, µ = 7 σ 1 = 1, σ 2 = 2, σ = w 1 = 0.20, w 2 = 0.0, w = 0.50 µ 1 = 4, µ 2 = 0, µ = 5, µ 4 = 10 σ 1 = 1, σ 2 = 2, σ = 1, σ 4 = 2 w 1 = 0.25, w 2 = 0.25, w = 0.25, w 4 = 0.25 µ 1 = 6, µ 2 = 0, µ = 7, µ 4 = 15, µ 5 = 21 σ 1 = 1, σ 2 = 2, σ =, σ 4 = 2, σ 5 = 1 w 1 = 0.15, w 2 = 0.20, w = 0.0, w 4 = 0.20, w 5 = 0.15, The procedure for generating the data sets is (i) For i = 1,..., n, generate U i U(0, 1); if j 1 w j < u i j w j, ( ) j =1 j =1 generate Y i N µ j, σj 2, with fixed parameter values according to Table 2, for w 0 = 0 and j = 1,..., k c. (ii) In order to record from which component each observation is generated ( ) from we define G = (G 1,..., G n ) such that G i = j if Y i N µ j, σj 2, for i = 1,..., n and j = 1,..., k c. In the remainder of the paper we simulate datasets with sizes n = 500; samples with size n = 1, 000 also are simulated and results are presented in Section S- of the SM. Figure 1 shows the values generated by cluster for datasets A 1 to A 6 for n = 500.

19 A Bayesian sparse finite mixture model 1 y y Clusters (a) Data set A Clusters (d) Data set A 4. y y Clusters (b) Data set A Clusters (e) Data set A 5. y y Clusters (c) Data set A Clusters (f) Data set A 6. Figure 1 Values generated by cluster for datasets A 1 to A 6 with n = 500. We apply the proposed SMAS algorithm fixing L = 110, 000 iterations and a burn in of B = 10, 000. We also consider a sample of one draw for every 20 obtaining a sequence of 5, 000 cases. These values were enough for reliable results. For all values γ = r for r R 1, the SMAS poses maximum posterior probability on the k c true value. These results are presented in Section S-2 of the SM. For γ = r for r R 2 the SMAS tends to overestimate the number of clusters when we increase the r value. These results are presented by Figure 2 which show the posterior probability for k c true value and the k c values with maximum posterior probability for each γ = r R 2. As one can note, for large values of γ the method overestimate the number of cluster. For datasets A 1, A 2 and A 4, the maximum posterior on k c true value is obtained only for γ {0.1, 0.2}; while the maximum posterior on k c true value for datasets A and A 5 is obtained for γ {0.1, 0.2, 0.} and for dataset A 6 for γ {0.1, 0.2, 0., 0.4}. Besides, for all simulated cases, the posterior probability on k c true value goes to zero when γ approaches to 2. Motivated by the sensibility to the choice of the γ we develop a second approach considering for γ a Gamma prior distribution, γ Γ(a, b), a, b > 0.

20 20 E. F. Saraiva, A. K. Suzuki and L. A. Milan Probability Probability Probability (a) Data set A 1. γ (c) Data set A. γ (e) Data set A 5. γ Probability Probability Probability (b) Data set A 2. γ (d) Data set A γ (f) Data set A 6. γ Figure 2 Posterior probability on k c true value and k c values with maximum posterior probablity for n = 500. represents the posterior probability for the k c true value and represents the k c value with maximum posterior probability.

21 A Bayesian sparse finite mixture model 21 In this approach, rather than the value of γ it is the choice of the hyperparameters a and b that are influential on the posterior probability for k c. To verify the sensibility in relation to the choice of the hyperparameters values we consider: (i) a = b = 1, obtaining E(γ) = 1 and V ar(γ) = 1. This prior represents the belief that the weights have Uniform distribution over the simplex; (ii) a = 2 and b = 4, obtaining E(γ) = 0.5 and V ar(γ) = This prior was suggested by Escobar and West (15) in the context of the Dirichlet process mixture model. In order to simulate values from conditional posterior distribution of γ, π(γ c, y, θ, k), we implement a random walk Metropolis (RWM) algorithm. We consider the candidate-generating density q[γ γ], where γ is a candidate value, with Uniform distribution centered on the current value of γ, i.e., U(γ ɛ, γ + ɛ). We set up ɛ = The candidate value γ is accepted with probability Ψ(γ γ) = min(1, A γ ), where A γ = π(c γ,k) π(γ ) π(c γ,k) π(γ), since the proposal kernels from numerator and denominator cancel, π(c γ, k) is given in (.) and π(γ) is the density of the Gamma prior for γ. This algorithm is implemented as follows. RWM algorithm: Let the state of the Markov chain be c = (c 1,..., c n ), θ = (θ 1,..., θ k ) and γ. Conditional on c and θ, update γ at the l-th iteration of the algorithm as follows. (i) Generate γ U (γ ɛ, γ + ɛ); (ii) Calculate Ψ(γ γ) = min(1, A γ ); (iii) Generate U U(0, 1). If u Ψ(γ γ) accept γ doing γ (l) = γ. Otherwise, do γ (l) = γ (l 1). Estimate γ by the average of the simulated values, i.e., γ = 1 L γ (l) j. L kc l=b+1 Tables and 4 show the estimates and the credibility intervals (5%) for γ and the estimates for posterior probabilities of k c for datasets A 1 to A 6, for γ Γ(1, 1) and γ Γ(2, 4), respectively. All estimates for γ lead to the maximum posterior probability on the k c true value (highlighted in bold). The prior Γ(1, 1) on γ has led to higher posterior probability on the k c than prior distribution Γ(2, 4).

22 22 E. F. Saraiva, A. K. Suzuki and L. A. Milan For each dataset the credibility intervals contain the most of the values r G that lead to maximum posterior probability on the k c true value. For instance, for data sets A 1 and A 2, all values r that lead to maximum posterior probability on the k c true value, except r = 0.2, belong to the estimated credibility intervals. Table Estimates for γ and estimated posterior probabilities for k c, with γ Γ(1, 1). Data set k c true A 1 2 A 2 2 A A 4 A 5 4 A 6 5 Estimates Probabilities for k c values for γ (0.0051, 0.118) (0.0048, 0.12) (0.0148, 0.188) (0.0151, ) (0.028, 0.072) (0.045, 0.471) Table 4 Estimates for γ and estimated posterior probabilities for k c, with γ Γ(2, 4). Data set k c true A 1 2 A 2 2 A A 4 A 5 4 A 6 5 Estimates Probabilities for k c values for γ (0.011, ) (0.0124, 0.16) (0.02, ) (0.024, 0.257) (0.0414, 0.4) (0.0601, 0.51) Performance of the SMAS. Now we verify empirically the convergence of the sequence of the posterior probability for k c across iterations and the capacity to move for different values of k c in the course of the iterations and estimated autocorrelation function (acf). We present performance of SMAS using the results obtained with γ = γ where γ is the estimated value for γ given in Table. For other values of γ (estimates given in Table 4 and fixed as r for r G) with maximum posterior probability on the k c true value the results are similar.

23 2 A Bayesian sparse finite mixture model P(kc.) 0.2 k=5 0.0 k= Index k=5 k=6 k= Index (d) Data set A M k=4 0.6 P(kc.) (c) Data set A. cc ep te d 0.0 k=5 0 Index k= (b) Data set A (a) Data set A1. P(kc.) 000 Index k= Index k=4 k= an us k= k= k= P(kc.) 0.8 k=2 0.4 P(kc.) P(kc.) 0.8 k=2 k= k=4 cr ip t Figure shows the plots of P (kc ) estimates across the iterations. To maintain a good visualization we display in each graphic only the three higher P (kc ) estimates. The number of iterations and burn in seems to be adequate to achieve stability for the posterior probabilities of kc. (e) Data set A Index (f) Data set A6. Figure A posteriori probability for kc across iterations for thinned sequence. B JP S -A Figure 4 shows the sampled kc values in the course of iterations. The algorithm mix well over kc and remains, in the most of iterations, around the target kc true values. The sampled kc values do not present significant autocorrelation, as showed by Figure 5. Table 5 shows the percentage of split-and-merge moves accepted and the percentage of γ values accepted. Theses percentages show us that split-merge moves are proposed and accepted in a balancing way. The high percentages of accepted values for γ is due the way that we implement the RWM, assuming a small ε = 0.05 value. Table 5 Percentages of split and merge movements accepted. Movement and γ split merge γ A A Data A set A A A

24 24 E. F. Saraiva, A. K. Suzuki and L. A. Milan kc kc ACF ACF Iterations (a) Data set A Iterations (d) Data set A Lag (a) Data set A Lag (d) Data set A 4. kc kc Iterations (b) Data set A Iterations (e) Data set A 5. Figure 4 Sampled k c values for thinned sequence. ACF ACF Lag (b) Data set A Lag (e) Data set A Figure 5 Estimated autocorrelation for thinned sequence. kc kc ACF ACF Iterations (c) Data set A Iterations (f) Data set A Lag (c) Data set A Lag (f) Data set A 6.

25 A Bayesian sparse finite mixture model 25 Figure 2 in Section S-2.2 of the SM shows the simulated values and the clusters identified by the method. The clusters were satisfactorily identified. Section S-2.2 of the SM also presents the estimates for components of each cluster (Tables 1 and 2) and the histogram of the observed data and the estimated density function (Figure ). 4.2 Galaxy data set We now apply the proposed methodology to the well known galaxy dataset, previously analyzed by Roeder and Wasserman (17), Escobar and West (15), Richardson and Green (17), Stephens (2000), among others. The data set refers to velocity (in 10 km/s) from distant galaxies diverging from our own. The sample size is n = 82 observations. For more details on this dataset see Escobar and West (15) and Richardson and Green (17). We consider the galaxy velocities as realizations from a mixture of k normal distributions and the Bayesian model in (.1) with k = 10. We use the same hyperparameters specification used for artificial datasets. To estimate the hyperparameter γ we consider the hierarchical approach setting up γ Γ(a, b), for a = b = 1. The number of iterations, burn in and thin value are the same used in previous analysis. Considering the SMAS algorithm, the estimate and the credibility interval (5%) for γ are and (0.0182, 0.266), respectively. The estimated posterior probabilities for k c, 1 k c 10, are presented in Table 6. The maximum posterior is at k c = with P (k c = ) = For the sake of comparison, estimates of k for this data set range from and 4 for Roeder and Wasserman (17) and Stephens (2000), from 5 to 7 for Richardson and Green (17) and just 7 for Escobar and West (15). Table 6 Estimated probabilities for k c, Galaxy data set, with γ Γ(1, 1). k c values P (k c ) Figure 6 shows the performance of the SMAS. Similar to simulation results, the SMAS sampler mixes well over k c and shows a satisfactorily stability for probabilities of k c. The sampled k c values do not present significant autocorrelation. The acceptance ratio for split and merge moves are 1.818% and 1.464%, respectively. The acceptance ratio for generated γ values is %. Figure 7(a) shows clusters identified conditional on the estimate k c =. Figure 7(b) shows the histogram of the observed data and the estimated density function. As noted by Roeder and Wasserman (17) and Stephens

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM