Submitted to the Brazilian Journal of Probability and Statistics

Size: px
Start display at page:

Download "Submitted to the Brazilian Journal of Probability and Statistics"

Transcription

1 Submitted to the Brazilian Journal of Probability and Statistics A Bayesian sparse finite mixture model for clustering data from a heterogeneous population Erlandson F. Saraiva a, Adriano K. Suzuki b and Luís A. Milan c a Instituto de Matemática, Universidade Federal de Mato Grosso do Sul b Departamento de Matemática Aplicada e Estatística, Unversidade de São Paulo c Departamento de Estatística, Universidade Federal de São Carlos Abstract. In this paper we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components k previously fixed where many components can be empty. In this model, the number of components k can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters k c (e.g. non empty components) can be random and smaller than the number of components k of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively. 1 Introduction The main goal of the clustering analysis is to group the observed data into homogeneous clusters. The first works on clustering are due Sneath (157) and Sokal and Michener (158). Since the publication of these articles the clustering problem has been studied by many authors as Hartingan and Wong (178), Binder (178), Anderson (185), Banfield and Raftery (1), Bensmail et al. (17) and Witten and Tibshirani (2010). According to Bouveyron and Brunet (201), more and more scientific fields requires clustering data with the aim of understanding the phenomenon of interest. Usual clustering approaches are distance-based, such as k-means (Mac- Queen, 167) and hierarchical clustering (Ward, 16). Although simple and MSC 2010 subject classifications: Primary 62xx; secondary 62Mxx Keywords and phrases. Mixture model, Bayesian approach, Gibbs sampling, Metropolis-Hastings, Split-merge update. 1

2 2 E. F. Saraiva, A. K. Suzuki and L. A. Milan visually appealing these methods can be implemented only for cases where the number of clusters are known. Besides, according to Oh and Raftery (2007) these methods are not based on standard principles of statistical inference and they do not provide a statistically based method for choosing the number of clusters. Under a probabilistic approach McLachlan and Basford (188), Banfield and Raftery (1), McLachlan and Pell (2000) and Fraley and Raftery (2002) proposed the use of a finite mixture model in which each component of the mixture represents a cluster. In these methods, partitions are determined by the EM (expectation-maximization) algorithm and the number of components of the mixture are, in general, determined comparing fitted models with different number of components using some model selection criterion, such that, AIC (Akaike, 174; Bozdogan, 187) or BIC (Schwarz, 178). A similar strategy is adopted in the Bayesian approach, considering the DIC (Spiegelhalter et al., 2002) as model selection criterion, see for example Celeux et al. (2000). This can be seen as a drawback to be overcome, since in practice it may be very tedious to fit several models and afterwards compare them according to a model selection criterion. Also, in many cases the estimation depends on iterative methods which may not converge imposing additional difficulties to the process. Thus, a practical and efficient computational algorithm to estimate the number of cluster jointly with the component-specific parameters is desirable. Under this scenario, the Bayesian approach has been successful, in special, due the reversible-jump Markov Chain Monte Carlo algorithm proposed by Richardson and Green (17) in the context of Gaussian mixture models. However, one difficulty frequently encountered for implementing a reversible-jump algorithm is the construction of efficient transitions proposals that lead to a reasonable acceptance rate. In this paper we consider a sparse finite mixture model for clustering data. This model is a finite mixture model with k components, previously fixed as a large number, where many components can be empty. The large value assumed for k can be interpreted as the maximum number of distinct mixture components. The term sparse refers refers to the existence of many empty components. The main motivation to consider this model is the fact that a finite mixture model is a population model, then given an observed sample y not all k components may have observations in the sample and we may have empty components. In addition, assuming this model, we avoid the need to use the reversible-jump MCMC method. It allowed us to implement the split-merge movements using the observed data; instead of to use a transition function as in a RJMCMC.

3 A Bayesian sparse finite mixture model Thus, we assume that the number of clusters k c (i.e., number of non empty components) is an unknown quantity, but smaller than k. To estimate k c jointly with the other parameters of interest we consider a Bayesian approach. In this approach, we set up a symmetric Dirichlet prior distribution with hyperparameter γ on the weights of the mixture. So, using the Dirichlet integral we integrate out the weights and derive the prior and posterior allocations probabilities. In order to simulate from joint posterior distribution of the parameters of interest we develop a MCMC algorithm, the Split-Merge Allocation sampler (SMAS). This algorithm consist of three steps. In the first step parameters of the not-empty components (e.g. clusters) are updated from its conditional posterior distribution and parameters of the empty components are updated from prior distribution. In the second step a Gibbs sampling is performed to update the latent indicator variables using the posterior allocations probabilities. And the third step consist of a split-merge step that change the number of clusters on the neighbourhood ±1, respectively. This step was inserted within the algorithm in order to allow a major change in configuration of the latent variables in a single iteration of the algorithm, and consequently increasing the mixing of the Markov chain in relation to the number of clusters. The split and merge movements are developed in a way that they are reversible. In a split move each observation is allocated to one of two new partitions based on probabilities which are calculated according to marginal likelihood function from observations previously allocated to two new partitions. Given the proposed allocation of observations, the candidate-values for component parameters are generated from the conditional posterior distributions. Choosing the posterior density as candidate-generating density, the likelihood ratio and the ratio of prior densities from the Metropolis-Hastings acceptance probability cancels the corresponding term in the posterior density, simplifying the computation of the acceptance probability. Thus, the proposed algorithm performs a standard Metropolis-Hastings update with an acceptance probability which depends only on data associated with component(s) selected for a split or a merge. Our approach does not require the specification of transition functions to estimate the number of clusters jointly with the component-specific parameters and when a new cluster is created, through a split or a merge movement, it determines a new partition in the data set. This is due to the way that we implement the split-merge movements which is data-driven instead of parameter based.

4 4 E. F. Saraiva, A. K. Suzuki and L. A. Milan In order verify the performance of the SMAS we develop a simulation study considering data sets generated with a different number of clusters. We also verify its sensibility to the choice of the value of hyperparameter γ of the Dirichlet prior distribution. To do this we explore two scenery. In the first we set up γ = r, where r is value of a grid G. In the second we consider one more hierarchical level with a prior distribution on γ and estimating it from its posterior distribution. For each artificial dataset we present the performance of SMAS in terms of posterior probability for number of clusters, convergence, mixing and autocorrelation. We also apply the methodology to three real data sets. The first is the benchmark data on velocities from distant galaxies diverging each other, previously analyzed by Roeder and Wasserman (17), Escobar and West (15), Richardson and Green (17), Stephens (2000) and Saraiva et al. (2014), available in R software. The second and the third one are Enzyme and Acidity datasets extracted from website / mapjg/mixdata. The remainder of the paper is structured as follows. In Section 2 we describe the sparse finite mixture model for clustering data. In Section we introduce the hierarchical Bayesian approach and develop the SMAS algorithm. In Section 4 the proposed sampler is applied to simulated data sets to access its performance and to real data sets to illustrate its use. Section 5 concludes the paper with final remarks. Additional details are provided in the supplementary material, denoted by prefix SM when referred to in this paper. Table 1 presents the main notations used throughout the article. Table 1 Mathematical notation. Notation Description k Number of components k c Number of clusters θ j Parameter of the j-th component, for j = 1,..., k θ = (θ 1,..., θ k ) The whole vector of parameters w j Weight of the j-th component, for j = 1,..., k Y i The i-th sampled value, for i = 1,..., n c i The i-th indicator variable, for i = 1,..., n y = (y 1,..., y n) The vector of independent observations c = (c 1,..., c n) The vector of latent indicator variables Φ = (θ, c, k c) Current state Φ sp = (θ sp, c sp, kc sp ) Split state Φ mg = (θ mg, c mg, kc mg ) Merge state

5 2 Mixture model for clustering A Bayesian sparse finite mixture model 5 Consider a population composed by k subpopulations, such that, the sampling units are homogeneous with respect to the characteristic under study within the subpopulation and heterogeneous among the subpopulations. Let w 1,..., w k be the relative frequencies of each subpopulation in relation to the overall population, for 0 w j 1 and k j=1 w j = 1. Assume that each subpopulation j is modeled by a probability distribution F (θ j ) indexed by parameter θ j (scalar or vector), for j = 1,..., k. Suppose that the sampling process consists of choosing a subpopulation j with probability w j and then sample a Y i value of this subpopulation, for j = 1,..., k and i = 1,..., n where n is the sample size. Then we can represent each sample unit by the pair (Y i, c i ), where c i is an indicator variable that assume a value of the set {1,..., k} with probabilities {w 1,..., w k }, respectively. Thus, we have that (Y i c i = j, θ j ) F (θ j ) and P (C i = j w) = w j, (2.1) where w = (w 1,..., w k ), for i = 1,..., n and j = 1,..., k. In many practical problems such as clustering problems indicator variables are non-observable (also known as latent variables). The probability of i- th observation coming from subpopulation j is w j, for i = 1,..., n and j = 1,..., k. The marginal probability density function for Y i = y i is given by k f(y i θ, w) = w j f(y i θ j ), (2.2) j=1 where f(y i θ j ) is the probability density function of F (θ j ), θ = (θ 1,..., θ k ) is the whole vector of parameters and w = (w 1,..., w k ) are the weights. Model (2.2) is denominated in the literature by finite mixture model. In this model each probability density function, f( θ j ), corresponds to a component of the mixture. The finite mixture model is a natural probabilistic approach for clustering. However, as model (2.2) is a population model, then given an observed sample y = (y 1,..., y n ) not all k components may have observations in the sample and we may have empty components, i.e., fewer than k components are in the sample (Fruhwirth-Schnatter, 2017). In this case, we have that the number of clusters (i.e., non-empty components) is smaller than the number of components k. Walli et al. (2016) and Fruhwirth-Schnatter (2017) call this model by sparse finite mixture model.

6 6 E. F. Saraiva, A. K. Suzuki and L. A. Milan Our interest is to infer about the number of clusters k c and the latent variables c jointly with the components parameters. Due this, we consider c as parameters to be estimated in model (2.2). 2.1 Joint distribution for complete data Let (y, c) be the complete data, where y = (y 1,..., y n ) is the vector of independent observations and c = (c 1,..., c n ) is the vector of latent indicator variables, with y and c being paired. Consider k c the number of clusters defined by the configuration c, k c k. From model (2.1) we have that the joint probability for the latent indicator variables c is given by π(c = c w, k) = k j=1 w n j j, (2.) where n j is the number of observations y i s allocated to subpopulation j, for j = 1,..., k. The distribution of the number of observations assigned to each component, n 1,..., n k, called the occupation number, is multinomial (n 1,..., n k n, w) Multinomial(n, w), (2.4) where n = n n k. Consider D j = {y i ; c i = j} be the set of observations allocated to component j, for j = 1,..., k. Without loss of generality, assume that clusters are labelled from 1 to k c and that the empty components are labelled from k c + 1 to k. Thus, the joint distribution for complete data (y, c) conditional on mixing proportions w, component parameters θ and number of components k is P (Y = y, C = c w, θ, k) = P (Y = y c, w, θ, k)π(c = c w, k) (2.5) = k n [w j f(y i θ j )] Ic i (j) = j=1 i=1 k c w n j j j=1 f(y i θ j ), Dj where I ci (j) = 1 if c i = j and I ci (j) = 0 otherwise, for i = 1,..., n and j = 1,..., k. At this point some remarks about the label switching are necessary. Note that, the cluster labels j = 1,..., k c are not uniquely determined and a

7 A Bayesian sparse finite mixture model 7 permutation of the labels would lead to the same model. Since our interest lies in inferences on partitions, the non-identifiability of labels would cause a problem in posterior computation and allocation probabilities are useless for partitioning the observations (Stephens, 2000). Therefore we impose restrictions on the class of component means of the clusters to get identifiability, i.e, we assume that µ 1,..., µ kc are the component means for clusters and that µ 1 <... < µ kc. However, it does not prevent the MCMC algorithm described in the next Section from being applicable to other labelling. For further discussion and additional references about label switching see Stephens (2000), Jasra et al. (2005) and their references. Bayesian approach In order to estimate c and the number of clusters k c jointly with the component parameters θ, we consider the following hierarchical Bayesian model Y i c i = j, θ, k F (θ j ), (.1) c i w, k Discrete(w 1,..., w k ), θ j G(η j ), w γ, k Dirichlet(γ), for i = 1,..., n, where G(η j ) is the a priori distribution for component parameters θ j, η j (scalar or vector) are the hyperparameters, for j = 1,..., k, and Dirichlet(γ) represents the symmetric Dirichlet distribution with parameter γ, γ > 0, and probability density function given by π(w γ, k) = Γ(kγ) [Γ(γ)] k k j=1 w γ 1 j. (.2) The choice of the prior distribution G(η j ) for the parameter θ j depends on the probability distribution F (θ j ) assumed for Y i, for i = 1,..., n and j = 1,..., k. In section 4, we discuss the choice of G(η j ) in the context in which F (θ j ) is given by a Gaussian distribution with mean µ j and variance σj 2, i.e., θ j = (µ j, σj 2 ), for j = 1,..., k. It is important to note that depending on the value fixed for the hyperparameter γ the Dirichlet sampling may generate some weights w j = 0 and consequently the multinomial sampling given in (2.4) lead to partitions with n j = 0, for some j {1,..., k}. According to Walli et al. (2016) and Fruhwirth-Schnatter (2017) this happens when we consider a small value for hyperparameter γ.

8 8 E. F. Saraiva, A. K. Suzuki and L. A. Milan In order to verify the sensibility of the results to the choice of the value for the hyperparameter γ we explore two scenery. First we set up γ = r, where r is a value of the set G = {R 1, R 2 } in which R 1 is a grid from 0.01 to 0.0 with a fixed increment step equals to 0.01 and R 2 is a grid from 0.1 to 2 with a fixed increment step equals to 0.1. In the second scenery we consider one more hierarchical level with a gamma prior distribution on γ, γ Γ(a, b), for a, b > 0. The parametrization of the gamma distribution is so that the expected value is a/b and the variance is a/b 2. As our main interest lies on configuration c, then we integrate out the mixing proportion. Taking the product of equations in (2.) and (.2) and integrating out the mixing proportions, we can write the joint probability of c given γ and k as Γ(kγ) π(c = c γ, k) = [Γ(γ)] k Γ(n + kγ) k Γ(n j + γ). (.) The joint probability for complete data (y, c) given θ, γ and k is given by P (Y = y, C = c θ, γ, k) = P (Y = y c, θ, k)π(c = c γ, k) (.4) = k c f(y i θ j ) π(c = c γ, k), Dj j=1 where π(c = c γ, k) is given in (.). Using the Dirichlet integral and keeping all but a single indicator variable fixed, the conditional distribution for a single latent indicator variable c i given all others, denoted by c i = (c 1,..., c i 1, c i+1,..., c n ), is given by j=1 π(c i = j c i, γ) = n j, i + γ n + kγ 1, (.5) where n j, i is the number of observations in D j except the observation y i, for i = 1,..., n and j = 1,..., k. From (.5), the probability of observation y i to be assigned to one empty component is π(c i = j γ c i, γ) = n + kγ 1, (.6) for j {k c i + 1,..., k}, where k c i is the number of clusters defined by configuration c i, for i = 1,..., n. The expression in (.6) is the probability of a observation y i to define a new cluster, for i = 1,..., n.

9 A Bayesian sparse finite mixture model.1 Posterior distribution Using the Bayes theorem, the joint posterior distribution upon which inference is based is given by π(θ, c y, γ, k) P (Y = y c, θ, k)π(c = c γ, k)π(θ k), (.7) where P (Y = y c, θ, k)π(c = c γ, k) = L(θ y, c, γ, k) is the completedata likelihood function for θ, which is equal to the sampling distribution given in (.4), regarded as a function of the unknown parameters θ, and π(θ k) = k j=1 π G (θ j η j ) is the joint prior distribution for θ with π G (θ j η j ) being the probability density function of the prior distribution G(η j ), for j = 1,..., k. In order to sample from the joint posterior distribution in (.7) and estimate parameters of interest, we propose the Split-Merge allocation sampler algorithm (SMAS). This is a MCMC algorithm that makes use of the following three moves types. (a) update the parameters θ; (b) update the allocation indicators c; (c) split one cluster into two, or combine two clusters into one..1.1 Updating the parameters. Conditional on configuration c, we have k c clusters and (k k c ) empty components. The conditional posterior distribution for θ j is given by where π(θ j y, c, k) L(θ j D j )π G (θ j η j ), (.8) f(y i θ j ), if D j ; L(θ j D j ) = D j 1, if D j = (.) is the likelihood function for component j, for i = 1,..., n and j = 1,..., k. Thus, we update component parameters according to Algorithm 1. Algorithm 1: Let the state of the Markov chain consist of c = (c 1,..., c n ) and θ = (θ 1,..., θ k ). Conditional on configuration c, update component parameters θ as follows: (1) For clusters j, j = 1,..., k c, generate θ j from posterior distribution given in (.8); (2) For empty components j, j = k c +1,..., k, generate θ j from prior distribution, θ j π G (θ j );

10 10 E. F. Saraiva, A. K. Suzuki and L. A. Milan () Accept the updated values, θ updated, only if adjacency condition for component parameters of the clusters is met, i.e., if µ updated 1 <... < µ updated k c. Otherwise, keep θ updated = θ;.1.2 Updating the allocation indicators. Given the component parameters θ, the conditional posterior probability for C i = j is given by π(c i = j y i, θ j, c i, γ) = n j, i + γ n + kγ 1 f(y i θ j ), (.10) for j = 1,..., k c i and i = 1,..., n. We need now to define the probability of an observation y i to be allocated to an empty component, i.e., to create a new cluster. In order to probability of creating a new cluster does not depend of the parameter value θ j, which is generated from prior distribution, we integrate parameters out for this case. Thus, the conditional posterior probability for C i = j is π(c i = j γ y i, c i, γ, η j ) = f(y i θ j )π G (θ j η j )dθ j (.11) n + kγ 1 for j {k c i + 1,..., k} and i = 1,..., n. Let P i = ( ) P c i, P 0i be the vector of allocation ) probabilities for the i- th observation, where P c i = (P 1,..., P kc i for P j given in (.10) and ) P 0 = (P kc i +1,..., P k for P j given in (.11), for j = 1,..., k c i, j = k c i + 1,..., k and i = 1,..., n. Thus, we update the latent indicator variables according to Algorithm 2. Algorithm 2: Let the state of the Markov chain consist of c = (c 1,..., c n ) and θ = (θ 1,..., θ k ). Conditional on θ, update c = (c 1,..., c n ) as follows. For i = 1,..., n: (1) remove c i from current state c, obtaining c i and k c i ; (2) calculate the vector of allocation probabilities P i ; () generate an auxiliary variable Z i = (Z i1,..., Z ik ) Multinomial(1, P i ); (4) If Z ij = 1, for j {1,..., k c i }, set up c i = j and do n j = n j, i + 1; (5) If Z ij = 1 do n j = 1 and k c = k c i + 1. Generate a value for the component parameter θ j of the new cluster from the posterior distribution π(θ j y i ). Relabel the k c clusters in order to maintain the adjacency condition. If the component mean µ j of the new cluster is so that:

11 A Bayesian sparse finite mixture model 11 (a) µ j = min 1 j k c µ j, then do j = 1 and relabel all other clusters doing j + 1; (b) µ j = max µ j, then do j = k c and keep all other clusters 1 j k c labels; (c) µ j < µ j < µ j+1, for j {1, k c }, then do j = j + 1 and relabel all other clusters j j + 1 doing j = j + 1. At this point, the estimation procedure could be done by iterating between algorithms 1 and 2. But, algorithm 2 may be inefficient in situations where clusters have near means. This happens because the algorithm updates only one latent indicator variable at a time. Consequently, we may have a poor exploration of observation clusters and the algorithm may be trapped in local modes. This is the reason why we insert a split-merge step within the MCMC algorithm in order to increase the mixing on k c..2 Split-merge movements Let Φ = (θ, c, k c ) be the current state of the MCMC algorithm and Φ sp = (θ sp, c sp, kc sp ) and Φ me = (θ me, c me, kc me ) be the proposal states obtained by split and merge movements respectively. As dimension of the parametric space of θ = (θ 1,..., θ k ) is fixed, the acceptance probability for a split or a merge movement is given by the Metropolis-Hastings acceptance probability (Chib and Greenberg, 15), i.e., Ψ[Φ Φ] = min(1, A ), where A = P (y c, θ, k) π(c γ, k) π(θ η, k) q[φ Φ ] P (y c, θ, k) π(c γ, k) π(θ η, k) q[φ Φ], (.12) where means either a split or a merge, q[ ] is the transition proposal which is obtained by a split or a merge depending on the type of proposal, P (y c, θ, k)π(c γ, k) is the complete-data likelihood function which is equal to the sampling distribution given in (.4), regarded as a function of the unknown parameters θ, and π(θ k) = π G (θ j η j ) is the joint prior distribution for θ. k j=1.2.1 Split movement. Let P sp kc and P me kc be the probabilities of proposing a split and a merge, respectively, with P sp kc +P me kc = 1. These probabilities depend on k c because if k c = 1, we can propose only a split, P sp kc = 1; in the other hand, if k c = k we can propose only a merge, P me kc = 1. For 2 k c (k 1) we use P sp kc = P me kc = 1/2.

12 12 E. F. Saraiva, A. K. Suzuki and L. A. Milan Provided we choose a split, consider C 2 be the number of clusters with n j 2. Select a component D j, with n j > 2, with probability P j C2 = 1 C 2 and propose a split of observations y i D j in two new sets D j1 and D j2 as follows: (i) Let y h1 and y h2 be the minimum and the maximum value of the set D j, respectively; (ii) Do D j1 = {y h1 }, D j2 = {y h2 } and n j1 = n j2 = 1; (iii) For i = 1,..., n do the following: (a) if c i = j and y i / {y h1, y h2 }, then allocate y i in D j1 with probability n j1 f(yi θ j1 )π(θ j1 D j1 )dθ j1 P j1 (y i ) =, n j1 f(yi θ j1 )π(θ j1 D j1 )dθ j1 + n j2 f(yi θ j2 )π(θ j2 D j2 )dθ j2 where π(θ m D m ) is the posterior distribution for θ m given D m, for m = j 1, j 2 ; (b) Generate an indicator variable I i Bernoulli (P j1 (y i )). If I i = 1, y i D j1. Then, do D j1 = {D j1 } {y i }, n j1 = n j1 +1 and c sp i = j 1. Otherwise, do D j2 = {D j2 } {y i }, n j2 = n j2 + 1 and c sp i = j 2 ; (c) If c i = j and y i = y h1, then do P j1 (y i ) = 1 and c sp i = j 1. If c i = j and y i = y h2, then do P j1 (y i ) = 0 and c sp i = j 2 ; We fix the new labels as j 1 = j, j 2 = j + 1 and for all other labels j > j we do j = j + 1, for j {1,..., k}. Thus, we have a new configuration c sp = (c sp 1,..., csp n ) with kc sp = k c + 1 clusters where (i) for all c i c, so that, c i = j, do c sp i = j 1 or c sp i = j 2 according to step (iii) above; (ii) for all c i c, so that c i = j for j < j, do c sp i = c i ; (iii) for all c i c, so that c i = j for j > j, do c sp i = c i + 1; for i = 1,..., n, j {1,..., k c } and j {1,..., k} \ {j 1, j 2 }. The probability of configuration D j1 and D j2 is P alloc = y i D j1 P j1 (y i ) y i D j2 (1 P j1 (y i )). Conditional on D j1 and D j2, generate candidate-values θ sp j 1 and θ sp j 2 for parameters θ j1 and θ j2 from posterior distributions π(θ j1 D j1 ) and π(θ j2 D j2 ), respectively. Thus we have a new vector of parameters θ sp = ( θ sp 1,..., θsp k c, θ sp k,..., ) c+1 θsp k, where (i) for all θ sp j θ sp, so that j < j 1, do θ sp j = θ j ;

13 A Bayesian sparse finite mixture model 1 (ii) for all θ sp j θ sp, so that j > j 2, do θ sp j = θ j 1; for θ j θ and j {1,..., k} \ {j 1, j 2 }. Now we must check if the adjacency condition is met in the split proposal, i.e., if component means for clusters are in increasing numerical order, µ sp j 1 1 < µsp j 1 < µ sp j 2 < µ sp j In the case where it is not, the proposal is rejected because the movement may not be reversible by the merge proposal. If the adjacency condition is met, we have a new configuration c sp and a new set of parameters θ sp with kc sp = k c +1 clusters. This transition proposal is denoted by Φ sp Φ and its probability is given by q[φ sp Φ] = P sp kc P j C2 P alloc π(θ j1 D j1 )π(θ j2 D j2 ), (.1) where π(θ m D m ) is the posterior density for θ m, for m = j 1, j Merge movement. We deal with merge a reverse of split movement. This movement is initialized choosing sets D j1 and D j2. We establish a criterion which merges clusters adjacent in relation to the current values of their means. This is due to the adjacency condition assumed in the split movement. Thus, the probability of selecting D j1 and D j2 for a merge is 1, if k c = 2; P j1,j 2 = P j1 P j2 j 1 + P j2 P j1 j 2 = 2k c, if k c > 2 and j 1 = 1 or j 2 = k c ; 1 k c, if k c > 2 and j 1 1 or j 2 k c ; where P b1 is the probability of choosing cluster b 1 and P b2 b 1 is the conditional probability of choosing cluster b 2 given the previous choice of b 1. After selecting clusters D j1 and D j2, we join them in a single cluster D j, i.e., we do D j = {D j1 } {D j2 }. We fix the new labels as j = j 1 and for all other labels j with j j 2 we do j = j 1. Thus, we have a new configuration c me = (c me 1,..., cme n ) with kc me = k c 1 clusters, where (a) for all c i c, so that c i = j and j j 1, do c mg i = c i ; (b) for all c i c, so that c i = j and j j 2, do c mg i = c i 1; for i = 1,..., n, j = 1,..., k c and j {1,..., k c 1}. Conditional on D j, generate the candidate-value θ mg j for parameter θ j from posterior distribution ( π(θ j D j ). This determine ) a new vector of parameters θ me = θ1 me,..., θme k c 1, θk me c,..., θk me, where (i) for all θj me θ me, so that j < j 1, do θj me = θ j ; (ii) for all θj me θ me, so that j j 2, do θj me = θ j +1, for θ j θ and j {1,..., k 1} \ {j 1, j 2 };

14 14 E. F. Saraiva, A. K. Suzuki and L. A. Milan (iii) in order to complete θ me, generate θk me from prior distribution, θ k π G (θ k ). Here we also must check if the adjacency condition is met, i.e., µ me j 1 < µ me j < µ me j+1. In the case where it is not, the proposal is rejected because the movement may not be reversible by the split proposal. If the adjacency condition is met, the merge proposal determine the new configuration c me and a new vector of parameters θ me with kc me = k c 1 clusters. This transition proposal is denoted by Φ me Φ and its probability is given by q[φ me Φ] = P me kc P j1,j 2 π(θ j D j )π G (θ k η k ). (.14) Defined the transition probabilities for split and merge movements, it is important to note that, given the current state Φ, the probability of proposing a split of the cluster D j in D j1 and D j2 (i.e., the split state Φ me ) is equivalent to being in the state with D j1 and D j2 merged in D j (i.e., the merge state Φ me ) and proposing the move back to the current state Φ. In terms of transition probability this means that q [Φ sp Φ] = q [Φ Φ me ] = P sp k me c P j C me 2 P alloc π(θ j1 D j1 )π(θ j2 D j2 ). (.15) Besides, it is also important to note that, in the merge proposal there is only one way to merge the observations from two components in one component. Thus, we need to calculate the corresponding probability, P alloc, of generating the current split state from the proposed merge state. This is done in the same way as the split proposal, but now considering the known of the current split state. Analogously to (.15), we have that q [Φ me Φ] = q [Φ Φ sp ] = P me k sp c P j1,j 2 π(θ j D j )π G (θ k η k ). (.16).2. Acceptance probability From equation (.12), the acceptance probability for a split movement is Ψ[Φ sp Φ] = min(1, A sp ), where A sp = P (y csp, θ sp, k) π (c sp γ, k) π (θ sp k) q [Φ Φ sp ] P (y c, θ, k) π (c γ, k) π (θ k) q [Φ sp Φ]. The ratio of the joint probabilities of y conditional on latent indicator variables and component parameters is given by ( ) ( ) P (y c sp, θ sp, k) L θ sp j 1 D j1 L θ sp j 1 D j1 = ( ), (.17) P (y c, θ, k) L θ sp j 1 D j1 where L(θ m D m ) is given in (.), for m = j, j 1, j 2.

15 A Bayesian sparse finite mixture model 15 From (.), the ratio of the joint prior probability for latent indicator variables is π (c sp γ, k) π (c γ, k) = Γ(n j 1 + γ)γ(n j2 + γ). (.18) Γ(n j + γ)γ(γ) The prior distributions ratio for component parameters is π (θ sp k) = π (θ k) k ( π G j=1 θ sp j η j ) k π G (θ j η j ) j=1 ( ) ( ) π G θ sp j 1 η j1 π G θ sp j 2 η j2 =. (.1) π G (θ j η j ) π G (θ k η k ) From (.1) and (.16), the transition probability ratio for the split proposal is q [Φ Φ sp ] q [Φ sp Φ] = P sp me kc P j1,j 2 1 π(θ j D j )π G (θ k η k ) Q ( ) ( ) sp P r =, (.20) P sp kc P j C2 P alloc π θ sp j 1 D j1 π θ sp j 2 D P j2 alloc where 1 Q sp = P me kc sp P 2, if k c = 1; j1,j 2 ( = 1 ) 1 Ic(k 1) C2 2 k P sp kc P c+1, if k c K 1 ; j C2 Ic(k 1) C2 2 k c+1, if k c K 2 ; I c (k 1) = 1 if k c = k 1 and I c (k 1) = 0 otherwise, K 1 = {2 k c k 1; and j 1 = 1 or j 2 = k c }, K 2 = {2 k c k 1 and j 1 1 or j 2 k c } and P r is the posterior densities ratio, given by π(θ j D j )π G (θ k ) π(θ j1 D j1 )π(θ j2 D j2 ) = L(θ j D j ) π G (θ j η j )π G (θ k η k ) I(D j1 )I(D j! ), L(θ j1 D j1 )L(θ j2 D j2 ) π G (θ j1 η j1 )π G (θ j2 η j2 ) I(D j ) where I(D m ) = L(θ m D m )π G (θ m η m )dθ m is the normalizing constant, for m = j, j 1, j 2. Multiplying (.17), (.18), (.1) and (.20), a split is accepted with probability Ψ[Φ sp φ] = min(1, A sp ) for A sp = I(D j 1 )I(D j2 ) Γ(n j1 + γ)γ(n j2 + γ). I(D j ) Γ(n j + γ) P alloc Similarly, the acceptance probability for a merge is Ψ[Φ me Φ] = min(1, A me ) = 1 A sp. Q sp

16 16 E. F. Saraiva, A. K. Suzuki and L. A. Milan.2.4 Split-merge allocation sampler. Now the split-merge procedure is described as an algorithm denominated by Split-Merge allocation sampler (SMAS). SMAS Algorithm: Initialize with a configuration c (0). For l-th iteration of the algorithm do : (i) Update the component parameters θ using algorithm 1; (ii) Update the indicator variables c using algorithm 2; (iii) Choose between split or merge with probabilities P sp kc and P me kc ; (iv) Accept the proposal with probability Ψ[Φ Φ], where is either a sp or a me; (a) If a split proposal is accepted, do k c (l) = k c (l 1) + 1; (b) If a merge proposal is accepted, do k c (l) = k c (l 1) 1; (c) Otherwise, maintain k (l) = k (l 1) c c ; In order to estimate the number of clusters, we discard the first B iterations as a burn in and calculate N(k c = j), the number of times that k c = j in the L B iterations, for j {1,..., k}. Let P (k c = j) = N(k c = j)/(l B) be the estimated posterior probability for k c = j, for j = 1,..., k. Thus, k ( c = argmax P (kc = j)) is the estimate for the number of clusters. 1 j k Conditional on k c, consider (i) = L ) I L kc (l) ( kc k be the number of iterations for which k c = k c in l=b+1 c L B iterations, where I (l) k = 1 if in l-iteration k c (l) = k c and I (l) c k = 0 c otherwise; (ii) N ij = L I (l) c l=b+1 i (j)i k (l) c be the number of times that observation y i is allocated in component j in iterations, where I L kc (l) c (j) = 1 if in i l-th iteration c i = j and I (l) c (j) = 0 otherwise, for i = 1,..., n and j = 1,..., k. i Let P (c i = j) = N ij be the posterior probability that the observation y i L kc belongs to cluster j, for j = 1,..., k c. If P ( ) (c i = j) = max P (ci = j ), 1 j k c then we consider that y i belongs to cluster j, for i = 1,..., n and j = 1,..., k c. We estimate parameters θ j, j = 1,..., k c, considering the average of the generated values, i.e.,

17 A Bayesian sparse finite mixture model 17 4 Data analysis θ j k c = 1 L L kc l=b+1 ) θ (l) j ( kc I k c (l). We illustrate the performance of the proposed method by using simulated data and three real data sets. Following Richardson and Green (17), we model these datasets by assuming an univariate normal mixture model. From model (2.2) f(y i θ j ) is the ( density ) of a normal distribution with mean µ j and variance σj 2 and θ j = µ j, σj 2, for j = 1,..., k. In order to explore the fully conjugation we ( consider ) the following prior distributions for component parameters θ j = µ j, σj 2, ( ) µ j σj 2, µ 0, λ N µ 0, σ2 j and σj 2 α, β Γ(α, β) λ ( ) where µ 0, λ, α and β are hyperparameters, N µ 0, σ2 j represents the normal distribution with mean µ 0 and variance σ2 j λ and Γ(α, β) represents the Gamma distribution with location parameter α and scale parameter β. The parametrization of the Gamma distribution is so that the mean is α/β and the variance is α/β 2. These prior distributions are also used by Casella et al. (2000), Nobile and Fearnside (2007) and Saraiva et al. (2016). Since it may be unrealistic to assume the availability of strong prior information regarding component parameters θ j in practice, we specify the hyperparameters values according to guidelines of Richardson and Green (17) and Saraiva et al. (2016). Thus, we set µ 0 = ε, α = 2 and β = 0.2/10R 2, where ε is the midpoint of the observed variation interval of the data and R is the length of this interval. We also fix the hyperparameter λ = 10 2 in order to get a prior distribution for component means with large variance. The normalizing constant I(D j ) present in the acceptance probability for split-merge is presented in section S-1 of the SM. 4.1 Simulated data sets λ Consider the population model given in (2.2) with k = 10 components. To simulate the data sets, we set up the number of clusters k c and the component parameters for the k c clusters according to the specified in Table 2.

18 18 E. F. Saraiva, A. K. Suzuki and L. A. Milan In the set up A 1 the two clusters have equal variances and weights while A 2 has different variances and weights. In A the three clusters have equal variances and similar weight and A 4 has different variances and weights. In A 5 we consider four clusters with same weights values and two clusters with higher variances in relation to the other two, and in A 6 we consider five clusters with different weights and one cluster with large variance in relation to the others. Table 2 Number of clusters and parameter values used for simulating the datasets. Artificial data set Number of clusters A 1 k true = 2 A 2 k true = 2 A k true = A 4 k true = A 5 k true = 4 A 6 k true = 5 Parameter values µ 1 = 0, µ 2 =, σ 1 = 1, σ 2 = 1, w 1 = 0.50, w 2 = 0.50, µ 1 = 0, µ 2 = 4, σ 1 = 1, σ 2 = 2, w 1 = 0.70, w 2 = 0.0, µ 1 =, µ 2 = 0, µ = σ 1 = 1, σ 2 = 1, σ = 1 w 1 = 0.0, w 2 = 0.40, w = 0.0 µ 1 = 5, µ 2 = 0, µ = 7 σ 1 = 1, σ 2 = 2, σ = w 1 = 0.20, w 2 = 0.0, w = 0.50 µ 1 = 4, µ 2 = 0, µ = 5, µ 4 = 10 σ 1 = 1, σ 2 = 2, σ = 1, σ 4 = 2 w 1 = 0.25, w 2 = 0.25, w = 0.25, w 4 = 0.25 µ 1 = 6, µ 2 = 0, µ = 7, µ 4 = 15, µ 5 = 21 σ 1 = 1, σ 2 = 2, σ =, σ 4 = 2, σ 5 = 1 w 1 = 0.15, w 2 = 0.20, w = 0.0, w 4 = 0.20, w 5 = 0.15, The procedure for generating the data sets is (i) For i = 1,..., n, generate U i U(0, 1); if j 1 w j < u i j w j, ( ) j =1 j =1 generate Y i N µ j, σj 2, with fixed parameter values according to Table 2, for w 0 = 0 and j = 1,..., k c. (ii) In order to record from which component each observation is generated ( ) from we define G = (G 1,..., G n ) such that G i = j if Y i N µ j, σj 2, for i = 1,..., n and j = 1,..., k c. In the remainder of the paper we simulate datasets with sizes n = 500; samples with size n = 1, 000 also are simulated and results are presented in Section S- of the SM. Figure 1 shows the values generated by cluster for datasets A 1 to A 6 for n = 500.

19 A Bayesian sparse finite mixture model 1 y y Clusters (a) Data set A Clusters (d) Data set A 4. y y Clusters (b) Data set A Clusters (e) Data set A 5. y y Clusters (c) Data set A Clusters (f) Data set A 6. Figure 1 Values generated by cluster for datasets A 1 to A 6 with n = 500. We apply the proposed SMAS algorithm fixing L = 110, 000 iterations and a burn in of B = 10, 000. We also consider a sample of one draw for every 20 obtaining a sequence of 5, 000 cases. These values were enough for reliable results. For all values γ = r for r R 1, the SMAS poses maximum posterior probability on the k c true value. These results are presented in Section S-2 of the SM. For γ = r for r R 2 the SMAS tends to overestimate the number of clusters when we increase the r value. These results are presented by Figure 2 which show the posterior probability for k c true value and the k c values with maximum posterior probability for each γ = r R 2. As one can note, for large values of γ the method overestimate the number of cluster. For datasets A 1, A 2 and A 4, the maximum posterior on k c true value is obtained only for γ {0.1, 0.2}; while the maximum posterior on k c true value for datasets A and A 5 is obtained for γ {0.1, 0.2, 0.} and for dataset A 6 for γ {0.1, 0.2, 0., 0.4}. Besides, for all simulated cases, the posterior probability on k c true value goes to zero when γ approaches to 2. Motivated by the sensibility to the choice of the γ we develop a second approach considering for γ a Gamma prior distribution, γ Γ(a, b), a, b > 0.

20 20 E. F. Saraiva, A. K. Suzuki and L. A. Milan Probability Probability Probability (a) Data set A 1. γ (c) Data set A. γ (e) Data set A 5. γ Probability Probability Probability (b) Data set A 2. γ (d) Data set A γ (f) Data set A 6. γ Figure 2 Posterior probability on k c true value and k c values with maximum posterior probablity for n = 500. represents the posterior probability for the k c true value and represents the k c value with maximum posterior probability.

21 A Bayesian sparse finite mixture model 21 In this approach, rather than the value of γ it is the choice of the hyperparameters a and b that are influential on the posterior probability for k c. To verify the sensibility in relation to the choice of the hyperparameters values we consider: (i) a = b = 1, obtaining E(γ) = 1 and V ar(γ) = 1. This prior represents the belief that the weights have Uniform distribution over the simplex; (ii) a = 2 and b = 4, obtaining E(γ) = 0.5 and V ar(γ) = This prior was suggested by Escobar and West (15) in the context of the Dirichlet process mixture model. In order to simulate values from conditional posterior distribution of γ, π(γ c, y, θ, k), we implement a random walk Metropolis (RWM) algorithm. We consider the candidate-generating density q[γ γ], where γ is a candidate value, with Uniform distribution centered on the current value of γ, i.e., U(γ ɛ, γ + ɛ). We set up ɛ = The candidate value γ is accepted with probability Ψ(γ γ) = min(1, A γ ), where A γ = π(c γ,k) π(γ ) π(c γ,k) π(γ), since the proposal kernels from numerator and denominator cancel, π(c γ, k) is given in (.) and π(γ) is the density of the Gamma prior for γ. This algorithm is implemented as follows. RWM algorithm: Let the state of the Markov chain be c = (c 1,..., c n ), θ = (θ 1,..., θ k ) and γ. Conditional on c and θ, update γ at the l-th iteration of the algorithm as follows. (i) Generate γ U (γ ɛ, γ + ɛ); (ii) Calculate Ψ(γ γ) = min(1, A γ ); (iii) Generate U U(0, 1). If u Ψ(γ γ) accept γ doing γ (l) = γ. Otherwise, do γ (l) = γ (l 1). Estimate γ by the average of the simulated values, i.e., γ = 1 L γ (l) j. L kc l=b+1 Tables and 4 show the estimates and the credibility intervals (5%) for γ and the estimates for posterior probabilities of k c for datasets A 1 to A 6, for γ Γ(1, 1) and γ Γ(2, 4), respectively. All estimates for γ lead to the maximum posterior probability on the k c true value (highlighted in bold). The prior Γ(1, 1) on γ has led to higher posterior probability on the k c than prior distribution Γ(2, 4).

22 22 E. F. Saraiva, A. K. Suzuki and L. A. Milan For each dataset the credibility intervals contain the most of the values r G that lead to maximum posterior probability on the k c true value. For instance, for data sets A 1 and A 2, all values r that lead to maximum posterior probability on the k c true value, except r = 0.2, belong to the estimated credibility intervals. Table Estimates for γ and estimated posterior probabilities for k c, with γ Γ(1, 1). Data set k c true A 1 2 A 2 2 A A 4 A 5 4 A 6 5 Estimates Probabilities for k c values for γ (0.0051, 0.118) (0.0048, 0.12) (0.0148, 0.188) (0.0151, ) (0.028, 0.072) (0.045, 0.471) Table 4 Estimates for γ and estimated posterior probabilities for k c, with γ Γ(2, 4). Data set k c true A 1 2 A 2 2 A A 4 A 5 4 A 6 5 Estimates Probabilities for k c values for γ (0.011, ) (0.0124, 0.16) (0.02, ) (0.024, 0.257) (0.0414, 0.4) (0.0601, 0.51) Performance of the SMAS. Now we verify empirically the convergence of the sequence of the posterior probability for k c across iterations and the capacity to move for different values of k c in the course of the iterations and estimated autocorrelation function (acf). We present performance of SMAS using the results obtained with γ = γ where γ is the estimated value for γ given in Table. For other values of γ (estimates given in Table 4 and fixed as r for r G) with maximum posterior probability on the k c true value the results are similar.

23 2 A Bayesian sparse finite mixture model P(kc.) 0.2 k=5 0.0 k= Index k=5 k=6 k= Index (d) Data set A M k=4 0.6 P(kc.) (c) Data set A. cc ep te d 0.0 k=5 0 Index k= (b) Data set A (a) Data set A1. P(kc.) 000 Index k= Index k=4 k= an us k= k= k= P(kc.) 0.8 k=2 0.4 P(kc.) P(kc.) 0.8 k=2 k= k=4 cr ip t Figure shows the plots of P (kc ) estimates across the iterations. To maintain a good visualization we display in each graphic only the three higher P (kc ) estimates. The number of iterations and burn in seems to be adequate to achieve stability for the posterior probabilities of kc. (e) Data set A Index (f) Data set A6. Figure A posteriori probability for kc across iterations for thinned sequence. B JP S -A Figure 4 shows the sampled kc values in the course of iterations. The algorithm mix well over kc and remains, in the most of iterations, around the target kc true values. The sampled kc values do not present significant autocorrelation, as showed by Figure 5. Table 5 shows the percentage of split-and-merge moves accepted and the percentage of γ values accepted. Theses percentages show us that split-merge moves are proposed and accepted in a balancing way. The high percentages of accepted values for γ is due the way that we implement the RWM, assuming a small ε = 0.05 value. Table 5 Percentages of split and merge movements accepted. Movement and γ split merge γ A A Data A set A A A

24 24 E. F. Saraiva, A. K. Suzuki and L. A. Milan kc kc ACF ACF Iterations (a) Data set A Iterations (d) Data set A Lag (a) Data set A Lag (d) Data set A 4. kc kc Iterations (b) Data set A Iterations (e) Data set A 5. Figure 4 Sampled k c values for thinned sequence. ACF ACF Lag (b) Data set A Lag (e) Data set A Figure 5 Estimated autocorrelation for thinned sequence. kc kc ACF ACF Iterations (c) Data set A Iterations (f) Data set A Lag (c) Data set A Lag (f) Data set A 6.

25 A Bayesian sparse finite mixture model 25 Figure 2 in Section S-2.2 of the SM shows the simulated values and the clusters identified by the method. The clusters were satisfactorily identified. Section S-2.2 of the SM also presents the estimates for components of each cluster (Tables 1 and 2) and the histogram of the observed data and the estimated density function (Figure ). 4.2 Galaxy data set We now apply the proposed methodology to the well known galaxy dataset, previously analyzed by Roeder and Wasserman (17), Escobar and West (15), Richardson and Green (17), Stephens (2000), among others. The data set refers to velocity (in 10 km/s) from distant galaxies diverging from our own. The sample size is n = 82 observations. For more details on this dataset see Escobar and West (15) and Richardson and Green (17). We consider the galaxy velocities as realizations from a mixture of k normal distributions and the Bayesian model in (.1) with k = 10. We use the same hyperparameters specification used for artificial datasets. To estimate the hyperparameter γ we consider the hierarchical approach setting up γ Γ(a, b), for a = b = 1. The number of iterations, burn in and thin value are the same used in previous analysis. Considering the SMAS algorithm, the estimate and the credibility interval (5%) for γ are and (0.0182, 0.266), respectively. The estimated posterior probabilities for k c, 1 k c 10, are presented in Table 6. The maximum posterior is at k c = with P (k c = ) = For the sake of comparison, estimates of k for this data set range from and 4 for Roeder and Wasserman (17) and Stephens (2000), from 5 to 7 for Richardson and Green (17) and just 7 for Escobar and West (15). Table 6 Estimated probabilities for k c, Galaxy data set, with γ Γ(1, 1). k c values P (k c ) Figure 6 shows the performance of the SMAS. Similar to simulation results, the SMAS sampler mixes well over k c and shows a satisfactorily stability for probabilities of k c. The sampled k c values do not present significant autocorrelation. The acceptance ratio for split and merge moves are 1.818% and 1.464%, respectively. The acceptance ratio for generated γ values is %. Figure 7(a) shows clusters identified conditional on the estimate k c =. Figure 7(b) shows the histogram of the observed data and the estimated density function. As noted by Roeder and Wasserman (17) and Stephens

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM

More information

Bayesian finite mixtures with an unknown number of. components: the allocation sampler

Bayesian finite mixtures with an unknown number of. components: the allocation sampler Bayesian finite mixtures with an unknown number of components: the allocation sampler Agostino Nobile and Alastair Fearnside University of Glasgow, UK 2 nd June 2005 Abstract A new Markov chain Monte Carlo

More information

Hmms with variable dimension structures and extensions

Hmms with variable dimension structures and extensions Hmm days/enst/january 21, 2002 1 Hmms with variable dimension structures and extensions Christian P. Robert Université Paris Dauphine www.ceremade.dauphine.fr/ xian Hmm days/enst/january 21, 2002 2 1 Estimating

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Mixture Labeling by Minimizing Deviance of. Classification Probabilities to Reference Labels

Bayesian Mixture Labeling by Minimizing Deviance of. Classification Probabilities to Reference Labels Bayesian Mixture Labeling by Minimizing Deviance of Classification Probabilities to Reference Labels Weixin Yao and Longhai Li Abstract Solving label switching is crucial for interpreting the results of

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for

More information

Recursive Deviance Information Criterion for the Hidden Markov Model

Recursive Deviance Information Criterion for the Hidden Markov Model International Journal of Statistics and Probability; Vol. 5, No. 1; 2016 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Recursive Deviance Information Criterion for

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models Collaboration with Rudolf Winter-Ebmer, Department of Economics, Johannes Kepler University

More information

Bayesian computational methods for estimation of two-parameters Weibull distribution in presence of right-censored data

Bayesian computational methods for estimation of two-parameters Weibull distribution in presence of right-censored data Chilean Journal of Statistics Vol. 8, No. 2, September 2017, 25-43 Bayesian computational methods for estimation of two-parameters Weibull distribution in presence of right-censored data Erlandson F. Saraiva

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

Likelihood-free MCMC

Likelihood-free MCMC Bayesian inference for stable distributions with applications in finance Department of Mathematics University of Leicester September 2, 2011 MSc project final presentation Outline 1 2 3 4 Classical Monte

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Mixture models. Mixture models MCMC approaches Label switching MCMC for variable dimension models. 5 Mixture models

Mixture models. Mixture models MCMC approaches Label switching MCMC for variable dimension models. 5 Mixture models 5 MCMC approaches Label switching MCMC for variable dimension models 291/459 Missing variable models Complexity of a model may originate from the fact that some piece of information is missing Example

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models

A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models Richard J. Boys Daniel A. Henderson Department of Statistics, University of Newcastle, U.K. Richard.Boys@ncl.ac.uk

More information

A BAYESIAN ANALYSIS OF HIERARCHICAL MIXTURES WITH APPLICATION TO CLUSTERING FINGERPRINTS. By Sarat C. Dass and Mingfei Li Michigan State University

A BAYESIAN ANALYSIS OF HIERARCHICAL MIXTURES WITH APPLICATION TO CLUSTERING FINGERPRINTS. By Sarat C. Dass and Mingfei Li Michigan State University Technical Report RM669, Department of Statistics & Probability, Michigan State University A BAYESIAN ANALYSIS OF HIERARCHICAL MIXTURES WITH APPLICATION TO CLUSTERING FINGERPRINTS By Sarat C. Dass and Mingfei

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US Gerdie Everaert 1, Lorenzo Pozzi 2, and Ruben Schoonackers 3 1 Ghent University & SHERPPA 2 Erasmus

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Index. Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables.

Index. Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables. Index Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables. Adaptive rejection metropolis sampling (ARMS), 98 Adaptive shrinkage, 132 Advanced Photo System (APS), 255 Aggregation

More information

A Simple Solution to Bayesian Mixture Labeling

A Simple Solution to Bayesian Mixture Labeling Communications in Statistics - Simulation and Computation ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: http://www.tandfonline.com/loi/lssp20 A Simple Solution to Bayesian Mixture Labeling

More information

Model Selection in Bayesian Survival Analysis for a Multi-country Cluster Randomized Trial

Model Selection in Bayesian Survival Analysis for a Multi-country Cluster Randomized Trial Model Selection in Bayesian Survival Analysis for a Multi-country Cluster Randomized Trial Jin Kyung Park International Vaccine Institute Min Woo Chae Seoul National University R. Leon Ochiai International

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University

More information

Using Model Selection and Prior Specification to Improve Regime-switching Asset Simulations

Using Model Selection and Prior Specification to Improve Regime-switching Asset Simulations Using Model Selection and Prior Specification to Improve Regime-switching Asset Simulations Brian M. Hartman, PhD ASA Assistant Professor of Actuarial Science University of Connecticut BYU Statistics Department

More information

BAYESIAN MODEL CRITICISM

BAYESIAN MODEL CRITICISM Monte via Chib s BAYESIAN MODEL CRITICM Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection

A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection Silvia Pandolfi Francesco Bartolucci Nial Friel University of Perugia, IT University of Perugia, IT

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Introduction to Markov Chain Monte Carlo & Gibbs Sampling Introduction to Markov Chain Monte Carlo & Gibbs Sampling Prof. Nicholas Zabaras Sibley School of Mechanical and Aerospace Engineering 101 Frank H. T. Rhodes Hall Ithaca, NY 14853-3801 Email: zabaras@cornell.edu

More information

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition. Christian P. Robert The Bayesian Choice From Decision-Theoretic Foundations to Computational Implementation Second Edition With 23 Illustrations ^Springer" Contents Preface to the Second Edition Preface

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Markov Chain Monte Carlo in Practice

Markov Chain Monte Carlo in Practice Markov Chain Monte Carlo in Practice Edited by W.R. Gilks Medical Research Council Biostatistics Unit Cambridge UK S. Richardson French National Institute for Health and Medical Research Vilejuif France

More information

Bayesian Mixture Modeling Approach. to Account for Heterogeneity in Speed Data

Bayesian Mixture Modeling Approach. to Account for Heterogeneity in Speed Data Par, Zhang, and Lord Bayesian Mixture Modeling Approach to Account for Heterogeneity in Speed Data Byung-Jung Par Graduate Assistant Researcher Zachry Department of Civil Engineering Texas A&M University,

More information

A note on Reversible Jump Markov Chain Monte Carlo

A note on Reversible Jump Markov Chain Monte Carlo A note on Reversible Jump Markov Chain Monte Carlo Hedibert Freitas Lopes Graduate School of Business The University of Chicago 5807 South Woodlawn Avenue Chicago, Illinois 60637 February, 1st 2006 1 Introduction

More information

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J. Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox fox@physics.otago.ac.nz Richard A. Norton, J. Andrés Christen Topics... Backstory (?) Sampling in linear-gaussian hierarchical

More information

INTRODUCTION TO BAYESIAN STATISTICS

INTRODUCTION TO BAYESIAN STATISTICS INTRODUCTION TO BAYESIAN STATISTICS Sarat C. Dass Department of Statistics & Probability Department of Computer Science & Engineering Michigan State University TOPICS The Bayesian Framework Different Types

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 18 : Advanced topics in MCMC Lecturer: Eric P. Xing Scribes: Jessica Chemali, Seungwhan Moon 1 Gibbs Sampling (Continued from the last lecture)

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model David B Dahl 1, Qianxing Mo 2 and Marina Vannucci 3 1 Texas A&M University, US 2 Memorial Sloan-Kettering

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

BIOINFORMATICS. Gilles Guillot

BIOINFORMATICS. Gilles Guillot BIOINFORMATICS Vol. 00 no. 00 2008 Pages 1 19 Supplementary material for: Inference of structure in subdivided populations at low levels of genetic differentiation. The correlated allele frequencies model

More information

Gaussian kernel GARCH models

Gaussian kernel GARCH models Gaussian kernel GARCH models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics 7 June 2013 Motivation A regression model is often

More information

Clustering bi-partite networks using collapsed latent block models

Clustering bi-partite networks using collapsed latent block models Clustering bi-partite networks using collapsed latent block models Jason Wyse, Nial Friel & Pierre Latouche Insight at UCD Laboratoire SAMM, Université Paris 1 Mail: jason.wyse@ucd.ie Insight Latent Space

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian Analysis of Mixture Models with an Unknown Number of Components an alternative to reversible jump methods

Bayesian Analysis of Mixture Models with an Unknown Number of Components an alternative to reversible jump methods Bayesian Analysis of Mixture Models with an Unknown Number of Components an alternative to reversible jump methods Matthew Stephens Λ Department of Statistics University of Oxford Submitted to Annals of

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Label Switching and Its Simple Solutions for Frequentist Mixture Models Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

Lecture 4: Dynamic models

Lecture 4: Dynamic models linear s Lecture 4: s Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes hlopes@chicagobooth.edu

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Bayesian model selection for computer model validation via mixture model estimation

Bayesian model selection for computer model validation via mixture model estimation Bayesian model selection for computer model validation via mixture model estimation Kaniav Kamary ATER, CNAM Joint work with É. Parent, P. Barbillon, M. Keller and N. Bousquet Outline Computer model validation

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31 Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa State) Hierarchical models August 31, 2017 1 / 31 Normal hierarchical model Let Y ig N(θ g, σ 2 ) for i = 1,...,

More information

Bayesian semiparametric modeling and inference with mixtures of symmetric distributions

Bayesian semiparametric modeling and inference with mixtures of symmetric distributions Bayesian semiparametric modeling and inference with mixtures of symmetric distributions Athanasios Kottas 1 and Gilbert W. Fellingham 2 1 Department of Applied Mathematics and Statistics, University of

More information

Bayesian time series classification

Bayesian time series classification Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet Stat 535 C - Statistical Computing & Monte Carlo Methods Lecture 18-16th March 2006 Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Trans-dimensional Markov chain Monte Carlo. Bayesian model for autoregressions.

More information

PROGRAMA INTERINSTITUCIONAL DE PÓS-GRADUAÇÃO EM ESTATÍSTICA UFSCar-USP. DEs-UFSCar e SME-ICMC-USP

PROGRAMA INTERINSTITUCIONAL DE PÓS-GRADUAÇÃO EM ESTATÍSTICA UFSCar-USP. DEs-UFSCar e SME-ICMC-USP ISSN 0104-0499 PROGRAMA INTERINSTITUCIONAL DE PÓS-GRADUAÇÃO EM ESTATÍSTICA UFSCar-USP DEs-UFSCar e SME-ICMC-USP BAYESIAN ESTIMATION FOR MIXTURE OF SIMPLEX DISTRIBUTION WITH UNKNOWN NUMBER OF COMPONENTS:

More information

Bayesian Modelling and Inference on Mixtures of Distributions Modelli e inferenza bayesiana per misture di distribuzioni

Bayesian Modelling and Inference on Mixtures of Distributions Modelli e inferenza bayesiana per misture di distribuzioni Bayesian Modelling and Inference on Mixtures of Distributions Modelli e inferenza bayesiana per misture di distribuzioni Jean-Michel Marin CEREMADE Université Paris Dauphine Kerrie L. Mengersen QUT Brisbane

More information

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors RicardoS.Ehlers Laboratório de Estatística e Geoinformação- UFPR http://leg.ufpr.br/ ehlers ehlers@leg.ufpr.br II Workshop on Statistical

More information