An Application of MCMC Methods and of Mixtures of Laws for the Determination of the Purity of a Product.

Size: px

Start display at page:

Download "An Application of MCMC Methods and of Mixtures of Laws for the Determination of the Purity of a Product."

Wilfred Oliver
5 years ago
Views:

1 An Application of MCMC Methods and of Mixtures of Laws for the Determination of the Purity of a Product. Julien Cornebise & Myriam Maumy E.S.I-E-A LSTA 72 avenue Maurice Thorez, Université Pierre et Marie Curie - Paris VI IVRY SUR SEINE, France 175 rue du Chevaleret, ( cornebis@et.esiea.fr) PARIS, France IRMA Université Louis Pasteur - Strasbourg I 7 rue René Descartes, STRASBOURG Cedex, France ( mmaumy@math.u-strasbg.fr) Abstract The aim of this article is to illustrate practical issues that may arise when applying MCMC methods in a real case. MCMC methods are used over a model of mixture of distributions, in order to determine honnesty specifications about some kind of product. First, assuming a known number of components, parameters of each component are estimated using Gibbs sampling. Convergence and label-switching are taken into account. Then, determination of the number of components is mentioned, through model selection with Bayes Factors. Résumé Le but de cet article est d illustrer des questions partiques qui peuvent survenir lorsque l on applique des méthodes MCMC à un cas réel. Ces dernières sont utilisées sur un modèle de mélange de lois, afin de déterminer des spécifications d honnêteté sur un certain type de produits. En travaillant un premier temps à nombre de composantes fixé, les paramètres de chaque composante sont estimés à l aide de l échantilloneur de Gibbs. La convergence et le label-switching sont pris en compte. Enfin, la détermination du nombre de composantes est envisagé, à l aide des Facteurs de Bayes. 1. The Problem and its Modelization A particular kind of products, avalaible in stores almost everywhere, is subject to different kind of adulteration (or fraud). Some products are pure and thus have low glucose and xylose rates, while some other have been chemically modified (adjunction of extra-components), wich reduces their quality and increases their glucose and xylose rates. Given a set of 1002 observations of the couple (glucose rate, xylose rate), can we determine : (i) the number K of kinds of production, and their parameters (mean, standard deviation), (K 1 different possible modifications, plus one for pure products), (ii) the proportion of each kind of production, and (iii) a way to detect wether a product 1

2 is pure or not (e.g. the 99% quantile of the distribution of the glucose for the honnest kind of production)? We here only consider the univariate case, i.e. glucose and xylose rates separately. The approach could later be generalised to the two-dimensional variable. We modelize the distribution of the glucose (resp. the xylose) as a mixture of normal laws. We consider that the T = 1002 observations from the sample come from K distinct populations (1 honest and K 1 modified), each population k {1,..., K} following a normal law of density f k and of parameters θ k = (µ k, σ k ). The density function of an observation x i, 1 i T is : [x i π, θ] = K k=1 π kf k (x i θ k ), where f k (. θ k ) is the density of a normal law of parameters θ k = (µ k, σ k ) and π k is the probability of belonging to population k, with K k=1 π k = 1. The parameters of interest are θ = (θ 1,..., θ K ) and π = (π 1,..., π K ). The matter of choosing the value of K will be mentioned later. The hypothesis of normality of the law may (and have to) be reconsidered in a second time (e.g. for log-normal distribution). Other parametrisations for mixtures of normal laws have been published, for example by Robert in Droesbeke et al. (2002), wich avoids some label-switching (see below) troubles, but miss the natural physical interpretation, very useful here. The parameters θ k of each law are considered as random variables, with their own distribution, and so the notation [..] is used for the conditional density and for the parametrized density. The distribution of density [θ, π] is the prior distribution, and its parameters are the hyperparameters. They allow to take mathematically into account the prior knowledge of the experts of the field, if avalaible (here, the potential informations held by the chemists). The general aim of a study is to estimate a function of the real values of the parameters F (θ, π). This can be done by calculating E[F (θ, π) x] = F (θ, π)[θ, π x]d(θ, π), where [θ, π x] is the density of the posterior distribution, i.e. the distribution of the parameters conditionnaly to the observations x = (x 1,..., x n ). It can be computed via the Bayes Formula : [θ, π x] = [x θ, π][θ, π]/[x], where [x] is the density of the prior distribution of the observations, which can be taken as constant and thus ignored. The first question is therefore to choose the prior distribution of the parameters, [θ, π]. Before going any further, we introduce the hidden variable z i, i {1,..., T } in the model mentioned above, which is not observed and is thus named latent variable. z i {1,..., K} indicates the original population of the observation x i, and z = (z 1,..., z T ). Thus, the z i are i.i.d, with [z i = k π] = π k, [x i θ, π, z i = k] = N (x µ k, σ k ), where N (. µ, σ) is the density of a univariate normal law. Analysis of mixture distributions by MCMC methods have been the subject of many publications, for example Diebolt and Robert (1990), Richardson and Green (1997), Stephens (1997), Marin et al. (to appear). 2. Choosing the Prior and its Hyperparameters The choice of the prior distribution of the parameters is always a much discussed point 2

3 in any bayesian analysis. Several cases may arise. Either the experts of the field have valuable information about the distribution of the parameters (for example, the mean of each component is approximately known); or they don t have any information at all (or it should be ignored, for blind validation purposes). In this last case, we can either use empirical prior, i.e. hyperparameters built upon the data, or use non-informative prior, i.e. prior carrying no information at all, or poorly informative, i.e. carrying as few information as possible. Moreover, the prior is often chosen in a closed-by-sampling or conjugate prior family, i.e. such that conditionning by the sample (passing from the prior to the posterior distribution) only result in a change of the hyperparameters, not in a change of family. This simplifies implementation. This part of the bayesian analysis is the most subjective. A sensitivity analysis should be done after the study, to test the stability of the results relatively to hyperparameters and prior choices. Here we have chosen the following model, that often arises, because each law belongs to a closed-by-sampling family : π = (π 1,..., π K ) Di(a 1,..., a K ), µ k σ k N (m k, σ k / c k ), σ 2 k IG(α k, β k ) where Di is a Dirichlet law, and IG an Inverse Gamma. The hyperparameters are a k, m k, c k, α k, and β k, k {1,..., K}. For non-informative prior, we should have a Uniform law for π, which can be obtained with a 1 =... = a K = 1, a Dirac density for σk 2, which can be obtained with limit values for α k and β k, and a unary density on R for each µ k, which is more difficult to implement with limit values on c k. In practice, non-informative prior is not so easy to deal with : implementation under Matlab or Scilab, for example, it is not always possible due to infinite values of the parameters, of to particular densities. Therefore, poorly informative prior, or even empirical prior, should be used. This is what we have done here. We have chosen : k {1,..., K} π k = 1, c k = 1, m k = x (empirical mean), α k = K, and β k = (T (K 1)) 1 T i=1 (x i x) 2. Thus, the proportions are non-informative, the mean of the components equals the sample s one, and the mean of σk 2 is equal to the empirical variance (E[σk 2] = (α k 1)β k for Inverse Gamma). For reasons that should become clear further (related to label-switching problems and bayes factors), we have choosen the same hyperparameters for each one of the K components : this maintains the symmetry of the density (and therefore of the likelihood). The choice of prior is cleary the weakest, the most arguable, and the most argued point of the analysis. Each of the many approaches has its good and bad points. The approach chosen here is neither the most rigorous one, nor the purest, but allows easy implementation. Further discussions about the choice of the prior can be found, for example in Droesbeke et al. (2002), and Gelman et al. (2003). 3. Gibbs Sampling, Complete Conditionnal Laws The evaluation of the expectation E[F (θ, π) x] is hard to achieve, either analitically or numerically, due either to its complex expression or to its highly multidimensional 3

4 nature. MCMC methods are a good way to solve this problem. The principle of Monte- Carlo methods is to use N independent realizations (θ (i), π (i) ) of random variables following the posterior distribution [θ, π x], and to approximate : F (θ, π)[θ, π x]d(θ, π) i=n i=1 F (θ(i), π (i) )/N. Here F will be first the identity function, to estimate the parameters, then the 99%-quantile function of the component corresponding to pure product. In this last case, in order to lower the producer-risk, some praticians advise to use the empirical 95%-quantile of the sampled values F (θ (i), π (i) ), instead of their empirical mean. Markov-Chain part of MCMC methods solves the problem of sampling from the posterior distribution. Different algorithms exist. The two main are Metropolis-Hastings, and the Gibbs Sampler. We have used this last one, which is quite straightforward to implement. Its aim is to sample from the joint posterior distribution of (θ, π). It s based on the complete conditional law. The general principle is the following. Let θ = (θ 1,..., θ n ) the vector of the parameters of the model. Here, we have θ = (θ, π) = (µ 1, σ 1,..., µ K, σ K, π 1,..., π K ). Let θ (i) = (θ 1,..., θ i 1, θ i+1,..., θ n ) the vector θ without its i th component. The density of the complete conditional law of θ i is [θ i θ (i), x], whose closed form is often known. Let s assume that we know all complete conditionnal laws, i.e. [θ i θ (i), x], i {1,..., n}. Let s also take arbitrary initial values θ (0) = (θ (0) 1,..[., θ n (0) ). The Gibbs Sampling algorithm consists in successively sampling θ (j) 1 from θ 1 θ (j 1) 2, θ (j 1) 3, θ (j 1) 4,..., θ n (j 1), ] [ ] [ ] θ (j) 2 from θ 2 θ (j) 1, θ (j 1) 3, θ (j 1) 4,..., θ n (j 1),, θ n (j) from θ n θ (j) 1, θ (j) 2, θ (j) 3,..., θ (j) n 1, for j {1,..., N + M}, where M is a number of iterations that will be discarded. They are called burn-in iterations, and correspond to the time before convergence. Eventually, it can be shown that θ (M) = (θ (M) 1,..., θ n (M) ) converges in law to the posterior joint distribution [θ 1,..., θ n x]. The following N values are then considered as a sample from this distribution, and can be used to approximate F (θ) by empirical mean, as mentioned above. In our model, we have the following complete conditional laws (in order to simplify the notations, the list of all parameters but θ are figured by (θ)) : z i (z i ), x Mu(π 1,... π K ) π (π), x Di(a 1 + n 1,..., a K + n K ), where n k = i:z i =k 1 µ k (µ k ), x N ( m kc k +s k σ c k +n k, nk k +c k ) σ 2 k (σ k), x IG ( ( α k + n k+1, β 2 k + 1 ck (µ 2 k m k ) 2 + i:z i =k (x k µ k ) 2)) Now the sampling can take place, for example with M = 5000 and N = Convergence issues are discussed below. Metropolis Hastings algorithm details and variants of the Gibbs Sampler can be found in Droesbeke et al. (2002), in Richardson and Green (1997), or also in Marin et al. (2004). 4. Convergence Issues 4

5 Since the historically first convergence diagnostic (known as the thick pen one, Gelfand and Hills (1990)), there exist two main kinds of methods to determine wether the sampler has reached convergence or not, i.e. wether the values from the current generation can be considered as a sample of the target distribution. They are single-chain convergence diagnostic, and multi-chains ones. Single-chain diagnostics have clearly not been advised by Gelman and Rubin (1992). The principle of muti-chains diagnostics is to run multiple independent chains from very different starting points, and to test wether the last values of each chain come from the same distribution or not. If that is the case, we can assume that convergence has been reached. Gelmand and Rubin (1992) have proposed a method which is often used, based on the comparison of within and between-chains variances. At the beginning of the sampling, the chains are much influenced by the starting point, and the between-chains variance is high above the within-chain one. When each chain as reached the target distribution, the ratio between within and between-chain variance should be around 1. This method as been much improved since then, and some more subtle tests are avalaible. This method (that won t be detailed here, to save space) is quite straightforward to implement, and has proven to be efficient in many cases. But here, we encounter an important problem, making it almost impossible to use : the label switching problem. 5. The Label Switching Problem A particularity of the mixture of laws is that the density of the observations, as well as the likelihood and the posterior joint density of the parameters (which is the target distribution of the Gibbs Sampler) is symmetrical, i.e. invariant by permutation of the components. Therefore, this last density has up to K! duplications of each mode, and the sampler can move from one mode to another freely, thus permuting the components. The absence of label-switching means that the sampler is stuck in a local mode, maybe because modes are well separated (e.g. when K is very low). The space of parameters is thus not completely explored, which is not sane. When estimating a function F (θ, π) which is also invariant by permutation of the components (e.g. estimating the density at a given point), label-switching is not a problem. But when trying to estimate the parameters of each component, this label-switching has to be undone. Two approaches come in mind : imposing an identification constraint during the sampling, or post-processing the sampled values to undo the permutation. The first solution constrains the exploration of the space of parameters by the sampler, and thus may alter the results, see Celeux and Robert (2000), Marin et al. (2004), and Stephens (2000). Forcing the prior to be highly separable (using much different hyperparameters for components) isn t a good idea neither : label switching arises anyway, and the resulting lack of correspondance between the prior and the component would corrupt any interpretation the prior (such as Bayes factors). We applied here the second solution. The post-processing must be done very carefully : ordering on µ k, or on σ k, or on π k, isn t a good idea. Some components may be close in 5

6 mean but not in variance, and vice-versa. Some methods use a loss function and clustering algorithms (e.g. K-means with K! classes) to determine to which mode (i.e. permutation) belongs each of the sampled vectors of parameters. The reader is invited to read the three references above for more details. Label-switching is alas incompatible with convergence diagnostics such as these mentionned in paragraph 4 : comparing the variance between chains is meaningless when components can swap! Moreover, clustering assumes that convergence has been reached, it would thus be nonsense to use variance-based diagnostics on post-processed samples. 6. Choosing a Model : Bayes Factors Until now, we ve worked with a given number K of components. The question is now to choose between different models. Let M = {M 1,..., M K } be a finite ensemble of possible models (each one with k components, up to K, K = 7 in our application) to explain the observations x i, parametrized by θ. The best model within the K possible is the one with the highest a posteriori probability. The a posteriori probability of model M k is obtained via Bayes formula : [M k x] = ([M k ][x M k ]) / ( M k M [M k][x M k ] ), where [M k ] is an a priori probability of the validity of M k, with M k M [M k] = 1 (e.g. k [M k ] = 1/K), and [x M k ], defined by [x M k ] = [x θk ][θ k ] dθ k is the a priori predictive law of x under model M k. Let s consider M 1 et M 2. The ratio between a posteriori probabilities [M 2 x]/[m 1 x] = ([M 2 ][x M 2 ]) / ([M 1 ][x M 1 ]) is a bet a posteriori in favor of M 2 compared to M 1. The ratio B 21 = [x M 2 ]/[x M 1 ], modifying the bet a priori [M 2 ]/[M 1 ] is called Bayes factors of model M 2 relatively to M 1. Kass and Raftery (1995) suggest a scale to interpret the Bayes factors, wich can be valid for a first indication, but is far from being general. The evaluation of Bayes factors requires integration, wich can be done with MCMC methods. Nevertheless, sampling from the prior distribution [θ k ] wouldn t be efficient because the prior is often very flat. Kass and Raftery (1995) advises to use another method, based on samples from the posterior distribution [θ x], making once more use of the Bayes formula. It implies to continue the Gibbs sampler, setting each parameter one after the other to its estimate, as described in Chib (1995) and Carlin and Chib (1995). There to label-switching is a problem, and that s why other methods such as Reversible Jump or Birth Death processes are more and more often used, as in Stephens (2000). 7. Possible Improvements, Conclusion We have tried here to illustrate some basic technics of MCMC Bayesian Analysis. They are quite straightforward to implement, and easy to use in a first approach of the problem. Nevertheless they suffer of some particular drawbacks, such as label-switching avoiding convergence diagnostic and even easy estimation of the parameters of interest. That s why many more efficient developments have been achieved these last years. However, this kind of algorithm still deals better with this real-case dataset than other algorithms we have tried on it (such as EM algorithm for example). 6

7 Many points are still to be improved in our work : hypothesis of normality should be reconsidered (for log-normality, maybe), the choice of the prior, analysis of sensitivity, and convergence diagnostic. This is an illustration of the difficulties that praticians may face before benefitting from the powerful tools that are MCMC Bayesian Methods. The authors would like to thank Philippe Girard (Nestlé Research Center, Lausanne, Switzerland), Eric Parent, (ENGREF, Paris, France), and Luc Perrault (Hydro-Québec, Varennes, Québec, Canada) for their help and support. Bibliography [1] B.P. Carlin and S. Chib. (1995) Bayesian model choice via markov chain monte carlo methods. J. R. Statist. Soc. B, 57, [2] G. Celeux, M. Hurn, and C. Robert. (2000) Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc., 95, [3] S. Chib. (1995) Marginal likelihood from the gibbs output. J. Am. Stat. Assoc., 90, [4] J. Diebolt and C.P. Robert. (1990) Bayesian estimation of infnite mixture distributions, part i : Theoretical aspects. Technical Report 110, LSTA, Université Paris VI. [5] J.J. Droesbeke, J. Fine, and G. Saporta, editors. (2002) Méthodes Bayésiennes en statistique. Technip. [6] A.E. Gelfand, S.E. Hills, A. Rancine-Poon, and A.F.M. Smith. (1990) Illustration of bayesian inference in normal data models using gibbs sampling. J. Amer. Stat. Assoc, 85, [7] A. Gelman and D.B. Rubin. (1992) Inference from iterative simulation using multiple sequence (with discussion). Statistical Science, 7, [8] A. Gelman and D.B. Rubin. (1992) A single series from the gibbs sampler provides a false sense of security (with discussion). In J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors, Bayesian Statistics 4, pages Oxford University Press. [9] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. (2003) Bayesian Data Analysis. Chapman and Hall. [10] R.E. Kass and A.E. Raftery. (1995) Bayes factors. J. Am. Stat. Assoc., 90, [11] J.M. Marin, K. Mengersen, and C.P. Robert. Bayesian modelling and inference on mixtures of distributions. (to appear). D. Dey and C.R. Rao, editors, Handbook of Statistics 25. Elsevier-Sciences. [12] S. Richardson and P.J. Green. (1997) On bayesian analysis of mixtures with an unknown number of components. J. R. Statist. Soc. B, 59(4), [13] M. Stephens. (1997) Bayesian Methods for Mixtures of Normal Distributions. PhD thesis, Magdalen College, Oxford. [14] M. Stephens. (2000) Bayesian analysis of mixtures with an unknown number of components : an alternative to reversible jump methods. Annals of Statistics. [15] M. Stephens. (2000) Dealing with label-switching in mixture models. J. R. Stat. Soc. B, 62,

A Practical Implementation of the Gibbs Sampler for Mixture of Distributions: Application to the Determination of Specifications in Food Industry

A Practical Implementation of the Gibbs Sampler for Mixture of Distributions: Application to the Determination of Specifications in Food Industry Julien Cornebise 1, Myriam Maumy 2, and Philippe Girard