Label Switching and Its Simple Solutions for Frequentist Mixture Models

Similar documents
Label switching and its solutions for frequentist mixture models

Bayesian Mixture Labeling by Minimizing Deviance of. Classification Probabilities to Reference Labels

A Simple Solution to Bayesian Mixture Labeling

Relabel mixture models via modal clustering

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Computation of an efficient and robust estimator in a semiparametric mixture model

Root Selection in Normal Mixture Models

EM Algorithm II. September 11, 2018

Robust mixture regression using the t-distribution

Bayesian Methods for Machine Learning

Biostat 2065 Analysis of Incomplete Data

Testing for a Global Maximum of the Likelihood

STA 4273H: Statistical Machine Learning

An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

KANSAS STATE UNIVERSITY

Different points of view for selecting a latent structure model

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Estimating the parameters of hidden binomial trials by the EM algorithm

Bayesian finite mixtures with an unknown number of. components: the allocation sampler

Adaptive Metropolis with Online Relabeling

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Robust Estimation of the Number of Components for Mixtures of Linear Regression Models

Density Estimation. Seungjin Choi

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Communications in Statistics - Simulation and Computation. Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Answers and expectations

Hmms with variable dimension structures and extensions

Maximum Likelihood Estimation. only training data is available to design a classifier

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

The Expectation-Maximization Algorithm

A Note on the Expectation-Maximization (EM) Algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 3 September 1

Parametric Techniques Lecture 3

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Machine Learning Techniques for Computer Vision

Parametric Techniques

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

New Bayesian methods for model comparison

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Approximate Likelihoods

Non-Parametric Bayes

Lecture 4: Probabilistic Learning

Bayes: All uncertainty is described using probability.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Accelerating the EM Algorithm for Mixture Density Estimation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

The Expectation Maximization Algorithm

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Likelihood-Based Methods

Bayesian inference for multivariate skew-normal and skew-t distributions

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

An introduction to Variational calculus in Machine Learning

Statistical Estimation

Monte Carlo Integration using Importance Sampling and Gibbs Sampling

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

A Comparative Study of Imputation Methods for Estimation of Missing Values of Per Capita Expenditure in Central Java

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

STA 294: Stochastic Processes & Bayesian Nonparametrics

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Dynamic sequential analysis of careers

Overlapping Astronomical Sources: Utilizing Spectral Information

Adaptive Metropolis with Online Relabeling

Bayesian Modelling and Inference on Mixtures of Distributions Modelli e inferenza bayesiana per misture di distribuzioni

An Introduction to mixture models

Estimation for nonparametric mixture models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

Inferring biological dynamics Iterated filtering (IF)

Mixtures of Rasch Models

NONPARAMETRIC MIXTURE OF REGRESSION MODELS

CSE446: Clustering and EM Spring 2017

Long-Run Covariability

COM336: Neural Computing

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

A Note on Lenk s Correction of the Harmonic Mean Estimator

Data Preprocessing. Cluster Similarity

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Nonparametric Modal Regression

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Bayesian inference for factor scores

COS513 LECTURE 8 STATISTICAL CONCEPTS

K-ANTITHETIC VARIATES IN MONTE CARLO SIMULATION ISSN k-antithetic Variates in Monte Carlo Simulation Abdelaziz Nasroallah, pp.

Transcription:

Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching problem for Bayesian mixtures has been extensively researched in recent years. However, there is much less attention on the label switching issue for frequentist mixture models. In this article, we discuss the label switching problem and the importance of solving it for frequentist mixture models when using simulation study or bootstrap to evaluate the performance of mixture model estimators. We argue that many existing labeling methods for Bayesian mixtures can not be simply applied to frequentist mixture models. Two new simple but effective labeling methods are proposed for frequentist mixture models. The new labeling methods can incorporate the information of component labels of each sample, which is available for the simulation study or parametric bootstrap for frequentist mixture models. Our empirical studies demonstrate that the new proposed methods work well and provide better results than the traditionally used order constraint labeling. In addition, the simulation studies also demonstrate that the simple order constraint labeling can sometimes lead to sever biased and even meaningless estimates, and thus might provide misleading estimated variation. Key words: Complete likelihood; Label switching; Mixture models; 1

1 Introduction Label switching has long been known to be a big challenging problem for Bayesian mixture modeling. It occurs due to the invariance of mixture likelihood to the permutations of component labels. Many methods have been proposed to solve the label switching for Bayesian mixtures. See, for example, Stephens (2000), Celeux, Hurn, and Robert (2000), Chung, Loken, and Schafer (2004), Geweke (2007), Yao and Lindsay (2009), Grün and Leisch (2009), Sperrin, Jaki, and Wit (2010), and Papastamoulis and Iliopoulos (2010). However, there is much less attention on the label switching issue for frequentist mixture models. (As far as we know, all the label switching paper by far are devoted to the Bayesian mixture models.) One of the reasons is that the label switching issue is more obvious and sever for Bayesian mixture analysis. In this article, we will discuss the label switching problem and the importance of solving it for frequentist mixture models when using simulation study or bootstrap to evaluate the performance of mixture model estimators. We argue that many existing labeling methods for Bayesian mixtures can not be simply applied to frequentist mixture models. Two new simple but effective labeling methods are then proposed for frequentist mixture models. Let x = (x 1,, x n ) be independent identically distributed (iid) observations from a mixture density with m (m is assumed to be known and finite) components: p(x; θ) = π 1 f(x; λ 1 ) + π 2 f(x; λ 2 ) + + π m f(x; λ m ), (1.1) where θ = (π 1,..., π m 1, λ 1,..., λ m ), π j likelihood function for x = (x 1,, x n ) is > 0 for all js, and m j=1 π j = 1. Then the L(θ; x) = n {π 1 f(x i ; λ 1 ) + π 2 f(x i ; λ 2 ) + + π m f(x i ; λ m )}. (1.2) i=1 2

The maximum likelihood estimator (MLE) of θ, by maximizing (1.2), is straightforward using the EM algorithm (Dempster et al. 1977). For a general introduction to mixture models, see, for example, Lindsay (1995), Böhning (1999), McLachlan and Peel (2000), and Frühwirth-Schnatter (2006). For any permutation ω = (ω(1),..., ω(m)) of the identity permutation (1,..., m), define the corresponding permutation of the parameter vector θ by θ ω = (πω(1),..., πω(m 1), λω(1),..., λω(m)). A special feature of mixture model is that the likelihood function L(θ ω ; x) is numerically the same as L(θ; x) for any permutation ω. Hence if ˆθ is the MLE, ˆθ ω is the MLE for any permutation ω. If one is only interested in a point estimator, the MLE suffices for the purpose without any label switching problem. (Note that for Bayesian mixtures, the label switching needs to be solved even for a point estimator.) However, if one wants to use a simulation study or bootstrap approach to evaluate the variation of the MLE for mixture models, the label switching problem will similarly occur. Given a sequence of raw unlabeled estimates (ˆθ 1,..., ˆθ N ) of θ, in order to measure their variation, one must first label these samples, i.e., find the labels (ω 1,..., ω N ) such that (ˆθ ω 1 1,..., ˆθ ω N N ) have the same label meaning. Without correct labels, the estimates tend to have serious bias and the estimated variation might be also misleading. Theoretically, one may also estimate the variation of parameter estimates by their asymptotic covariance matrix and compute them by inverting the observed or expected information matrix at the MLE. In practice, however, this may be tedious analytically or computationally, although there have been many computation simpler methods proposed to estimate the observed information matrix. See, for example, Louis (1982), Meilijson (1989), Meng and Rubin (1991), and McLachlan and Krishnan (1997, Sect. 3

4.5). However, it is well known that the estimates of the covariance matrix of the MLE based on the expected or observed information matrices are guaranteed to be valid inferentially only asymptotically. Basford, Greenway, McLachlan, and Peel (1997) compared the bootstrap and information-based approaches for some normal mixture models and found that unless the sample size was very large, the standard errors obtained by an information-based approach were too unstable to be recommended. Therefore, the bootstrap approach is usually preferred and thus solving labeling switching is crucial. One of the most commonly used solutions to label switching is to simply put an explicit parameter constraint on all the estimates so that only one permutation can satisfy it for each estimate. One main problem with the order constraint labeling is that it can only use the label information of one component parameter at a time, which is not desirable when many component parameters can simultaneously provide label information. Another main problem with identifiability constraint labeling is the choice of constraint, especially for multivariate problems. Different order constraints may generate markedly different results; it is difficult to anticipate the overall effect. In addition, as demonstrated by Stephens (2000), for Bayesian mixtures, many choices of identifiability constraint do not completely remove the symmetry of the posterior distribution. As a result, label switching problem may remain after imposing an identifiability constraint. Therefore, it is expected that many choices of identifiability constraint don t work well for frequentist mixture models, either, which is also verified by our simulation studies in Section 3. Although, many other labeling methods have been proposed for Bayesian mixture models. However, many of them depend on the special structure of Bayesian mixture models and cannot be directly applied to frequentist mixture models. For example, the popular relabeling algorithm (Stephens, 2000) based on Kullback-Liebler divergence needs to calculate the classification probabilities for the same set of observations for all 4

sampled parameters. However, in frequency simulation study, each parameter estimate is usually estimated based on different generated sets of observations. The maximum a posterior (MAP) (Marin, Mengersen, and Robert, 2005) and posterior modes associated labeling (Yao and Lindsay, 2009) both depend on the special structure of posterior distribution. There are some exceptions, though, such as the normal likelihood based clustering method of Yao and Lindsay (2009) and the method of data-dependent priors (Chung, Loken, and Schafer, 2004), although it requires more research about how to apply the latter method when the data is not univariate. In this article, we will propose two simple but effective labeling methods for frequentist mixture models. In a simulation study or parametric bootstrap, the component labels for each observation are known. The proposed labeling methods try to make use of this valuable component labels information. The first method is to do labeling by maximizing the complete likelihood. The second method is to do labeling by minimizing the Euclidean distance between the classification probabilities and the latent true labels. Our empirical studies demonstrate that the new proposed methods work well and provide better labeling results than the traditionally used order constraint labeling. It is well known that the order constraint labeling methods don t work well and cannot completely remove the label switching in many cases for Bayesian mixture models. In this article, we use the simulation studies to demonstrate the similar undesirable results for frequentist mixture models, i.e., the simple order constraint labeling can sometimes lead to sever biased and even meaningless estimates, and thus might provide misleading estimated variation for frequentist mixture models. The structure of the paper is as follows. Section 2 introduces our new labeling methods. In Section 3, we use a simulation study to compare the proposed labeling methods to the traditionally used order constraint labeling. We summarize the proposed labeling methods and discuss some future research work in Section 4. 5

2 New Labeling Methods Suppose (ˆθ 1,..., ˆθ N ) are N raw unlabeled maximum likelihood estimates of mixture model (1.1) in a simulation study or bootstrap procedure. Our objective is to find the labels (ω 1,..., ω N ) such that (ˆθ ω 1 1,..., ˆθ ω N N ) have the same meaning of component labels. estimates. Then we can use the labeled samples to evaluate the variation of parameter In a simulation study or parametric bootstrap, the latent component labels for each observation are known. Our proposed new labeling methods try to make use of this valuable component labels information. Suppose x = {x 1,..., x n } is a typical generated data and the corresponding unlabeled MLE is ˆθ. Define the latent variable z = {z ij, i = 1,..., n, j = 1,..., m}, where 1, if the i th observation x i is from the j th component ; z ij = 0, otherwise. Complete likelihood based labeling: The first proposed method is to find the label ω for ˆθ by maximizing the complete likelihood of (x, z) over ω L(ˆθ ω ; x, z) = n m {ˆπ ω j f(x i ; ˆλ ω j )} z ij, (2.1) i=1 j=1 where ˆπ ω j = ˆπω(j), and ˆλ ω j = ˆλω(j). Unlike the mixture likelihood, the complete likelihood L(ˆθ; x, z) is not invariant to the permutation of component labels, since the variable z carries the label information. Therefore L(ˆθ; x, z) carries the label information of parameter ˆθ and can be used to do labeling. Here, we make use of the information of latent variable z to break the permutation symmetry of mixture likelihood. 6

Note that log{l(ˆθ ω ; y, z)} = = = where n i=1 n i=1 n i=1 m j=1 [ z ij log{ˆπ ω j f(x i ; ˆλ ω ] j )} (2.2) [ { m ˆπ ω j f(x i ; ˆλ ω }] j ) n m [ z ij log p(x i ; ˆθ ω + z ij log{p(x i ; ˆθ ω ] )} ) i=1 j=1 m z ij log p ij (ˆθ ω n ) + log p(x i ; ˆθ ω ) (2.3) j=1 j=1 i=1 p ij (θ ω ) = πω j f(x i ; λ ω j ) p(x i ; θ ω ) and p(x i ; θ ω ) = m j=1 πω j f(x i ; λ ω j ). Notice that the second term of (2.3) is log mixture likelihood and thus is invariant to the permutation of component labels of ˆθ. Therefore, we have the following result. Theorem 2.1. Maximizing L(ˆθ ω ; y, z) with respect to ω in (2.1) is equivalent to maximizing l 1 (ˆθ ω ; y, z) = n m z ij log p ij (ˆθ ω ), (2.4) i=1 j=1 which is equivalent to minimizing the Kullback-Leibler divergence if we consider z ij as the true classification probability and p ij (ˆθ) as the estimated classification. In practice, it is usually easier to work on (2.4) than (2.1), since the classification probabilities p ij (θ) is a byproduct of an EM algorithm. In addition, note that p ij (θ ω ) = p iω(j)(θ). Therefore, we don t need to find the classification probabilities for each permutation of θ and thus the computation of the complete likelihood labeling is usually very fast. Distance based labeling: The second method is to do labeling by minimizing the following Euclidian distance between the classification probabilities and the true latent 7

labels over ω l 2 (ˆθ ω ; y, z) = n i=1 m j=1 { p ij (ˆθ ω ) z ij } 2 (2.5) Here, we want to find the labels such that the estimated classification probabilities are as similar to latent labels as possible based on the Euclidian distance. Note that n m {p ij (ˆθ ω } 2 n m ) z ij = [{p ij (ˆθ ω ] n m )} 2 + {z ij } 2 2 z ij p ij (ˆθ ω ) i=1 j=1 i=1 j=1 i=1 j=1 The first part of above formula is invariant to the label. Thus to minimize the above Euclidian distance is equivalent to maximizing the second part. Therefore, we have the following result. Theorem 2.2. Minimizing l 2 (ˆθ ω ; y, z) with respect to ω in (2.5) is equivalent to maximizing l 3 (ˆθ ω ; y, z) = n m z ij p ij (ˆθ ω ). (2.6) i=1 j=1 Note that the objective function (2.4) and (2.6) are very similar, except that (2.4) uses log transformation for p ij (ˆθ ω ) but (2.6) does not. Both of the above two proposed labeling methods can be also applied to nonparametric bootstrap. Based on the MLE for the original samples (x 1,..., x n ), we can get the estimated classification probabilities {ˆp ij, i = 1,..., n, j = 1,..., m}. Then we can simply replace z ij in (2.4) by ˆp ij or let 1, if ˆp ij > ˆp il for all l j; z ij = 0, otherwise. 8

3 Simulation Study In this section, we will use a simulation study to compare the proposed complete likelihood based labeling method (COMPLH) and the Euclidean distance based labeling method (DISTLAT) with traditionally used order constraint labeling and the normal likelihood based labeling method (NORMLH) (Yao and Lindsay, 2009) for both univariate and multivariate normal mixture models. It is well known that the normal mixture models with unequal variance have unbounded likelihood function. Therefore, the maximum likelihood estimate (MLE) is not well defined. Similar unboundness issue also exists for multivariate normal mixture models with unequal covariance. There has been considerable research dealing with the unbounded mixture likelihood issue. See, for example, Hathaway (1985, 1986), Chen, Tan, and Zhang (2008), Chen and Tan (2009), and Yao (2010). However, since our focus is not directly on parameter estimation but on how to label them after they are derived, we will assume, without loss of generality and for simplicity of computation only, equal variance (covariance) for univariate (multivariate) normal mixture models when using EM algorithm to find the MLE. The EM algorithm is run based on 20 randomly chosen initial values and stops until the maximum difference between the updated parameter estimates of two consecutive iterations is less than 10 5. To compare different labeling results, we report the average and standard deviation of labeled MLEs for different labeling methods. It is expected that the ideally labeled estimates should have small bias. Therefore, the bias is a good indicator of how well each labeling method performs. Note that the estimated standard errors by each labeling method cannot be directly used to compare different labeling methods since the true standard errors are unknown even in simulation setting. In addition, as discussed in Section 1, the estimated standard errors obtained by an information-based approach were usually too unstable when the sample size is not too large. 9

Example 1. We are interested in evaluating the performance of MLE for the mixture model π 1 N(µ 1, 1) + (1 π 2 )N(µ 2, 1), where π 1 = 0.3 and µ 1 = 0. We consider the following four cases for µ 2 : I) µ 2 = 0.5; II) µ 2 = 1; III) µ 2 = 1.5; IV) µ 2 = 2. The above four cases have unequal component proportions and the separation of two mixture components increases from Case I to Case IV. For each case, we run 500 replicates for both sample size 50 and 200 and find the MLE for each replicate by EM algorithm assuming equal variance. We consider five labeling methods: order constraint labeling based on component means (OC-µ), order constraint labeling based on component proportions (OC-π), normal likelihood based labeling (NORMLH), complete likelihood based labeling (COMPLH), and the Euclidean distance based labeling method (DISTLAT). Table 1 and 2 report the average and standard deviation (Std) of the MLEs based on different labeling methods for n = 50 and 200, respectively. Since the equal variance is assumed, the variance estimates do not carry the labeling information and are the same for different labeling methods. Therefore, we did not report the variance estimates in the Tables for simplicity of comparison. From Tables 1 and 2, we can see that COMPLH and DISTLAT have similar results and have smaller bias than all other labeling methods, especially when two components are close, and NORMLH also has slightly smaller bias than OC-µ and OC-π, which have large bias, especially when the components are close. In addition, we can see that all labeling methods perform better and have smaller bias when sample size increases. Therefore, it is expected that those five labeling methods will provide similar labeled estimates for all four cases considered in this example when sample size is large enough. 10

Example 2. We generate independent and identically distributed (iid) samples (x 1,..., x n ) from π 1 N µ 11 µ 12, 1 0 0 1 + (1 π 2 )N µ 21 µ 22, 1 0 0 1, where µ 11 = 0 and µ 12 = 0. We consider the following four cases for π 1, µ 21, µ 22, σ 2 : I. π 1 = 0.3, µ 21 = 0.5, µ 22 = 0.5; II. π 1 = 0.3, µ 21 = 1, µ 22 = 1; III. π 1 = 0.3, µ 21 = 1.5, µ 22 = 1.5; IV. π 1 = 0.5, µ 21 = 3, µ 22 = 0. For each case, we run 500 replicates for both sample size 100 and 400 and find the MLE for each replicate by EM algorithm assuming equal covariance. We consider the following six labeling methods: order constraint labeling based on the component means of first dimension (OC-µ 1 ), order constraint labeling based on the component means of second dimension (OC-µ 2 ), order constraint labeling based on the component proportions (OCπ), NORMLH, COMPLH, and DISTLAT. Table 3 and 4 report the average and standard deviation (Std) of the MLEs based on different labeling methods for n = 100 and 400, respectively. From the tables, we can see that, for Case I to Case III, COMPLH and DISTLAT have similar results and provide smaller bias of labeled estimate than all other labeling methods, especially when the components are close and the sample size is not large. In addition, NORMLH also provide smaller bias than OC-µ 1, OC-µ 2, and, OC-π. For Case IV, the component means of the second dimension are the same but the component means of the first dimension are very separate. In addition, the component proportions are the same. In this case, 11

the component means of the second dimension and component proportions don t have any label information. From the tables, we can see that OC-µ 2 and OC-π provide unreasonable estimates and have large bias for both n = 100 and n = 400. Note that in this case, OC-µ 2 and OC-π won t work well even when the sample size is larger than 400 due to the wrong order constraint component parameters used. However, OC-µ 1, NORMLH, COMPLH, and DISTLAT all work well for both n = 100 and n = 400. Example 3. (x 1,..., x n ) from We generate independent and identically distributed (iid) samples 3 π j N µ j1 j=1 µ j2, 1 0 0 1, where π 1 = 0.2, π 2 = 0.3, π 3 = 0.5, µ 11 = 0, and µ 12 = 0. We consider the following three cases for (µ 21, µ 22, µ 31, µ 32 ): I. µ 21 = 0.5, µ 22 = 0.5, µ 31 = 1, µ 32 = 1; II. µ 21 = 1, µ 22 = 1, µ 31 = 2, µ 32 = 2; III. µ 21 = 0, µ 22 = 2, µ 31 = 0, µ 32 = 4. For each case, we run 500 replicates for both sample size 100 and 400 and find the MLE for each replicate by EM algorithm assuming equal covariance. We consider the following six labeling methods: OC-µ 1, OC-µ 2, OC-π, NORMLH, COMPLH, and DISTLAT. Table 5 and 6 report the average and standard deviation (Std) of the MLEs based on different labeling methods for n = 100 and 400, respectively. The findings are similar to Example 2. From the tables, we can see that COMPLH and DISTLAT produce similar results in most of cases and have overall better performance than all other labeling methods, especially when sample size is small. 12

4 Summary Label switching issue has not received as much attention for frequentist mixture models as for Bayesian mixture models. In this article, we explain the importance of solving label switching issue for frequentist mixture models and propose a new labeling method. Based on the simulation study, we can see that the proposed complete likelihood based labeling method and the Euclidean distance based labeling method by incorporating the information of latent labels have overall better performance than all other methods considered in our simulation study. In addition, it can be seen that the order constraint labeling methods only work well when the constrained component parameters have enough label information but usually provide poor estimates and even unreasonable estimates if the constrained component parameters are not very separate. Therefore, in practice, the choice of constraint is very sensitive and thus difficult, especially for multivariate problems. Different order constraints may generate markedly different results and it is difficult to anticipate the overall effect. As explained in Section 1, many of the labeling methods proposed for Bayesian mixtures cannot be directly applied to frequentist mixtures. However, it requires more research whether we can apply, either directly or after some revision, some of the Bayesian labeling methods to frequentist mixtures. 5 Acknowledgements This work is related to my Ph.D. dissertation. I am indebted to my dissertation advisor, Bruce G. Lindsay, for his assistance and counsel in this research. 13

References Basford, K. E., Greenway, D. R., McLachlan, G. J., and Peel, D. (1997). Standard errors of fitted means under normal mixture models. Computational Statistics, 12:1-17. Böhning, D. (1999). Computer-Assisted Analysis of Mixtures and Applications, Boca Raton, FL: Chapman and Hall/CRC. Celeux, G. (1998), Bayesian inference for mixtures: The label switching problem. In Compstat 98-Proc. in Computational Statistics (eds. R. Payne and P.J. Green), 227-232. Physica, Heidelberg. Celeux, G., Hurn, M., and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95, 957-970. Chen, J., Tan, X., and Zhang, R. (2008). Inference for normal mixture in mean and variance. Statistica Sincia, 18, 443-465. Chen, J. and Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100, 1367-1383. Chung, H., Loken, E., and Schafer, J. L. (2004). Difficulties in drawing inferences with finite-mixture models: a simple example with a simple solution. The American Statistican, 58, 152-158. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of Royal Statistical Society, Ser B., 39, 1-38. Frühwirth-Schnatter, S. (2001). Markov chain Monte Carlo estimation of classical and 14

dynamic switching and mixture models. Journal of the American Statistical Association, 96, 194-209. (2006). Finite Mixture and Markov Switching Models, Springer, 2006. Geweke, J. (2007). Interpretation and inference in mixture models: Simple MCMC works. Computational Statistics and Data Analysis, 51, 3529-3550. Grün, B. and Leisch, F. (2009). Dealing with label switching in mixture models under genuine multimodality. Journal of Multivariate Analysis, 100, 851-861. Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Annals of Statistics, 13, 795-800. Hathaway, R. J. (1986). A constrained EM algorithm for univariate mixtures. Journal of Statistical Computation and Simulation, 23, 211-230. Lindsay, B. G., (1995). Mixture Models: Theory, Geometry, and Applications. NSF- CBMS Regional Conference Series in Probability and Statistics v 5, Hayward, CA: Institure of Mathematical Statistics. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B, 44:226-233. Marin, J.-M., Mengersen, K. L. and Robert, C. P. (2005). Bayesian modelling and inference on mixtures of distributions. Handbook of Statistics 25 (eds. D. Dey and C.R. Rao), North-Holland, Amsterdam. McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York. McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. New York: Wiley. 15

Meilijson, I. (1989). A fast improvement of the EM algorithm in its own terms. Journal of the Royal Statistical Society, Series B, 51:127-138. Meng, X. L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variancecovariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86:899-909. Papastamoulis, P. and Iliopoulos, G. (2010), An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions, Journal of Computational and Graphical Statistics, 19, 313-331. Sperrin, M., Jaki, T., and Wit, E. (2010), Probabilistic relabeling strategies for the label switching problem in Bayesian mixture models, Statistics and Computing, 20, 357-366. Stephens, M. (2000). Dealing with label switching in mixture models. Journal of Royal Statistical Society, Ser B., 62, 795-809. Yao, W. (2010). A profile likelihood method for normal mixture with unequal variance. Journal of Statistical Planning and Inference, 140, 2089-2098. Yao, W. and Lindsay, B. G. (2009). Bayesian mixture labeling by highest posterior density. Journal of American Statistical Association, 104, 758-767. 16

Table 1: Average (Std) of Point Estimates Over 500 Repetitions When n = 50 for Example 1. Case TRUE OC-µ OC-π NORMLH COMPLH DISTLAT I µ 1 : 0-0.547(0.705) 0.250(1.456) -0.246(1.240) -0.040(1.356) -0.017(1.371) µ 2 : 0.5 1.152(0.640) 0.355(0.476) 0.851(0.457) 0.645(0.529) 0.622(0.518) π 1 : 0.3 0.471(0.284) 0.257(0.149) 0.353(0.245) 0.304(0.207) 0.298(0.201) II µ 1 : 0-0.292(0.722) 0.479(1.497) -0.030(1.22) 0.075(1.309) 0.121(1.348) µ 2 : 1 1.518(0.617) 0.748(0.516) 1.257(0.470) 1.152(0.499) 1.106(0.491) π1 : 0.3 0.462(0.277) 0.266(0.151) 0.359(0.241) 0.328(0.220) 0.312(0.207) III µ 1 : 0-0.131(0.711) 0.441(1.436) 0.026(1.110) 0.031(1.112) 0.071(1.167) µ 1 : 1.5 1.800(0.575) 1.229(0.576) 1.644(0.412) 1.638(0.415) 1.598(0.416) π 1 : 0.3 0.407(0.238) 0.289(0.144) 0.351(0.208) 0.351(0.208) 0.336(0.196) IV µ 1 : 0-0.083(0.709) 0.288(1.322) -0.030(0.901) -0.013(0.940) 0.002(0.975) µ 2 : 2 2.152(0.461) 1.782(0.594) 2.100(0.366) 2.083(0.366) 2.067(0.365) π 1 : 0.3 0.349(0.191) 0.294(0.129) 0.333(0.177) 0.326(0.170) 0.320(0.163) Table 2: Average (Std) of Point Estimates Over 500 Repetitions When n = 200 for Example 1. Case TRUE OC-µ OC-π NORMLH COMPLH DISTLAT I µ 1 : 0-0.479(0.718) 0.122(1.345) -0.479(0.718) -0.065(1.282) -0.047(1.29) µ 2 : 0.5 1.003(0.613) 0.401(0.381) 0.720(0.366) 0.588(0.367) 0.571(0.369) π 1 : 0.3 0.451(0.293) 0.251(0.162) 0.318(0.235) 0.273(0.192) 0.269(0.187) II µ 1 : 0-0.212(0.728) 0.342(1.332) -0.004(1.130) 0.050(1.178) 0.077(1.199) µ 2 : 1 1.341(0.559) 0.786(0.424) 1.133(0.353) 1.078(0.366) 1.052(0.371) π1 : 0.3 0.432(0.271) 0.270(0.159) 0.339(0.229) 0.317(0.211) 0.307(0.202) III µ 1 : 0-0.132(0.621) 0.113(1.029) -0.076(0.792) -0.073(0.797) -0.073(0.797) µ 1 : 1.5 1.596(0.358) 1.351(0.429) 1.540(0.273) 1.537(0.275) 1.537(0.275) π 1 : 0.3 0.332(0.190) 0.287(0.139) 0.307(0.165) 0.306(0.163) 0.306(0.163) IV µ 1 : 0-0.028(0.367) 0.017(0.532) -0.021(0.418) -0.019(0.428) -0.019(0.428) µ 1 : 2 2.009(0.220) 1.964(0.277) 2.002(0.187) 2.000(0.189) 2.000(0.189) π 1 : 0.3 0.305(0.103) 0.300(0.092) 0.303(0.099) 0.302(0.097) 0.302(0.097) 17

Table 3: Average (Std) of Point Estimates Over 500 Repetitions When n = 100 for Example 2. Case TRUE OC-µ 1 OC-µ 2 OC-π NORMLH COMPLH DISTLAT I µ 11 : 0-0.246(0.633) 0.306(0.819) 0.308(1.136) 0.252(1.130) 0.145(1.091) 0.164(1.099) µ 12 : 0 0.324(0.92) -0.242(0.604) 0.395(1.175) 0.408(1.180) 0.241(1.153) 0.244(1.155) µ 21 : 0.5 0.914(0.593) 0.363(0.868) 0.360(0.366) 0.417(0.358) 0.524(0.404) 0.504(0.399) µ 22 : 0.5 0.410(0.816) 0.976(0.639) 0.338(0.363) 0.325(0.359) 0.493(0.392) 0.490(0.388) π 1 : 0.3 0.481(0.278) 0.512(0.278) 0.269(0.155) 0.273(0.160) 0.309(0.202) 0.303(0.196) II µ 11 : 0-0.017(0.598) 0.444(0.937) 0.435(1.101) 0.281(1.040) 0.258(1.019) 0.264(1.024) µ 12 : 0 0.435(0.916) -0.046(0.631) 0.392(1.123) 0.272(1.070) 0.213(1.027) 0.222(1.036) µ 21 : 1 1.238(0.536) 0.775(0.708) 0.784(0.400) 0.939(0.367) 0.962(0.383) 0.956(0.382) µ 22 : 1 0.750(0.782) 1.232(0.534) 0.793(0.399) 0.913(0.378) 0.972(0.396) 0.963(0.390) π 1 : 0.3 0.441(0.258) 0.431(0.256) 0.280(0.148) 0.300(0.174) 0.322(0.196) 0.317(0.191) III µ 11 : 0 0.033(0.563) 0.23(0.841) 0.233(0.902) 0.150(0.814) 0.136(0.789) 0.138(0.797) µ 12 : 0 0.238(0.823) 0.036(0.540) 0.215(0.859) 0.107(0.727) 0.115(0.736) 0.120(0.743) µ 21 : 1.5 1.551(0.354) 1.355(0.505) 1.352(0.391) 1.435(0.329) 1.448(0.331) 1.447(0.319) µ 22 : 1.5 1.334(0.507) 1.536(0.344) 1.357(0.380) 1.465(0.293) 1.458(0.303) 1.453(0.308) π 1 : 0.3 0.346(0.177) 0.343(0.175) 0.296(0.117) 0.309(0.137) 0.311(0.139) 0.309(0.137) IV µ 11 : 0 0.030(0.257) 1.536(1.510) 1.629(1.568) 0.030(0.257) 0.030(0.256) 0.030(0.256) µ 12 : 0-0.008(0.228) -0.132(0.207) -0.012(0.266) -0.008(0.228) -0.008(0.227) -0.008(0.227) µ 21 : 3 2.998(0.248) 1.493(1.510) 1.399(1.434) 2.998(0.248) 2.998(0.248) 2.998(0.248) µ 22 : 0 0.004(0.225) 0.129(0.159) 0.008(0.177) 0.004(0.225) 0.004(0.224) 0.004(0.225) π 1 : 0.5 0.504(0.075) 0.493(0.075) 0.444(0.050) 0.504(0.075) 0.504(0.075) 0.504(0.075) 18

Table 4: Average (Std) of Point Estimates Over 500 Repetitions When n = 400 for Example 2. Case TRUE OC-µ 1 OC-µ 2 OC-π NORMLH COMPLH DISTLAT I µ 11 : 0-0.193(0.544) 0.263(0.767) 0.259(0.991) 0.118(0.949) 0.124(0.952) 0.125(0.954) µ 12 : 0 0.325(0.820) -0.195(0.580) 0.292(1.060) 0.144(1.030) 0.172(1.034) 0.173(1.035) µ 21 : 0.5 0.826(0.519) 0.370(0.700) 0.374(0.310) 0.515(0.325) 0.509(0.320) 0.508(0.320) µ 22 : 0.5 0.337(0.739) 0.857(0.573) 0.369(0.304) 0.517(0.312) 0.490(0.315) 0.488(0.313) π 1 : 0.3 0.471(0.286) 0.474(0.286) 0.266(0.167) 0.301(0.207) 0.294(0.200) 0.293(0.199) II µ 11 : 0-0.008(0.488) 0.258(0.712) 0.210(0.798) 0.102(0.706) 0.104(0.708) 0.104(0.708) µ 12 : 0 0.303(0.800) -0.011(0.494) 0.246(0.860) 0.132(0.776) 0.130(0.773) 0.130(0.773) µ 21 : 1 1.077(0.345) 0.809(0.537) 0.857(0.316) 0.965(0.275) 0.963(0.277) 0.963(0.277) µ 22 : 1 0.806(0.518) 1.121(0.388) 0.863(0.322) 0.977(0.278) 0.979(0.276) 0.979(0.276) π 1 : 0.3 0.364(0.211) 0.379(0.221) 0.289(0.136) 0.309(0.164) 0.310(0.164) 0.310(0.164) III µ 11 : 0-0.007(0.217) 0.005(0.252) -0.005(0.224) -0.004(0.232) -0.004(0.232) -0.004(0.232) µ 12 : 0 0.022(0.287) 0.007(0.208) 0.022(0.294) 0.013(0.259) 0.013(0.259) 0.013(0.259) µ 21 : 1.5 1.505(0.112) 1.492(0.188) 1.503(0.117) 1.502(0.122) 1.502(0.122) 1.502(0.122) µ 22 : 1.5 1.493(0.158) 1.508(0.135) 1.493(0.143) 1.501(0.120) 1.501(0.119) 1.501(0.119) π 1 : 0.3 0.303(0.067) 0.305(0.071) 0.302(0.063) 0.302(0.063) 0.302(0.063) 0.302(0.063) IV µ 11 : 0 0.001(0.097) 1.401(1.51) 1.474(1.533) 0.001(0.097) 0.001(0.097) 0.001(0.097) µ 12 : 0-0.004(0.082) -0.052(0.063) 0.001(0.086) -0.004(0.082) -0.004(0.082) -0.004(0.082) µ 21 : 3 3.001(0.095) 1.601(1.500) 1.527(1.475) 3.001(0.095) 3.001(0.095) 3.001(0.095) µ 22 : 0 0.005(0.082) 0.053(0.063) 0.002(0.078) 0.005(0.082) 0.005(0.082) 0.005(0.082) π 1 : 0.5 0.500(0.031) 0.498(0.031) 0.475(0.019) 0.500(0.031) 0.499(0.031) 0.500(0.031) 19

Table 5: Average (Std) of Point Estimates Over 500 Repetitions When n = 100 for Example 3. Case TRUE OC-µ 1 OC-µ 2 OC-π NORMLH COMPLH DISTLAT I µ 11 : 0-0.451(0.719) 0.403(1.110) 0.435(1.504) 0.260(1.480) 0.045(1.255) 0.079(1.225) µ 12 : 0 0.408(1.100) -0.420(0.695) 0.477(1.465) 0.400(1.440) -0.001(1.187) 0.045(1.164) µ 21 : 0.5 0.607(0.353) 0.636(0.853) 0.627(0.863) 0.802(0.837) 0.658(0.911) 0.642(0.988) µ 22 : 0.5 0.637(0.874) 0.620(0.404) 0.663(0.873) 0.718(0.903) 0.746(0.891) 0.735(0.977) µ 31 : 1 1.600(0.666) 0.715(1.090) 0.691(0.389) 0.692(0.364) 1.052(0.522) 1.034(0.505) µ 32 : 1 0.779(1.030) 1.625(0.604) 0.683(0.414) 0.705(0.382) 1.080(0.535) 1.043(0.521) π 1 : 0.2 0.235(0.159) 0.255(0.187) 0.123(0.079) 0.127(0.084) 0.182(0.132) 0.192(0.137) π 2 : 0.3 0.471(0.184) 0.460(0.182) 0.314(0.094) 0.313(0.102) 0.352(0.187) 0.332(0.190) II µ 11 : 0-0.094(0.671) 0.265(0.922) 0.776(1.530) -0.025(0.752) 0.089(0.973) 0.180(0.943) µ 12 : 0 0.332(1.070) -0.108(0.618) 0.905(1.599) 0.039(0.815) 0.149(1.085) 0.145(1.039) µ 21 : 1 1.184(0.449) 1.353(0.948) 1.084(0.995) 1.415(0.625) 1.248(0.807) 1.248(1.061) µ 22 : 1 1.298(0.988) 1.228(0.444) 1.058(1.001) 1.144(0.575) 1.256(0.809) 1.352(0.977) µ 31 : 2 2.318(0.566) 1.788(0.954) 1.547(0.506) 2.018(0.852) 2.070(0.555) 1.979(0.484) µ 32 : 2 1.876(0.852) 2.387(0.609) 1.543(0.541) 2.323(0.683) 2.102(0.566) 2.009(0.498) π 1 : 0.2 0.226(0.131) 0.233(0.126) 0.141(0.076) 0.223(0.123) 0.204(0.118) 0.210(0.113) π 2 : 0.3 0.420(0.167) 0.432(0.167) 0.325(0.074) 0.450(0.154) 0.380(0.166) 0.337(0.179) III µ 11 : 0-0.634(0.595) -0.032(0.630) -0.027(1.120) -0.457(0.734) -0.060(0.819) -0.040(0.699) µ 12 : 0 2.265(1.890) 0.159(0.763) 1.441(2.138) 0.901(1.770) 0.350(1.255) 0.380(1.219) µ 21 : 0-0.010(0.193) 0.049(0.754) 0.010(0.483) 0.438(0.726) 0.033(0.781) 0.031(0.958) µ 22 : 2 2.108(1.650) 2.415(0.864) 1.840(1.444) 1.930(1.270) 2.331(0.993) 2.400(1.244) µ 31 : 0 0.634(0.616) -0.027(0.779) 0.006(0.295) 0.009(0.329) 0.015(0.537) -0.001(0.407) µ 32 : 4 2.308(1.780) 4.107(0.551) 3.400(0.827) 3.850(0.508) 3.999(0.491) 3.900(0.455) π 1 : 0.2 0.284(0.159) 0.232(0.113) 0.154(0.076) 0.206(0.106) 0.212(0.107) 0.223(0.105) π 2 : 0.3 0.416(0.156) 0.369(0.151) 0.325(0.061) 0.303(0.129) 0.341(0.143) 0.297(0.146) 20

Table 6: Average (Std) of Point Estimates Over 500 Repetitions When n = 400 for Example 3. Case TRUE OC-µ 1 OC-µ 2 OC-π NORMLH COMPLH DISTLAT I µ 11 : 0-0.369(0.678) 0.233(1.040) 0.485(1.406) 0.463(1.410) -0.003(1.185) -0.001(1.116) µ 12 : 0 0.206(1.020) -0.364(0.658) 0.406(1.320) 0.392(1.320) -0.014(1.11) -0.019(1.059) µ 21 : 0.5 0.611(0.325) 0.669(0.784) 0.537(0.751) 0.491(0.735) 0.682(0.718) 0.741(0.861) µ 22 : 0.5 0.673(0.678) 0.588(0.294) 0.581(0.766) 0.524(0.761) 0.664(0.705) 0.729(0.819) µ 31 : 1 1.499(0.603) 0.837(0.886) 0.717(0.330) 0.785(0.318) 1.061(0.418) 0.999(0.389) µ 32 : 1 0.806(0.890) 1.461(0.545) 0.697(0.339) 0.769(0.320) 1.036(0.435) 0.976(0.394) π 1 : 0.2 0.228(0.166) 0.231(0.171) 0.111(0.077) 0.111(0.078) 0.157(0.124) 0.168(0.125) π 2 : 0.3 0.474(0.190) 0.468(0.201) 0.309(0.092) 0.316(0.105) 0.364(0.187) 0.322(0.189) II µ 11 : 0-0.063(0.492) 0.132(0.717) 0.733(1.403) 0.063(0.392) 0.062(0.782) 0.041(0.550) µ 12 : 0 0.100(0.638) -0.076(0.496) 0.701(1.388) 0.074(0.362) 0.012(0.713) 0.007(0.483) µ 21 : 1 1.164(0.427) 1.266(0.850) 0.988(0.908) 1.318(1.110) 1.220(0.763) 1.318(1.001) µ 22 : 1 1.308(0.916) 1.146(0.438) 0.999(0.924) 1.265(1.130) 1.229(0.746) 1.324(0.997) µ 31 : 2 2.259(0.512) 1.960(0.672) 1.637(0.405) 1.977(0.327) 2.076(0.388) 1.999(0.327) µ 32 : 2 1.913(0.661) 2.252(0.481) 1.621(0.409) 1.982(0.339) 2.080(0.398) 1.991(0.343) π 1 : 0.2 0.220(0.102) 0.215(0.104) 0.143(0.078) 0.251(0.094) 0.203(0.105) 0.226(0.098) π 2 : 0.3 0.407(0.164) 0.414(0.157) 0.325(0.062) 0.267(0.180) 0.362(0.161) 0.298(0.174) III µ 11 : 0-0.308(0.424) -0.023(0.320) 0.024(0.739) -0.014(0.217) 0.001(0.482) 0.009(0.299) µ 12 : 0 2.100(1.790) 0.016(0.441) 0.748(1.609) 0.050(0.398) 0.052(0.598) 0.062(0.533) µ 21 : 0-0.003(0.092) 0.022(0.568) -0.009(0.202) 0.033(0.727) 0.004(0.569) 0.013(0.697) µ 22 : 2 2.078(1.630) 2.167(0.718) 1.735(1.157) 2.192(0.958) 2.162(0.791) 2.178(0.916) µ 31 : 0 0.328(0.471) 0.016(0.433) 0.001(0.160) -0.002(0.192) 0.011(0.237) -0.006(0.194) µ 32 : 4 2.029(1.730) 4.022(0.289) 3.722(0.568) 3.964(0.277) 3.992(0.272) 3.965(0.275) π 1 : 0.2 0.310(0.153) 0.214(0.080) 0.167(0.071) 0.221(0.077) 0.206(0.083) 0.217(0.078) π 2 : 0.3 0.389(0.146) 0.324(0.123) 0.323(0.045) 0.284(0.122) 0.309(0.111) 0.288(0.119) 21