Finite Mixture Model Diagnostics Using Resampling Methods

Similar documents
FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R

FlexMix: A general framework for finite mixture models and latent glass regression in R

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Mixtures of Rasch Models

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Package plw. R topics documented: May 7, Type Package

Applications of finite mixtures of regression models

More on Unsupervised Learning

Bootstrapping, Randomization, 2B-PLS

Package sklarsomega. May 24, 2018

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Experimental Design and Data Analysis for Biologists

Introduction and Single Predictor Regression. Correlation

MLDS: Maximum Likelihood Difference Scaling in R

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Regression Clustering

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Holiday Assignment PS 531

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

22s:152 Applied Linear Regression

Regression and Models with Multiple Factors. Ch. 17, 18

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Fitting mixed models in R

Advanced Statistical Methods: Beyond Linear Regression

Latent Class Analysis

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Generalized Linear Models

Introduction to mtm: An R Package for Marginalized Transition Models

The Bootstrap: Theory and Applications. Biing-Shen Kuo National Chengchi University

M. H. Gonçalves 1 M. S. Cabral 2 A. Azzalini 3

BOOtstrapping the Generalized Linear Model. Link to the last RSS article here:factor Analysis with Binary items: A quick review with examples. -- Ed.

Package esaddle. R topics documented: January 9, 2017

Econometrics. 4) Statistical inference

STAT440/840: Statistical Computing

robmixglm: An R package for robust analysis using mixtures

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Package anoint. July 19, 2015

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Label switching and its solutions for frequentist mixture models

ECE521 week 3: 23/26 January 2017

PAPER 218 STATISTICAL LEARNING IN PRACTICE

Investigating Models with Two or Three Categories

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Multiple Linear Regression (solutions to exercises)

Resampling and the Bootstrap

Introduction to Within-Person Analysis and RM ANOVA

Bayesian Learning (II)

Determining the number of components in mixture models for hierarchical data

A Non-Parametric Approach of Heteroskedasticity Robust Estimation of Vector-Autoregressive (VAR) Models

Model selection and comparison

Topic 20: Single Factor Analysis of Variance

Online Resource 2: Why Tobit regression?

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

UNIVERSITY OF TORONTO Faculty of Arts and Science

Finite Population Correction Methods

Exam Applied Statistical Regression. Good Luck!

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

The energy Package. October 28, 2007

Solution to Series 6

Aster Models and Lande-Arnold Beta By Charles J. Geyer and Ruth G. Shaw Technical Report No. 675 School of Statistics University of Minnesota January

Sociology 593 Exam 1 Answer Key February 17, 1995

Package multivariance

Hypothesis Testing for Var-Cov Components

Introductory Statistics with R: Simple Inferences for continuous data

Multivariable Fractional Polynomials

Package effectfusion

Hotelling s One- Sample T2

Classification. Chapter Introduction. 6.2 The Bayes classifier

The betareg Package. October 6, 2006

Model Assessment and Selection: Exercises

Introductory Statistics with R: Linear models for continuous response (Chapters 6, 7, and 11)

Package bayeslm. R topics documented: June 18, Type Package

The Design and Analysis of Benchmark Experiments Part II: Analysis

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Statistical Estimation

PLS205 Lab 2 January 15, Laboratory Topic 3

A strategy for modelling count data which may have extra zeros

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Stat 5102 Final Exam May 14, 2015

Package orclus. R topics documented: March 17, Version Date

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

UNIVERSITÄT POTSDAM Institut für Mathematik

Machine Learning Linear Classification. Prof. Matteo Matteucci

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Sequential/Adaptive Benchmarking

Coping with Additional Sources of Variation: ANCOVA and Random Effects

Systematic error, of course, can produce either an upward or downward bias.

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt

A better way to bootstrap pairs

A SAS/AF Application For Sample Size And Power Determination

Journal of Statistical Software

Communications in Statistics - Simulation and Computation. Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study

On block bootstrapping areal data Introduction

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Transcription:

Finite Mixture Model Diagnostics Using Resampling Methods Bettina Grün Johannes Kepler Universität Linz Friedrich Leisch Universität für Bodenkultur Wien Abstract This paper illustrates the implementation of resampling methods in flexmix as well as the application of resampling methods for model diagnostics of fitted finite mixture models. Convenience functions to perform these methods are available in package flexmix. The use of the methods is illustrated with an artificial example and the seizure data set. Keywords: R, finite mixture models, resampling, bootstrap. 1. Implementation of resampling methods The proposed framework for model diagnostics using resampling (Grün and Leisch 2004) equally allows to investigate model fit for all kinds of mixture models. The procedure is applicable to mixture models with different component specific models and does not impose any limitation such as for example on the dimension of the parameter space of the component specific model. In addition to the fitting step different component specific models only require different random number generators for the parametric bootstrap. Theboot()functioninflexmixisagenericS4functionwithamethodforfittedfinitemixtures of class "flexmix" and is applicable to general finite mixture models. The function with arguments and their defaults is given by: boot(object, R, sim = c("ordinary", "empirical", "parametric"), initialize_solution = FALSE, keep_weights = FALSE, keep_groups = TRUE, verbose = 0, control, k, model = FALSE,...) The interface is similar to the boot() function in package boot (Davison and Hinkley 1997; CantyandRipley2010). Theobjectisafittedfinitemixtureofclass"flexmix"andRdenotes the number of resamples. The possible bootstrapping method are"empirical" (also available as "ordinary") and "parametric". For the parametric bootstrap sampling from the fitted mixture is performed using rflexmix(). For mixture models with different component specific models rflexmix() requires a sampling method for the component specific model. Argument initialize_solution allows to select if the EM algorithm is started in the original finite mixture solution or if random initialization is performed. The fitted mixture model might contain weights and group indicators. The weights are case weights and allow to reduce the amount of data if observations are identical. This is useful for example for latent class analysis of multivariate binary data. The argument keep_weights allows to indicate if they

2 Finite Mixture Model Diagnostics Using Resampling Methods should be kept for the bootstrapping. Group indicators allow to specify that the component membership is identical over several observations, e.g., for repeated measurements of the same individual. Argument keep_groups allows to indicate if the grouping information should also be used in the bootstrapping. verbose indicates if information on the progress should be printed. The control argument allows to control the EM algorithm for fitting the model to each of the bootstrap samples. By default the control argument is extracted from the fitted model provided by object. k allows to specify the number of components and by default this is also taken from the fitted model provided. The model argument determines if also the model and the weights slot for each sample are stored and returned. The returned object is of class "FLXboot" and otherwise only contains the fitted parameters, the fitted priors, the log likelihoods, the number of components of the fitted mixtures and the information if the EM algorithm has converged. The likelihood ratio test is implemented based on boot() in function LR_test() and returns an object of class "htest" containing the number of valid bootstrap replicates, the p-value, the double negative log likelihood ratio test statistics for the original data and the bootstrap replicates. The plot method for "FLXboot" objects returns a parallel coordinate plot with the fitted parameters separately for each of the components. 2. Artificial data set In the following a finite mixture model is used as the underlying data generating process which is theoretically not identifiable. We are assuming a finite mixture of linear regression models with two components of equal size where the coverage condition is not fulfilled (Hennig 2000). Hence, intra-component label switching is possible, i.e., there exist two parameterizations implying the same mixture distribution which differ how the components between the covariate points are combined. We assume that one measurement per object and a single categorical regressor with two levels are given. The usual design matrix for a model with intercept uses the two covariate points x 1 = (1,0) and x 2 = (1,1). The mixture distribution is given by H(y x,θ) = 1 2 N(µ 1,0.1)+ 1 2 N(µ 2,0.1), where µ k (x) = x α k and N(µ,σ 2 ) is the normal distribution. Now let µ 1 (x 1 ) = 1, µ 2 (x 1 ) = 2, µ 1 (x 2 ) = 1 and µ 2 (x 2 ) = 4. As Gaussian mixture distributions are generically identifiable the means, variances and component weights are uniquely determined in each covariate point given the mixture distribution. However, as the coverage condition is not fulfilled, the two possible solutions for α are: Solution 1: α (1) 1 = (2, 2), α (1) 2 = (1, 2), Solution 2: α (2) 1 = (2, 3), α (2) 2 = (1, 3). We specify this artificial mixture distribution using FLXdist(). FLXdist() returns an unfitted finite mixture of class"flxdist". The class of fitted finite mixture models"flexmix" extends class "FLXdist". Each component follows a normal distribution. The parameters specified in a named list therefore consist of the regression coefficients and the standard deviation.

Bettina Grün, Friedrich Leisch 3 Function FLXdist() has an argument formula for specifying the regression in each of the components, an argument k for the component weights and components for the parameters of each of the components. > library("flexmix") > Component_1 <- list(model_1 = list(coef = c(1, -2), sigma = sqrt(0.1))) > Component_2 <- list(model_1 = list(coef = c(2, 2), sigma = sqrt(0.1))) > ArtEx.mix <- FLXdist(y ~ x, k = rep(0.5, 2), + components = list(component_1, Component_2)) We draw a balanced sample with 50 observations in each covariate point from the mixture model using rflexmix() after defining the data points for the covariates. rflexmix() can either have an unfitted or a fitted finite mixture as input. For unfitted mixtures data has to be provided using the newdata argument. For already fitted mixtures data can be optionally provided, otherwise the data used for fitting the mixture is used. > ArtEx.data <- data.frame(x = rep(0:1, each = 100/2)) > ArtEx.sim <- rflexmix(artex.mix, newdata = ArtEx.data) > ArtEx.data$y <- ArtEx.sim$y[[1]] > ArtEx.data$class <- ArtEx.sim$class In Figure 1 the sample is plotted together with the two solutions for combining x 1 and x 2, i.e., this illustrates intra-component label switching. We fit a finite mixture to the sample using stepflexmix(). > ArtEx.fit <- stepflexmix(y ~ x, data = ArtEx.data, k = 2, nrep = 5, + control = list(iter = 1000, tol = 1e-8, verbose = 0)) 2 : * * * * * The fitted mixture can be inspected using summary() and parameters(). > summary(artex.fit) Call: stepflexmix(y ~ x, data = ArtEx.data, control = list(iter = 1000, tol = 1e-08, verbose = 0), k = 2, nrep = 5) prior size post>0 ratio Comp.1 0.56 55 77 0.714 Comp.2 0.44 45 65 0.692 log Lik. -82.52413 (df=7) AIC: 179.0483 BIC: 197.2845

4 Finite Mixture Model Diagnostics Using Resampling Methods y 1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 x Figure 1: Balanced sample from the artificial example with the two theoretical solutions. > parameters(artex.fit) Comp.1 Comp.2 coef.(intercept) 1.9130903 0.9269647 coef.x 2.1270909-2.0597489 sigma 0.3167696 0.2738558 Obviously the fitted mixture parameters correspond to the parameterization we used to specify the mixture distribution. Using standard asymptotic theory to analyze the fitted mixture model gives the following estimates for the standard deviations. > ArtEx.refit <- refit(artex.fit) > summary(artex.refit) $Comp.1 Estimate Std. Error z value Pr(> z ) (Intercept) 1.914496 0.072060 26.568 < 2.2e-16 *** x 2.125685 0.093203 22.807 < 2.2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 $Comp.2 Estimate Std. Error z value Pr(> z ) (Intercept) 0.927075 0.068756 13.484 < 2.2e-16 ***

Bettina Grün, Friedrich Leisch 5 x -2.059859 0.089780-22.943 < 2.2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The fitted mixture can also be analyzed using resampling techniques. For analyzing the stability of the parameter estimates where the possibility of identifiability problems is also taken into account the parametric bootstrap is used with random initialization. Function boot() can be used for empirical or parametric bootstrap (specified by the argument sim). The logical argument initialize_solution specifies if the initialization is in the original solution or random. By default random initialization is made. The number of bootstrap samples is set by the argument R. Please note that the arguments are chosen to correspond to those for function boot in package boot (Davison and Hinkley 1997). Only a few number of bootstrap samples are drawn to keep the amount of time needed to run the vignette within reasonable limits. However, for a sensible application of the bootstrap methods at leastrequal to 100 should be used. If the output for this setting has been saved, it is loaded and used in the further analysis. Please see the appendix for the code for generating the saved R output. > ArtEx.bs <- boot(artex.fit, R = 15, sim = "parametric") > if ("boot-output.rda" %in% list.files()) load("boot-output.rda") > ArtEx.bs Call: boot(artex.fit, R = 15, sim = "parametric") Function boot() returns an object of class "FLXboot". The default plot compares the bootstrap parameter estimates to the confidence intervals derived using standard asymptotic theory in a parallel coordinate plot (see Figure 2). Clearly two groups of parameter estimates can be distinguished which are about of equal size. One subset of the parameter estimates stays within the confidence intervals induced by standard asymptotic theory, while the second group corresponds to the second solution and clusters around these parameter values. In the following the DIP-test is applied to check if the parameter estimates follow a unimodal distribution. This is done for the aggregated parameter esimates where unimodality implies that this parameter is not suitable for imposing an ordering constraint which induces a unique labelling. For the separate component analysis which is made after imposing an ordering constraint on the coefficient of x rejection the null hypothesis of unimodality implies that identifiability problems are present, e.g. due to intra-component label switching. > require("diptest") > parameters <- parameters(artex.bs) > Ordering <- factor(as.vector(apply(matrix(parameters[,"coef.x"], + nrow = 2), 2, order))) > Comp1 <- parameters[ordering == 1,] > Comp2 <- parameters[ordering == 2,] > dip.values.art <- matrix(nrow = ncol(parameters), ncol = 3,

6 Finite Mixture Model Diagnostics Using Resampling Methods > print(plot(artex.bs, ordering = "coef.x", col = Colors)) Max Min coef.(intercept) coef.x sigma prior Figure 2: Diagnostic plot of the bootstrap results for the artificial example. + dimnames=list(colnames(parameters), + c("aggregated", "Comp 1", "Comp 2"))) > dip.values.art[,"aggregated"] <- apply(parameters, 2, dip) > dip.values.art[,"comp 1"] <- apply(comp1, 2, dip) > dip.values.art[,"comp 2"] <- apply(comp2, 2, dip) > dip.values.art Aggregated Comp 1 Comp 2 coef.(intercept) 0.18235817 0.14606128 0.15026513 coef.x 0.19166688 0.15519639 0.14200493 sigma 0.06337873 0.06945877 0.07618370 prior 0.08223492 0.08195298 0.08195298 The critical value for column Aggregated is 0.088 and for the columns of the separate components 0.119. The component sizes as well as the standard deviations follow a unimodal distribution for the aggregated data as well as for each of the components. The regression coefficients are multimodal for the aggregate data as well as for each of the components. While from the aggregated case it might be concluded that imposing an ordering constraint on the intercept or the coefficient of x is suitable, the component-specific analyses reveal that a unique labelling was not achieved.

Bettina Grün, Friedrich Leisch 7 3. Seizure In Wang, Puterman, Cockburn, and Le (1996) a Poisson mixture regression is fitted to data from a clinical trial where the effect of intravenous gammaglobulin on suppression of epileptic seizures is investigated. The data used were 140 observations from one treated patient, where treatment started on the 28 th day. In the regression model three independent variables were included: treatment, trend and interaction treatment-trend. Treatment is a dummy variable indicating if the treatment period has already started. Furthermore, the number of parental observation hours per day were available and it is assumed that the number of epileptic seizures per observation hour follows a Poisson mixture distribution. The number of epileptic seizures per parental observation hour for each day are plotted in Figure 3. The fitted mixture distribution consists of two components which can be interpreted as representing good and bad days of the patients. The mixture model can be formulated by H(y x,θ) = π 1 P(λ 1 )+π 2 P(λ 2 ), where λ k = e x α k for k = 1,2 and P(λ) is the Poisson distribution. The data is loaded and the mixture fitted with two components. > data("seizure", package = "flexmix") > model <- FLXMRglm(family = "poisson", offset = log(seizure$hours)) > control <- list(iter = 1000, tol = 1e-10, verbose = 0) > seizmix <- stepflexmix(seizures ~ Treatment * log(day), + data = seizure, k = 2, nrep = 5, model = model, control = control) 2 : * * * * * The fitted regression lines for each of the two components are shown in Figure 3. The parameteric bootstrap with random initialization is used to investigate identifiability problems and parameter stability. The diagnostic plot is given in Figure 3. The coloring is according to an ordering constraint on the intercept. Clearly the parameter estimates corresponding to the solution where the bad days from the base period are combined with the good days from the treatement period and vice versa for the good days of the base period can be distinguished and indicate the slight identifiability problems of the fitted mixture. > parameters <- parameters(seizmix.bs) > Ordering <- factor(as.vector(apply(matrix(parameters[,"coef.(intercept)"], + nrow = 2), 2, order))) > Comp1 <- parameters[ordering == 1,] > Comp2 <- parameters[ordering == 2,] For applying the DIP test also an ordering constraint on the intercept is used. The critical value for column Aggregated is 0.088 and for the columns of the separate components 0.119.

8 Finite Mixture Model Diagnostics Using Resampling Methods > par(mar = c(5, 4, 2, 0) + 0.1) > plot(seizures/hours~day, data=seizure, pch = as.integer(seizure$treatment)) > abline(v = 27.5, lty = 2, col = "grey") > matplot(seizure$day, fitted(seizmix)/seizure$hours, type="l", + add = TRUE, col = 1, lty = 1, lwd = 2) Seizures/Hours 0 2 4 6 8 0 20 40 60 80 100 120 140 Day Figure3: SeizuredatawiththefittedvaluesfortheWanget al. model. Theplottingcharacter for the observed values in the base period is a circle and for those in the treatment period a triangle. > dip.values.art <- matrix(nrow = ncol(parameters), ncol = 3, + dimnames = list(colnames(parameters), + c("aggregated", "Comp 1", "Comp 2"))) > dip.values.art[,"aggregated"] <- apply(parameters, 2, dip) > dip.values.art[,"comp 1"] <- apply(comp1, 2, dip) > dip.values.art[,"comp 2"] <- apply(comp2, 2, dip) > dip.values.art Aggregated Comp 1 Comp 2 coef.(intercept) 0.10161195 0.07671967 0.06252763 coef.treatmentyes 0.13981763 0.10766545 0.07014211 coef.log(day) 0.05572212 0.07262258 0.06296500 coef.treatmentyes:log(day) 0.15641412 0.08790622 0.06899360 prior 0.15660112 0.09542677 0.09542677 For the aggregate results the hypothesis of unimodality cannot be rejected for the trend. For the component-specific analyses unimodality cannot be rejected only for the intercept

Bettina Grün, Friedrich Leisch 9 > seizmix.bs <- boot(seizmix, R = 15, sim = "parametric") > if ("boot-output.rda" %in% list.files()) load("boot-output.rda") > print(plot(seizmix.bs, ordering = "coef.(intercept)", col = Colors)) Max Min coef.(intercept) coef.treatmentyes coef.log(day) coef.treatmentyes:log(day) prior Figure 4: Diagnostic plot of the bootstrap results for the seizure data. (where the ordering condition was imposed on) and again the trend. For all other parameter estimates unimodality is rejected which indicates that the ordering constraint was able to impose a unique labelling only for the own parameter and not for the other parameters. This suggests identifiability problems. A. Generation of saved R output > ArtEx.bs <- boot(artex.fit, R = 200, sim = "parametric") > seizmix.bs <- boot(seizmix, R = 200, sim = "parametric") > save(artex.bs, seizmix.bs, file = "boot-output.rda") References Canty A, Ripley B (2010). boot: Bootstrap R (S-Plus) Functions. R package version 1.2-43, URL http://cran.r-project.org/package=boot.

10 Finite Mixture Model Diagnostics Using Resampling Methods Davison AC, Hinkley DV (1997). Bootstrap Methods and Their Application. Cambridge Series on Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK. ISBN 0-521-57391-2 (hardcover), 0-521-57471-4 (paperback). Grün B, Leisch F (2004). Bootstrapping Finite Mixture Models. In J Antoch (ed.), COMP- STAT 2004 Proceedings in Computational Statistics, pp. 1115 1122. Physica Verlag, Heidelberg. ISBN 3-7908-1554-3. Hennig C (2000). Identifiability of Models for Clusterwise Linear Regression. Journal of Classification, 17(2), 273 296. Wang P, Puterman ML, Cockburn IM, Le ND (1996). Mixed Poisson Regression Models with Covariate Dependent Rates. Biometrics, 52, 381 400. Affiliation: Bettina Grün Institut für Angewandte Statistik Johannes Kepler Universität Linz Altenbergerstraße 69 4040 Linz, Austria E-mail: Bettina.Gruen@jku.at Friedrich Leisch Institut für Angewandte Statistik und EDV Universität für Bodenkultur Wien Peter Jordan Straße 82 1190 Wien, Austria E-mail: Friedrich.Leisch@boku.ac.at URL: http://www.statistik.lmu.de/~leisch/