An EM-Algorithm Based Method to Deal with Rounded Zeros in Compositional Data under Dirichlet Models. Rafiq Hijazi

Similar documents
Updating on the Kernel Density Estimation for Compositional Data

arxiv: v2 [stat.me] 16 Jun 2011

The Dirichlet distribution with respect to the Aitchison measure on the simplex - a first approach

Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation 1

Time Series of Proportions: A Compositional Approach

A Critical Approach to Non-Parametric Classification of Compositional Data

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

Methodological Concepts for Source Apportionment

arxiv: v6 [stat.me] 7 Jun 2017

The Mathematics of Compositional Analysis

The k-nn algorithm for compositional data: a revised approach with and without zero values present

Regression analysis with compositional data containing zero values

Regression with compositional response having unobserved components or below detection limit values

Appendix 07 Principal components analysis

A latent Gaussian model for compositional data with zeroes

Reconstruction of individual patient data for meta analysis via Bayesian approach

Statistical Analysis of. Compositional Data

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Regression with Compositional Response. Eva Fišerová

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bayes spaces: use of improper priors and distances between densities

Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain

Nonparametric hypothesis testing for equality of means on the simplex. Greece,

THE CLOSURE PROBLEM: ONE HUNDRED YEARS OF DEBATE

STA 4273H: Statistical Machine Learning

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Propensity score matching for multiple treatment levels: A CODA-based contribution

Bayesian Methods for Machine Learning

Approaching predator-prey Lotka-Volterra equations by simplicial linear differential equations

Pattern Recognition. Parameter Estimation of Probability Density Functions

Generative Clustering, Topic Modeling, & Bayesian Inference

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Error Propagation in Isometric Log-ratio Coordinates for Compositional Data: Theoretical and Practical Considerations

COMPOSITIONAL IDEAS IN THE BAYESIAN ANALYSIS OF CATEGORICAL DATA WITH APPLICATION TO DOSE FINDING CLINICAL TRIALS

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

More on Unsupervised Learning

Some Practical Aspects on Multidimensional Scaling of Compositional Data 2 1 INTRODUCTION 1.1 The sample space for compositional data An observation x

Regression Clustering

A Note on Bayesian Inference After Multiple Imputation

A latent Gaussian model for compositional data with structural zeroes

Pattern Recognition and Machine Learning

ASA Section on Survey Research Methods

Statistical Practice

Foundations of Statistical Inference

STA 4273H: Statistical Machine Learning

Chapter 17: Undirected Graphical Models

1 Introduction. 2 A regression model

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

CS6220: DATA MINING TECHNIQUES

Bayesian Networks in Educational Assessment

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models

On Bicompositional Correlation. Bergman, Jakob. Link to publication

Likelihood-based inference with missing data under missing-at-random

Marginal Specifications and a Gaussian Copula Estimation

Estimation and model selection in Dirichlet regression

Fractional Imputation in Survey Sampling: A Comparative Review

and Comparison with NPMLE

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Nonparametric Bayesian Methods (Gaussian Processes)

Contents. Part I: Fundamentals of Bayesian Inference 1

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

Models for Multivariate Panel Count Data

Learning Gaussian Process Models from Uncertain Data

A spatial scan statistic for multinomial data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Probabilistic Graphical Models

Signal Interpretation in Hotelling s T 2 Control Chart for Compositional Data

Markov Chain Monte Carlo (MCMC)

Discriminant analysis for compositional data and robust parameter estimation

EXPLORATION OF GEOLOGICAL VARIABILITY AND POSSIBLE PROCESSES THROUGH THE USE OF COMPOSITIONAL DATA ANALYSIS: AN EXAMPLE USING SCOTTISH METAMORPHOSED

CS Lecture 19. Exponential Families & Expectation Propagation

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Patterns of Scalable Bayesian Inference Background (Session 1)

Overview. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

A Program for Data Transformations and Kernel Density Estimation

Genetic Algorithms: Basic Principles and Applications

Austrian Journal of Statistics

A general mixed model approach for spatio-temporal regression data

CPSC 540: Machine Learning

Exploring Compositional Data with the CoDa-Dendrogram

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

BAYESIAN DECISION THEORY

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

Introduction An approximated EM algorithm Simulation studies Discussion

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

Nonparametric Bayesian Methods - Lecture I

Probabilistic Graphical Models for Image Analysis - Lecture 4

Multistate Modeling and Applications

Nonresponse weighting adjustment using estimated response probability

ECE 5984: Introduction to Machine Learning

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

CPSC 540: Machine Learning

Statistical Estimation

Undirected Graphical Models

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Transcription:

An EM-Algorithm Based Method to Deal with Rounded Zeros in Compositional Data under Dirichlet Models Rafiq Hijazi Department of Statistics United Arab Emirates University P.O. Box 17555, Al-Ain United Arab Emirates rhijazi@uaeu.ac.ae Abstract Zeros in compositional data are classified into rounded zeros and essential zeros. The rounded zero corresponds to a small proportion or below detection limit value while the essential zero is an indication of the complete absence of the component in the composition. Several parametric and non-parametric imputation techniques have been proposed to replace rounded zeros and model the essential zeros under logratio model. In this paper, a new method based on EM algorithm is proposed for replacing rounded zeros. The proposed method is illustrated using simulated data. 1 Introduction Compositional data are non-negative proportions with unit-sum. This type of data arises whenever objects are classified into disjoint categories and the resulting relative frequencies are recorded, or partition a whole measurement into percentage contributions from its various parts. The sample space of compositional data is the simplex S D defined as S D = {(x 1,..., x D ) : x j > 0 for j = 1,..., D and D x j = 1} j=1 In compositional data analysis, the presence of zero components represents one of the main obstacles facing the application of both logratio analysis and Dirichlet regression. In a logratio analysis, we cannot take the logarithm of zero when applying the additive logistic transformation. In the Dirichlet model, the presence of zeros makes the probability density function vanish. In this paper, we propose a new technique, based on EM algorithm, for replacing the zeros under Dirichlet model. Section 2 gives an overview of zeros and zero replacement strategies in compositional data. The new EM algorithm based replacement method is described in Section 3. An application to illustrate the use of the proposed technique is presented in Section 4. Finally, concluding remarks are given in Section 5. 2 Zero Replacement in Compositional Data 2.1 Types of Zeros in Compositional Data Aitchison (1986) classified the zeros in compositional data into rounded or trace elements zeros and essential or true zeros. The trace zero is an artifact of the measurement process, where observation is recorded as zero when it is below the detection limit (BDL). Such zeros should be treated as missing values and should be replaced. On the other hand, often the observation is recorded as zero as an indication of the complete absence of the component in the composition. Such zeros are called essential or true zeros and their pattern of occurrence should be investigated and modeled. 2.2 Common Zero Replacement Methods As a solution to the rounded zeros problem, Aitchison (1986) suggested the reduction of the number of components in the composition by amalgamation. That is, eliminating the components with zero observations by combining them with some other components. Such approach is not appropriate when 1 1

the goal is modeling the original compositions or the model includes only three components. However, a more logical approach is to replace the rounded zeros by a small nonzero value that does not seriously distort the covariance structure of the data (Martín-Fernández et al. 2003a). The first replacement method, the additive replacement, proposed by Aitchison (1986) is simply replacing the zeros by a small value δ and then normalizing the imputed compositions. Fry et al. (2000) showed that the additive replacement is not subcompositionally coherent and consequently, distorts the covariance structure of the data set. Martín-Fernández et al. (2003) proposed an alternative method using a multiplicative replacement which preserves the ratios of nonzero components. Let x = (x 1,..., x D ) S D be a composition with rounded zeros. The multiplicative method replaces the composition x containing c zeros with a zero-free composition r S D according to the following replacement rule { δj if x r j = j = 0 (1) (1 cδ) x j if x j > 0 In addition, Martín-Fernández et al. (2003) emphasized that the best results are obtained when δ is close to 65% of the detection limit. However, since the multiplicative replacement imputes exactly the same value in all the zeros of the compositions, this replacement introduces artificial correlation between components which have zero values in the same composition. Besides these nonparametric approaches, a parametric approach based on applying a modified EM algorithm on the additive logratio transformation has been proposed (Martín-Fernández et al. (2003), Palarea-Albaladejo et al. (2007) and Palarea-Albaladejo and Martín-Fernández (2008)). However, none of these methods is applicable when the compositional data arise from Dirichlet distribution. Recently, Hijazi (2009) proposed a new method based on beta regression to replace the rounded zeros when the data arise from Dirichlet distribution. However, the proposed method might rarely replace the rounded zeros by values exceeding the detection limit. 3 EM Algorithm Based Method The EM (Expectation-Maximization) algorithm is a widely used iterative algorithm for parameter estimation by maximum likelihood method when part of the data is missing or censored data. The algorithm consists of two steps; the E and the M. The E step finds the conditional expectation of the missing data given the observed data and current estimated parameters, and then replaces missing values by estimated values. The M step performs maximum-likelihood estimation of the parameters using estimated parameter values as true values. In the compositional context, let X = [x ij ] be a matrix containing a random sample of n D- component compositions such that X = (X obs, X miss ) where X obs and X miss are the observed and the missing parts respectively. Assume that X has a Dirichlet distribution with parameters Λ = (λ 1, λ 2,, λ D ) and a rounded zero occurs when x ij < γ where γ is the detection limit. The completedata log-likelihood for Λ based on a sample X of size n is given by D l = log L = n log Γ (λ) n D log Γ (λ j ) + (λ j 1) log(x ij ) (2) j=1 i=1 j=1 where λ = D j λ j. A useful reparametrization of Dirichlet density is constructed by letting µ j = λ j /λ for j = 1, 2, ; D 1 while λ will be the dispersion parameter of the model. Since the Dirichlet distribution belongs to the exponential family, the complete data log-likelihood is linear in the non-observed data. Hence, on the t th iteration of the EM algorithm, the computation of the conditional expected complete data likelihood in the E-step reduces to the computation of the conditional expected value of the missing part, E[X miss X obs ; θ (t) ], given a parameter estimate θ (t) = Λ (t). The E-step in the EM algorithm should be modified to incorporate the detection limit. Hence, on the t th iteration in the E-step, the values of x ij should be replaced according to the following rule: { x (t) xij if x ij = ij γ E ( x ij x i, j, x ij < γ, Λ (t)) if x ij < γ (3) 2

for i = 1,, n and j = 1,, D, where x i, j denotes the set of observed variables for the i th composition of the data. The conditional expectation in the proposed method should be computed under Dirichlet distribution analogous to the work of Palarea-Albaladejo et al. (2007). The expectation in (3) is written as 1 γ Γ(λ) E (x j x j, x j < γ) = P (x j < γ) 0 Γ(λ j )Γ(λ λ j ) xλ j 1 j (1 x j ) λ λj 1 dx j (4) where λ and λ j are obtained from the Beta regression of x j on x j (Hijazi, 2009). The latest expression will reduce to E (x j x j, x j < γ) = λ jf 1 (γ) λf 2 (γ) where F 1 and F 2 are the cumulative distribution functions of beta random variables with parameters (λ j + 1, λ λ j ) and (λ j, λ λ j ), respectively. 4 Example: Simulated Data In this section, we present an application using a simulated data to illustrate the proposed replacement method. The simulated data in Figure 1 consist of 100 compositions randomly generated from D(1, 19, 30). This paper focuses on the new proposed replacement method, we will consider that the detection limit is 0.5%. This will result in 14 compositions with first component recorded as rounded zero as shown in Figure (1a). The Euclidean distance between the original data and the imputed data using the multiplicative method, beta regression based method and EM algorithm based method are 0.00775, 0.01389 and 0.00768, respectively. This indicates that the new replacement method yielded an imputed data which is slightly closer to the original than the one produced by the multiplicative method. The beta regression based methods appears to perform poorly in this case. The compositions with rounded zeros and the corresponding imputed compositions are shown in Figure (1b). The maximum likelihood estimates of the original data and the imputed data are given in Table (1). It is clear that the EM algorithm based method has resulted in the closest estimates compared to the original data, however, the other two methods have distorted the covariance structure of the original data. Table 1: Maximum likelihood estimates of original and imputed data (5) λ 1 λ 2 λ 3 Original Data 1.183 19.517 32.267 Multiplicative replacement 1.372 21.573 35.701 Beta regression replacement 1.477 22.595 37.409 EM algorithm replacement 1.154 19.278 31.869 5 Conclusion In this work we have proposed a new replacement method based on EM algorithm under Dirichlet model. The proposed method was compared with the multiplicative replacement and the beta regression based methods through an illustrative example. The new method outperforms the multiplicative replacement and beta regression based methods. This method gives positive imputed value and takes into account the detection limit of the part. A Monte Carlo simulation study should be conducted to evaluate the performance of the new technique with varying covariance structures and percentage of rounded zeros. 3

x2 x1 Zero Free Rounded Zeros x3 0.0 0.002 0.004 0.006 Original MR BRR EM 0.55 0.60 0.65 0.70 0.75 0.80 0.85 (a) (b) Figure 1: (a) The ternary diagram of the simulated data (b) The original compositions before rounding, imputed compositions with MR (multiplicative method), BRR (beta regression based method) and imputed compositions with EM (EM algorithm based method) 4

References Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London (UK). (Reprinted in 2003 with additional material by The Blackburn Press). 416 p. Fry, J. M., Fry, T. R. L., and McLaren, K. R. (2000). Compositional data analysis and zeros in micro data. Applied Economics 32 (8), 953 959. Hijazi R. (2009). Dealing with Rounded Zeros in Compositional Data under Dirichlet Models. Proceedings of the 10th Islamic Countries Conference on Statistical Sciences 2009 (ICCS-X). Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V. (2003). Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation. Mathematical Geology 35 (3), 253 278. Martín-Fernández, J.A., Palarea-Albaladejo J, Gómez-García, J. (2003). Markov Chain Monte Carlo Method Applied to Rounding Zeros of Compositional Data: First Approach. Proceedings of CO- DAWORK 05, The 2nd Compositional Data Analysis Workshop. Palarea-Albaladejo, J., Martín-Fernández, J. (2008). A modified EM alr-algorithm for replacing rounded zeros in compositional data sets. Computers & Geosciences 34, 902 917. Palarea-Albaladejo, J., Martín-Fernández, J., Gómez-García, J. (2007). A parametric approach for dealing with compositional rounded zeros. Mathematical Geology 39, 625 645. 5