Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions

Similar documents
Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

New mixture models and algorithms in the mixtools package

Estimation for nonparametric mixture models

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Variable selection for model-based clustering

Master s Written Examination - Solution

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Mixture models for analysing transcriptome and ChIP-chip data

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Techniques Lecture 3

Parametric Techniques

Lecture 6: Gaussian Mixture Models (GMM)

Statistical Inference

Computation of an efficient and robust estimator in a semiparametric mixture model

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning

Graph Wavelets to Analyze Genomic Data with Biological Networks

Pattern Recognition. Parameter Estimation of Probability Density Functions

Predicting Protein Functions and Domain Interactions from Protein Interactions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

LECTURE NOTE #3 PROF. ALAN YUILLE

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

The Expectation-Maximization Algorithm

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Forecasting Wind Ramps

EM for Spherical Gaussians

Unsupervised machine learning

Review and continuation from last week Properties of MLEs

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Brief Review on Estimation Theory

The BLP Method of Demand Curve Estimation in Industrial Organization

Lecture 14. Clustering, K-means, and EM

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Outline of GLMs. Definitions

Introduction to Estimation Methods for Time Series models Lecture 2

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Master s Written Examination

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Regression Clustering

Statistical Data Mining and Machine Learning Hilary Term 2016

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Mathematical statistics

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Illustration of the Varying Coefficient Model for Analyses the Tree Growth from the Age and Space Perspectives

Math 494: Mathematical Statistics

Modelling Survival Data using Generalized Additive Models with Flexible Link

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

A New Two Sample Type-II Progressive Censoring Scheme

Survival Prediction Under Dependent Censoring: A Copula-based Approach

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI

Maximum Likelihood Estimation. only training data is available to design a classifier

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Data-Efficient Information-Theoretic Test Selection

Likelihood-based inference with missing data under missing-at-random

Multi Omics Clustering. ABDBM Ron Shamir

More on Unsupervised Learning

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

3003 Cure. F. P. Treasure

Journal of Multivariate Analysis

IEOR165 Discussion Week 5

Semiparametric two-component mixture models when one component is defined through linear constraints

Classification. Chapter Introduction. 6.2 The Bayes classifier

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Quantile Regression for Residual Life and Empirical Likelihood

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Augmented Likelihood Estimators for Mixture Models

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Different methods of estimation for generalized inverse Lindley distribution

Estimation of a Two-component Mixture Model

Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed

Lecture 3. Conditional distributions with applications

Structural Learning and Integrative Decomposition of Multi-View Data

Supervised dimensionality reduction using mixture models

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

A Note on the Expectation-Maximization (EM) Algorithm

Clustering on Unobserved Data using Mixture of Gaussians

CMU-Q Lecture 24:

Expectation-Maximization

Machine Learning Linear Classification. Prof. Matteo Matteucci

UNIVERSITY OF CALGARY. Minimum Hellinger Distance Estimation for a Two-Component Mixture Model. Xiaofan Zhou A THESIS

Linear Models for Classification

Chapter 4: CONTINUOUS RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS

Parameters Estimation for a Linear Exponential Distribution Based on Grouped Data

Graduate Econometrics I: Maximum Likelihood II

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Rater agreement - ordinal ratings. Karl Bang Christensen Dept. of Biostatistics, Univ. of Copenhagen NORDSTAT,

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Statistical Learning Theory and the C-Loss cost function

2 Mathematical Model, Sequential Probability Ratio Test, Distortions

Adaptive Piecewise Polynomial Estimation via Trend Filtering

KANSAS STATE UNIVERSITY Manhattan, Kansas

Hmms with variable dimension structures and extensions

Transcription:

UDC 519.2 Statistical Analysis of Data Generated by a Mixture of Two Parametric Distributions Yu. K. Belyaev, D. Källberg, P. Rydén Department of Mathematics and Mathematical Statistics, Umeå University, SE 901 87 Umeå, Sweden Abstract. We introduce a novel approach to estimate the parameters of a twocomponent mixture distribution. The method combines a grid-based approach with the method of moments and reparametrization. The grid approach enables the use of parallel computing and the method can easily be combined with resampling techniques. We derive a reparametrization for the mixture of two Weibull distributions, and apply the method on gene expression data from one gene and 409 ER+ cancer patients. Keywords: mixture of parametric distributions, point process, consistency, accuracy of estimators, resampling, Palm intensities, responsibilities. 1. Introduction Novel technologies in medicine and manufacturing industry are generating high-dimensional and complex data which have the potential to provide vital information and knowledge, but statistical analyses remain a bottle neck. In cancer research the expression of thousands of genes are measured, with the objective to search for novel disease subtypes by applying cluster analysis [1, 2]. This requires that the dimension of the problem is reduced through variable selection [1, 2]. We address variable selection in a parametric framework for the case when there are two types of sub-diseases. For each variable we assume that the observations are generated by a two-component mixture distribution, with a set of unknown parameters. This is a well-studied problem that has been addressed more than 100 years, see e.g. [3 5]. Karl Pearson carried out the first attempt to estimate the normal mixture distribution using the method of moments, here we modify this approach and introduce a grid-based method that can be applied for a wide range of distribution families. We introduce a measure that is a function of the responsibilities, which can be easily estimated and used to select informative genes for the cluster analysis. 2. Main section We consider n d observations x n d 1 = {x 1,..., x nd } from a population with two groups of individuals. Let t n d 1 = {t 1,..., t nd } denote the individual s unobservable group labels (1 or 2) and regard {x j, t j } j=1,2,... as

observations of the independent and identically distributed (i.i.d.) random variables {X j, T j }, where X j has the probability density function (p.d.f.) p[x, θ] = ω 1 p 1 [x, θ 1 ] + (1 ω 1 )p 2 [x, θ 2 ]. Here p i [x, θ i ] denotes the p.d.f. of X j given that T j = i and ω 1 denotes the proportion of individuals belonging to group 1. Let θ = {ω 1, θ 1, θ 2 } denote the k unknown parameters of the p.d.f. and suppose that the overall objective is to estimate a parameter that can be expressed as a function of the model parameters, i.e. ϕ = g(θ). We propose a general solution to the above problem that can easily be combined with a resampling procedure in order to derive a confidence interval (CI) for the parameter of interest. In the first stage the k parameters are divided in two groups: l grid parameters θ g and k l free parameters θ f. Through reparametrization the (k l) first moments can be expressed as algebraic functions of the parameters, i.e. µ 1 = f 1 (θ g, θ f ),..., µ k l = f k l (θ g, θ f ), (1) where µ m = E[Xj m ], m = 1,..., (k l). If the grid parameters are considered as known and the moments are empirically estimated then the equation system defined in (1) can be solved. We propose a grid-based approach where we for each grid-point estimate the (k l) free parameters via the reparametrization approach. Finally θ is estimated with the grid-point estimate ˆθ that maximizes the log-likelihood function (or some other fitness criterion). Below we show how reparametrization can be used for the case when the data is described by a mixture of two Weibull distributions. Using the theory of point processes [6] we can apply the Palm intensities to define the responsibilities q i [x j, θ] = P[T j = i X j = x j ] = ω i p i [x j, θ i ] ω 1 p 1 [x j, θ 1 ] + (1 ω 1 )p 2 [x j, θ 2 ], for i = 1, 2, and j = 1,..., n d, and where ω 2 = 1 ω 1. Suppose that the responsibilities are used to predict which cancer type the patients have, then the expected number of correctly classified (cc) individuals can be expressed as n d ) ϕ cc = (q 1 [x j, θ] 2 + q 2 [x j, θ] 2, j=1 where θ are the true parameters. The measure ϕ cc can be estimated by replacing θ with ˆθ. The estimated measure ˆϕ cc (x n d 1 ) can be used for variable selection in a high-dimensional cluster analysis problem.

We now consider the special case when data is described by a mixture of two 2-parameter Weibull distributions (Mix2W ) where X j has the probability density function (p.d.f.) p 2W [x, θ 5 ] = ω 1 p W [x, α 1, β 1 ] + (1 ω 1 )p W [x, α 2, β 2 ], with p w [x, α, β] = β α ( x α ) β 1 e (x/α) β, x 0, where α > 0 and β > 0 denote the scale and shape parameters, respectively. The first three moments of X j are µ 1 = ω 1 α 1 g 11 + (1 ω 1 )α 2 g 21, (2) µ 2 = ω 1 α 2 1g 12 + (1 ω 1 )α 2 2g 22, (3) µ 3 = ω 1 α 3 1g 13 + (1 ω 1 )α 3 2g 23, (4) where g ik = Γ[1+k/β i ] and Γ[ ] is the gamma function, i = 1, 2, k = 1, 2, 3. For known values of ψ 5 = {µ 1, µ 2, µ 3, β 1, β 2 } this is a determined equation system with unknown parameters {ω 1, α 1, α 2 }. From (2) it follows that α 2 can be expressed as a function of α 1 and together with (3) we get that α 1 is given by the solutions of a quadratic equation with coefficients depending on ω 1. Inserting these solutions in (4) yields cubic equations that can be solved w.r.t. ω 1 if the moments µ 1, µ 2, µ 3 are replaced with the empirical moments ˆµ 1, ˆµ 2, ˆµ 3, where ˆµ k = 1 nd n d j=1 xk j. The local estimator θ 5 of θ 5 given {β 1, β 2 } is defined as the relevant solution (i.e. α 1, α 2 > 0, and 0 < ω 1 < 1) that maximizes the log-likelihood function. Note that for some values ψ = {ˆµ 1, ˆµ 2, ˆµ 3, β 1, β 2 } there may not exist any relevant solutions of the equation system. Let G denote the regular grid with grid-points {β 1, β 2 }. The global estimator ˆθ 5 = {ˆω 1, ˆα 1, ˆβ 1, ˆα 2, ˆβ 2 } of θ 5 is given by the local estimator θ 5 that maximizes the log-likelihood function, i.e. ˆθ 5 = arg max θ 5 n d j=1 ( log ω 1 p 1 [x j, α 1, β 1 ] + (1 ω 1 )p 2 [x j, α 2, β ) 2 ]. We call this algorithm the Hybrid, Reparametrization, and Discretization (HRD) algorithm. The HRD-estimator ˆθ 5 can be used to estimate the responsibilities and the measure ϕ cc. The approach can easily be combined with resampling in order to obtain the accuracy of the estimators (see [7]). Next we consider an example where we model microarray breast cancer data obtained from the Cancer Genome Atlas (TCGA). Expression

0.12 0.10 0.08 0.06 0.04 0.02 5 10 15 20 25 30 x Figure 1. Estimated components of the Mix2W model. 1.0 0.8 0.6 0.4 0.2 5 10 15 20 25 30 x Figure 2. Empirical survival distribution function and the estimated mixture of two Weibull survival distribution functions. levels for the gene GLNT3 were taken from 409 patients with positive estrogen receptor (ER+) status, a well known subgroup of the disease. The untransformed data were modeled with a mixture of two Weibull distributions, with the weight ˆω 1 = 0.17, the scale and shape parameters were ˆα 1 = 0.926, ˆβ 1 = 1.75, ˆα 2 = 7.514, ˆβ 2 = 1.287, see Fig. 1. The survival distribution function of the fitted model was close to the empirical survival distribution function, see Fig 2. 3. Conclusions The problem on how to estimate the parameters in a two-component mixture distribution is an old problem that has attracted a lot of attention. The idea to lower the complexity of the problem by momentarily consider some parameters as fixed over a grid is attractive since: modern computers are powerful, the approach is well-suited for parallelization and the approach can easily be combined with resampling. Moreover, the values of

the log-likelihood function can easily be visualized for 1- or 2-dimensional grids, which can be very informative. For the Weibull example we used a 2-dimensional grid and fixed the shape parameters, but there are several alternative reparametrizations that could be considered. Generally, there is a tradeoff between the complexity of the grid and the complexity of the equation system, the higher dimension of the grid the simpler equation system. One advantage by considering a high-dimensional grid is that there will be no need to estimate higher order moments which can be difficult, in particular if the sample size is relatively small and if the data are contaminated with outliers. There are several open questions regarding the construction of the grid. For example: How should the boundaries be selected? How dense does the grid need to be? Can an iterative grid-based approach, allowing for uneven grid-densities, be used? Acknowledgments This work was supported by grants from the Swedish Research Council, Dnr 340-2013-5185 (P. R.), the Kempe Foundations, Dnr JCK-1315 (D. K., P. R.), and the Faculty of Science and Technology, Umeå University (Yu. B., D. K., P. R.). References 1. Freyhult E., Landfors M., Önskog J., Hvidsten T. R., Rydén P. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering // BMC Bioinformatics. 2010. Vol. 11. Article 503. 2. Bolón-Canedo V., Sánchez-Marono N., Alonso-Betanzos A., Benítez J. M., Herrera F. A review of microarray datasets and applied feature selection methods-information Sciences. 2014. Vol. 282, no. 1. P. 111 135. 3. Bordes L., Mottelet S., Vandekerkhove P. Semiparametric estimation of a two-component mixture model // Annals of Statistics. 2006. Vol. 34, no. 3. P. 1204 1232. 4. Carta J. A., Ramirez P. Analysis of two-component mixture Weibull statistics for estimation of wind speed distributions // Renewable Energy. 2007. Vol. 32, no. 3. P. 518 531. 5. Celeux G., Chauveau D., Diebolt J. Stochastic versions of the EM algorithm: an experimental study in the mixture case // Journal of Statistical Computation and Simulation. 1996. Vol. 55, no. 4. P. 287 314. 6. Kallenberg O. Foundations of modern probability. Springer, 2006. 7. Belyaev Yu. K., Nilsson L. Parametric maximum likelihood estimators and resampling. Statistical Research Report, Department of Mathematical Statistics, Umeå University, 1997-15. ISSN 1401-730X.