Estimating the parameters of hidden binomial trials by the EM algorithm

Similar documents
The Expectation Maximization Algorithm

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

EM Algorithm II. September 11, 2018

Biostat 2065 Analysis of Incomplete Data

EM for ML Estimation

Clustering on Unobserved Data using Mixture of Gaussians

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

A Note on the Expectation-Maximization (EM) Algorithm

Optimization Methods II. EM algorithms.

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Weighted Finite-State Transducers in Computational Biology

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Mixture Models and Expectation-Maximization

Convergence Rate of Expectation-Maximization

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Accelerating the EM Algorithm for Mixture Density Estimation

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Issues on Fitting Univariate and Multivariate Hyperbolic Distributions

Statistical Estimation

Mixtures of Rasch Models

Expectation Maximization

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

An Akaike Criterion based on Kullback Symmetric Divergence in the Presence of Incomplete-Data

Label Switching and Its Simple Solutions for Frequentist Mixture Models

MODEL BASED CLUSTERING FOR COUNT DATA

COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY

Hidden Markov models for time series of counts with excess zeros

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Choosing initial values for the EM algorithm for nite mixtures

1 Expectation Maximization

EM for Spherical Gaussians

Publication IV Springer-Verlag Berlin Heidelberg. Reprinted with permission.

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Estimating Gaussian Mixture Densities with EM A Tutorial

The Regularized EM Algorithm

Latent Variable Models and EM algorithm

The Expectation-Maximization Algorithm

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

COM336: Neural Computing

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

Lecture 6: April 19, 2002

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

We Prediction of Geological Characteristic Using Gaussian Mixture Model

Introduction An approximated EM algorithm Simulation studies Discussion

Expectation-Maximization (EM) algorithm

Finite Singular Multivariate Gaussian Mixture

Learning Binary Classifiers for Multi-Class Problem

Method of Conditional Moments Based on Incomplete Data

EM Algorithm. Expectation-maximization (EM) algorithm.

Computational Statistics and Data Analysis. Estimation for the three-parameter lognormal distribution based on progressively censored data

Accounting for Missing Values in Score- Driven Time-Varying Parameter Models

p(d θ ) l(θ ) 1.2 x x x

A note on shaved dice inference

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Basics of Modern Missing Data Analysis

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Modelling Dropouts by Conditional Distribution, a Copula-Based Approach

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

arxiv: v1 [math.st] 21 Jun 2012

Extinction of Family Name. Student Participants: Tianzi Wang Jamie Xu Faculty Mentor: Professor Runhuan Feng

Computing the MLE and the EM Algorithm

arxiv: v1 [math.oc] 4 Oct 2018

1 What is a hidden Markov model?

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

EXPECTATION- MAXIMIZATION THEORY

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

Optimization with EM and Expectation-Conjugate-Gradient

STATISTICAL ANALYSIS WITH MISSING DATA

Technical Details about the Expectation Maximization (EM) Algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Testing Restrictions and Comparing Models

What is the expectation maximization algorithm?

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Ancillarity and conditional inference for ML and interval estimates in some classical genetic linkage trials

But if z is conditioned on, we need to model it:

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Sharpening Jensen s Inequality

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

SIMULTANEOUS CONFIDENCE INTERVALS AMONG k MEAN VECTORS IN REPEATED MEASURES WITH MISSING DATA

THE UNIVERSITY OF CHICAGO CONSTRUCTION, IMPLEMENTATION, AND THEORY OF ALGORITHMS BASED ON DATA AUGMENTATION AND MODEL REDUCTION

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Introduction to Machine Learning

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

y = 1 N y i = 1 N y. E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) and its variance is var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1.

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

Department of Statistical Science FIRST YEAR EXAM - SPRING 2017

Lecture 14. Clustering, K-means, and EM

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

MIXTURE MODELS AND EM

B.1 Optimization of the negative log-likelihood functions

Transcription:

Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 : 12 : 2013 Abstract The EM algorithm has become a popular efficient iterative procedure to compute the maximum likelihood (ML) estimate in the presence of incomplete data. Each iteration of the EM algorithm involves two steps called expectation step (E-step) and maximization step (M-step). Complexity of statistical model usually makes the iteration of maximization step difficult. An identity on which the derivation of the EM algorithm is based is presented. It is showed that deriving iteration formula of parameter of hidden binomial trials based on the identity is much simpler than that in common M-step. 2000 AMS Classification: 62F10,62-07,62P10. Keywords: Maximum likelihood; EM algorithm; incomplete data; hidden binomial trials; identity. 1. Introduction The EM (Expectation-Maximization) algorithm has become a popular tool in statistical estimation problems involving incomplete data, or in problems which can be posed in a similar form, such as mixture estimation [7, 6]. This idea has been around for a long time. However, the seminal reference that formalized EM and provided a proof of convergence is the paper by Dempster, Laird, and Rubin [1]. Although, some special cases of the EM algorithm were investigated long time ago [2], it was proposed formally by [1]. At present, there are a number of popular and very useful references about the EM algorithm [6, 10, 8, 9, 5]. In this paper, we first introduce an identity in the EM algorithm and then show that derivations of the iterative formula for the parameters estimation in hidden binomial trials based on this identity are simpler and easier than that in the common procedure. School of Statistics and Management Shanghai University of Finance and Economics, Shanghai 200433, China, Department of Applied Mathematics Nanjing Forestry University, Nanjing 210037, China, Email: dgzhu@njfu.edu.cn

First we give a brief introduction to the EM algorithm. Suppose Y are the unobserved complete data in some statistical trial, Y obs are the observed incomplete data, Y mis are the missing data. So we can denote Y = (Y obs, Y mis). A general idea is to choose Y so that maximum likelihood becomes trivial for the complete data. Suppose Y has the probability density f(y θ), and θ is the parameter vector. As mentioned above, each iteration of the EM algorithm involves two steps: E-step and M-step. In the E-step, we calculate the conditional expectation Q(θ θ (t) ) = E[log f(y θ) Y obs, θ (t) ]. where, θ (t) denote the current estimate for θ after the tth iteration. In the M-step, we maximize the Q(θ θ (t) ) with respect to θ, denote the update value as θ (t+1), that is θ (t+1) = arg max θ Q(θ θ (t) ) be the new parameter estimate, which commonly gives an iterative formula. Repeat the two steps until convergence occurs. Comprehensive discussions about the convergence of the EM algorithm can be found in [6, 11]. The key idea of the EM algorithm is to obtain the point series {θ (t) } which increase the loglikelihood of the observed data and then view the limit value of the series {θ (t) } as the estimate of the parameters. It is particularly suited to missing data problems, as the very fact that there are missing data can sometimes make calculations cumbersome. It seems that the two steps are easy and clear. However, the M-step sometimes can not be handled easily. So it is desirable to find some procedures to make the M-step easier. We introduce an identity first. 2. Methods 2.1. Proposition [4] Suppose the complete data Y possesses a probability density f(y θ), where θ is the unknown parameter vector. Sometimes Y are unobserved and can be denoted by Y = (Y obs, Y mis), where Y obs are observed incomplete data, and Y mis are missing data. The EM algorithm for estimating the parameter θ can be formulated based on the following identity (2.1) θ Q(θ θ(t) ) θ=θ (t) = θ l(θ Y obs) θ=θ (t). where θ (t) is a any interior point in the parameter space. l(θ Y obs ) is the log-likelihood of the observed data. (2.2) Proof. Since Y = (Y obs, Y mis), we have f(y θ) = f(y obs, Y mis θ) = f(y obs θ)f(y mis Y obs, θ), Taking the logarithm on both sides of (2.2) and making arrangement, we get (2.3) l(θ Y obs ) = l(θ Y ) log f(y mis Y obs, θ), Taking conditional expectation for given Y obs and θ (t) on both sides, we thus obtain the log-likelihood of the observed data as (2.4) l(θ Y obs ) = Q(θ θ (t) ) q(θ θ (t) ), where Q(θ θ (t) ) = E[l(θ Y ) Y obs, θ (t) ], q(θ θ (t) ) = E[log f(y mis Y obs, θ) Y obs, θ (t) ].

According to the Entropy inequality [4], we know that q(θ θ (t) ) attains its maximum when θ = θ (t) under certain regularity conditions. Thus, the first-order derivative of q(θ θ (t) ) takes value 0 at θ = θ (t). Taking derivative with respect to θ on both sides of (2.4), we have the following identity proving the Proposition 2.1. θ l(θ Y obs) θ=θ (t) = θ Q(θ θ(t) ) θ=θ (t) + 0 = θ Q(θ θ(t) ) θ=θ (t). The value of Proposition 2.1 is that one can replace the unobserved complete data likelihood with the observed incomplete data likelihood so that the maximization process is simplified (as compared with a direct maximization of Q(θ θ (t) ) ). Following this idea, a similar identity aimed to simplify the M-step in some special cases can be formulated as follows. 2.2. Proposition [4] Consider a statistical model involving N binomial trials with success probability θ. Suppose S trials result in success. Let Y obs denote the observed incomplete data. Binomial experiments usually involve constant number of trials. Now the number N is allowed to vary for more generality. Then in the implementation of the EM algorithm, we have the following identity (2.5) θ (t+1) = E[S Y obs, θ (t) ] E[N Y obs, θ (t) ]. Proof. Obviously, the complete data likelihood can be expressed as N! S!(N S)! N! S!(N S)! θs (1 θ) N S = kθ S (1 θ) N S where k = is the irrelevant constant. The E-step is to calculate the expected log-likelihood. Specially, it is to compute (2.6) Q(θ θ (t) ) = log(k) + log(θ) E[S Y obs, θ (t) ] + log(1 θ) E[N S Y obs, θ (t) ] The derivative with respect to θ, is (2.7) θ Q(θ θ(t) ) = 1 θ E[S Y obs, θ (t) ] 1 1 θ E[N S Y obs, θ (t) ] Setting the derivative equal to 0 and solving yields the iterative formula (2.5), which completes the proof of Proposition 2.2. In order to examine the relationship between θ (t+1) and θ (t), we use Proposition 2.1 to give another specific identity. Specifically, from (2.7) and (2.1), we get (2.8) l (θ (t) Y obs ) = 1 θ E[S Y obs, θ (t) 1 ] (t) 1 θ E[N S Y obs, θ (t) ] (t)

Multiplying (2.8) by θ (t) (1 θ (t) ), we have E[N S Y obs,θ (t) ] θ (t) (1 θ (t) ) E[N S Y obs, θ (t) ] l (θ (t) Y obs ) = (1 θ(t) )E[S Y obs, θ (t) ] θ (t) E[N S Y obs, θ (t) ] E[N S Y obs, θ (t) ] = E[S Y obs, θ (t) ] θ (t) E[N Y obs, θ (t) ] E[N Y obs, θ (t) ] = E[S Y obs, θ (t) ] E[N Y obs, θ (t) ] θ(t) = θ (t+1) θ (t) where the last equality is true because of (2.5). Hence we obtain another identity (2.9) θ (t+1) = θ (t) + θ (t) (1 θ (t) ) E[N S Y obs, θ (t) ] l (θ (t) Y obs ). Starting with some initial value of the parameters θ (0), one cycles between the E and M-steps until θ (t) converges to a local maxima. As far as the problem of choosing initial values is concerned, many methods have been proposed for the reason that the choice of initial values can heavily influence the speed of convergence of the algorithm and its ability to locate the global maximum [3]. We argue that the concrete problem need a concrete analysis. A good way to construct initial guesses for the unknown parameters depends on the special cases. 3. Results In this section we make a application of Proposition 2.2 to the twins pairs research in statistical genetics [4]. Consider random sampling from a twin pairs population, let y be the number of female pairs, x be the number of male pairs, z be the number of opposite sex pairs. Suppose that monozygotic twins occur with probability p and dizygotic twins with probability 1 p. Suppose in monozygotic twins, boys were born with probability q and girls with probability 1 q. Suppose there are x 1 monozygotic twins and x 2 dizygotic twins in x male pairs (x 1 + x 2 = x), there are y 1 monozygotic twins and y 2 dizygotic twins in y female pairs(y 1 + y 2 = y). Because the data x 1, x 2, y 1, y 2 are unobserved, we call this problem hidden binomial trials. The goal is to estimate the parameter θ = (p, q). Next, we propose two different ways to get the iterate formula for estimating the parameter. 3.1. Theorem The iterate formulae for estimating p and q can be expressed by the following two identities (3.1) p (t+1) = x(t) + y (t) x + y + z. (3.2) q (t+1) = x (t) + 2(x x (t) ) + z x (t) + 2(x x (t) ) + y (t) + 2(y y (t) ) + 2z.

where (3.3) x (t) = E(x 1 x, p (t), q (t) ) = y (t) = E(y 1 y, p (t), q (t) ) = xp (t) q (t) p (t) q (t) + (1 p (t) )(q (t) ) 2, yp (t) (1 q (t) ) p (t) (1 q (t) ) + (1 p (t) )(1 q (t) ) 2. We shall give two proofs for the Theorem 3.1. Proof 1 is based on the common E-step and M-step, while Proof 2 is based on the proposition 2.2. Proof 1. According to the genetics laws, the probability of x male pairs with x 1 monozygotic twins and x 2 dizygotic twins is (3.4) l 1(p, q x 1, x 2) = x! x 1!x 2! (pq)x 1 [(1 p)q 2 ] x 2. Similarly, the probability of y male pairs with y 1 monozygotic twins and y 2 dizygotic twins is (3.5) l 2(p, q y 1, y 2) = y! y 1!y 2! [p(1 q)]y 1 [(1 p)(1 q) 2 ] y 2. The probability for z opposite sex pairs is (3.6) l 3(p, q z) = [(1 p)2q(1 q)] z. Thus we can get the likelihood of the complete data (3.7) L(p, q x 1, x 2, y 1, y 2, z) = Hence, (x + y + z)! l 1(p, q x 1, x 2)l 2(p, q y 1, y 2)l 3(p, q z). x!y!z! (3.8) l = log L(p, q x 1, x 2, y 1, y 2, z) = (x + y + z)! log + x 1[log p + log q] + x 2[log(1 p) + log(q 2 )] x!y!z! +y 1[log p + log(1 q)] + y 2[log(1 p) + log(1 q) 2 ] +z[log(1 p) + log 2q(1 q)]. Applying the common E-step and M-step [1] to the complete data log-likelihood (3.8), we shall get the iterative formula (3.1) and (3.2). Proof 2. Since p denotes the probability of monozygotic twins, there are x 1 + y 1 monozygotic twins in the total x + y + z twins pairs, we can argue that N = x + y + z, S = x 1 + y 1, Y obs = (x, y) in (2.5). So we get (3.9) p (t+1) = E[S Y obs, θ (t) ] E[N Y obs, θ (t) ] = E[x 1 + y 1 Y obs, θ (t) ] E[x + y + z Y obs, θ (t) ] = E(x1 x, p(t), q (t) ) + E(y 1 y, p (t), q (t) ). x + y + z While by the properties of the binomial distribution, (3.3) is obvious given the data and current estimate x, y, p (t), q (t). Thus, (3.9) is just (3.1).

Similarly, since q denotes the probability of boys being chosen in monozygotic twins pairs, and there are x 1 male monozygotic twins pairs, x 2 = x x 1 male dizygotic twins pairs, z opposite sex pairs, so totally the number of boys are x 1+2(x x 1)+z, the number of girls are y 1 + 2(y y 1) + z, the total number is x 1 + 2(x x 1) + z + y 1 + 2(y y 1) + z = x 1+2(x x 1)+y 1+2(y y 1)+2z. We can argue that N = x 1+2(x x 1)+y 1+2(y y 1)+2z, S = x 1 + 2(x x 1) + z, Y obs = (x, y, z) in (2.5). Then we shall have (3.10) q (t+1) = E[S Y obs, θ (t) ] E[N Y obs, θ (t) ] = E[x 1 + 2(x x 1) + z Y obs, p (t), q (t) ] E[x 1 + 2(x x 1) + y 1 + 2(y y 1) + 2z Y obs, p (t), q (t) ] Again according to (3.3), we find that (3.10) is just (3.2). 4. Conclusion and Discussion According to the two proofs for the Theorem 3.1 shown above, we can draw a conclusion that deriving iteration formula of parameters of hidden binomial trials based on Proposition 2.2 is simpler than that in the common M-step. Whether the iteration formula in Proposition 2.2 can be extended to other expositional families such as hidden Poisson or exponential trials requires further research. Acknowledgments. The author gratefully acknowledges the anonymous referees and the editor for their valuable comments which substantially improved the paper. This research was supported in part by the Joint PhD Student Program of Shanghai University of Finance and Economics. References [1] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1 38, 1977. [2] HO Hartley. Maximum likelihood estimation from incomplete data. Biometrics, 14(2):174 194, 1958. [3] Dimitris Karlis and Evdokia Xekalaki. Choosing initial values for the em algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3):577 590, 2003. [4] Kenneth Lange. Mathematical and statistical methods for genetic analysis. Springer, 2002. [5] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 539. Wiley New York, 1987. [6] Geoffrey McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007. [7] Geoffrey McLachlan and David Peel. Finite mixture models. Wiley. com, 2004. [8] Xiao-Li Meng and Donald B Rubin. Using em to obtain asymptotic variance-covariance matrices: The sem algorithm. Journal of the American Statistical Association, 86(416):899 909, 1991. [9] Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195 239, 1984. [10] Martin A Tanner. Tools for statistical inference: observed data and data augmentation methods. Springer-Verlag New York, 1991. [11] CF Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11(1):95 103, 1983.