Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Similar documents
EM Algorithm II. September 11, 2018

Statistical Estimation

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Chapter 3 : Likelihood function and inference

Biostat 2065 Analysis of Incomplete Data

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

EM for ML Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Goodness of Fit Goodness of fit - 2 classes

EM algorithm and applications Lecture #9

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Estimating the parameters of hidden binomial trials by the EM algorithm

Lecture 3 September 1

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Lecture Notes. Introduction

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Latent Variable Models and EM algorithm

Maximum Likelihood Estimation

A note on shaved dice inference

CS Lecture 18. Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Outline of GLMs. Definitions

POLI 8501 Introduction to Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Fall 2003: Maximum Likelihood II

Allele Frequency Estimation

Lecture 16 Solving GLMs via IRWLS

Review and continuation from last week Properties of MLEs

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Latent Variable Models and EM Algorithm

A note on multiple imputation for general purpose estimation

Mixtures of Rasch Models

Spring 2012 Math 541B Exam 1

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

MATH4427 Notebook 2 Fall Semester 2017/2018

Generalized Linear Models

Statistics 3858 : Maximum Likelihood Estimators

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Introduction An approximated EM algorithm Simulation studies Discussion

Expectation maximization tutorial

Gibbs Sampling in Latent Variable Models #1

Approximate Likelihoods

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 14 More on structural estimation

HT Introduction. P(X i = x i ) = e λ λ x i

Gaussian Mixture Models

Generalized linear models

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

Chapter 2. Discrete Distributions

Computer Intensive Methods in Mathematical Statistics

Optimization Methods II. EM algorithms.

The Expectation-Maximization Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computational statistics

Review of Maximum Likelihood Estimators

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

STAT Advanced Bayesian Inference

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Generalized Linear Models for Non-Normal Data

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Statistical Pattern Recognition

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Statistics. Statistics

1.5 MM and EM Algorithms

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For more information about how to cite these materials visit

Generalized Linear Models. Kurt Hornik

Generalized Linear Models Introduction

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Lecture 5: LDA and Logistic Regression

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Lecture 6: April 19, 2002

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Fractional Imputation in Survey Sampling: A Comparative Review

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Probabilistic Graphical Models

Maximum Likelihood Estimation. only training data is available to design a classifier

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Expectation-Maximization (EM) algorithm

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Econometrics I, Estimation

Fitting Narrow Emission Lines in X-ray Spectra

6. Fractional Imputation in Survey Sampling

Transcription:

EM Algorithm

Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

Expectation maximization (EM) algorithm 2/35 An iterative algorithm for maximizing likelihood when the model contains unobserved latent variables. Was initially invented by computer scientist in special circumstances. Generalized by Arthur Dempster, Nan Laird, and Donald Rubin in a classic 1977 JRSSB paper, which is widely known as the DLR paper. The algorithm iterate between E-step (expectation) and M-step (maximization). E-step: create a function for the expectation of the log-likelihood, evaluated using the current estimate for the parameters. M-step: compute parameters maximizing the expected log-likelihood found on the E step.

Motivating example of EM algorithm 3/35 Assume people s height (in cm) follow normal distributions with different means for male and female: N(µ 1, σ 2 1 ) for male, and N(µ 2, σ 2 2 ) for female. We observe the heights for 5 people (don t know the gender): 182, 163, 175, 185, 158. We want to estimate µ 1, µ 2, σ 1 and σ 2. This is the typical two-component normal mixture model, e.g., data are from a mixture of two normal distributions. The goal is to estimate model parameters. We could, of course, form the likelihood function (multiplication of Normal densities) and find its maximum by Newton-Raphson.

A sketch of an EM algorithm 4/35 Some notations: For person i, denote his/her height by x i, and use Z i to indicate gender. Define p i be the proportion of male in the population. Start by choosing reasonable initial values. Then: In the E-step, compute the probability of each person being male or female, given the current model parameters. We have (after some derivation) λ (k) i E[Z i µ (k) 1, µ(k) 2, σ(k) 1, σ(k) 2 ) = π (k) φ(x i ; µ (k) 1, σ(k) 1 ) πφ(x i ; µ (k) 1, σ(k) 1 ) + (1 π(k) )φ(x i ; µ (k) 2, σ(k) 2 ) In the M-step, update parameters and group proportions by considering the probabilities from E-step as weights. They are basically weighted average and variance. For example, i λ (k) i(1 λ (k) µ (k+1) 1 = x i i i λ (k) i, µ (k+1) 2 = i )x i i(1 λ (k) i ), p i = i λ (k) i /5

Example results 5/35 We choose µ 1 = 175, µ 2 = 165, σ 1 = σ 2 = 10 as initial values. After first iteration, we have after E-step Person 1 2 3 4 5 x i : height (cm) 179 165 175 185 158 λ i : prob. male 0.79 0.48 0.71 0.87 0.31 The estimates for parameters after M-step are (weighted average and variance): µ 1 = 176, µ 2 = 167, σ 1 = 8.7, σ 2 = 9.2, π = 0.63. At iteration 15 (converged), we have: Person 1 2 3 4 5 Height (cm) 179 165 175 185 158 Prob. male 9.999968e-01 4.009256e-03 9.990943e-01 1.000000e+00 2.443061e-06 The estimates for parameters are: µ 1 = 179.6, µ 2 = 161.5, σ 1 = 4.1, σ 2 = 3.5, π = 0.6.

Another motivating example of EM algorithm 6/35 ABO blood groups Genotype Genotype Frequency Phenotype AA p 2 A A AO 2p A p O A BB p 2 B B BO 2p B p O B OO p 2 O O AB 2p A p B AB The genotype frequencies above assume Hardy-Weinberg equilibrium. For a random sample of n individuals, we observe their phenotype, but not their genotype. We wish to obtain the MLEs of the underlying allele frequencies p A, p B, and p O = 1 p A p B. The likelihood is (from multinomial): L(p A, p B ) = (p 2 A + 2p Ap O ) n A (p 2 B + 2p Bp O ) n B (p 2 O )n O (2p A p B ) n AB

Motivating example (Allele counting algorithm) 7/35 Let n A, n B, n O, n AB be the observed numbers of individuals with phenotypes A, B, O, AB, respectively. Let n AA, n AO, n BB and n BO be the unobserved numbers of individuals with genotypes AA, AO, BB and BO, respectively. They satisfy n AA + n AO = n A and n BB + n BO = n B. 1. Start with initial estimates p (0) = (p (0) A, p(0) B, p(0) O ) 2. Calculate the expected n AA and n BB, given observed data and p (k) n (k+1) AA = E(n AA n A, p (k) ) = n A p (k) p (k) A p(k) A 3. Update p (k+1). Imagining that n (k+1) AA, n(k+1) BB p (k+1) A = (2n (k+1) AA + n(k+1) AO A p(k) A + 2p(k) O p(k) A and n (k+1) AB, n (k+1) BB =? were actually observed + n(k+1) AB )/(2n), p(k+1) B =? 4. Repeat step 2 and 3 until the estimates converge

EM algorithm: Applications 8/35 Expectation-Mmaximization algorithm (Dempster, Laird, & Rubin, 1977, JRSSB, 39:1 38) is a general iterative algorithm for parameter estimation by maximum likelihood (optimization problems). It is useful when some of the random variables involved are not observed, i.e., considered missing or incomplete. direct maximizing the target likelihood function is difficult, but one can introduce (missing) random variables so that maximizing the complete-data likelihood is simple. Typical problems include: Filling in missing data in a sample Discovering the value of latent variables Estimating parameters of HMMs Estimating parameters of finite mixtures

Description of EM 9/35 Consider (Y obs, Y mis ) f (y obs, y mis θ), where we observe Y obs but not Y mis It can be difficult to find MLE ˆθ = arg max θ g(y obs θ) = arg max θ f (Yobs, y mis θ) dy mis But it could be easy to find ˆθ C = arg max θ f (Y obs, Y mis θ), if we had observed Y mis. { E step: h (k) } (θ) E log f (Y obs, Y mis θ) Y obs, θ (k) M step: θ (k+1) = arg max θ h (k) (θ); Nice properties (compared to Newton-Raphson): 1. simplicity of implementation 2. stable monotone convergence

Justification of EM 10/35 The E-step creates a surrogate function (often called the Q function ), which is the expected value of the log likelihood function, with respect to the conditional distribution of Y mis given Y obs, under the current estimate of the parameters θ (k). The M-step maximizes the surrogate function.

Ascent property of EM 11/35 Theorem: At each iteration of the EM algorithm, log g(y obs θ (k+1) ) log g(y obs θ (k) ) and the equality holds if and only if θ (k+1) = θ (k). Proof: The definition of θ (k+1) gives which can be expanded to E{log f (Y obs, Y mis θ (k+1) ) Y obs, θ (k) } E{log f (Y obs, Y mis θ (k) ) Y obs, θ (k) }, E{log c(y mis Y obs, θ (k+1) ) Y obs, θ (k) }+log g(y obs θ (k+1) ) E{log c(y mis Y obs, θ (k) ) Y obs, θ (k) }+log g(y obs θ (k) ). By the non-negativity of the Kullback-leibler information, i.e., log p(x) p(x)dx 0, for densities p(x), q(x), q(x) we have log c(y mis Y obs, θ (k) [ ) c(y mis Y obs, θ (k+1) ) c(y mis Y obs, θ (k) ) dy mis = E log c(y mis Y obs, θ (k) ] ) c(y mis Y obs, θ (k+1) ) Y obs, θ (k) 0. (2) (1)

Ascent property of EM (continued) 12/35 Combining (1) and (2) yields log g(y obs θ (k+1) ) log g(y obs θ (k) ), thus we partially proved the theorem. If the equality holds, i.e., log g(y obs θ (k+1) ) = log g(y obs θ (k) ), (3) by (1) and (2), E{log c(y mis Y obs, θ (k+1) ) Y obs, θ (k) } = E{log c(y mis Y obs, θ (k) ) Y obs, θ (k) }. The Kullback-leibler information is zero if and only if log c(y mis Y obs, θ (k+1) ) = log c(y mis Y obs, θ (k) ). (4) Combining (3) and (4), we have log f (Y θ (k+1) ) = log f (Y θ (k) ). The uniqueness of θ leads to θ (k+1) = θ (k).

Example 1: Grouped Multinomial Data 13/35 Suppose Y = (y 1, y 2, y 3, y 4 ) has a multinomial distribution with cell probabilities ( 1 2 + θ 4, 1 θ 4, 1 θ 4, θ ). 4 Then the probability for Y is given by L(θ Y) (y 1 + y 2 + y 3 + y 4 )! y 1!y 2!y 3!y 4! ( 1 2 + θ ) y1 ( 1 θ 4 4 ) y2 ( 1 θ If we use Newton-Raphson to directly maximize f (Y, θ), we need y 1 /4 l(θ Y) = 1/2 + θ/4 y 2 + y 3 1 θ + y 4 θ l(θ Y) = y 1 (2 + θ) y 2 + y 3 2 (1 θ) y 4 2 θ 2 The probability of the first cell is a trouble-maker! How to avoid? 4 ) y3 ( θ ) y4. 4

Example 1: Grouped Multinomial Data (continued) 14/35 Suppose Y = (y 1, y 2, y 3, y 4 ) has a multinomial distribution with cell probabilities ( 1 2 + θ 4, 1 θ 4, 1 θ 4, θ ). 4 Define the complete-data: X = (x 0, x 1, y 2, y 3, y 4 ) to have a multinomial distribution with probabilities ( 1 2, θ 4, 1 θ 4, 1 θ 4, θ ), 4 and to satisfy x 0 + x 1 = y 1 Observed-data log likelihood ( 1 l(θ Y) y 1 log 2 + θ ) 4 Complete-data log likelihood + (y 2 + y 3 ) log (1 θ) + y 4 log θ l C (θ X) (x 1 + y 4 ) log θ + (y 2 + y 3 ) log (1 θ)

Example 1: Grouped Multinomial Data (continued) 15/35 E step: evaluate x (k+1) 1 = E(x 1 Y, θ (k) ) = y 1 θ (k) /4 1/2 + θ (k) /4 M step: maximize complete-data log likelihood with x 1 replaced by x (k+1) 1 θ (k+1) = x (k+1) 1 + y 4 x (k+1) 1 + y 4 + y 2 + y 3

Example 1: Grouped Multinomial Data (continued) 16/35 We observe Y = (125, 18, 20, 34) and start EM with θ (0) = 0.5. Parameter update Convergence to ˆθ Convergence rate k θ (k) θ (k) ˆθ (θ (k) ˆθ)/(θ (k 1) ˆθ) 0.500000000.126821498 1.608247423.018574075.1465 2.624321051.002500447.1346 3.626488879.000332619.1330 4.626777323.000044176.1328 5.626815632.000005866.1328 6.626820719.000000779.1328 7.626821395.000000104 8.626821484.000000014 ˆθ.626821498 Stop

Example 2: Normal mixtures 17/35 Consider a J-group normal mixture, where x 1,..., x n J j=1 p jφ(x i µ j, σ j ). φ(. µ, σ) is the normal density. This is the clustering/finite mixture problem in which EM is typically used for. Define indicator variable for observation i: (y i1, y i2,..., y ij ) follows a multinomial distribution (with trail number=1) and cell probabilities p = (p 1, p 2,..., p J ). Clearly, j y i j = 1. Given y i j = 1 and y i j = 0 for j j, we assume x i N(µ j, σ j ). You can check, marginally, x i J j=1 p jφ(x i µ j, σ j ). Here, {x i } i is the observed data; {x i, y i1,..., y ij } i is the complete data. Observed-data log likelihood l(µ, σ, p x) Complete-data log likelihood l C (µ, σ, p x, y) log i J j=1 p j φ(x i µ j, σ j ) { y i j log p j + log φ(x i µ j, σ j ) } i j

Example 2: Normal mixtures (continued) 18/35 Complete-data log likelihood: l C (µ, σ, p x, y) y i j {log p j (x i µ j ) 2 /(2σ 2 j ) log σ j} E step: evaluate for i = 1,..., n and j = 1,..., J, i j ω (k) i j E(y i j x i, µ (k), σ (k), p (k) ) = Pr(y i j = 1 x i, µ (k), σ (k), p (k) ) = p(k) j j p (k) j f (x i µ (k) j, σ (k) j ) f (x i µ (k) j, σ (k) j )

M step: maximize complete-data log likelihood with y i j replaced by ω i j p (k+1) j µ (k+1) j = σ (k+1) j = = n 1 i i ω (k) i j ω (k) i j x i i ω (k) i j / i ω (k) i j ( xi µ (k) j ) 2 / i ω (k) i j Practice: When all groups share the same variance (σ 2 ), what s the M-step update for σ 2?

Example 2: Normal mixtures in R 20/35 ### two component EM ### pn(0,1)+(1-p)n(4,1) EM_TwoMixtureNormal = function(p, mu1, mu2, sd1, sd2, X, maxiter=1000, tol=1e-5) { diff=1 iter=0 while (diff>tol & iter<maxiter) { ## E-step: compute omega: d1=dnorm(x, mean=mu1, sd=sd1) d2=dnorm(x, mean=mu2, sd=sd2) omega=d1*p/(d1*p+d2*(1-p)) # compute density in two groups ## M-step: update p, mu and sd p.new=mean(omega) mu1.new=sum(x*omega) / sum(omega) mu2.new=sum(x*(1-omega)) / sum(1-omega) resid1=x-mu1 resid2=x-mu2;

sd1.new=sqrt(sum(resid1ˆ2*omega) / sum(omega)) sd2.new=sqrt(sum(resid2ˆ2*(1-omega)) / sum(1-omega)) ## calculate diff to check convergence diff=sqrt(sum((mu1.new-mu1)ˆ2+(mu2.new-mu2)ˆ2 +(sd1.new-sd1)ˆ2+(sd2.new-sd2)ˆ2)) p=p.new; mu1=mu1.new; mu2=mu2.new; sd1=sd1.new; sd2=sd2.new; iter=iter+1; } cat("iter", iter, ": mu1=", mu1.new, ", mu2=",mu2.new, ", sd1=",sd1.new, ", sd2=",sd2.new, ", p=", p.new, ", diff=", diff, "\n") }

Example 2: Normal mixtures in R (continued) 22/35 > ## simulation > p0=0.3; > n=5000; > X1=rnorm(n*p0); # n*p0 indiviudals from N(0,1) > X2=rnorm(n*(1-p0), mean=4) # n*(1-p0) individuals from N(4,1) > X=c(X1,X2) # observed data > hist(x, 50) Frequency 0 10 20 30 40 50 60 Histogram of X 2 0 2 4 6 8 X

Example 2: Normal mixtures in R (continued) 23/35 > ## initial values for EM > p=0.5 > mu1=quantile(x, 0.1); > mu2=quantile(x, 0.9) > sd1=sd2=sd(x) > c(p, mu1, mu2, sd1, sd2) 0.5000000-0.3903964 5.0651073 2.0738555 2.0738555 > EM_TwoMixtureNormal(p, mu1, mu2, sd1, sd2, X) Iter 1: mu1=0.8697, mu2=4.0109, sd1=2.1342, sd2=1.5508, p=0.3916, diff=1.7252 Iter 2: mu1=0.9877, mu2=3.9000, sd1=1.8949, sd2=1.2262, p=0.3843, diff=0.4345 Iter 3: mu1=0.8353, mu2=4.0047, sd1=1.7812, sd2=1.0749, p=0.3862, diff=0.2645 Iter 4: mu1=0.7203, mu2=4.0716, sd1=1.6474, sd2=0.9899, p=0.3852, diff=0.2070... Iter 44: mu1=-0.0048, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.9e-05 Iter 45: mu1=-0.0048, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.4e-05 Iter 46: mu1=-0.0049, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.1e-05 Iter 47: mu1=-0.0049, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=8.7e-06

In class practice: Poisson mixture 24/35 Using the same notations as in Normal mixture model. now assume the data is from a mixture of Poisson distributions. Consider x 1,..., x n J j=1 p jφ(x i λ j ), where φ(. λ) is the Poisson density. Again use y i j to indicate group assignments, (y i1, y i2,..., y ij ) follows a multinomial distribution with cell probabilities p = (p 1, p 2,..., p J ). Now the observed-data log likelihood l(λ, p x) log Complete-data log likelihood l C (λ, p x, y) i J j=1 p j (x i log λ j λ j ) { y i j log p j + (x i log λ j λ j ) } i j Derivate the EM iterations!

Example 3: Mixed-effects model 25/35 For a longitudinal dataset of i = 1,..., N subjects, each with n i measurements of the outcome. The linear mixed effect model is given by Y i = X i β + Z i b i + ɛ i, b i N q (0, D), ɛ i N ni (0, σ 2 I ni ), b i, ɛ i independent Observed-data log-likelihood l(β, D, σ 2 Y 1,..., Y N ) where Σ i = Z i DZ i + σ2 I ni. i { 1 2 (Y i X i β) Σ 1 i (Y i X i β) 1 } 2 log Σ i, In fact, this likelihood can be directly maximized for (β, D, σ 2 ) by using Newton-Raphson or Fisher scoring. Given (D, σ 2 ) and hence Σ i, we obtain β that maximizes the likelihood by solving l(β, D, σ 2 Y 1,..., Y N ) = X i β Σ 1 (Y i X i β) = 0, which implies β = N i=1 X i Σ 1 i X i i 1 N i=1 X i Σ 1 i Y i.

Example 3: Mixed-effects model (continued) 26/35 Complete-data log-likelihood Note the equivalence of (ɛ i, b i ) and (Y i, b i ) and the fact that ( ) {( ) ( )} bi 0 D 0 = N, 0 0 σ 2 I ni ɛ i l C (β, D, σ 2 ɛ 1,..., ɛ N, b 1,..., b N ) { i 1 2 b i Db i 1 2 1 log D 2σ 2ɛ i ɛ i n i 2 log σ2 } The parameter that maximizes the complete-data log-likelihood is obtained as, conditional on other parameters, N D = N 1 b i b i σ 2 = β = i=1 i=1 1 N n i N X i X i i=1 N i=1 1 ɛ i ɛ i N i=1 X i (Y i Z i b i ).

Example 3: Mixed-effects model (continued) 27/35 E step: to evaluate E ( b i b i Y i, β (k), D (k), σ 2(k)) E ( ɛ i ɛ Y i, β (k), D (k), σ 2(k)) E ( b i Y i, β (k), D (k), σ 2(k)) We use the relationship E(b i b i Y i) = E(b i Y i )E(b i Y i) + Var(b i Y i ). Thus we need to calculate E(b i Y i ) and Var(b i Y i ). Recall the conditional distribution for multivariate normal variables ( ) {( ) ( Yi Xi β Zi DZ = N, i + )} σ2 I ni Z i D, 0 D Let Σ i = Z i DZ i + σ2 I ni. We known that b i DZ i E(b i Y i ) = 0 + DZ i Σ 1 i (Y i X i β) Var(b i Y i ) = D DZ i Σ 1 i Z i D.

Example 3: Mixed-effects model (continued) 28/35 Similarly, We use the relationship We can derive E(ɛ i ɛ i Y i ) = E(ɛ i Y i )E(ɛ i Y i ) + Var(ɛ i Y i ). ( Yi ɛ i ) = N {( ) Xi β, 0 Let Σ i = Z i DZ i + σ2 I ni. Then we have M step D (k+1) = N 1 σ 2(k+1) = β (k+1) = i=1 ( Zi DZ i + )} σ2 I ni σ 2 I ni σ 2 I ni σ 2. I ni E(ɛ i Y i ) = 0 + σ 2 Σ 1 i (Y i X i β) Var(ɛ i Y i ) = σ 2 I ni σ 4 Σ 1 i. N E(b i b i Y i, β (k), D (k), σ 2(k) )) i=1 1 N n i N X i X i i=1 N i=1 1 E(ɛ i ɛ i Y i, β (k), D (k), σ 2(k) ) N i=1 X i E(Y i Z i b i Y i, β (k), D (k), σ 2(k) ).

Issues 29/35 1. Stopping rules l(θ (k+1) ) l(θ (k) ) < ɛ for m consecutive steps, where l(θ) is observed-data log-likelihood. This is bad! l(θ) may not change much even when θ does. θ (k+1) θ (k) < ɛ for m consecutive steps This could run into problems when the components of θ are of very different magnitudes. θ (k+1) j θ (k) j < ɛ 1 ( θ (k) + ɛ 2 ) for j = 1,..., p j In practice, take ɛ 1 =10 8 ɛ 2 =10ɛ 1 to 100ɛ 1

Issues (continued) 30/35 2. Local vs. global max There may be multiple modes EM may converge to a saddle point Solution: Multiple starting points 3. Starting points Use information from the context Use a crude method (such as the method of moments) Use an alternative model formulation 4. Slow convergence EM can be painfully slow to converge near the maximum Solution: Switch to another optimization algorithm when you get near the maximum 5. Standard errors Numerical approximation of the Hessian matrix Louis (1982), Meng and Rubin (1991)

Numerical approximation of the Hessian matrix 31/35 Note: l(θ) = observed-data log-likelihood We estimate the gradient using { l(θ)} i = l(θ) θ i l(θ + δ ie i ) l(θ δ i e i ) 2δ i where e i is a unit vector with 1 for the ith element and 0 otherwise. In calculating derivatives using this formula, I generally start with some medium size δ and then repeatedly halve it until the estimated derivative stabilizes. We can estimate the Hessian by applying the above formula twice: { l(θ)} i j l(θ + δ ie i + δ j e j ) l(θ + δ i e i δ j e j ) l(θ δ i e i + δ j e j ) + l(θ δ i e i δ j e j ) 4δ i δ j

Louis estimator (1982) 32/35 l C (θ Y obs, Y mis ) log { f (Y obs, Y mis θ)} { } l O (θ Y obs ) log f (Y obs, y mis θ) dy mis l C (θ Y obs, Y mis ), l O (θ Y obs ) = gradients of l C, l O l C (θ Y obs, Y mis ), l O (θ Y obs ) = second derivatives of l C, l O We can prove that (5) l O (θ Y obs ) = E { l } C (θ Y obs, Y mis ) Y obs (6) l O (θ Y obs ) = E { l C (θ Y obs, Y mis ) Y obs } E {[ l C (θ Y obs, Y mis ) ] 2 Y obs } + [ l O (θ Y obs ) ] 2 MLE: ˆθ = arg max θ l O (θ Y obs ) Louis variance estimator: { l O (θ Y obs ) } 1 evaluated at θ = ˆθ Note: All of the conditional expectations can be computed in the EM algorithm using only l C and l C, which are first and second derivatives of the complete-data log-likelihood. Louis estimator should be evaluated at the last step of EM.

Proof of (5) 33/35 Proof: By the definition of l O (θ Y obs ), l O (θ Y obs ) = log { } f (Y obs, y mis θ) dy mis θ = f (Y obs, y mis θ) dy mis / θ f (Yobs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis. (7) Multiplying and dividing the integrand of the numerator by f (Y obs, y mis θ) gives (5), f (Y obs,y mis θ) f (Y l O (θ Y obs ) = obs,y mis θ) f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis = = log{ f (Yobs,y mis θ)} θ f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis f (Y obs, y mis θ) l C (θ Y obs, Y mis ) dy mis f (Yobs, y mis θ) dy mis = E { l C (θ Y obs, Y mis ) Y obs }.

Proof of (6) 34/35 Proof: We take an additional derivative of l O (θ Y obs ) in expression (7) to obtain f l (Y obs, y mis θ) dy mis f 2 (Y obs, y mis θ) dy mis O (θ Y obs ) = f (Yobs, y mis θ) dy mis f (Yobs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis { l O (θ Y obs ) } 2. To see how the first term breaks down, we take an additional derivative of log { f f (Yobs, y mis θ)} (Y obs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis θ to obtain f (Y obs, y mis θ) dy mis = Thus we express the first term to be 2 log { f (Y obs, y mis θ)} θ θ f (Y obs, y mis θ) dy mis [ ] 2 log { f (Yobs, y mis θ)} + f (Y obs, y mis θ) dy mis θ E { l } C (θ Y obs, Y mis ) Y obs + E {[ l C (θ Y obs, Y mis ) ] 2 } Y obs.

Convergence rate of EM 35/35 Let I C (θ) and I O (θ) denote the complete information and observed information, respectively. One can show when the EM converges, the linear convergence rate, denoted as (θ (k+1) ˆθ)/(θ (k) ˆθ) approximates 1 I O (ˆθ)/I C (ˆθ). (later) This means that When missingness is small, EM converges quickly Otherwise EM converges slowly.