Optimization Methods II. EM algorithms.

Similar documents
EM for ML Estimation

EM Algorithm. Expectation-maximization (EM) algorithm.

Introduction An approximated EM algorithm Simulation studies Discussion

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

The Expectation-Maximization Algorithm

Markov Chain Monte Carlo. Simulated Annealing.

Chapter 3 : Likelihood function and inference

1 Expectation Maximization

Estimating the parameters of hidden binomial trials by the EM algorithm

Week 3: The EM algorithm

Computing the MLE and the EM Algorithm

Introduction to Machine Learning

Lecture 8: Bayesian Estimation of Parameters in State Space Models

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Latent Variable Models and EM algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization

CSC411 Fall 2018 Homework 5

Monte Carlo Integration I [RC] Chapter 3

A Quantile Implementation of the EM Algorithm and Applications to Parameter Estimation with Interval Data

Finite Singular Multivariate Gaussian Mixture

New mixture models and algorithms in the mixtools package

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

On the Geometry of EM algorithms

Parameter Estimation of Power Lomax Distribution Based on Type-II Progressively Hybrid Censoring Scheme

Basic math for biology

MCMC Simulated Annealing Exercises.

Computer Intensive Methods in Mathematical Statistics

Expectation maximization tutorial

Issues on Fitting Univariate and Multivariate Hyperbolic Distributions

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Latent Variable Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Gaussian Mixture Models

Lecture 6: April 19, 2002

Theory of Statistical Tests

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

The Expectation Maximization Algorithm

EM Algorithm II. September 11, 2018

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

Stat 451 Lecture Notes Monte Carlo Integration

Statistical Estimation

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

EM for Spherical Gaussians

An introduction to Variational calculus in Machine Learning

Censoring mechanisms

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Lecture 4: Probabilistic Learning

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Stat 451 Lecture Notes Numerical Integration

Computational statistics

Posterior Regularization

Foundations of Statistical Inference

A Note on the Expectation-Maximization (EM) Algorithm

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

ECE 5984: Introduction to Machine Learning

Hmms with variable dimension structures and extensions

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Expectation Maximization

Parameter Estimation

Lecture 1: Supervised Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Expectation Maximization and Mixtures of Gaussians

Definition 1.1 (Parametric family of distributions) A parametric distribution is a set of distribution functions, each of which is determined by speci

ECE 275B Homework # 1 Solutions Version Winter 2015

CSC 2541: Bayesian Methods for Machine Learning

MAS223 Statistical Inference and Modelling Exercises

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Introduction to Bayesian Computation

CPSC 540: Machine Learning

Computational Statistics and Data Analysis. Estimation for the three-parameter lognormal distribution based on progressively censored data

STAT 514 Solutions to Assignment #6

Bayesian Inference by Density Ratio Estimation

Outline lecture 6 2(35)

Foundations of Statistical Inference

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Generalized Linear Models Introduction

Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ECE 275B Homework # 1 Solutions Winter 2018

The Expectation Maximization or EM algorithm

Stat 710: Mathematical Statistics Lecture 27

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

Mixtures of Rasch Models

Statistics & Data Sciences: First Year Prelim Exam May 2018

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Answer Key for STAT 200B HW No. 7

Spring 2012 Math 541B Exam 1

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

EM-algorithm for motif discovery

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Rare-Event Simulation

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Transcription:

Aula 7. Optimization Methods II. 0 Optimization Methods II. EM algorithms. Anatoli Iambartsev IME-USP

Aula 7. Optimization Methods II. 1 [RC] Missing-data models. Demarginalization. The term EM algorithms has been around for a long time by [DLR]. Consider the case where the density of the observations can be expresses as g θ (x) = f θ (x, z)dz, g θ (x) f θ (x, z). This representation occurs in many statistical settings, including censoring models and mixtures and latent variable models (tobit, probit, arch, stochastic volatility, ect.).

Aula 7. Optimization Methods II. 2 [RC] Example 5.12 The mixture model of Example 5.2 (see previous lecture), 0.25N(µ 1, 1) + 0.75N(µ 2, 1), can be expressed as a missing-data model even though the (observed) likelihood can be computed in a manageable time. Indeed, if we introduce a vector z = (z 1,..., z n ) {1, 2} n in addition to the sample x = (x 1,..., x n ) such that P θ (Z i = 1) = 1 P θ (Z i = 2) = 0.25, X i Z i = z N(µ z, 1). we recover the mixture model from the Example 5.2 as the marginal distribution of X i. The (observed) likelihood is then obtained as E(H(x, z)) for { 1 H(x, z) 4 exp (x } i µ 1 ) 2 { 3 2 4 exp (x } i µ 2 ) 2. 2 i:z i =1 i:z i =2

Aula 7. Optimization Methods II. 3 [RC] Example 5.12 The (observed) likelihood is then obtained as E(H(x, z)) for } } H(x, z) Indeed, i:z i =1 g µ1,µ 2 (x) = { 1 4 exp = = (x i µ 1 ) 2 2 i:z i =2 f µ1,µ 2 (x, z)dz = z {1,2} n i:z i =1 n ( 1 4 i=1 1 4 e (x i µ 1 )2 2 2π e (x i µ 1 )2 2 + 3 2π 4 { 3 4 exp (x i µ 2 ) 2 2 H(x, z) ( 2π) n z {1,2} n i:z i =2 3 4 e (x i µ 2 )2 2 2π ). e (x i µ 2 )2 2 2π :

Aula 7. Optimization Methods II. 4 [RC] Example 5.13 Censored data may come from experiments where some potential observations are replaced with a lower bound because they take too long to observe. Suppose that we observe Y 1,..., Y m, iid, from f(y θ) and that the (n m) remaining (Y m+1,..., Y n ) are censored at the threshold a. The corresponding likelihood function is then L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ), where F is the cdf associated with f and y = (y 1,..., y m ).

Aula 7. Optimization Methods II. 5 [RC] Example 5.13 L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ). If we had observed the last n m values, say z = (z m+1,..., z n ), with z i a(i = m + 1,..., n), we could have constructed the (complete data) likelihood L c (θ y, z) = m i=1 f(y i θ) n i=m+1 f(z i θ). Note that L(θ y) = E ( L c (θ y, Z) ) = L c (θ y, z)f(z y, θ)dz, where f(z y, θ) is the density of the missing data conditional on the observed data, namely the product of the f(z i θ)/(1 F (a θ)) s; i.e., f(z θ) restricted to (a, + ).

Aula 7. Optimization Methods II. 6 Main Idea of EM algorithms Demarginalization: g θ (x) f θ (x, z), g θ (x) = f θ (x, z)dz. A values from z can be generated by the conditional distribution k θ (z x) = f θ(x, z) g θ (x). Take a logarithm log g θ (x) = log f θ (x, z) log k θ (z x). In notations of likelihood function log L(θ x) = log L c (θ x, z) log k θ (z x), where L c stands for complete likelihood function.

Aula 7. Optimization Methods II. 7 Main Idea of EM algorithms log L(θ x) = log L c (θ x, z) log k θ (z x), Let us fix a value θ 0 and calculate the expectation according to the distribution k θ0 (z x): log L(θ x) = E k,θ0 log L c (θ x, z) E k,θ0 log k θ (z x) =: Q(θ θ 0, x) H(θ θ 0, x). Theorem. Let θ 1 maximizes the Q, i.e., Then Q(θ 1 θ 0, x) = max θ Q(θ θ 0, x). log L(θ 1 x) log L(θ 0 x).

Aula 7. Optimization Methods II. 8 Main Idea of EM algorithms. Proof. Q(θ 1 θ 0, x) = max θ Proof. Q(θ θ 0, x) log L(θ 1 x) log L(θ 0 x). log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) Note that by definition of θ 1 Q(θ 1 θ 0, x) Q(θ 0 θ 0, x) 0.

Aula 7. Optimization Methods II. 9 Main Idea of EM algorithms. Proof. and log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) H(θ 1 θ 0, x) H(θ 0 θ 0, x) = E k,θ0 log k θ1 (Z x) E k,θ0 log k θ0 (Z x) = E k,θ0 log k θ 1 (Z x) k θ0 (Z x) log E k θ1 (Z x) k,θ 0 k θ0 (Z x) = log 1 = 0, where Jensen inequality was used E log ξ log Eξ.

Aula 7. Optimization Methods II. 10 Main Idea of EM algorithms. Proof. log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) We have thus Q(θ 1 θ 0, x) Q(θ 0 θ 0, x) 0, H(θ 1 θ 0, x) H(θ 0 θ 0, x) 0, log L(θ 1 x) log L(θ 0 x) 0. This completes the proof of the theorem.

Aula 7. Optimization Methods II. 11 Main Idea of EM algorithms. Each iteration EM algorithm maximizes a function Q. Let θ t be a sequence obtained recurcively Q(θ t+1 θ t, x) = max θ Q(θ θ t, x). This recurrent scheme consists on two steps expectation and maximization, that gives the name for the scheme: EM algorithm.

Aula 7. Optimization Methods II. 12 EM algorithm. Choose the initial parameter θ 0 and repeat: E-step. Calculate the expectation Q(θ θ t, x) = E k,θt log L c (θ x, Z) with respect to the distribution k θt (z x). M-step. Maximize Q(θ θ t, x) on θ and determine the next value θ t+1 = arg max θ Q(θ θ t, x), define t = t + 1, return to the E-step

Aula 7. Optimization Methods II. 13 [RC]. Example 5.14 (Cont. of Example 5.13) Let again Y 1,..., Y m are iid com density f(y θ) and others Y m+1,..., Y n are censured at the level a. The likelihood function L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ), where F (a θ) = P(Y i a). If we had observed the last n m values, say z = (z m+1,..., z n ), with z i a(i = m + 1,..., n), we could have constructed the (complete data) likelihood L c (θ y, z) = m i=1 f(y i θ) n i=m+1 f(z i θ). and k θ (z y) = n m i=1 f(z i θ) 1 F (a θ).

Aula 7. Optimization Methods II. 14 [RC]. Example 5.14 (Cont. of Example 5.13) Suppose that f(y θ) corresponds to the N(θ, 1) distribution, the complete-data likelihood is m n L c (θ y, z) e (y i θ) 2 /2 e (z i θ) 2 /2, i=1 i=m+1 resulting in the expected complete-data log-likelihood Q(θ θ 0, y) = 1 m (y i θ) 2 1 n ( E k,θ0 Zi θ) 2), 2 2 i=1 i=m+1 where the missing observations Z i are distributed from a normal N(θ, 1) distribution truncated in a. Doing the M-step (i.e., differentiating the function Q(θ θ 0, y) in θ) and setting it equal to 0 then leads to the EM update ˆθ = mȳ + (n m)e k,θ 0 (Z 1 ). n

Aula 7. Optimization Methods II. 15 [RC]. Example 5.14 (Cont. of Example 5.13) Doing the M-step... the EM update ˆθ = mȳ + (n m)e k,θ 0 (Z 1 ). n Since E k,θ0 (Z 1 ) = θ + φ(a θ), where φ and Φ are the 1 Φ(a θ) normal pdf and cdf, respectively, the EM sequence is θ t+1 = m n ȳ + n m ( θ t + φ(a θ ) t). n 1 Φ(a θ t )

Aula 7. Optimization Methods II. 16 Principle of missing information (informally) log L(θ x) = log L c (θ x, z) log k θ (z x) 2 log L(θ x) = 2 log L c (θ x, z) + 2 log k θ (z x) θ 2 θ 2 θ 2 Observed information = Complete information Missing information Informally about a convergence rate of EM algorithms: if a proportion of missing information increases with iterations, then a rate of convergence of an algorithm decreases.

Aula 7. Optimization Methods II. 17 EM algorithms for exponential family models. It is more easy to implement an algorithm when a complete data (z, x) has the exponential family type of distribution in a canonic form: p(z, x θ) = b(z, x) exp(θt s(z, x)) a(θ) Let y = (z, x) be a vector of complete data. We have log p(y θ) = log b(y) + θ T s(y) log a(θ) 1 a(θ) log p(y θ) = s(y) θ a(θ) θ Remember that a(θ) = b(y) exp(θ T s(y))dy

Aula 7. Optimization Methods II. 18 EM algorithms for exponential family models. θ log p(y θ) = s(y) 1 a(θ) a(θ) θ, a(θ) = b(y) exp(θ T s(y))dy we have log a(θ) θ = = = 1 a(θ) 1 a(θ) a(θ) θ = 1 a(θ) b(y)s(y) exp(θ T s(y))dy b(y) exp(θt s(y)) dy θ s(y) b(y) exp(θt s(y)) dy = E(s(y) θ) a(θ) Thus, log p(y θ) = 0 s(y) = E(s(y) θ) θ

Aula 7. Optimization Methods II. 19 EM algorithms for exponential family models. Implementation of EM algorithm: E-step Q(θ θ t ) = = log p(z, x θ)p(z θ t, x)dz b(z, x)p(z θ t, x)dz + θ T s(z, x)p(z θ t, x)dz log a(θ) Note that the first term will not participate in M-step. M-step: we obtain extreme point Q(θ θ t ) θ = s(z, x)p(z θ t, x)dz 1 a(θ) a(θ) θ = E(s(z, x) θ t, x) E(s(z, x) θ) Remember that E(s(z, x) θ) = 1 a(θ) a(θ) θ

Aula 7. Optimization Methods II. 20 EM algorithms for exponential family models. Thus the maximization of Q(θ θ t, x) in M-step is equivalent to solve the following equation E(s(z, x) θ t, x) = E(s(z, x) θ) If the solution exists, then it is unique.

Aula 7. Optimization Methods II. 21 Monte Carlo for E-step. Given θ t we need to calculate Q(θ θ t, x) = E k,θt log L c (θ Z, x). When it is difficult to calculate explicitly we can calculate approximately using Monte Carlo: 1. generate z 1,..., z m according k θ (z x); 2. calculate Q(θ θ t ) = 1 m m i=1 log Lc (θ z i, x); during M-step one maximizes Q by θ in order to obtain θ t+1.

Aula 7. Optimization Methods II. 22 EM algorithm. [H] Hunter, D.R. On the Geometry of EM algorithms.: this paper demonstrates how the geometry of EM algorithms can help explain how their rate of convergence is related to the proportion of missing data. In footnote [H] wrote: In a footnote, [DLR] refer to the comment of a referee, who noted that the use of the word algorithm may be criticized since EM is not, strictly speaking, an algorithm. However, EM is a recipe for creating algorithms, and thus we consider the set of EM algorithms to consist of all algorithms baked according to the EM recipe.

Aula 7. Optimization Methods II. 23 References. [DLR] Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm, J.Roy. Statist. Soc. Ser. B, 39, 1-38, 1977. [H] David R. Hunter. On the Geometry of EM algorithms. Technical Report 0303, Dep. of Stat., Penn State University. February, 2003. [RC ] Cristian P. Robert and George Casella. Introducing Monte Carlo Methods with R. Series Use R!. Springer