f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

Similar documents
Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Chapter 3 : Likelihood function and inference

Computing the MLE and the EM Algorithm

Allele Frequency Estimation

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

HT Introduction. P(X i = x i ) = e λ λ x i

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Classical Estimation Topics

The Expectation Maximization or EM algorithm

Optimization Methods II. EM algorithms.

The Expectation-Maximization Algorithm

Statistics 3858 : Maximum Likelihood Estimators

EM Algorithm. Expectation-maximization (EM) algorithm.

EM algorithm and applications Lecture #9

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Finite Singular Multivariate Gaussian Mixture

Spring 2012 Math 541B Exam 1

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

EM for ML Estimation

COM336: Neural Computing

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Association studies and regression

Lecture 4 September 15

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Review of Maximum Likelihood Estimators

Basic math for biology

A Note on the Expectation-Maximization (EM) Algorithm

1. Fisher Information

Statistical Estimation

Lagrange Multipliers

Week 3: The EM algorithm

Evaluating the Performance of Estimators (Section 7.3)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Statistics & Data Sciences: First Year Prelim Exam May 2018

Goodness of Fit Goodness of fit - 2 classes

Methods of Estimation

MIT Spring 2016

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Generalized Linear Models

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Stat 516, Homework 1

STA 260: Statistics and Probability II

STAT 536: Genetic Statistics

Affected Sibling Pairs. Biostatistics 666

Expectation Maximization

Inference in non-linear time series

Latent Variable Models and EM algorithm

Fitting Narrow Emission Lines in X-ray Spectra

1.5 MM and EM Algorithms

Estimating the parameters of hidden binomial trials by the EM algorithm

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

General Bayesian Inference I

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Likelihood, Estimation and Testing

MIT Spring 2016

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Chapter 6 Likelihood Inference

Theory and Use of the EM Algorithm. Contents

Review and continuation from last week Properties of MLEs

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Foundations of Statistical Inference

1 Complete Statistics

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

A note on shaved dice inference

Mathematical statistics

Lecture 14 More on structural estimation

MATH4427 Notebook 2 Fall Semester 2017/2018

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Introduction An approximated EM algorithm Simulation studies Discussion

Fall 2003: Maximum Likelihood II

26. LECTURE 26. Objectives

Gaussian Mixture Models

Lecture 17: Likelihood ratio and asymptotic tests

Nonparametric Tests for Multi-parameter M-estimators

Stat 451 Lecture Notes Monte Carlo Integration

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

analysis of incomplete data in statistical surveys

EM Algorithm II. September 11, 2018

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Linear Dynamical Systems

1 Expectation Maximization

STA 216, GLM, Lecture 16. October 29, 2007

Gaussian Mixtures and the EM algorithm

Transcription:

Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe, but cannot. (ii) Recall homework example (with added parameter σ 2 ): Y i i.i.d. from mixture 1 2 N( θ, σ2 ) + 1 2 N(θ, σ2 ) f Y (y; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( y 2 /(2σ 2 )) exp( θ 2 /(2σ 2 )) (exp( yθ/σ 2 ) + exp(yθ/σ 2 )) l n (θ, σ 2 ) = const (n/2) log(σ 2 ) ( Yi 2 )/(2σ 2 ) nθ 2 /(2σ 2 ) i log((exp( Y i θ/σ 2 ) + exp(y i θ/σ 2 )) + i (iv) What is the sufficient statistic? How would we estimate (θ, σ 2 )? (v) Now suppose each Y i caries a flag Z i = 1 or 1 as obsn i comes from N( θ, σ 2 ) or N(θ, σ 2 ). Let X i = (Y i, Z i ). f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) = const (n/2) log(σ 2 ) i (Y i θz i ) 2 /2σ 2 = const (n/2) log(σ 2 ) (2σ 2 ) 1 ( Y 2 i 2θ Y i Z i + θ 2 n) (vi) What now is sufficient statistic, if X i were observed? How now could you estimate (θ, σ 2 )? The Z i are latent variables ; X i are complete data (vii) E(l c,n Y ) requires only E(Z i Y i ) = φ((y i θ)/σ) φ((y i + θ)/σ) φ((y i θ)/σ) + φ((y i + θ)/σ) 1

7.2 Defining the EM algorithm (i) Data random variables Y, with probability measure Q θ, with θ Θ R k. Suppose Y has density f θ w.r.t. some σ-finite measure µ (usually Lebegue measure (pdf) or counting measure (pmf)). (ii) l(θ) = log L(θ) = log f θ (y), for observed data Y = y. Suppose maximization of l(θ) is messy/impossible; k > 1 but not huge. (iii) Suppose we augment the random variables to complete data X: Y = Y (X). Suppose X has density g θ (x). Then L(θ) = f θ (y) = and l c (θ) = log g θ (x) x:y(x)=y g θ(x)dµ(x) is known as the complete-data likelihood. (iv) Let Q(θ; θ ) = E θ (l c (θ) Y (X) = y). Then g θ (X) = h θ (X Y = y)f θ (y) l c (θ; X) = log h θ (X Y = y) + l(θ; y) Q(θ; θ ) = H(θ; θ ) + l(θ) y where H(θ; θ ) = E θ (log h θ (X Y = y) Y (X) = y) (v) EM algorithm is: E-step: at current estimate θ compute Q(θ; θ ). M-step: Maximize Q(θ; θ ) w.r.t. θ to obtain new estimate θ. Set θ = θ and repeat ad nauseam. 2

7.3 Why does EM work? (i) Recall (see Kullback-Leibler info) that for any densities p and q of r.v. Z, E q (log p(z)) = log(p(z))q(z)dµ(z) is maximized w.r.t p by p = q. (K(q; p) = E q log(q/p) 0.) (ii) Hence H(θ; θ ) H(θ, θ ) for all θ, θ. (iii) Now with new/old estimates θ, θ l( θ) l(θ ) = Q( θ; θ ) Q(θ ; θ ) (H( θ; θ ) H(θ ; θ )) by 7.2(iv) H(θ ; θ ) H( θ; θ ) by M-step 0 by (ii). Also l( θ) l(θ ) > 0, unless h θ(x Y ) = h θ (X Y ) (iv) Thus each step of EM cannot decrease l(θ) and usually increases l(θ). (v) If the MLE θ is the unique stationary point of l(θ) in the interior of the space, then θ θ (vi) In practice, EM is very robust, but can be very slow, especially in final stages: cgce is first-order. (vii) Caution: we do NOT use expectations to impute the missing data We compute the expected complete-data log-likelihood. This normally involves using conditional expectations to impute the complete-data sufficient statistics. This is NOT the same thing see hwk. And it could be more complicated than this although not if we have chosen sensible complete-data. 3

7.4 A multinomial example (i) Bernstein (1928) used population data to validate the hypothesis that human ABO blood tyoes are determined by 3 alleles, A, B and O at a single genetic locus, rather than being 2 independent factors A/not-A, B/not-B. (ii) Suppose that the population frequencies of the A, B and O are p, q and r (p + q + r = 1); we want to estimate (p, q, r). (iii) We assume that the types of the two alleles carried by an individual are independent (Hardy-Weinberg Equil: 1908), and that individuals are independent ( unrelated ). (iv) ABO blood types are determined as follows: blood type genotype freq. type geno. freq. A AA p 2 A AO 2pr B BB q 2 B BO 2qr AB AB 2pq O OO r 2 (v) Y M 4 (n, (p 2 + 2pr, q 2 + 2qr, 2pq, r 2 )) X M 6 (n, (p 2, 2pr, q 2, 2qr, 2pq, r 2 )) (vi) l(p, q, r) easy to evaluate but hard to max. l(p, q, r) = const + y A log(p 2 + 2pr) + y B log(q 2 + 2qr) +y AB log(2pq) + y O log(r 2 ) (v) l c (p, q, r) is easy to maximize: l c (p, q, r) = const + x AA log(p 2 ) + x AO log(2pr) + x BB log(q 2 ) +x BO log(2qr) + x AB log(2pq) + x OO log(r 2 ) = const + (2x AA + x AO + x AB ) log p + (2x BB + x BO + x AB ) log q + (2x OO + x AO + x BO ) log r 4

(vi) E-step: x AA = E p,q,r (X AA Y = y) = p2 p 2 +2pr y A = (vii) M-step: p = (2n) 1 (2x AA + x AO + y AB ), q = (2n) 1 (2x BB + x BO + y AB ), r = 1 p q. p p+2r y A etc. (viii) This method know to geneticists in 1950s: genecounting. (EM algorithm dates to 1977: Dempster,Laird & Rubin) 7.5 A mixture example (see prev hwk) (i) Y i i.i.d, with pdf f(y; θ, ψ) = θf 1 (y; ψ) + (1 θ)f 2 (y; ψ) (ii) l(θ, ψ) = i log(θf 1 (y i ; ψ) + (1 θ)f 2 (y i ; ψ)) (iii) Let Z i = I(Y i f 1 ). P (Z i = 1) = θ, l c (θ, ψ) = i(z i log θ + (1 Z i ) log(1 θ) + Z i log f 1 (y i ; ψ) + (1 Z i ) log f 2 (y i ; ψ)) (iv) l c is linear in Z i, so E-step requires only δ i = E(Z i Y ) = θf 1 (y i ; ψ) θf 1 (y i ; ψ) + (1 θ)f 2 (y i ; ψ) (v) M-step: θ = i δ i /n and ψ maximizes i δ i log f 1 (y i ; ψ) + (1 δ i ) log f 2 (y i ; ψ)) (vi) Example: f j (y i ; ψ) = ψj 1 exp( y i /ψ j ) i δ i log f 1 (y i ; ψ) = log ψ 1 i δ i i δ i y i ψ 1 = i δ i y i /( i δ i ), ψ2 = i(1 δ i )y i / i(1 δ i ). (vii) Be careful about identifiability exchanging the probs and labels on components gives same mixture: e.g. fix ψ 1 < ψ 2. 5

7.6 Other types of example (i) Missing data actual Caution: we do NOT use expectations to impute the missing data. (ii) Variance component models (see hwk 9) (a) Y = AZ + e, e N n (0, τ 2 ), Z N r (0, σ 2 G), A is n r matrix. Y N n (0, σ 2 AGA + τ 2 I) (b) X = (Y, Z): l c (σ 2, τ 2 ) = (n/2) log(τ 2 ) (r/2) log(σ 2 ) (2τ 2 ) 1 (y Az) (y Az) (2σ 2 ) 1 z G 1 z (c) σ 2 = r 1 E(z G 1 z y), τ 2 = n 1 E((Y AZ) (Y AZ) Y ), but E(z G 1 z y) E(z y) G 1 E(z y). (d) Note X N n+r (0, V ), so we have usual formulae E(Z Y ) = V zy Vyy 1 Y, var(z Y ) = V zz V zy Vyy 1 V yz. (e) Also, if E(W ) = 0, E(W BW ) = E( i,j W i W j B ij ) = i,j var(w ) ij B ij = tr(var(w )B). (iii) Censored data, age-of-onset-data, competing risks models, etc. (iv) Hidden states, latent variables: Models in Genetics, Biology, Climate modelling, Environmental modelling. (v) General auxiliary variables: the latent variables do not have to mean anything they are simply a tool, s.t. that the complete-data log-likelihood is easy. 6

7.7 Likelihood in Exponential families (i) See Chapter 2, for definitions, the natural sufficient statistics T j, the natural parameter space and parametrization π j. (iii) Moment formulae. see 2.2 (vii) log c(π) E(t j (X)) = Cov(t j (X), t j (X)) = 2 log c(π) π l (iv) Likelihood equation for exponential family see 4.6 l = n(n 1 The natural sufficient statistics T j expectations nτ j. n 1 t j (X i ) E(t j (X))) (v) The information results (see 4.6) are set equal to their I(π) = J(π) = var(t 1,..., T k ) I(τ) = (var(t 1,..., T k )) 1 where τ j = E(t j (X)) (T 1,..., T k ) achieves (multiparameter) CRLB for (τ 1,..., τ k ). (v) Suppose complete-data X has exp.fam. form: for n- sample T j (X) = n i=1 t j (X i ) log g θ (X) = log c(θ) + k j=1 Q(θ; θ ) = log c(θ) + k j=1 π j (θ)t j (X) + log w(x) π j (θ)e θ (T j (X) Y ) + E θ (log w(x) Y ). 7

7.8 EM for exponential families (i) In natural parametrization π j : Q(π; π ) = log c(π) + k Q j=1 π j E π (T j (X) Y ) = E π (T j (X) Y ) + log c(π) = E π (T j (X) Y ) E π (T j (X)) Thus EM iteratively fits unconditioned to conditioned expectations of T j. At MLE E π (T j (X) Y ) = E π (T j (X)). (ii) Recall l(π) = log g π (X) log h π (X Y ) w(x) exp( j π j t j (X)) but h π (X Y ) = y(x)=y w(x) exp( j π j t j (X))dX = c (π; Y )w(x) exp( π j t j (X)) j so l(π) = log c(π) log c (π; Y ) (iii) Hence, differentiating this: l = E π (T j ) + E π (T j Y ) At MLE : E π (T j ) = E π (T j Y ) (iv) Differentiating again: 2 l = Cov(T j, T l ) Cov((T j, T l ) Y ) π l If Y determines X, var(t (X) Y ) = 0, and then observed information is var(t ) as for any exp fam. If Y tells nothing about X, var(t (X) Y ) = var(t (X)), and observed information is 0. Information lost due to observing Y not X is var(t (X) Y ). 8