Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe, but cannot. (ii) Recall homework example (with added parameter σ 2 ): Y i i.i.d. from mixture 1 2 N( θ, σ2 ) + 1 2 N(θ, σ2 ) f Y (y; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( y 2 /(2σ 2 )) exp( θ 2 /(2σ 2 )) (exp( yθ/σ 2 ) + exp(yθ/σ 2 )) l n (θ, σ 2 ) = const (n/2) log(σ 2 ) ( Yi 2 )/(2σ 2 ) nθ 2 /(2σ 2 ) i log((exp( Y i θ/σ 2 ) + exp(y i θ/σ 2 )) + i (iv) What is the sufficient statistic? How would we estimate (θ, σ 2 )? (v) Now suppose each Y i caries a flag Z i = 1 or 1 as obsn i comes from N( θ, σ 2 ) or N(θ, σ 2 ). Let X i = (Y i, Z i ). f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) = const (n/2) log(σ 2 ) i (Y i θz i ) 2 /2σ 2 = const (n/2) log(σ 2 ) (2σ 2 ) 1 ( Y 2 i 2θ Y i Z i + θ 2 n) (vi) What now is sufficient statistic, if X i were observed? How now could you estimate (θ, σ 2 )? The Z i are latent variables ; X i are complete data (vii) E(l c,n Y ) requires only E(Z i Y i ) = φ((y i θ)/σ) φ((y i + θ)/σ) φ((y i θ)/σ) + φ((y i + θ)/σ) 1
7.2 Defining the EM algorithm (i) Data random variables Y, with probability measure Q θ, with θ Θ R k. Suppose Y has density f θ w.r.t. some σ-finite measure µ (usually Lebegue measure (pdf) or counting measure (pmf)). (ii) l(θ) = log L(θ) = log f θ (y), for observed data Y = y. Suppose maximization of l(θ) is messy/impossible; k > 1 but not huge. (iii) Suppose we augment the random variables to complete data X: Y = Y (X). Suppose X has density g θ (x). Then L(θ) = f θ (y) = and l c (θ) = log g θ (x) x:y(x)=y g θ(x)dµ(x) is known as the complete-data likelihood. (iv) Let Q(θ; θ ) = E θ (l c (θ) Y (X) = y). Then g θ (X) = h θ (X Y = y)f θ (y) l c (θ; X) = log h θ (X Y = y) + l(θ; y) Q(θ; θ ) = H(θ; θ ) + l(θ) y where H(θ; θ ) = E θ (log h θ (X Y = y) Y (X) = y) (v) EM algorithm is: E-step: at current estimate θ compute Q(θ; θ ). M-step: Maximize Q(θ; θ ) w.r.t. θ to obtain new estimate θ. Set θ = θ and repeat ad nauseam. 2
7.3 Why does EM work? (i) Recall (see Kullback-Leibler info) that for any densities p and q of r.v. Z, E q (log p(z)) = log(p(z))q(z)dµ(z) is maximized w.r.t p by p = q. (K(q; p) = E q log(q/p) 0.) (ii) Hence H(θ; θ ) H(θ, θ ) for all θ, θ. (iii) Now with new/old estimates θ, θ l( θ) l(θ ) = Q( θ; θ ) Q(θ ; θ ) (H( θ; θ ) H(θ ; θ )) by 7.2(iv) H(θ ; θ ) H( θ; θ ) by M-step 0 by (ii). Also l( θ) l(θ ) > 0, unless h θ(x Y ) = h θ (X Y ) (iv) Thus each step of EM cannot decrease l(θ) and usually increases l(θ). (v) If the MLE θ is the unique stationary point of l(θ) in the interior of the space, then θ θ (vi) In practice, EM is very robust, but can be very slow, especially in final stages: cgce is first-order. (vii) Caution: we do NOT use expectations to impute the missing data We compute the expected complete-data log-likelihood. This normally involves using conditional expectations to impute the complete-data sufficient statistics. This is NOT the same thing see hwk. And it could be more complicated than this although not if we have chosen sensible complete-data. 3
7.4 A multinomial example (i) Bernstein (1928) used population data to validate the hypothesis that human ABO blood tyoes are determined by 3 alleles, A, B and O at a single genetic locus, rather than being 2 independent factors A/not-A, B/not-B. (ii) Suppose that the population frequencies of the A, B and O are p, q and r (p + q + r = 1); we want to estimate (p, q, r). (iii) We assume that the types of the two alleles carried by an individual are independent (Hardy-Weinberg Equil: 1908), and that individuals are independent ( unrelated ). (iv) ABO blood types are determined as follows: blood type genotype freq. type geno. freq. A AA p 2 A AO 2pr B BB q 2 B BO 2qr AB AB 2pq O OO r 2 (v) Y M 4 (n, (p 2 + 2pr, q 2 + 2qr, 2pq, r 2 )) X M 6 (n, (p 2, 2pr, q 2, 2qr, 2pq, r 2 )) (vi) l(p, q, r) easy to evaluate but hard to max. l(p, q, r) = const + y A log(p 2 + 2pr) + y B log(q 2 + 2qr) +y AB log(2pq) + y O log(r 2 ) (v) l c (p, q, r) is easy to maximize: l c (p, q, r) = const + x AA log(p 2 ) + x AO log(2pr) + x BB log(q 2 ) +x BO log(2qr) + x AB log(2pq) + x OO log(r 2 ) = const + (2x AA + x AO + x AB ) log p + (2x BB + x BO + x AB ) log q + (2x OO + x AO + x BO ) log r 4
(vi) E-step: x AA = E p,q,r (X AA Y = y) = p2 p 2 +2pr y A = (vii) M-step: p = (2n) 1 (2x AA + x AO + y AB ), q = (2n) 1 (2x BB + x BO + y AB ), r = 1 p q. p p+2r y A etc. (viii) This method know to geneticists in 1950s: genecounting. (EM algorithm dates to 1977: Dempster,Laird & Rubin) 7.5 A mixture example (see prev hwk) (i) Y i i.i.d, with pdf f(y; θ, ψ) = θf 1 (y; ψ) + (1 θ)f 2 (y; ψ) (ii) l(θ, ψ) = i log(θf 1 (y i ; ψ) + (1 θ)f 2 (y i ; ψ)) (iii) Let Z i = I(Y i f 1 ). P (Z i = 1) = θ, l c (θ, ψ) = i(z i log θ + (1 Z i ) log(1 θ) + Z i log f 1 (y i ; ψ) + (1 Z i ) log f 2 (y i ; ψ)) (iv) l c is linear in Z i, so E-step requires only δ i = E(Z i Y ) = θf 1 (y i ; ψ) θf 1 (y i ; ψ) + (1 θ)f 2 (y i ; ψ) (v) M-step: θ = i δ i /n and ψ maximizes i δ i log f 1 (y i ; ψ) + (1 δ i ) log f 2 (y i ; ψ)) (vi) Example: f j (y i ; ψ) = ψj 1 exp( y i /ψ j ) i δ i log f 1 (y i ; ψ) = log ψ 1 i δ i i δ i y i ψ 1 = i δ i y i /( i δ i ), ψ2 = i(1 δ i )y i / i(1 δ i ). (vii) Be careful about identifiability exchanging the probs and labels on components gives same mixture: e.g. fix ψ 1 < ψ 2. 5
7.6 Other types of example (i) Missing data actual Caution: we do NOT use expectations to impute the missing data. (ii) Variance component models (see hwk 9) (a) Y = AZ + e, e N n (0, τ 2 ), Z N r (0, σ 2 G), A is n r matrix. Y N n (0, σ 2 AGA + τ 2 I) (b) X = (Y, Z): l c (σ 2, τ 2 ) = (n/2) log(τ 2 ) (r/2) log(σ 2 ) (2τ 2 ) 1 (y Az) (y Az) (2σ 2 ) 1 z G 1 z (c) σ 2 = r 1 E(z G 1 z y), τ 2 = n 1 E((Y AZ) (Y AZ) Y ), but E(z G 1 z y) E(z y) G 1 E(z y). (d) Note X N n+r (0, V ), so we have usual formulae E(Z Y ) = V zy Vyy 1 Y, var(z Y ) = V zz V zy Vyy 1 V yz. (e) Also, if E(W ) = 0, E(W BW ) = E( i,j W i W j B ij ) = i,j var(w ) ij B ij = tr(var(w )B). (iii) Censored data, age-of-onset-data, competing risks models, etc. (iv) Hidden states, latent variables: Models in Genetics, Biology, Climate modelling, Environmental modelling. (v) General auxiliary variables: the latent variables do not have to mean anything they are simply a tool, s.t. that the complete-data log-likelihood is easy. 6
7.7 Likelihood in Exponential families (i) See Chapter 2, for definitions, the natural sufficient statistics T j, the natural parameter space and parametrization π j. (iii) Moment formulae. see 2.2 (vii) log c(π) E(t j (X)) = Cov(t j (X), t j (X)) = 2 log c(π) π l (iv) Likelihood equation for exponential family see 4.6 l = n(n 1 The natural sufficient statistics T j expectations nτ j. n 1 t j (X i ) E(t j (X))) (v) The information results (see 4.6) are set equal to their I(π) = J(π) = var(t 1,..., T k ) I(τ) = (var(t 1,..., T k )) 1 where τ j = E(t j (X)) (T 1,..., T k ) achieves (multiparameter) CRLB for (τ 1,..., τ k ). (v) Suppose complete-data X has exp.fam. form: for n- sample T j (X) = n i=1 t j (X i ) log g θ (X) = log c(θ) + k j=1 Q(θ; θ ) = log c(θ) + k j=1 π j (θ)t j (X) + log w(x) π j (θ)e θ (T j (X) Y ) + E θ (log w(x) Y ). 7
7.8 EM for exponential families (i) In natural parametrization π j : Q(π; π ) = log c(π) + k Q j=1 π j E π (T j (X) Y ) = E π (T j (X) Y ) + log c(π) = E π (T j (X) Y ) E π (T j (X)) Thus EM iteratively fits unconditioned to conditioned expectations of T j. At MLE E π (T j (X) Y ) = E π (T j (X)). (ii) Recall l(π) = log g π (X) log h π (X Y ) w(x) exp( j π j t j (X)) but h π (X Y ) = y(x)=y w(x) exp( j π j t j (X))dX = c (π; Y )w(x) exp( π j t j (X)) j so l(π) = log c(π) log c (π; Y ) (iii) Hence, differentiating this: l = E π (T j ) + E π (T j Y ) At MLE : E π (T j ) = E π (T j Y ) (iv) Differentiating again: 2 l = Cov(T j, T l ) Cov((T j, T l ) Y ) π l If Y determines X, var(t (X) Y ) = 0, and then observed information is var(t ) as for any exp fam. If Y tells nothing about X, var(t (X) Y ) = var(t (X)), and observed information is 0. Information lost due to observing Y not X is var(t (X) Y ). 8