The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Size: px

Start display at page:

Download "The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X"

Tracey Owen
5 years ago
Views:

1 The EM Algorthm (Dempster, Lard, Rubn 1977 The mssng data or ncomplete data settng: An Observed Data Lkelhood (ODL that s a mxture or ntegral of Complete Data Lkelhoods (CDL. (1a ODL(;Y = [Y;] = [Y,][ ] = CDL(;Y, The goal: maxmze ODL (or the Observed Data Posteror, ODL x π. Ths s STEP E of the Emprcal Bayes method. {: g(=y} {Y} Y Φ (hyperparameter space (complete-data sample space Y (observed-data sample space FORMULATION A: A set of datasets. Often represents the complete data, and there s a functon g whch hdes somethng about, and Y s what s vsble. So g 1 ( Y s a subset of, the set of s consstent wth Y. (1b ODL(;Y = I{Y = g( }[ ] = [ ] g 1 (Y Examples: Y represents censored data, grouped data, rounded data.

2 FORMULATION B: Data are mssng. Sometmes we can wrte =(Y,Z, and Z represents the mssng data. Then ODL(;Y = [Y,][ ]d = [Y Y ',Z,][Y ',Z ]dy 'dz (1c Y Z = [Y,Z ] dz = [Y Z,][Z ]dz Z Z whose fnal expresson looks smlar to (1a but s nterpreted dfferently. Examples: Z represents random effects, latent or hdden varables, future data, nusance parameters, mssng covarates, mssng targets, varances n mxtures, mxture proportons. Detaled example of Formulaton A: Censored data event cens = Complete data = {(, : 1,..., n} Y = Observed data = {( Y, c : = 1,..., n} event cens event 1 c cens c where c = I{ > } and Y = ( (. In our language, Y =Observed data = Y f c = 0 ( Y, f c = 1 g Y c = Y. =. so that n the censored case 1 (, (, For a censored observaton, g maps the true falure tme for observaton to the half-nfnte nterval. Every wthn the nterval maps to the nterval. Another detaled example of Formulaton A: Grouped data Anmals that are crosses of AB/ab x AB/ab are classfed n 4 categores: Y Genotype Pr(category θ Pr(category π 15 AB (3 θ + θ / 4 1 π 18 Ab 0 ab 34 ab + 4 ( θ θ / 4 1 π 4 4 ( θ θ / 4 1 π 4 4 (1 θ + θ / 4 π 4 Category AB s really categores that cannot be dstngushed.

3 θ= recombnaton frequency Y1 Y+ Y3 Y4 1 π 1 π π ODL [ Y π ] = Y1 1= 0 1 Y1 1 Y Y3 Y4 Y1 1= 0 1 Y1 1 Y Y3 Y4 1 Y1 1 Y+ Y3 Y4 n! 1 π 1 π π!(!!!! n! CDL( π; 1, Y1, Y, Y3, Y4!(!!!! Detaled example of Formulaton B: Dscrete contamnated mxture dstrbutons Y σ ~ N( β, σ [ σ = a] = p, p = 1 (..d. Suppose that { },{ a},{ p } are known. The lkelhood s β ( Y, σ Y {, Y, a} Z { σ } ODL = Y = p Y [ β]... [ β, σ ] 1 n = p Y [ βσ, ] The forgettng functon g forgets the component Z. In ths case, margnally, we can also wrte ODL Y Y = [ β,{ σ }] = [ β,{ σ }]. (Here, n brackets we have the set of possble error varances, not the ones assgned to partcular observatons. Hence the subscrpt s. The error dstrbuton s a mxture dstrbuton.

4 Another detaled example of Formulaton B: Contnuous contamnated mxture dstrbutons Y σ ~ N( β, σ σ ~ InverseGamma( ν /, σ (..d. 0 Here, the ODL can also be wrtten as above. But also, FACT: The t-dstrbuton s obtaned by a convoluton of the gamma (or chsquare wth the normal. The pvotal parameter s the precson (nverse of the varance. Therefore you COULD bypass the EM algorthm and wrte the lkelhood functon drectly, then use Newton-Raphson or other maxmzaton search algorthms. Ths wll usually be faster--- but often t wll not work. The EM, however, wll always work n the sense that t wll always converge monotoncally to a local mnmum.

5 The EM algorthm (for ODL Purpose: to maxmze the ODL (to get ˆML. [ Y, ] [ ] CDL Defne π ( Y ;, = [ Y, ] = = =. [ Y ] [ Y ] ODL Ths s a knd of posteror dstrbuton. Rearrangng, [ Y, ] [ Y ] =, [ Y, ].e. ODL( ; Y = CDL( ; / π ( ; Y,, or takng logs, ( log ODL( ; Y log CDL( ; log π ( ; Y, =. Oddly, the LHS does not have but the RHS does. Eq ( holds for every, so also (3 = ( π log ODL( ; Y log CDL( ; log ( ; Y, dg( for any dstrbuton G on, snce the stuff n parentheses s constant. We ll choose dg( / d = π ( ; Y, = [ Y, ] for some current guess at the hyperparameter. Then (4 log ODL( ; Y = Q(, H(,. where Q(, = Elog CDL( ;, H(, = Elog[, Y ], both expectatons beng over [ Y, ]. H s a knd of entropy, maxmzed at = (Jensen s nequalty. So f you ncrease Q, log ODL wll ncrease for two reasons: Q gets bgger H gets smaller

6 EM algorthm : Iteratvely maxmze Q as a functon of, replacng by at each step. Generalzed EM: Iteratvely ncrease Q at each step. Detals of the exponental famly case: Suppose that the CDL s a member of the exponental famly (normal, beta, Posson, bnomal, gamma, nverse gamma, negatve bnomal, E step = Calculate expected suffcent statstcs ET ( data,. Then Q looks lke a CDL wth these suffcent statstcs. M step = Maxmze ths CDL. Suppose the CDL s a complete exponental famly (CEF: k b( xexp t where : natural parameter = f ( x = CDL( ;x = t : canoncal suffcent statstc =1 ( a ( x and complete means (very roughly that there are no constrants on. Example: Normal dstrbuton = 1 ( ( x µ σ n 1 CDL = exp / σ π = n / 1 x ( µ π exp + ( x σ σ ( n / σ exp ( nµ / σ a( = normalzng constant where t 1 = 1 t = b x a x 1 = 1 σ x = µ σ ( = ( π n/ ( = n/ 1 exp n / 1 ( (If your model forced a constrant reducng the dmenson of the parameter space, for example σ = µ, then t would not be complete.

7 ( log a( = E( t ( = µ The cumulant generatng functon log a( : log a ( = Var ( t ( (The nd cumulant = nd central moment. ( kth dervatve kth cumulant, E( t ( ( t ( µ...( t ( (k 1µ. Convexty Snce / log a ( > 0, log a s convex down. ( log a( ThereforelogCDL( ;x = logb(x + t x s concave down, there s a unque maxmum. Theorem: ˆ = maxmzes CDL ;x EM Algorthm for CEF s ( = t x ( E T ( x = ˆ (. Suppose f ( x s complete exponental famly: E-Step: Gven a current estmate ( p, let t ( p = E( t( x y, ( p. M-Step: Fnd ( p 1 to solve + ( p ( ( t Etx =. The M-step s a MLE calculaton, because ( = log a( + T t x ( x = log a / = E( t. log f x has dervatve = 0 at t Ths s a Posteror Mean. ( + logb( x

8 Y ODL (Y (1 t ( t t ( CDL (T t E E E (0 M (1 M ( M ( Exp Famly convexty unqueness. (But exstence? Check boundares! Why does ths correspond to the general EM? Q(, π = E π = E where t π = E ( logcdl( ; ( log b(, y + T t(, y log a( = const ( + T t log a( = logcdl T ( ; t Maxmzng Q = E (log CDL, s maxmzng logcdl( ; t = E( t( y,. ( T (, y = E (T (, y y,. But maxmzng a CEF s easy: pck : E(T = t.

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons