EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al., 2001..
The EM algorithm This lecture plan: 1. Presentation and Correctness Proof of the EM algorithm. 2. Examples of Implementations 2
Model, Parameters, ML A model with parameters θ is a probabilistic space M, in which each simple event y is determined by values of random variables (dice). The parameters θ are the probabilities associated with the random variables. (In HMM of length L, the simple events are HMM-sequences of length L, and the parameters are the transition probabilities m l and the emission probabilities e (b)). An observed data is a non empty subset x M. (In HMM, it can be all the simple events which fit with a given output sequence). Given observed data x, the ML method sees parameters θ* which maximize the lielihood of the data p(x θ)= y p(x,y θ). Finding such θ* is easy when the observed data is a simple event, but hard in general. 3
The EM algorithm Assume a model with parameters as in the previous slide. Given observed data x, the lielihood of x under model parameters θ is given by p(x θ)= y p(x,y θ). (The y s are the simple events which comprise x, usually determined by the possible values of hidden data ). The EM algorithm receives x and parameters θ, and returns new parameters λ* s.t. p(x λ*) > p(x θ). i.e., the new parameters increase the lielihood of the observed data. 4
The EM algorithm The graphs below are the logarithms of the lielihood functions Log(L θ )= E θ [log P(x,y λ)] log P(x λ) θ λ EM uses the current parameters θ to construct a simpler ML problem L θ : L ( ) p( x, y ) p x y θ λ = λ (, θ ) Guarantee: if L θ (λ)>l θ (θ), than P(x λ)>p(x θ). y λ 5
Derivation of the EM Algorithm Let x be the observed data. Let {(x,y 1 ),,(x,y )} be the set of (simple) events which comprise x. Our goal is to find parameters θ * which maximize the sum px (, θ ) = pxy (, θ ) + pxy (, θ ) +.. + pxy (, θ ) * * * * 1 2 As this is hard, we start with some parameters θ, and only find λ * s.t.: * * (, λ ) = (, i λ ) > (, i θ) = (, θ) i= 1 i= 1 p x p x y pxy px 6
For given parameters θ, Let p i =p(y i x,θ). (note that p 1 + +p =1). We use the p i s to define virtual sampling, in which: y 1 occurs p 1 times, y 2 occurs p 2 times, y occurs p times The EM algorithm loos for new parameters λ which maximize the lielihood of this "virtual" sampling. This lielihhod is given by L λ p y x λ p y x λ p y x λ θ p1 p2 p ( )= (, ) (, ) (, ). 1 2 7
In each iteration the EM algorithm does the following. (E step): Calculate The EM algorithm p y x L ( λ )= θ θ p y x λ (, ) (, ), (M step): Find λ* which maximizes L θ (λ) (ext iteration sets θ λ* and repeat). y Comment: 1. At the M-step we only need that L θ (λ*)>l θ (θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal λ*. 2. Usually, Q ( λ)= log( L ( λ))= p( y x, θ)log( p( y, x λ)) is used. θ θ y 8
Correctness Theorem for the EM Algorithm Theorem: Let x= {( x, y i= 1 1 ),..,( xy, )} be a collection of events, as in the setting of the EM algorithm, and let L θ ( λ) = prob( xy, λ) Then the following holds: If L prob( y x, θ ) * * θ( λ ) > Lθ( θ), then prob( x λ ) > pr i i ob( x θ ). 9
Correctness proof of EM * Let prob( yi x, θ) = pi,prob( yi x, λ ) = qi. Then from the definition of conditional probability we have: * * prob( xy, i θ) = pi prob( x θ), prob( xy, i λ ) = qi prob( x λ ). * By the EM assumption on θ and λ : i= 1 since * pi * ( q prob( x λ )) = L ( λ ) > L ( θ) = ( p prob( x θ)) i p = q = 1 we get: i= 1 i i= 1 i θ i= 1 p * i pi ( q ) prob( ) ( ) prob( ) 1 i x λ > p i i 1 i x θ = = θ i p i 10
Correctness proof of EM (end) from last slide: p * i pi ( q ) prob( ) ( ) prob( ) [1] 1 i x λ > p i i 1 i x θ = = pi ( q ) * i= 1 i > p ( p i i= 1 i ) < 1 by the ML principle i= 1 p i q i 1 i < = i= 1 pi By the ML principle we have: ( ) ( p ). pi Dividing equation [1] by ( p ) we get : prob( x λ ) QED i * prob( x λ ) > prob( x θ ) by [1] above i 11
Example: Baum Welsh = EM for HMM The Baum-Welsh algorithm is the EM algorithm for HMM: p s x E step for HMM: L ( λ )= p s x λ (, θ ) θ (, ), s where λ are the new parameters {m l,e (b)}. M step for HMM: loo for λ which maximizes L θ (λ). Recall that for HMM, psx λ = m ( ) (, ) ( e() b ) s l M E b l l, b, s 12
Baum Welsh = EM for HMM (cont) Ml E ( b) writing psx (, λ) as m e() b we get l l, b, s l Lθ ( λ))= m l e ( b) s, l, b s s Ml p( s x, θ) E b p s x θ s s m ( ) (, ) l e ( b). l, b, M l s s M E ( b) s p( s x, θ ) As we showed, L ( λ )) is maximized when the m 's and e ( b)' s θ E (b) are the relative frequencies of the corresponding variables given x and θ. i.e., = l ml = M M l l ' l ' and e( b) = E( b) E b ( ') b' 13
A simple example: EM for 2 coin tosses Consider the following experiment: Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θ H, θ T = 1- θ H. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*). We wish to apply the EM algorithm to get parameters that increase the lielihood of the data. Let the initial parameters be θ = (θ H, θ T ) = ( ¼, ¾ ). 14
EM for 2 coin tosses (cont) The hidden data which produce x are the sequences y 1 = (T,H); y 2 =(T,T); Hence the lielihood of x with parameters (θ H, θ T ), is p(x θ) = P(x,y 1 θ) + P(x,y 2 θ) = q H q T +q 2 T For the initial parameters θ = ( ¼, ¾ ), we have: p(x θ) = ¼ ¾+ ¾ ¾= ¾ ote that in this case P(x,y i θ) = P(y i θ), for i = 1,2. we can always define y so that (x,y) = y (otherwise we set y (x,y) and replace the y s by y s). 15
EM for 2 coin tosses - E step Calculate L θ (λ) = L θ (λ H,λ T ). Recall: λ H,λ T are the new parameters, which we need to optimize Lθ ( λ) = p( x, y λ) p( x, y λ) p( y1 x, θ ) p( y2 x, θ ) 1 2 p(y 1 x,θ) = p(y 1,x θ)/p(x θ) = (¾ ¼)/ (¾) = ¼ p(y 2 x,θ) = p(y 2,x θ)/p(x θ) = (¾ ¾)/ (¾) = ¾ Thus we have 1 3 4 4 1 2 Lθ ( λ) = p( x, y λ) p( x, y λ) 16
EM for 2 coin tosses - E step For a sequence y of coin tosses, let H (y) be the number of H s in y, and T (y) be the number of T s in y. Then ( y) ( y) H T pyλ = λ λ ( ) H T In our example: y 1 = (T,H); y 2 =(T,T), hence: H (y 1 ) = T (y 1 )=1, H (y 2 ) =0, T (y 2 )=2 17
Example: 2 coin tosses - E step Thus T ( y1) H ( y1) 1 T H T H T ( y2) H ( y2) 2 pxy (, 2 λ) = λt λh = λt 1 3 ( ) (, 4 4 1 ) (, 2 ) 1 3 4 2 ( λλ ) ( ) 4 T H λt = 1 ( 3 1 3 1) ( 2) ( 1) ( 2) 4 T y + 4 T y 4 H y + 4 H y T λh Lθ λ = p x y λ p x y λ = λ pxy (, λ) = λ λ = λ λ And in general: T = 7 /4 H = ¼ T L λ = λ λ θ H ( ) T H 18
EM for 2 coin tosses - M step Find λ* which maximizes L θ (λ) And as we already saw, λ H λ T T λ H H is maximized when: H T = ; λt = + + H T H T λ H 1 7 4 4 = = 1 ; λ 7 8 T = = 8 1 + 7 1 + 7 4 4 4 4 that is, λ* = (, ) and px ( λ*) = 1 7 7 8 8 8. [The optimal parameters (0,1), will never be reached by the EM algorithm!] 19
EM for single random variable (dice) ow, the probability of each y ( (x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities λ 1,..,λ m. Let (y) = #(outcome occurs in y). Then m pyλ ( ) = λ = 1 Let be the expected value of (y), given x and θ: Then we have: ( y) =E( x,θ) = y p(y x,θ) (y), 20
L θ (λ) for one dice L θ ( λ) ( ) p( y x, θ ) = p y λ = y m p( y x, θ ) m ( y) p( y x, θ ) m ( y) y λ = λ = λ y = 1 = 1 = 1 which is maximized for λ = ' ' 21
EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j = (, ) (, ) = j= 1 j y j= 1 p y x θ y x n j= 1 1 j px ( θ ) j = y j j j j j py (, x θ ) ( y, x) 22
23 Example: The ABO locus A locus is a particular place on the chromosome. Each locus state (called genotype) consists of two alleles one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. q q q q q q o o o o b a b a o b o b b b b b o a o a a a a a / / / / / / / / / / / /,,,,, = = = = = = Suppose we randomly sampled individuals and found that a/a have genotype a/a, a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.
The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O)? The problem is that among individuals measured to have blood type A, we don t now how many have genotype a/a and how many have genotype a/o. So what can we do? 24
The ABO locus (Cont.) The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles q a,q b,q o in the population determine the frequencies of the genotypes as follows: q a/b = 2q a q b, q a/o = 2q a q o, q b/o = 2q b q o, q a/a = [q a ] 2, q b/b = [q b ] 2, q o/o = [q o ] 2. In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y: 25
The ABO locus (Cont.) The dice outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O. Each blood type is determined by two successive random sampling of alleles, which is an ordered genotypes pair this is the hidden data. For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a). So we have three parameters of one dice q a,q b,q o -that we need to estimate. We start with parameters θ = (q a,q b,q o ), and then use EM to improve them. 26
EM setting for the ABO locus The observed data x =(x 1,..,x n ) is a sequence of elements (blood types) from the set {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x 1, x 11 ). The hidden data (ie the y s) for each x j is the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}. The parameters θ= {q a,q b, q o } are the probabilities of the alleles. 27
EM for ABO loci For each observed blood type x j {A,B,AB,O} and for each allele z in {a,b,o} we compute z (x j ), the expected number of times that z appear in x j. j j j j z( x ) = p( y x, θ ) z( y ) y j Where the sum is taen over the ordered genotype pairs y j, and z (y j ) is the number of times allele z occurs in the pair y j. eg, a (o,b)=0; b (o,b) = o (o,b) = 1. 28
EM for ABO loci The computation for blood type B: P(B θ) = P((b,b) θ) + p((b,o) θ) +p((o,b) θ)) = q 2 b + 2q b q o. Since b ((b,b))=2, and b ((b,o))= b ((o,b)) = o ((o,b))= o ((b,o))=1, o (B) and b (B), the expected number of occurrences of o and b in B, are given by: 2qq b o 2qq b o o( B) = p( y B, θ ) o( y) = = 2 pb ( θ ) q + 2q q y ( B) = p( y B, θ ) ( y) = b y b + 2 2qb 2qbqo 2 qb + 2qbqo b b o Observe that b (B)+ o (B) = 2 29
EM for ABO loci Similarly, P(A θ) = q a 2 + 2q a q o. 2 2qq a o 2qa + 2qq a o 2 a 2 a + 2 a o a + 2 a o o( A) =, ( A) = q q q q q q P(AB θ) = p((b,a) θ) + p((a,b) θ)) = 2q a q b ; a (AB) = b (AB) = 1 P(O θ) = p((o,o) θ) = q o 2 o (O) = 2 [ b (O) = a (O) = o (AB) = b (A) = a (B) = 0 ] 30
E step: compute a, b and o Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively. = #( A) ( A) + #( AB) ( AB) a a a = #( B) ( B) + #( AB) ( AB) b b b = #( A) ( A) + #( B) ( B) + #( O) ( O) o o o o ote that + + = 2 = 22 a b o M step: set λ*=( q a *, q b *, q o *) q * a = a 2 ; q * b = b 2 ; q * o = o 2 31
EM for a general discrete stochastic processes ow we wish to maximize lielihood of observation x with hidden data as before, ie maximize p(x λ)= y p(x,y λ). But this time experiment (x,y) is generated by a general stochastic process. The only assumption we mae is that the outcome of each experiment consists of a (finite) sequence of samplings of r discrete random variables (dices) Z 1,..., Z r, each of the Z i s can be sampled few times. This can be realized by a probabilistic acyclic state machine, where at each state some Z i is sampled, and the next state is determined by the outcome until a final state is reached. 32
EM for processes with many dices Example: In HMM, the random variables are the transmissions and emission probabilities: a l, e (b). x is the visible information y is the sequence s of states (x,y) is the complete HMM As before, we can redefine y so that (x,y) = y. s 1 s 2 s L-1 s L s i X 1 X 2 X L-1 X L X i 33
EM for processes with many dices Each random variable Z l (l =1,...,r)hasm l values z l,1,...z l,ml with probabilities {q l =1,...,m l }. Each y defines a sequence of outcomes (z l1, 1,...,z ln, n ) of the random variables used in y. In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence y j. Let l (y) = #(z l appears in y). 34
EM for processes with many dices Similarly to the single dice case, we have: pyλ ( ) r m = l l= 1 = 1 λ l l ( y) Define l as the expected value of l (y), given x and θ: Then we have: l =E( l x,θ) = y p(y x,θ) l (y), 35
L θ Q θ (λ) for processes with many dices l l= 1 = 1 l l= 1 = 1 p( y x, θ ) r ml p( y x, θ ) l ( y) ( λ) = p( y λ) = λl = y y l= 1 = 1 r ml l ( y) p( y x, θ ) r ml y l λ = λ where is the expected number of times that, l given x and θ, the outcome of dice l was : L θ = ( y) p( y x, θ ). l y ( λ) is maximized for λ l l = l ' l l ' 36
EM algorithm for processes with many dices Similarly to the one dice case we get: Expectation step Set l to E ( l (y) x,θ), ie: l = y p(y x,θ) l (y) Maximization step Set λ l = l / ( l ) 37
EM algorithm for n independent observations x 1,, x n : Expectation step It can be shown that, if the x j are independent, then: n n j j j j j l = (, ) l (, ) = l j= 1 j y j= 1 py x θ y x j l n j= 1 1 j px ( ) j = y j j j j py (, x θ ) ( y, x) l 38
EM in Practice Initial parameters: Random parameters setting Best guess from other source Stopping criteria: Small change in lielihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early pruning of unpromising ones 39