An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads For each tral I do the followng: Frst I toss Con 0 If Con 0 turns up heads, I toss con 1 three tmes If Con 0 turns up tals, I toss con 2 three tmes I don t tell you whether Con 0 came up heads or tals, or whether Con 1 or 2 was tossed three tmes, but I do tell you how many heads/tals are seen at each tral you see the followng sequence: HHH, T T T, HHH, T T T, HHH What would you estmate as the values for λ, p 1 and p 2? 1 3 Overvew The EM algorthm n general form The EM algorthm for hdden markov models (brute force) The EM algorthm for hdden markov models (dynamc programmng) Maxmum Lkelhood Estmaton We have data ponts x 1, x 2,... x n drawn from some (fnte or countable) set X We have a parameter vector Θ We have a parameter space Ω We have a dstrbuton P (x Θ) for any Θ Ω, such that P (x Θ) 1 and P (x Θ) 0 for all x x X We assume that our data ponts x 1, x 2,... x n are drawn at random (ndependently, dentcally dstrbuted) from a dstrbuton P (x Θ ) for some Θ Ω 2 4

Log-Lkelhood We have data ponts x 1, x 2,... x n drawn from some (fnte or countable) set X We have a parameter vector Θ, and a parameter space Ω We have a dstrbuton P (x Θ) for any Θ Ω The lkelhood s n Lkelhood(Θ) P (x 1, x 2,... x n Θ) P (x Θ) 1 Maxmum Lkelhood Estmaton Gven a sample x 1, x 2,... x n, choose Θ ML argmax Θ Ω L(Θ) argmax Θ Ω log P (x Θ) For example, take the con example: say x 1... x n has Count(H) heads, and (n Count(H)) tals L(Θ) log (Θ Count(H) (1 Θ) n Count(H)) Count(H) log Θ + (n Count(H)) log(1 Θ) The log-lkelhood s L(Θ) log Lkelhood(Θ) n log P (x Θ) 1 We now have Θ ML Count(H) n 5 7 A Frst Example: Con Tossng X {H,T}. Our data ponts x 1, x 2,... x n are a sequence of heads and tals, e.g. HHTTHHHTHH Parameter vector Θ s a sngle parameter,.e., the probablty of con comng up heads Parameter space Ω [0, 1] Dstrbuton P (x Θ) s defned as { Θ If x H P (x Θ) 1 Θ If x T A Second Example: Probablstc Context-Free Grammars X s the set of all parse trees generated by the underlyng context-free grammar. Our sample s n trees T 1... T n such that each T X. R s the set of rules n the context free grammar N s the set of non-termnals n the grammar Θ r for r R s the parameter for rule r Let R(α) R be the rules of the form α β for some α The parameter space Ω s the set of Θ [0, 1] R such that for all α N r R(α) Θ r 1 6 8

We have P (T Θ) Θr Count(T,r) r R Multnomal Dstrbutons X s a fnte set, e.g., X {dog, cat, the, saw} where Count(T, r) s the number of tmes rule r s seen n the tree T log P (T Θ) r R Count(T, r) log Θ r Our sample x 1, x 2,... x n s drawn from X e.g., x 1, x 2, x 3 dog, the, saw The parameter Θ s a vector n R m where m X e.g., Θ 1 P (dog), Θ 2 P (cat), Θ 3 P (the), Θ 4 P (saw) The parameter space s m Ω {Θ : Θ 1 and, Θ 0} 1 If our sample s x 1, x 2, x 3 dog, the, saw, then L(Θ) log P (x 1, x 2, x 3 dog, the, saw) log Θ 1 +log Θ 3 +log Θ 4 9 11 Maxmum Lkelhood Estmaton for PCFGs Models wth Hdden Varables We have log P (T Θ) r R Count(T, r) log Θ r Now say we have two sets X and Y, and a jont dstrbuton P (x, y Θ) where Count(T, r) s the number of tmes rule r s seen n the tree T And, L(Θ) log P (T Θ) Solvng Θ ML argmax Θ Ω L(Θ) gves Count(T, r) Θ r s R(α) Count(T, s) where r s of the form α β for some β Count(T, r) log Θ r r R If we had fully observed data, (x, y ) pars, then L(Θ) log P (x, y Θ) If we have partally observed data, x examples, then L(Θ) log P (x Θ) log P (x, y Θ) y Y 10 12

The EM (Expectaton Maxmzaton) algorthm s a method for fndng Θ ML argmax Θ log P (x, y Θ) y Y Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 13 15 e.g., n the three cons example: Y {H,T} X {HHH,TTT,HTT,THH,HHT,TTH,HTH,THT} Θ {λ, p 1, p 2 } and where and P (x, y Θ) P (y Θ)P (x y, Θ) P (y Θ) { λ If y H 1 λ If y T { p h P (x y, Θ) 1 (1 p 1 ) t If y H p h 2(1 p 2 ) t If y T where h number of heads n x, t number of tals n x Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 14 16

Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 Fully observed data mght look lke: ( HHH, H), ( T T T, T ), ( HHH, H), ( T T T, T ), ( HHH, H) P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 In ths case maxmum lkelhood estmates are: λ 3 5 P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 p 1 9 9 p 2 0 6 17 19 Varous probabltes can be calculated, for example: P (x THT, y H Θ) λp 1 (1 p 1 ) 2 P (x THT, y T Θ) (1 λ)p 2 (1 p 2 ) 2 Partally observed data mght look lke: HHH, T T T, HHH, T T T, HHH P (x THT Θ) P (x THT, y H Θ) +P (x THT, y T Θ) λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 How do we fnd the maxmum lkelhood parameters? P (y H x THT, Θ) P (x THT, y H Θ) P (x THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 18 20

Partally observed data mght look lke: HHH, T T T, HHH, T T T, HHH If current parameters are λ, p 1, p 2 P (y H x HHH ) P (y H x TTT ) 21 P ( HHH, H) P ( HHH, H) + P ( HHH, T) λp 3 1 λp 3 1 + (1 λ)p 3 2 P ( TTT, H) P ( TTT, H) + P ( TTT, T) λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3 After fllng n hdden varables for each example, partally observed data mght look lke: ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 23 If current parameters are λ, p 1, p 2 P (y H x HHH ) P (y H x TTT ) If λ 0.3, p 1 0.3, p 2 0.6: λp 3 1 λp 3 1 + (1 λ)p 3 2 λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3 P (y H x HHH ) 0.0508 P (y H x TTT ) 0.6967 New Estmates: p 1 p 2 ( HHH, H) P (y H HHH) 0.0508 ( HHH, T ) P (y T HHH) 0.9492 ( TTT, H) P (y H TTT) 0.6967 ( TTT, T ) P (y T TTT) 0.3033... λ 3 0.0508 + 2 0.6967 5 0.3092 3 3 0.0508 + 0 2 0.6967 3 3 0.0508 + 3 2 0.6967 0.0987 3 3 0.9492 + 0 2 0.3033 3 3 0.9492 + 3 2 0.3033 0.8244 22 24

: Summary Begn wth parameters λ 0.3, p 1 0.3, p 2 0.6 Fll n hdden varables, usng P (y H x HHH ) 0.0508 P (y H x TTT ) 0.6967 Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 p 5 0 0.3000 0.3000 0.6000 0.0508 0.6967 0.0508 0.6967 0.0508 1 0.3092 0.0987 0.8244 0.0008 0.9837 0.0008 0.9837 0.0008 2 0.3940 0.0012 0.9893 0.0000 1.0000 0.0000 1.0000 0.0000 3 0.4000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 The con example for { HHH, T T T, HHH, T T T, HHH }. λ s now 0.4, ndcatng that the con-tosser has probablty 0.4 of selectng the tal-based con. Re-estmate parameters to be λ 0.3092, p 1 0.0987, p 2 0.8244 25 27 Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.3000 0.6000 0.0508 0.6967 0.0508 0.6967 1 0.3738 0.0680 0.7578 0.0004 0.9714 0.0004 0.9714 2 0.4859 0.0004 0.9722 0.0000 1.0000 0.0000 1.0000 3 0.5000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 The con example for y { HHH, T T T, HHH, T T T }. The soluton that EM reaches s ntutvely correct: the con-tosser has two cons, one whch always shows up heads, the other whch always shows tals, and s pckng between them wth equal probablty (λ 0.5). The posteror probabltes p show that we are certan that con 1 (tal-based) generated y 2 and y 4, whereas con 2 generated y 1 and y 3. Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.3000 0.6000 0.1579 0.6967 0.0508 0.6967 1 0.4005 0.0974 0.6300 0.0375 0.9065 0.0025 0.9065 2 0.4632 0.0148 0.7635 0.0014 0.9842 0.0000 0.9842 3 0.4924 0.0005 0.8205 0.0000 0.9941 0.0000 0.9941 4 0.4970 0.0000 0.8284 0.0000 0.9949 0.0000 0.9949 The con example for y { HHT, T T T, HHH, T T T }. EM selects a tals-only con, and a con whch s heavly heads-based (p 2 0.8284). It s certan that y 1 and y 3 were generated by con 2, as they contan heads. y 2 and y 4 could have been generated by ether con, but con 1 s far more lkely. 26 28

Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.7000 0.7000 0.3000 0.3000 0.3000 0.3000 1 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 2 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 3 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 4 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 5 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 6 0.3000 0.5000 0.5000 0.3000 0.3000 0.3000 0.3000 The con example for y { HHH, T T T, HHH, T T T }, wth p 1 and p 2 ntalsed to the same value. EM s stuck at a saddle pont Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.6999 0.7000 0.2999 0.3002 0.2999 0.3002 1 0.3001 0.4998 0.5001 0.2996 0.3005 0.2996 0.3005 2 0.3001 0.4993 0.5003 0.2987 0.3014 0.2987 0.3014 3 0.3001 0.4978 0.5010 0.2960 0.3041 0.2960 0.3041 4 0.3001 0.4933 0.5029 0.2880 0.3123 0.2880 0.3123 5 0.3002 0.4798 0.5087 0.2646 0.3374 0.2646 0.3374 6 0.3010 0.4396 0.5260 0.2008 0.4158 0.2008 0.4158 7 0.3083 0.3257 0.5777 0.0739 0.6448 0.0739 0.6448 8 0.3594 0.1029 0.7228 0.0016 0.9500 0.0016 0.9500 9 0.4758 0.0017 0.9523 0.0000 0.9999 0.0000 0.9999 10 0.4999 0.0000 0.9999 0.0000 1.0000 0.0000 1.0000 11 0.5000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 The con example for y { HHH, T T T, HHH, T T T }. If we ntalse p 1 and p 2 to be a small amount away from the saddle pont p 1 p 2, the algorthm dverges from the saddle pont and eventually reaches the global maxmum. 29 31 The EM Algorthm Iteraton λ p 1 p 2 p 1 p 2 p 3 p 4 0 0.3000 0.7001 0.7000 0.3001 0.2998 0.3001 0.2998 1 0.2999 0.5003 0.4999 0.3004 0.2995 0.3004 0.2995 2 0.2999 0.5008 0.4997 0.3013 0.2986 0.3013 0.2986 3 0.2999 0.5023 0.4990 0.3040 0.2959 0.3040 0.2959 4 0.3000 0.5068 0.4971 0.3122 0.2879 0.3122 0.2879 5 0.3000 0.5202 0.4913 0.3373 0.2645 0.3373 0.2645 6 0.3009 0.5605 0.4740 0.4157 0.2007 0.4157 0.2007 7 0.3082 0.6744 0.4223 0.6447 0.0739 0.6447 0.0739 8 0.3593 0.8972 0.2773 0.9500 0.0016 0.9500 0.0016 9 0.4758 0.9983 0.0477 0.9999 0.0000 0.9999 0.0000 10 0.4999 1.0000 0.0001 1.0000 0.0000 1.0000 0.0000 11 0.5000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 The con example for y { HHH, T T T, HHH, T T T }. If we ntalse p 1 and p 2 to be a small amount away from the saddle pont p 1 p 2, the algorthm dverges from the saddle pont and eventually reaches the global maxmum. Θ t s the parameter vector at t th teraton Choose Θ 0 (at random, or usng varous heurstcs) Iteratve procedure s defned as Θ t argmax Θ Q(Θ, Θ t 1 ) where Q(Θ, Θ t 1 ) P (y x, Θ t 1 ) log P (x, y Θ) y Y 30 32

The EM Algorthm Iteratve procedure s defned as Θ t argmax Θ Q(Θ, Θ t 1 ), where Q(Θ, Θ t 1 ) P (y x, Θ t 1 ) log P (x, y Θ) Key ponts: y Y Intuton: fll n hdden varables y accordng to P (y x, Θ) EM s guaranteed to converge to a local maxmum, or saddle-pont, of the lkelhood functon In general, f argmax Θ log P (x, y Θ) has a smple (analytc) soluton, then argmax Θ P (y x, Θ) log P (x, y Θ) also has a smple (analytc) soluton. y The Structure of Hdden Markov Models Have N states, states 1... N Wthout loss of generalty, take N to be the fnal or stop state Have an alphabet K. For example K {a, b} Parameter π for 1... N s probablty of startng n state Parameter a,j for 1... (N 1), and j 1... N s probablty of state j followng state Parameter b (o) for 1... (N 1), and o K s probablty of state emttng symbol o 33 35 Overvew The EM algorthm n general form The EM algorthm for hdden markov models (brute force) The EM algorthm for hdden markov models (dynamc programmng) An Example Take N 3 states. States are {1, 2, 3}. Fnal state s state 3. Alphabet K {the, dog}. Dstrbuton over ntal state s π 1 1.0, π 2 0, π 3 0. Parameters a,j are j1 j2 j3 1 0.5 0.5 0 2 0 0.5 0.5 Parameters b (o) are othe odog 1 0.9 0.1 2 0.1 0.9 34 36

A Generatve Process A Hdden Varable Problem Pck the start state s 1 probablty π. to be state for 1... N wth We have an HMM wth N 3, K {e, f, g, h} We see the followng output sequences n tranng data Set t 1 Repeat whle current state s t s not the stop state (N): Emt a symbol o t K wth probablty b st (o t ) Pck the next state s t+1 as state j wth probablty a st,j. t t + 1 e e f f g h h g How would you choose the parameter values for π, a,j, and b (o)? 37 39 Probabltes Over Sequences An output sequence s a sequence of observatons o 1... o T where each o K e.g. the dog the dog dog the A state sequence s a sequence of states s 1... s T where each s {1... N} e.g. 1 2 1 2 2 1 HMM defnes a probablty for each state/output sequence par e.g. the/1 dog/2 the/1 dog/2 the/2 dog/1 has probablty Another Hdden Varable Problem We have an HMM wth N 3, K {e, f, g, h} We see the followng output sequences n tranng data e g h e h f h g f g g e h π 1 b 1 (the) a 1,2 b 2 (dog) a 2,1 b 1 (the) a 1,2 b 2 (dog) a 2,2 b 2 (the) a 2,1 b 1 (dog)a 1,3 Formally: ( T ) ( T ) P (s 1... s T, o 1... o T ) π s1 P (s s 1 ) P (o s ) P (N s T ) 2 1 38 How would you choose the parameter values for π, a,j, and b (o)? 40