6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Size: px

Start display at page:

Download "6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm"

Gerald Robinson
5 years ago
Views:

1 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

2 Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic programming)

3 An Experiment/Some Intuition I have three coins in my pocket, Coin 0 has probability λ of heads; Coin 1 has probability p 1 of heads; Coin 2 has probability p 2 of heads For each trial I do the following: First I toss Coin 0 If Coin 0 turns up heads, I toss coin 1 three times If Coin 0 turns up tails, I toss coin 2 three times I don t tell you whether Coin 0 came up heads or tails, or whether Coin 1 or 2 was tossed three times, but I do tell you how many heads/tails are seen at each trial you see the following sequence: HHH, T T T, HHH, T T T, HHH What would you estimate as the values for λ, p 1 and p 2?

4 Maximum Likelihood Estimation We have data points x 1, x 2,... x n drawn from some (finite or countable) set X We have a parameter vector Θ We have a parameter space Ω We have a distribution P (x Θ) for any Θ Ω, such that P (x Θ) = 1 and P (x Θ) 0 for all x x X We assume that our data points x 1, x 2,... x n are drawn at random (independently, identically distributed) from a distribution P (x Θ ) for some Θ Ω

5 Log-Likelihood We have data points x 1, x 2,... x n drawn from some (finite or countable) set X We have a parameter vector Θ, and a parameter space Ω We have a distribution P (x Θ) for any Θ Ω The likelihood is n Likelihood(Θ) = P (x 1, x 2,... x n Θ) = P (x i Θ) i=1 The log-likelihood is n L(Θ) = log Likelihood(Θ) = log P (x i Θ) i=1

6 A First Example: Coin Tossing X = {H,T}. Our data points x 1, x 2,... x n are a sequence of heads and tails, e.g. HHTTHHHTHH Parameter vector Θ is a single parameter, i.e., the probability of coin coming up heads Parameter space Ω = [0, 1] Distribution P (x Θ) is defined as P (x Θ) = { Θ If x = H 1 Θ If x = T

7 Maximum Likelihood Estimation Given a sample x 1, x 2,... x n, choose Θ ML = argmax Θ Ω L(Θ) = argmax Θ Ω log P (x i Θ) i For example, take the coin example: say x 1... x n has Count(H) heads, and (n Count(H)) tails L(Θ) = ( ) log Θ Count(H) (1 Θ) n Count(H) = Count(H) log Θ + (n Count(H)) log(1 Θ) We now have Θ ML = Count(H) n

8 A Second Example: Probabilistic Context-Free Grammars X is the set of all parse trees generated by the underlying context-free grammar. Our sample is n trees T 1... T n such that each T i X. R is the set of rules in the context free grammar N is the set of non-terminals in the grammar Θ r for r R is the parameter for rule r Let R(α) R be the rules of the form α β for some β The parameter space Ω is the set of Θ [0, 1] R such that for all α N Θ r = 1 r R(α)

9 We have P (T Θ) = r R Count(T,r) Θr where Count(T, r) is the number of times rule r is seen in the tree T log P (T Θ) = Count(T, r) log Θ r r R

10 Maximum Likelihood Estimation for PCFGs We have log P (T Θ) = r R Count(T, r) log Θ r where Count(T, r) is the number of times rule r is seen in the tree T And, L(Θ) = log P (T i Θ) = Count(T i, r) log Θ r i i r R Solving Θ ML = argmax Θ Ω L(Θ) gives i Count(T i, r) Θ r = i s R(α) Count(T i, s) where r is of the form α β for some β

11 Multinomial Distributions X is a finite set, e.g., X = {dog, cat, the, saw} Our sample x 1, x 2,... x n is drawn from X e.g., x 1, x 2, x 3 = dog, the, saw The parameter Θ is a vector in R m where m = X e.g., Θ 1 = P (dog), Θ 2 = P (cat), Θ 3 = P (the), Θ 4 = P (saw) The parameter space is m Ω = {Θ : Θ i = 1 and i, Θ i 0} i=1 If our sample is x 1, x 2, x 3 = dog, the, saw, then L(Θ) = log P (x 1, x 2, x 3 = dog, the, saw) = log Θ 1 +log Θ 3 +log Θ 4

12 Models with Hidden Variables Now say we have two sets X and Y, and a joint distribution P (x, y Θ) If we had fully observed data, (x i, y i ) pairs, then L(Θ) = log P (x i, y i Θ) i If we have partially observed data, x i examples, then L(Θ) = log P (x i Θ) i = log P (x i, y Θ) i y Y

13 The EM (Expectation Maximization) algorithm is a method for finding Θ ML = argmax Θ log P (x i, y Θ) i y Y

14 The Three Coins Example e.g., in the three coins example: Y = {H,T} X = {HHH,TTT,HTT,THH,HHT,TTH,HTH,THT} Θ = {λ, p 1, p 2 } and where and P (x, y Θ) = P (y Θ)P (x y, Θ) { P (y Θ) = P (x y, Θ) = { λ If y = H 1 λ If y = T h 1 h 2 p (1 p 1 ) p (1 ) p 2 t t If y = H If y = T where h = number of heads in x, t = number of tails in x

15 The Three Coins Example Various probabilities can be calculated, for example: P (x = THT, y = H Θ) = λp 1 (1 p 1 ) 2 P (x = THT, y = T Θ) = (1 λ)p 2 (1 p 2 ) 2 P (x = THT Θ) = P (x = THT, y = H Θ) +P (x = THT, y = T Θ) = λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y = H x = THT, Θ) = = P (x = THT, y = H Θ) P (x = THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2

16 The Three Coins Example Various probabilities can be calculated, for example: P (x = THT, y = H Θ) = λp 1 (1 p 1 ) 2 P (x = THT, y = T Θ) = (1 λ)p 2 (1 p 2 ) 2 P (x = THT Θ) = P (x = THT, y = H Θ) +P (x = THT, y = T Θ) = λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y = H x = THT, Θ) = = P (x = THT, y = H Θ) P (x = THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2

17 The Three Coins Example Various probabilities can be calculated, for example: P (x = THT, y = H Θ) = λp 1 (1 p 1 ) 2 P (x = THT, y = T Θ) = (1 λ)p 2 (1 p 2 ) 2 P (x = THT Θ) = P (x = THT, y = H Θ) +P (x = THT, y = T Θ) = λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y = H x = THT, Θ) = = P (x = THT, y = H Θ) P (x = THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2

18 The Three Coins Example Various probabilities can be calculated, for example: P (x = THT, y = H Θ) = λp 1 (1 p 1 ) 2 P (x = THT, y = T Θ) = (1 λ)p 2 (1 p 2 ) 2 P (x = THT Θ) = P (x = THT, y = H Θ) +P (x = THT, y = T Θ) = λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2 P (y = H x = THT, Θ) = = P (x = THT, y = H Θ) P (x = THT Θ) λp 1 (1 p 1 ) 2 λp 1 (1 p 1 ) 2 + (1 λ)p 2 (1 p 2 ) 2

19 The Three Coins Example Fully observed data might look like: ( HHH, H), ( T T T, T ), ( HHH, H), ( T T T, T ), ( HHH, H) In this case maximum likelihood estimates are: 3 λ = 5 9 p 1 = 9 0 p 2 = 6

20 The Three Coins Example Partially observed data might look like: HHH, T T T, HHH, T T T, HHH How do we find the maximum likelihood parameters?

21 The Three Coins Example Partially observed data might look like: HHH, T T T, HHH, T T T, HHH If current parameters are λ, p 1, p 2 P ( HHH, H) P (y = H x = HHH ) = P ( HHH, H) + P ( HHH, T) = λp 3 1 λp (1 λ)p 3 2 P (y = H x = TTT ) = = P ( TTT, H) P ( TTT, H) + P ( TTT, T) λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3

22 The Three Coins Example If current parameters are λ, p 1, p 2 λp 3 P (y = H x = HHH ) = 1 λp (1 λ)p 3 2 P (y = H x = TTT ) = λ(1 p 1 ) 3 λ(1 p 1 ) 3 + (1 λ)(1 p 2 ) 3 If λ = 0.3, p 1 = 0.3, p 2 = 0.6: P (y = H x = HHH ) = P (y = H x = TTT ) =

23 The Three Coins Example After filling in hidden variables for each example, partially observed data might look like: ( HHH, H) P (y = H HHH) = ( HHH, T ) P (y = T HHH) = ( TTT, H) P (y = H TTT) = ( TTT, T ) P (y = T TTT) = ( HHH, H) P (y = H HHH) = ( HHH, T ) P (y = T HHH) = ( TTT, H) P (y = H TTT) = ( TTT, T ) P (y = T TTT) = ( HHH, H) P (y = H HHH) = ( HHH, T ) P (y = T HHH) =

24 New Estimates: The Three Coins Example ( HHH, H) P (y = H HHH) = ( HHH, T ) P (y = T HHH) = ( TTT, H) P (y = H TTT) = ( TTT, T ) P (y = T TTT) = λ = = p 1 = = p 2 = =

25 The Three Coins Example: Summary Begin with parameters λ = 0.3, p 1 = 0.3, p 2 = 0.6 Fill in hidden variables, using P (y = H x = HHH ) = P (y = H x = TTT ) = Re-estimate parameters to be λ = , p 1 = , p 2 =

26 Iteration λ p 1 p 2 p 1 p 2 p 3 p The coin example for y = { HHH, T T T, HHH, T T T }. The solution that EM reaches is intuitively correct: the coin-tosser has two coins, one which always shows up heads, the other which always shows tails, and is picking between them with equal probability (λ = 0.5). The posterior probabilities p i show that we are certain that coin 1 (tail-biased) generated y 2 and y 4, whereas coin 2 generated y 1 and y 3.

27 Iteration λ p 1 p 2 p 1 p 2 p 3 p 4 p The coin example for { HHH, T T T, HHH, T T T, HHH }. λ is now 0.4, indicating that the coin-tosser has probability 0.4 of selecting the tail-biased coin.

28 Iteration λ p 1 p 2 p 1 p 2 p 3 p The coin example for y = { HHT, T T T, HHH, T T T }. EM selects a tails-only coin, and a coin which is heavily heads-biased (p 2 = ). It s certain that y 1 and y 3 were generated by coin 2, as they contain heads. y 2 and y 4 could have been generated by either coin, but coin 1 is far more likely.

29 Iteration λ p 1 p 2 p 1 p 2 p 3 p The coin example for y = { HHH, T T T, HHH, T T T }, with p 1 and p 2 initialised to the same value. EM is stuck at a saddle point

30 Iteration λ p 1 p 2 p 1 p 2 p 3 p The coin example for y = { HHH, T T T, HHH, T T T }. If we initialise p 1 and p 2 to be a small amount away from the saddle point p 1 = p 2, the algorithm diverges from the saddle point and eventually reaches the global maximum.

31 Iteration λ p 1 p 2 p 1 p 2 p 3 p The coin example for y = { HHH, T T T, HHH, T T T }. If we initialise p 1 and p 2 to be a small amount away from the saddle point p 1 = p 2, the algorithm diverges from the saddle point and eventually reaches the global maximum.

32 The EM Algorithm Θ t is the parameter vector at t th iteration Choose Θ 0 (at random, or using various heuristics) Iterative procedure is defined as where Θ t = argmax Θ Q(Θ, Θ t 1 ) Q(Θ, Θ t 1 ) = P (y x i, Θ t 1 ) log P (x i, y Θ) i y Y

33 The EM Algorithm t Iterative procedure is defined as Θ = argmax Θ Q(Θ, Θ t 1 ), where Q(Θ, Θ t 1 ) = P (y x i, Θ t 1 ) log P (x i, y Θ) Key points: i y Y Intuition: fill in hidden variables y according to P (y x i, Θ) EM is guaranteed to converge to a local maximum, or saddle-point, of the likelihood function In general, if argmax Θ log P (x i, y i Θ) i has a simple (analytic) solution, then argmax Θ P (y x i, Θ) log P (x i, y Θ) also has a simple (analytic) solution. i y

34 Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic programming)

35 The Structure of Hidden Markov Models Have N states, states 1... N Without loss of generality, take N to be the final or stop state Have an alphabet K. For example K = {a, b} Parameter λ i for i = 1... N is probability of starting in state i Parameter a i,j for i = 1... (N 1), and j = 1... N is probability of state j following state i Parameter b i (o) for i = 1... (N 1), and o K is probability of state i emitting symbol o

36 An Example Take N = 3 states. States are {1, 2, 3}. Final state is state 3. Alphabet K = {the, dog}. Distribution over initial state is λ 1 = 1.0, λ 2 = 0, λ 3 = 0. Parameters a i,j are j=1 j=2 j=3 i= i= Parameters b i (o) are o=the o=dog i= i=

37 A Generative Process Pick the start state s 1 to be state i for i = 1... N with probability λ i. Set t = 1 Repeat while current state s t is not the stop state (N): Emit a symbol o t K with probability b st (o t ) Pick the next state s t+1 as state j with probability a st,j. t = t + 1

38 Probabilities Over Sequences An output sequence is a sequence of observations o 1... o T where each o i K e.g. the dog the dog dog the A state sequence is a sequence of states s 1... s T where each s i {1... N} e.g HMM defines a probability for each state/output sequence pair e.g. the/1 dog/2 the/1 dog/2 the/2 dog/1 has probability λ 1 b 1 (the) a 1,2 b 2 (dog) a 2,1 b 1 (the) a 1,2 b 2 (dog) a 2,2 b 2 (the) a 2,1 b 1 (dog)a 1,3 Formally: ( ) ( ) T T P (s 1... s T, o 1... o T ) = λ s 1 P (s i s i 1 ) P (o i s i ) P (N s T ) i=2 i=1

39 A Hidden Variable Problem We have an HMM with N = 3, K = {e, f, g, h} We see the following output sequences in training data e e f f g h h g How would you choose the parameter values for λ i, a i,j, and b i (o)?

40 Another Hidden Variable Problem We have an HMM with N = 3, K = {e, f, g, h} We see the following output sequences in training data e g h e h f h g f g g e h How would you choose the parameter values for λ i, a i,j, and b i (o)?

41 A Reminder: Models with Hidden Variables Now say we have two sets X and Y, and a joint distribution P (x, y Θ) If we had fully observed data, (x i, y i ) pairs, then L(Θ) = log P (x i, y i Θ) i If we have partially observed data, x i examples, then L(Θ) = log P (x i Θ) i = log P (x i, y Θ) i y Y

42 Hidden Markov Models as a Hidden Variable Problem We have two sets X and Y, and a joint distribution P (x, y Θ) In Hidden Markov Models: each x X is an output sequence o 1... o T each y Y is a state sequence s 1... s T

43 Maximum Likelihood Estimates We have an HMM with N = 3, K = {e, f, g, h} We see the following paired sequences in training data e/1 g/2 e/1 h/2 f/1 h/2 f/1 g/2 Maximum likelihood estimates: for parameters a i,j : for parameters b i (o): λ 1 = 1.0, λ 2 = 0.0, λ 3 = 0.0 j=1 j=2 j=3 i= i= o=e o=f o=g o=h i= i=

44 The Likelihood Function for HMMs: Fully Observed Data Say (x, y) = {o 1... o T, s 1... s T }, and f (i, j, x, y) = Number of times state j follows state i in (x,y) f (i, x, y) = Number of times state i is the initial state in (x,y) (1 or 0) f (i, o, x, y) = Number of times state i is paired with observation o Then P (x, y) = f (i,x,y) λ i a f (i,j,x,y) i,j i {1...N 1} i {1...N 1}, i {1...N 1}, j {1...N } o K b i (o) f (i,o,x,y)

45 The Likelihood Function for HMMs: Fully Observed Data If we have training examples (x l, y l ) for l = 1... m, m L(Θ) = log P (x l, y l ) l=1 m = f(i, x l, y l ) log λ i + l=1 i {1...N 1} f(i, j, x l, y l ) log a i,j + i {1...N 1}, j {1...N } f(i, o, x l, y l ) log b i (o) i {1...N 1}, o K

46 Maximizing this function gives maximum-likelihood estimates: l f(i, x l, y l ) λ i = l k f(k, x l, y l ) l f(i, j, x l, y l ) a i,j = l k f(i, k, x l, y l ) l f(i, o, x l, y l ) b i (o) = l o K f(i, o, x l, y l )

47 The Likelihood Function for HMMs: Partially Observed Data If we have training examples (x l ) for l = 1... m, m L(Θ) = log P (x l, y) y l=1 m Q(Θ, Θ t 1 ) = P (y x l, Θ t 1 ) log P (x l, y Θ) y l=1

48 m Q(Θ, Θ t 1 ) = P (y x l, Θ t 1 ) f(i, x l, y) log λ i + l=1 y i {1...N 1} f(i, j, x l, y) log a i,j + f(i, o, x l, y) log b i (o) i {1...N 1}, j {1...N } i {1...N 1}, o K m = g(i, x l ) log λ i + g(i, j, x l ) log a i,j + g(i, o, x l ) log b i (o) l=1 i {1...N 1} i {1...N 1}, i {1...N 1}, j {1...N } o K where each g is an expected count: g(i, x l ) = P (y x l, Θ t 1 )f(i, x l, y) y g(i, j, x l ) = P (y x l, Θ t 1 )f(i, j, x l, y) y g(i, o, x l ) = P (y x l, Θ t 1 )f(i, o, x l, y) y

49 Maximizing this function gives EM updates: l g(i, x l) l g(i, j, x l) l g(i, o, x l) λ i = a i,j = b i (o) = l k g(k, x l) l k g(i, k, x l) l o K g(i, o, x l ) Compare this to maximum likelihood estimates in fully observed case: l f(i, x l, y l ) l f(i, j, x l, y l ) f(i, o, x l, y l ) λ i = a i,j = b i (o) = l f(k, x l, y l ) f(i, k, x l, y l ) f(i, o, x l, y l ) l k l k l o K

50 A Hidden Variable Problem We have an HMM with N = 3, K = {e, f, g, h} We see the following output sequences in training data e e f f g h h g How would you choose the parameter values for λ i, a i,j, and b i (o)?

51 Four possible state sequences for the first example: e/1 g/1 e/1 g/2 e/2 g/1 e/2 g/2

52 Four possible state sequences for the first example: e/1 g/1 e/1 g/2 e/2 g/1 e/2 g/2 Each state sequence has a different probability: e/1 g/1 λ 1 a 1,1 a 1,3 b 1 (e)b 1 (g) e/1 g/2 λ 1 a 1,2 a 2,3 b 1 (e)b 2 (g) e/2 g/1 λ 2 a 2,1 a 1,3 b 2 (e)b 1 (g) e/2 g/2 λ 2 a 2,2 a 2,3 b 2 (e)b 2 (g)

53 A Hidden Variable Problem Say we have initial parameter values: λ 1 = 0.35, λ 2 = 0.3, λ 3 = 0.35 a i,j j=1 j=2 j=3 i= i= b i (o) o=e o=f o=g o=h i= i= Each state sequence has a different probability: e/1 g/1 λ 1 a 1,1 a 1,3 b 1 (e)b 1 (g) = e/1 g/2 λ 1 a 1,2 a 2,3 b 1 (e)b 2 (g) = e/2 g/1 λ 2 a 2,1 a 1,3 b 2 (e)b 1 (g) = e/2 g/2 λ 2 a 2,2 a 2,3 b 2 (e)b 2 (g) =

54 A Hidden Variable Problem Each state sequence has a different probability: e/1 g/1 λ 1 a 1,1 a 1,3 b 1 (e)b 1 (g) = e/1 g/2 λ 1 a 1,2 a 2,3 b 1 (e)b 2 (g) = e/2 g/1 λ 2 a 2,1 a 1,3 b 2 (e)b 1 (g) = e/2 g/2 λ 2 a 2,2 a 2,3 b 2 (e)b 2 (g) = Each state sequence has a different conditional probability, e.g.: P (1 1 e g, Θ) = = e/1 g/1 P (1 1 e g, Θ) = 0.28 e/1 g/2 P (1 2 e g, Θ) = 0.42 e/2 g/1 P (2 1 e g, Θ) = 0.18 e/2 g/2 P (2 2 e g, Θ) = 0.12

55 fill in hidden values for (e g), (e h), (f h), (f g) e/1 g/1 P (1 1 e g, Θ) = 0.28 e/1 g/2 P (1 2 e g, Θ) = 0.42 e/2 g/1 P (2 1 e g, Θ) = 0.18 e/2 g/2 P (2 2 e g, Θ) = 0.12 e/1 h/1 P (1 1 e h, Θ) = e/1 h/2 P (1 2 e h, Θ) = e/2 h/1 P (2 1 e h, Θ) = e/2 h/2 P (2 2 e h, Θ) = f/1 h/1 P (1 1 f h, Θ) = f/1 h/2 P (1 2 f h, Θ) = f/2 h/1 P (2 1 f h, Θ) = f/2 h/2 P (2 2 f h, Θ) = f/1 g/1 P (1 1 f g, Θ) = f/1 g/2 P (1 2 f g, Θ) = f/2 g/1 P (2 1 f g, Θ) = f/2 g/2 P (2 2 f g, Θ) = 0.162

56 Calculate the expected counts: g(1, xl ) = = l g(2, xl ) = l g(3, xl ) = 0.0 l g(1, 1, xl ) = = l g(1, 2, xl ) = 1.72 l g(2, 1, xl ) = l g(2, 2, xl ) = l g(1, 3, xl ) = l g(2, 3, xl ) = l

57 Calculate the expected counts: g(1, e, x l ) = = 1.4 l g(1, f, x l ) = l g(1, g, x l ) = l g(1, h, x l ) = l g(2, e, x l ) = 0.6 l g(2, f, x l ) = l g(2, g, x l ) = l g(2, h, x l ) = l

58 Calculate the new estimates: l g(1, x l) λ 1 = = = l g(1, x l) + l g(2, x l) + l g(3, x l) λ 2 = λ 3 = 0 l g(1, 1, x l) 0.91 a 1,1 = = = l g(1, 1, x l) + l g(1, 2, x l) + l g(1, 3, x l ) a i,j j=1 j=2 j=3 i= i= b i (o) o=e o=f o=g o=h i= i=

59 Iterate this 3 times: λ 1 = , λ 2 = λ 3 = 0 a i,j j=1 j=2 j=3 i= i= b i (o) o=e o=f o=g o=h i= i=

60 Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic programming)

61 The Forward-Backward or Baum-Welch Algorithm Aim is to (efficiently!) calculate the expected counts: g(i, x l ) = P (y xl, Θ t 1 )f(i, x l, y) y g(i, j, x l ) = P (y xl, Θ t 1 )f(i, j, x l, y) y g(i, o, x l ) = P (y xl, Θ t 1 )f(i, o, x l, y) y

62 The Forward-Backward or Baum-Welch Algorithm Suppose we could calculate the following quantities, given an input sequence o 1... o T : α i (t) = P (o 1... o t 1, s t = i Θ) forward probabilities β i (t) = P (o t... o T s t = i, Θ) backward probabilities The probability of being in state i at time t, is p t (i) = P (s t = i o 1... o T, Θ) = P (s t = i, o 1... o T Θ) P (o 1... o T Θ) also, α t (i)β t (i) = P (o 1... o T Θ) P (o 1... o T Θ) = α t (i)β t (i) for any t i

63 Expected Initial Counts As before, g(i, o 1... o T ) = expected number of times state i is state 1 We can calculate this as g(i, o 1... o T ) = p 1 (i)

64 Expected Emission Counts As before, g(i, o, o 1... o T ) = expected number of times state i emits the symbol o We can calculate this as g(i, o, o 1... o T ) = t:o t =o p t (i)

65 The Forward-Backward or Baum-Welch Algorithm Suppose we could calculate the following quantities, given an input sequence o 1... o T : α i (t) = P (o 1... o t 1, s t = i Θ) forward probabilities β i (t) = P (o t... o T s t = i, Θ) backward probabilities The probability of being in state i at time t, and in state j at time t + 1, is p t (i, j) = P (s t = i, s t+1 = j o 1... o T, Θ) = P (s t = i, s t+1 = j, o 1... o T Θ) P (o 1... o T Θ) also, α t (i)a i,j b i (o t )β t+1 (j) = P (o 1... o T Θ) P (o 1... o T Θ) = α t (i)β t (i) for any t i

66 As before, Expected Transition Counts g(i, j, o 1... o T ) = expected number of times state j follows state i We can calculate this as g(i, j, o 1... o T ) = p t (i, j) t

67 Recursive Definitions for Forward Probabilities Given an input sequence o 1... o T : α i (t) = P (o 1... o t 1, s t = i Θ) forward probabilities Base case: α i (1) = λ i for all i Recursive case: α j (t+1) = α i (t)a i,j b i (o t ) for all j = 1... N and t = 2... T i

68 Recursive Definitions for Backward Probabilities Given an input sequence o 1... o T : β i (t) = P (o t... o T s t = i, Θ) backward probabilities Base case: β i (T + 1) = 1 β i (T + 1) = 0 for i = N for i = N Recursive case: β i (t) = a i,j b i (o t )β j (t+1) for all j = 1... N and t = 1... T j

69 Overview The EM algorithm in general form (more about the 3 coin example) The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic programming) Briefly: The EM algorithm for PCFGs

70 EM for Probabilistic Context-Free Grammars A PCFG defines a distribution P (S, T Θ) over tree/sentence pairs (S, T ) If we had tree/sentence pairs (fully observed data) then L(Θ) = log P (S i, T i Θ) i Say we have sentences only, S 1... S n trees are hidden variables L(Θ) = log P (S i, T Θ) i T

71 EM for Probabilistic Context-Free Grammars Say we have sentences only, S 1... S n trees are hidden variables L(Θ) = log P (S i, T Θ) i T EM algorithm is then Θ t = argmax Θ Q(Θ, Θ t 1 ), where Q(Θ, Θ t 1 ) = P (T S i, Θ t 1 ) log P (S i, T Θ) i T

72 Remember: log P (S i, T Θ) = r R Count(S i, T, r) log Θ r where Count(S, T, r) is the number of times rule r is seen in the sentence/tree pair (S, T ) Q(Θ, Θ t 1 ) = P (T S i, Θ t 1 ) log P (S i, T Θ) i T = P (T S i, Θ t 1 ) Count(S i, T, r) log Θ r i T r R = Count(S i, r) log Θ r i r R where Count(S i, r) = T P (T S i, Θ t 1 )Count(S i, T, r) the expected counts

73 Solving Θ ML = argmax Θ Ω L(Θ) gives i Count(S i, α β) Θ α β = i s R(α) Count(S i, s) There are efficient algorithms for calculating Count(S i, r) = P (T S i, Θ t 1 )Count(S i, T, r) T for a PCFG. See (Baker 1979), called The Inside Outside Algorithm. See also Manning and Schuetze section

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads