Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Emma Adams
6 years ago
Views:

1 SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 20,

2 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Introducing Hidden Markov Models 1

3 Modelling pairs of sequences Input: sequence of words; Output: sequence of labels Input British left waffles on Falkland Islands Output1 N N V P N N Output2 N V N P N N. N Noun, e.g. islands V Verb, e.g. leave, left P Preposition, e.g. on 2

4 Modelling pairs of sequences Input: sequence of words; Output: sequence of labels Input British left waffles on Falkland Islands Output1 N N V P N N Output2 N V N P N N. 3 states: S = {N, V, P} Input sequence: x 1, x 2,..., x n Output sequence: t 1, t 2,..., t n where t i S How many output sequences? S n 3

5 Modelling pairs of sequences Input: sequence of characters; Output: sequence of labels Input 北京大学生比赛 7 chars Output1 BIBIIBI 7 labels Output2 BIIIBBI 7 labels. 7 labels B Begin word I Inside word BIBIIBI 北京大学生比赛 (Beijing student competition) BIIIBBI 北京大学生比赛 (Peking University Health Competition) 4

6 Hidden Markov Models Input: x Output space: Y(x) Output: y Y(x) We want to learn a function f such that f (x) = y 5

7 Hidden Markov Models Conditional model Construct function f using a conditional probability: f (x) = arg max p(y x) y Y(x) We can construct this function f using two principles: Discriminative learning: find the best output y given input x Generative modelling: model the joint probability p(x, y) to find p(y x) 6

8 Hidden Markov Models Generative Model Start from the joint probability p(x, y): p(x, y) = p(y)p(x y) Also: p(x, y) = p(x)p(y x) Bayes Rule: p(y x) = p(y)p(x y) p(x) 7

9 Hidden Markov Models Generative Model Bayes Rule: where: p(x) = p(y x) = y Y(x) p(x, y) = p(y)p(x y) p(x) y Y(x) p(y)p(x y) So using a generative model, we can find the best output y using: p(y)p(x y) p(y x) = p(y)p(x y) y Y(x) 8

10 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Algorithms for Hidden Markov Models 9

11 Hidden Markov Model π i Model θ = a i,j b i (o) p(i): starting at state i p(j i): transition to state i from state j p(o i): output o at state i 10

12 Hidden Markov Model Algorithms HMM as parser: compute the best sequence of states for a given observation sequence. HMM as language model: compute probability of given observation sequence. HMM as learner: given a corpus of observation sequences, learn its distribution, i.e. learn the parameters of the HMM from the corpus. Learning from a set of observations with the sequence of states provided (states are not hidden) [Supervised Learning] Learning from a set of observations without any state information. [Unsupervised Learning] 11

13 HMM as Parser π = A 0.25 N 0.75 a = a i,j A N A N b = b i (o) clown killer problem crazy A N The task: for a given observation sequence find the most likely state sequence. a i,j = p(j i) and b i (o) = p(o i) 12

14 HMM as Parser Find most likely sequence of states for killer clown Score every possible sequence of states: AA, AN, NN, NA P(killer clown, AA) = πa b A (killer) a A,A b A (clown) = 0.0 P(killer clown, AN) = πa b A (killer) a A,N b N (clown) = 0.0 P(killer clown, NN) = π N b N (killer) a N,N b N (clown) = = P(killer clown, NA) = π N b N (killer) a N,A b A (clown) = 0.0 Pick the state sequence with highest probability (NN=0.045). 13

15 HMM as Parser As we have seen, for input of length 2, and a HMM with 2 states there are 2 2 possible state sequences. In general, if we have q states and input of length T there are q T possible state sequences. Using our example HMM, for input killer crazy clown problem we will have 2 4 possible state sequences to score. Our naive algorithm takes exponential time to find the best state sequence for a given input. The Viterbi algorithm uses dynamic programming to provide the best state sequence with a time complexity of q 2 T 14

16 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 3: Viterbi Algorithm for HMMs 15

17 Viterbi Algorithm for HMMs For input of length T : o 1,..., o T, we want to find the sequence of states s 1,..., s T Each s t in this sequence is one of the states in the HMM. So the task is to find the most likely sequence of states: arg max s 1,...,s T P(o 1,..., o T, s 1,..., s T ) The Viterbi algorithm solves this by creating a table V [s, t] where s is one of the states, and t is an index between 1,..., T. 16

18 Viterbi Algorithm for HMMs Consider the input killer crazy clown problem So the task is to find the most likely sequence of states: arg max P(killer crazy clown problem, s 1, s 2, s 3, s 4 ) s 1,s 2,s 3,s 4 A sub-problem is to find the most likely sequence of states for killer crazy clown: arg max s 1,s 2,s 3 P(killer crazy clown, s 1, s 2, s 3 ) 17

19 Viterbi Algorithm for HMMs In our example there are two possible values for s 4 : max P(killer crazy clown problem, s 1, s 2, s 3, s 4 ) = s 1,...,s 4 { max max P(killer crazy clown problem, s 1, s 2, s 3, N), s 1,s 2,s 3 } max P(killer crazy clown problem, s 1, s 2, s 3, A) s 1,s 2,s 3 Similarly: max P(killer crazy clown, s 1, s 2, s 3 ) = s 1,...,s 3 { max max P(killer crazy clown, s 1, s 2, N), s 1,s 2 } max P(killer crazy clown, s 1, s 2, A) s 1,s 2 18

20 Viterbi Algorithm for HMMs Putting them together: P(killer crazy clown problem, s 1, s 2, s 3, N) = max {P(killer crazy clown, s 1, s 2, N) a N,N b N (problem), P(killer crazy clown, s 1, s 2, A) a A,N b N (problem)} P(killer crazy clown problem, s 1, s 2, s 3, A) = max {P(killer crazy clown, s 1, s 2, N) a N,A b A (problem), P(killer crazy clown, s 1, s 2, A) a A,A b A (problem)} The best score is given by: max P(killer crazy clown problem, s 1, s 2, s 3, s 4 ) = s 1,...,s 4 { max max P(killer crazy clown problem, s 1, s 2, s 3, N), N,A s 1,s 2,s 3 } max P(killer crazy clown problem, s 1, s 2, s 3, A) s 1,s 2,s 3 19

21 Viterbi Algorithm for HMMs Provide an index for each input symbol: 1:killer 2:crazy 3:clown 4:problem V [N, 3] = max P(killer crazy clown, s 1, s 2, N) s 1,s 2 V [N, 4] = max P(killer crazy clown problem, s 1, s 2, s 3, N) s 1,s 2,s 3 Putting them together: V [N, 4] = max {V [N, 3] a N,N b N (problem), V [A, 3] a A,N b N (problem)} V [A, 4] = max {V [N, 3] a N,A b A (problem), V [A, 3] a A,A b A (problem)} The best score for the input is given by: max {V [N, 4], V [A, 4]} To extract the best sequence of states we backtrack (same trick as obtaining alignments from minimum edit distance) 20

22 Viterbi Algorithm for HMMs For input of length T : o 1,..., o T, we want to find the sequence of states s 1,..., s T Each s t in this sequence is one of the states in the HMM. For each state q we initialize our table: V [q, 1] = π q b q (o 1 ) Then compute for t = 1... T 1 for each state q: V [q, t + 1] = max q { V [q, t] a q,q b q (o t+1 ) } After the loop terminates, the best score is max q V [q, T ] 21

23 Learning from Fully Observed Data π = A 0.25 N 0.75 a = a i,j A N A N b = b i (o) clown killer problem crazy A N Viterbi algorithm: V killer:1 crazy:2 clown:3 problem:4 A N 22

24 Learning from Fully Observed Data π = A 0.25 N 0.75 a = a i,j A N A N b = b i (o) clown killer problem crazy A N Viterbi algorithm: V killer:1 crazy:2 clown:3 problem:4 A N

25 Probability models of language Question π = V 0.25 N 0.75 b = a = b i (o) time flies can V N a i,j V N V N What is the best sequence of tags for each string below: 1. time 2. time flies 3. time flies can 24

26 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 4: HMM as a Language Model 25

27 Hidden Markov Model π i Model θ = a i,j b i (o) probability of starting at state i probability of transition from state i to state j probability of output o at state i 26

28 Hidden Markov Model Algorithms HMM as parser: compute the best sequence of states for a given observation sequence. HMM as language model: compute probability of given observation sequence. HMM as learner: given a corpus of observation sequences, learn its distribution, i.e. learn the parameters of the HMM from the corpus. Learning from a set of observations with the sequence of states provided (states are not hidden) [Supervised Learning] Learning from a set of observations without any state information. [Unsupervised Learning] 27

29 HMM as a Language Model Find P(killer clown) = y P(y, killer clown) P(killer clown) = P(AA, killer clown) + P(AN, killer clown) + P(NN, killer clown) + P(NA, killer clown) 28

30 HMM as a Language Model Consider the input killer crazy clown problem So the task is to find the sum over all sequences of states: s 1,s 2,s 3,s 4 P(killer crazy clown problem, s 1, s 2, s 3, s 4 ) A sub-problem is to find the most likely sequence of states for killer crazy clown: s 1,s 2,s 3 P(killer crazy clown, s 1, s 2, s 3 ) 29

31 HMM as a Language Model In our example there are two possible values for s 4 : s 1,...,s 4 P(killer crazy clown problem, s 1, s 2, s 3, s 4 ) = s 1,s 2,s 3 P(killer crazy clown problem, s 1, s 2, s 3, N) + s 1,s 2,s 3 P(killer crazy clown problem, s 1, s 2, s 3, A) Very similar to the Viterbi algorithm. Sum instead of max, and that s the only difference! 30

32 HMM as a Language Model Provide an index for each input symbol: 1:killer 2:crazy 3:clown 4:problem V [N, 3] = s 1,s 2 P(killer crazy clown, s 1, s 2, N) V [N, 4] = Putting them together: s 1,s 2,s 3 P(killer crazy clown problem, s 1, s 2, s 3, N) V [N, 4] = V [N, 3] a N,N b N (problem) + V [A, 3] a A,N b N (problem) V [A, 4] = V [N, 3] a N,A b A (problem) + V [A, 3] a A,A b A (problem) The best score for the input is given by: V [N, 4] + V [A, 4] 31

33 HMM as a Language Model For input of length T : o 1,..., o T, we want to find P(o 1,..., o T ) = y 1,...,y T P(y 1,..., y T, o 1,..., o T ) Each y t in this sequence is one of the states in the HMM. For each state q we initialize our table: V [q, 1] = π q b q (o 1 ) Then compute recursively for t = 1... T 1 for each state q: V [q, t + 1] = q { V [q, t] a q,q b q (o t+1 ) } After the loop terminates, the best score is q V [q, T ] So: Viterbi with sum instead of max gives us an algorithm for HMM as a language model. This algorithm is sometimes called the forward algorithm. 32

34 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 5: Supervised Learning for HMMs 33

35 Hidden Markov Model Algorithms HMM as parser: compute the best sequence of states for a given observation sequence. HMM as language model: compute probability of given observation sequence. HMM as learner: given a corpus of observation sequences, learn its distribution, i.e. learn the parameters of the HMM from the corpus. Learning from a set of observations with the sequence of states provided (states are not hidden) [Supervised Learning] Learning from a set of observations without any state information. [Unsupervised Learning] 34

36 Hidden Markov Model π i Model θ = a i,j b i (o) probability of starting at state i probability of transition from state i to state j probability of output o at state i Constraints : i π i = 1, j a i,j = 1, o b i (o) = 1 35

37 HMM Learning from Labeled Data The task: to find the values for the parameters of the HMM: π A, π N a A,A, a A,N, a N,N, a N,A b A (killer), b A (crazy), b A (clown), b A (problem) bn (killer), b N (crazy), b N (clown), b N (problem) 36

38 Learning from Fully Observed Data Labeled Data L x1,y1: killer/n clown/n (x1 = killer,clown; y1 = N,N) x2,y2: killer/n problem/n (x2 = killer,problem; y2 = N,N) x3,y3: crazy/a problem/n... x4,y4: crazy/a clown/n x5,y5: problem/n crazy/a clown/n x6,y6: clown/n crazy/a killer/n 37

39 Learning from Fully Observed Data Let s say we have m labeled examples: L = (x 1, y 1 ),..., (x m, y m ) Each (x l, y l ) = {o 1,..., o T, s 1,..., s T } For each (x l, y l ) we can compute the probability using the HMM: (x 1 = killer, clown; y 1 = N, N) : P(x 1, y 1 ) = π N b N (killer) a N,N b N (clown) (x2 = killer, problem; y 2 = N, N) : P(x 2, y 2 ) = π N b N (killer) a N,N b N (problem) (x3 = crazy, problem; y 3 = A, N) : P(x 3, y 3 ) = π A b A (crazy) a A,N b N (problem) (x 4 = crazy, clown; y 4 = A, N) : P(x 4, y 4 ) = π A b A (crazy) a A,N b N (clown) (x5 = problem, crazy, clown; y 5 = N, A, N) : P(x 5, y 5 ) = π N b N (problem) a N,A b A (crazy) a A,N b N (clown) (x6 = clown, crazy, killer; y 6 = N, A, N) : P(x 6, y 6 ) = π N b N (clown) a N,A b A (crazy) a A,N b N (killer) l P(x l, y l ) = π N 4 π A2 a N,N 2 a N,A2 a A,N 4 a A,A0 b N (killer) 3 b N (clown) 4 b N (problem) 3 b A (crazy) 4 38

40 Learning from Fully Observed Data We can easily collect frequency of observing a word with a state (tag) f (i, x, y) = number of times i is the initial state in (x, y) f (i, j, x, y) = number of times j follows i in (x, y) f (i, o, x, y) = number of times i is paired with observation o Then according to our HMM the probability of x, y is: P(x, y) = π f (i,x,y) i a f (i,j,x,y) i,j b i (o) f (i,o,x,y) i i,j i,o 39

41 Learning from Fully Observed Data According to our HMM the probability of x, y is: P(x, y) = π f (i,x,y) i a f (i,j,x,y) i,j b i (o) f (i,o,x,y) i i,j i,o For the labeled data L = (x 1, y 1 ),..., (x l, y l ),..., (x m, y m ) P(L) = = m P(x l, y l ) l=1 m i l=1 π f (i,x l,y l ) i i,j a f (i,j,x l,y l ) i,j b i (o) f (i,o,x l,y l ) i,o 40

42 Learning from Fully Observed Data According to our HMM the probability of x, y is: m P(L) = b i (o) f (i,o,x l,y l ) l=1 i π f (i,x l,y l ) i i,j a f (i,j,x l,y l ) i,j The log probability of the labeled data (x 1, y 1 ),..., (x m, y m ) according to HMM with parameters θ is: m L(θ) = log P(x l, y l ) = l=1 i,o m f (i, x l, y l ) log π i + l=1 i f (i, j, x l, y l ) log a i,j + i,j f (i, o, x l, y l ) log b i (o) i,o 41

43 Learning from Fully Observed Data m L(θ) = l=1 f (i, x l, y l ) log π i + i,j i f (i, j, x l, y l ) log a i,j + i,o f (i, o, x l, y l ) log b i (o) θ = (π, a, b) L(θ) is the log probability of the labeled data (x 1, y 1 ),..., (x m, y m ) We want to find a θ that will give us the maximum value of L(θ) Find the θ such that dl(θ) dθ = 0 42

44 Learning from Fully Observed Data m L(θ) = l=1 f (i, x l, y l ) log π i + i i,j f (i, j, x l, y l ) log a i,j + i,o f (i, o, x l, y l ) log b i (o) The values of π i, a i,j, b i (o) that maximize L(θ) are: l π i = f (i, x l, y l ) l k f (k, x l, y l ) l a i,j = f (i, j, x l, y l ) l k f (i, k, x l, y l ) l b i (o) = f (i, o, x l, y l ) l o V f (i, o, x l, y l ) 43

45 Learning from Fully Observed Data Labeled Data: x1,y1: killer/n clown/n x2,y2: killer/n problem/n x3,y3: crazy/a problem/n x4,y4: crazy/a clown/n x5,y5: problem/n crazy/a clown/n x6,y6: clown/n crazy/a killer/n 44

46 Learning from Fully Observed Data The values of π i that maximize L(θ) are: l π i = f (i, x l, y l ) l k f (k, x l, y l ) π N = 2 3 and π A = 1 3 because: f (N, x l, y l ) = 4 l f (A, x l, y l ) = 2 l 45

47 Learning from Fully Observed Data The values of a i,j that maximize L(θ) are: l a i,j = f (i, j, x l, y l ) l k f (i, k, x l, y l ) a N,N = 1 2 ; a N,A = 1 2 ; a A,N = 1 and a A,A = 0 because: f (N, N, x l, y l ) = 2 l f (N, A, x l, y l ) = 2 l f (A, N, x l, y l ) = 4 l f (A, A, x l, y l ) = 0 l 46

48 Learning from Fully Observed Data The values of b i (o) that maximize L(θ) are: l b i (o) = f (i, o, x l, y l ) l o V f (i, o, x l, y l ) b N (killer) = 3 10 ; b N(clown) = 4 10 ; b N(problem) = 3 10 and b A (crazy) = 1 because: f (N, killer, x l, y l ) = 3 l f (N, clown, x l, y l ) = 4 l f (N, crazy, x l, y l ) = 0 l f (N, problem, x l, y l ) = 3 l f (A, killer, x l, y l ) = 0 l f (A, clown, x l, y l ) = 0 l f (A, crazy, x l, y l ) = 4 l f (A, problem, x l, y l ) = 0 l 47

49 Learning from Fully Observed Data x1,y1: killer/n clown/n x2,y2: killer/n problem/n x3,y3: crazy/a problem/n x4,y4: crazy/a clown/n x5,y5: problem/n crazy/a clown/n x6,y6: clown/n crazy/a killer/n π = A 0.25 N 0.75 b = a = a i,j A N A N b i (o) clown killer problem crazy A N

50 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 6: Lagrange Multipliers 49

51 Hidden Markov Model π i Model θ = a i,j b i (o) probability of starting at state i probability of transition from state i to state j probability of output o at state i Constraints : i π i = 1, j a i,j = 1, o b i (o) = 1 50

52 Learning from Fully Observed Data m L(θ) = l=1 f (i, x l, y l ) log π i + i,j i f (i, j, x l, y l ) log a i,j + i,o f (i, o, x l, y l ) log b i (o) θ = (π, a, b) L(θ) is the log probability of the labeled data (x 1, y 1 ),..., (x m, y m ) We want to find a θ that will give us the maximum value of L(θ) Find the θ such that dl(θ) dθ = 0 51

53 Learning from Fully Observed Data m L(θ) = l=1 f (i, x l, y l ) log π i + i i,j f (i, j, x l, y l ) log a i,j + i,o f (i, o, x l, y l ) log b i (o) Find the θ such that dl(θ) dθ = 0 and θ = (π, a, b) Split up L(θ) into L(π), L(a), L(b) Let L = i, j, o : L(π) π i, L(a) a i,j, L(b) b i (o) We must also obey constraints: k π k = 1, k a i,k = 1, o b i(o) = 1 52

54 Learning from Fully Observed Data m L(π) = f (i, x l, y l ) log π i l=1 i Let us focus on L(π) (the other two: a and b are similar) For the constraint k π k = 1 we introduce a new variable into our search for a maximum: L(π, λ) = L(π) + λ(1 k π k ) λ is called the Lagrange multiplier λ penalizes any solution that does not obey the constraint The constraint ensures that π is a probability distribution 53

55 Learning from Fully Observed Data L(π) π i = π i m f (i, x l, y l ) log π i l=1 }{{} the only part with variable π i + m f (j, x l, y l ) log π j l=1 j:j i } {{ } no π i so derivative is 0 We want a value of π i such that L(π,λ) π i = 0 ( m f (i, x l, y l ) log π i + λ(1 π i l=1 k π i m f (i, x l, y l ) log π }{{} i l=1 = f (i,x l,y l ) π i π i +λ λπ i }{{} π i =λ π k ) ) = 0 λ π j ) = 0 j:j i 54

56 Learning from Fully Observed Data L(π) π i = π i m f (i, x l, y l ) log π i l=1 }{{} the only part with variable π i + m f (j, x l, y l ) log π j l=1 j:j i } {{ } no π i so derivative is 0 We can obtain a value of π i wrt λ: L(π, λ) m f (i, x l, y l ) = λ = 0 π i π i l=1 }{{} see previous slide m l=1 π i = f (i, x l, y l ) λ Combine π i s from Eqn (1) with constraint k π k = 1 (1) λ = k m f (k, x l, y l ) l=1 55

57 Learning from Fully Observed Data L(π) π i = π i m f (i, x l, y l ) log π i l=1 }{{} the only part with variable π i + m f (j, x l, y l ) log π j l=1 j:j i } {{ } no π i so derivative is 0 The value of π i for which L(π,λ) π i = 0 is Eqn (2) which can be combined with the value of λ from Eqn (3). m l=1 π i = f (i, x l, y l ) (2) λ λ = m f (k, x l, y l ) (3) k l=1 m l=1 π i = f (i, x l, y l ) m k l=1 f (k, x l, y l ) 56

58 Learning from Fully Observed Data m L(θ) = l=1 f (i, x l, y l ) log π i + i i,j f (i, j, x l, y l ) log a i,j + i,o f (i, o, x l, y l ) log b i (o) The values of π i, a i,j, b i (o) that maximize L(θ) are: l π i = f (i, x l, y l ) l k f (k, x l, y l ) l a i,j = f (i, j, x l, y l ) l k f (i, k, x l, y l ) l b i (o) = f (i, o, x l, y l ) l o V f (i, o, x l, y l ) 57

59 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 7: Unsupervised Learning for HMMs 58

60 Hidden Markov Model π i Model θ = a i,j b i (o) probability of starting at state i probability of transition from state i to state j probability of output o at state i 59

61 Hidden Markov Model Algorithms HMM as parser: compute the best sequence of states for a given observation sequence. HMM as language model: compute probability of given observation sequence. HMM as learner: given a corpus of observation sequences, learn its distribution, i.e. learn the parameters of the HMM from the corpus. Learning from a set of observations with the sequence of states provided (states are not hidden) [Supervised Learning] Learning from a set of observations without any state information. [Unsupervised Learning] 60

62 Learning from Unlabeled Data Unlabeled Data U = x 1,..., x m : x1: killer clown x2: killer problem x3: crazy problem x4: crazy clown y1, y2, y3, y4 are unknown. But we can enumerate all possible values for y1, y2, y3, y4 For example, for x1: killer clown x1,y1,1: killer/a clown/a p 1 = π A b A (killer) a A,A b A (clown) x1,y1,2: killer/a clown/n p 2 = π A b A (killer) a A,N b N (clown) x1,y1,3: killer/n clown/n p 3 = π N b N (killer) a N,N b N (clown) x1,y1,4: killer/n clown/a p 4 = π N b N (killer) a N,A b A (clown) 61

63 Learning from Unlabeled Data Assume some values for θ = π, a, b We can compute P(y x l, θ) for any y for a given x l P(y x l, θ) = P(x, y θ) y P(x, y θ) For example, we can compute P(NN killer clown, θ) as follows: π N b N (killer) a N,N b N (clown) i,j π i b i (killer) a i,j b j (clown) P(y x l, θ) is called the posterior probability 62

64 Learning from Unlabeled Data Compute the posterior for all possible outputs for each example in training: For x1: killer clown x1,y1,1: killer/a clown/a P(AA killer clown, θ) x1,y1,2: killer/a clown/n P(AN killer clown, θ) x1,y1,3: killer/n clown/n P(NN killer clown, θ) x1,y1,4: killer/n clown/a P(NA killer clown, θ) For x2: killer problem x2,y2,1: killer/a problem/a P(AA killer problem, θ) x2,y2,2: killer/a problem/n P(AN killer problem, θ) x2,y2,3: killer/n problem/n P(NN killer problem, θ) x2,y2,4: killer/n problem/a P(NA killer problem, θ) Similarly for x3: crazy problem And x4: crazy clown 63

65 Learning from Unlabeled Data For unlabeled data, the log probability of the data given θ is: L(θ) = = m log P(x l, y θ) l=1 y m log P(y x l, θ) P(x l θ) y l=1 Unlike the fully observed case there is no simple solution to finding θ to maximize L(θ) We instead initialize θ to some values, and then iteratively find better values of θ: θ 0, θ 1,... using the following formula: θ t = arg max Q(θ, θ t 1 ) = θ m P(y x l, θ t 1 ) log P(x l, y θ) l=1 y 64

66 Learning from Unlabeled Data Q(θ, θ t 1 ) = = θ t = arg max Q(θ, θ t 1 ) θ m P(y x l, θ t 1 ) log P(x l, y θ) l=1 y m P(y x l, θ t 1 ) l=1 ( y i + i,j + i,o f (i, x l, y) log π i f (i, j, x l, y) log a i,j f (i, o, x l, y) log b i (o) 65

67 Learning from Unlabeled Data g(i, x l ) = y g(i, j, x l ) = y g(i, o, x l ) = y P(y x l, θ t 1 ) f (i, x l, y) P(y x l, θ t 1 ) f (i, j, x l, y) P(y x l, θ t 1 ) f (i, o, x l, y) θ t = arg max π,a,b m g(i, x l ) log π i l=1 i + i,j g(i, j, x l ) log a i,j + i,o g(i, o, x l ) log b j (o) 66

68 Learning from Unlabeled Data m Q(θ, θ t 1 ) = l=1 g(i, x l ) log π i + i i,j g(i, j, x l ) log a i,j + i,o g(i, o, x l ) log b i (o) The values of π i, a i,j, b i (o) that maximize L(θ) are: l π i = g(i, x l) l k g(k, x l) l a i,j = g(i, j, x l) l k g(i, k, x l) l b i (o) = g(i, o, x l) l o V g(i, o, x l ) 67

69 EM Algorithm for Learning HMMs Initialize θ 0 at random. Let t = 0. The EM Algorithm: E-step: compute expected values of y, P(y x, θ) and calculate g(i, x), g(i, j, x), g(i, o, x) M-step: compute θ t = arg max θ Q(θ, θ t 1 ) Stop if L(θ t ) did not change much since last iteration. Else continue. The above algorithm is guaranteed to improve likelihood of the unlabeled data. In other words, L(θ t ) L(θ t 1 ) But it all depends on θ 0! 68

70 Acknowledgements Many slides borrowed or inspired from lecture notes by Michael Collins, Chris Dyer, Kevin Knight, Philipp Koehn, Adam Lopez, Graham Neubig and Luke Zettlemoyer from their NLP course materials. All mistakes are my own. A big thank you to all the students who read through these notes and helped me improve them. 69

Natural Language Processing

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class