6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I

Size: px

Start display at page:

Download "6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I"

Tracey McDonald
5 years ago
Views:

1 6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and EM Algorithm Part I

2 Overview Ratnaparkhi s Maximum-Entropy Parser The EM Algorithm Part I

3 Log-Linear Taggers: Independence Assumptions The input sentence S, with length n = S:length, hasjt j n possible tag sequences. Each tag sequence T has a conditional probability (T j S) = Q n j=1 P Q n j=1 = P (T (j) j S; j; T (1) : : : T (j 1)) Chain rule P (T (j) j S; j; T (j 2); T (j 1)) Independence assumptions Estimate P (T (j) j S; j; T (j 2); T (j 1)) using log-linear models Use Viterbi algorithm to compute argmax T 2T n log P (T j S)

4 A General Approach: (Conditional) History-Based Models We ve shown how to define P (T sequence j S) where T is a tag How do we define P (T j S) if T is a parse tree (or anor structure)?

5 Step 1: represent a tree as a sequence of decisions d 1 : : : d m my A General Approach: (Conditional) History-Based Models T = hd 1 ; d 2 ; : : : d m i m is not necessarily length of sentence Step 2: probability of a tree is P (T j S) = P (d i j d 1 : : : d i 1 ; S) i=1 Step 3: Use a log-linear model to estimate P (d i j d 1 : : : d i 1 ; S) Step 4: Search?? (answer we ll get to later: beam or heuristic search)

6 An Example Tree S(questioned) (lawyer) VP(questioned) lawyer Vt (witness) PP(about) questioned IN (revolver) witness about revolver

7 Ratnaparkhi s Parser: Three Layers of Structure 1. Part-of-speech tags 2. Chunks 3. Remaining structure

8 Step 1: represent a tree as a sequence of decisions d 1 : : : d m Layer 1: Part-of-Speech Tags Vt IN lawyer questioned witness about revolver T = hd 1 ; d 2 ; : : : d m i First n decisions are tagging decisions hd 1 : : : d n i = h,, Vt,,, IN,, i

9 Layer 2: Chunks Vt IN questioned about lawyer witness revolver Chunks are defined as any phrase where all children are partof-speech tags (Or common chunks are ADJP, QP)

10 Step 1: represent a tree as a sequence of decisions d 1 : : : d n Layer 2: Chunks Start() Join() Or Start() Join() Or Start() Join() Vt IN lawyer questioned witness about revolver = hd 1 ; d 2 ; : : : d n i T First n decisions are tagging decisions n Next decisions are chunk tagging decisions hd 1 : : : d 2n i = h,, Vt,,, IN,,, Start(), Join(), Or, Start(), Join(), Or, Start(), Join()i

11 Layer 3: Remaining Structure Alternate Between Two Classes of Actions: Join(X) or Start(X), where X is a label (, S, VP etc.) Check=YES or Check=NO Meaning of se actions: Start(X) starts a new constituent with label X (always acts on leftmost constituent with no start or join label above it) Join(X) continues a constituent with label X (always acts on leftmost constituent with no start or join label above it) Check=NO does nothing Check=YES takes previous Join or Start action, and converts it into a completed constituent

12 Vt IN questioned about lawyer witness revolver

13 Start(S) Vt IN questioned about witness revolver lawyer

14 Start(S) Vt IN questioned about witness revolver lawyer Check=NO

15 Start(S) Start(VP) IN Vt about questioned witness revolver lawyer

16 Start(S) Start(VP) IN Vt about questioned witness revolver lawyer Check=NO

17 Start(S) Start(VP) Join(VP) IN Vt about questioned revolver lawyer witness

18 Start(S) Start(VP) Join(VP) IN Vt about questioned revolver lawyer witness Check=NO

19 Start(S) Start(VP) Join(VP) Start(PP) Vt IN questioned about revolver lawyer witness

20 Start(S) Start(VP) Join(VP) Start(PP) Vt IN questioned about revolver lawyer witness Check=NO

21 Start(S) Start(VP) Join(VP) Start(PP) Join(PP) Vt IN questioned about lawyer witness revolver

22 Start(S) Start(VP) Join(VP) PP Vt IN questioned about lawyer witness revolver Check=YES

23 Start(S) Start(VP) Join(VP) Join(VP) Vt PP questioned IN lawyer witness about revolver

24 Start(S) VP lawyer Vt questioned IN PP witness about revolver Check=YES

25 Start(S) Join(S) VP lawyer Vt PP questioned IN witness about revolver

26 S VP lawyer Vt PP questioned IN witness about revolver Check=YES

27 The Final Sequence of decisions hd 1 : : : d 2n i = h,, Vt,,, IN,,, Start(), Join(), Or, Start(), Join(), Or, Start(), Join(), Start(S), Check=NO, Start(VP), Check=NO, Join(VP), Check=NO, Start(PP), Check=NO, Join(PP), Check=YES, Join(VP), Check=YES, Join(S), Check=YES i

28 Step 1: represent a tree as a sequence of decisions d 1 : : : d m my A General Approach: (Conditional) History-Based Models T = hd 1 ; d 2 ; : : : d m i m is not necessarily length of sentence Step 2: probability of a tree is P (T j S) = P (d i j d 1 : : : d i 1 ; S) i=1 Step 3: Use a log-linear model to estimate P (d i j d 1 : : : d i 1 ; S) Step 4: Search?? (answer we ll get to later: beam or heuristic search)

29 1:::di 1;Si;di) W effi(hd e ffi(hd 1:::di 1;Si;d) W Pd2A Applying a Log-Linear Model Step 3: Use a log-linear model to estimate P (d i j d 1 : : : d i 1 ; S) A reminder: where: hd 1 : : : d i 1 ; Si d i ffi W A is history is outcome maps a history/outcome pair to a feature vector is a parameter vector is set of possible actions (may be context dependent) P (d i j d 1 : : : d i 1 ; S) =

30 Reminder: Implementing FEATURE VECTOR Intermediate step: strings map history/tag pair to set of feature Hispaniola/P quickly/rb became/vb an/ important/jj base/vt from which Spain expanded its empire into rest of Western Hemisphere. e.g., Ratnaparkhi s features: TAG=Vt;Word=base TAG=Vt;TAG-1=JJ TAG=Vt;TAG-1=JJ;TAG-2= TAG=Vt;SUFF1=e TAG=Vt;SUFF2=se TAG=Vt;SUFF3=ase TAG=Vt;WORD-1=important TAG=Vt;WORD+1=from

31 Reminder: Implementing FEATURE VECTOR Next step: match strings to integers through a hash table Hispaniola/P quickly/rb became/vb an/ important/jj base/vt from which Spain expanded its empire into rest of Western Hemisphere. e.g., Ratnaparkhi s features:!!!!!!!! TAG=Vt;Word=base 1315 TAG=Vt;TAG-1=JJ 17 TAG=Vt;TAG-1=JJ;TAG-2= TAG=Vt;SUFF1=e 459 TAG=Vt;SUFF2=se 1000 TAG=Vt;SUFF3=ase 1509 TAG=Vt;WORD-1=important 1806 TAG=Vt;WORD+1=from 300 In this case, sparse array is: A:length = 8; A(1:::8) = f1315; 17; 32908; 459; 1000; 1509; 1806; 300g

32 1:::di 1;Si;di) W effi(hd e ffi(hd 1:::di 1;Si;d) W Pd2A Applying a Log-Linear Model Step 3: Use a log-linear model to estimate P (d i j d 1 : : : d i 1 ; S) = The big question: how do we define ffi? Ratnaparkhi s method defines ffi differently depending on wher next decision is: A tagging decision (same features as before for POS tagging!) A chunking decision A start/join decision after chunking A check=no/check=yes decision

33 Layer 2: Chunks Start() Join() Or Start() Join() IN Vt about revolver lawyer questioned witness TAG=Join();Word0=witness;POS0= ) TAG=Join();POS0= TAG=Join();Word+1=about;POS+1=IN TAG=Join();POS+1=IN TAG=Join();Word+2=;POS+2= TAG=Join();POS+2=IN TAG=Join();Word-1=;POS-1=;TAG-1=Start() TAG=Join();POS-1=;TAG-1=Start() TAG=Join();TAG-1=Start() TAG=Join();Word-2=questioned;POS-2=Vt;TAG-2=Or : : :

35 Layer 3: Join or Start Looks at head word, constituent (or POS) label, and start/join annotation of n th tree relative to decision, n = where 2; 1 Looks at head word, constituent (or POS) label of n th tree relative to decision, n = 0; 1; 2 where Looks at bigram features of above for (-1,0) and (0,1) Looks at trigram features of above for (-2,-1,0), (-1,0,1) and (0, 1, 2) The above features with all combinations of head words excluded Various punctuation features

36 Layer 3: Check=NO or Check=YES A variety of questions concerning proposed constituent

37 The Search Problem In POS tagging, we could use Viterbi algorithm because P (T (j) j S; j; T (1) : : : T (j 1)) = P (T (j) j S; j; T (j 2) : : : T (j 1)) Now: Decision d i could depend on arbitrary decisions in ) past no chance for dynamic programming Instead, Ratnaparkhi uses a beam search method

38 Overview Ratnaparkhi s Maximum-Entropy Parser The EM Algorithm Part I

39 An Experiment/Some Intuition I have one coin in my pocket, Coin 0 has probability of heads I toss coin 10 times, and see following sequence: HHTTHHHTHH (7 heads out of 10) What would you guess to be?

40 An Experiment/Some Intuition I have three coins in my pocket, Coin 0 has probability of heads; Coin 1 has p probability 1 of heads; Coin 2 has p probability 2 of heads For each trial I do following: First I toss Coin 0 If Coin 0 turns up heads, Itosscoin 1 three times If Coin 0 turns up tails, Itosscoin 2 three times I don t tell you wher Coin 0 came up heads or tails, or wher Coin 1 or 2 was tossed three times, but I do tell you how many heads/tails are seen at each trial You see following sequence: hhhhi; httti; hhhhi; httti; hhhhi What would you estimate as values for ; p 1 and p 2?

41 We have a parameter vector We have a parameter space Ω Maximum Likelihood Estimation We have data points X 1 ; X 2 ; : : : X n drawn from some (finite or countable) X set We have a distribution P (X j ) for any 2 Ω, such that X P (X j ) = 1 and P (X j ) 0 for all X X2X We assume that our data points X 1 ; X 2 ; : : : X n are drawn at random (independently, identically distributed) from a P (X j distribution ) for some Λ 2 Ω Λ

42 Parameter space Ω = [0; 1] A First Example: Coin Tossing X = fh,tg. Our data points X 1 ; X 2 ; : : : X n are a sequence of heads and tails, e.g. HHTTHHHTHH Parameter vector is a single parameter, i.e., probability of coin coming up heads Distribution P (X j ) is defined as ( If X = H 1 If X = T P (X j ) =

43 We have a parameter vector, and a parameter space Ω nx ny Log-Likelihood We have data points X 1 ; X 2 ; : : : X n drawn from some (finite or countable) X set We have a distribution P (X j ) for any 2 Ω The likelihood is Likelihood( ) = P (X 1 ; X 2 ; : : : X n j ) = P (X i j ) i=1 The log-likelihood is L( ) = log Likelihood( ) = log P (X i j ) i=1

44 ML = argmax 2Ω L( ) = argmax 2Ω X Maximum Likelihood Estimation Given a sample X 1 ; X 2 ; : : : X n, choose log P (X i j ) i For example, take coin example: X say :::X n 1 Count(H) has heads, (n Count(H)) and tails ) Count(H) (1 ) n Count(H) L( ) = log = Count(H) log + (n Count(H)) log(1 ) We now have ML = Count(H) n

45 r for r 2 R is parameter for rule r A Second Example: Probabilistic Context-Free Grammars X is set of all parse trees generated by underlying context-free grammar. Our sample n is T trees : : : T n 1 such that T each 2 X i. R is set of rules in context free grammar is set of non-terminals in grammar N Let R(ff) ρ R be rules of form ff! fi for some fi The parameter space Ω is set of 2 [0; 1] jrj such that for all ff 2 N X r = 1 r2r(ff)

46 Count(T;r) r Count(T;r) log r We have P (T j ) = where Count(T;r) is number of times rule r is seen in tree T Y r2r ) log P (T j ) = X r2r

47 Count(T;r) log r Count(T i ; r) log r Maximum Likelihood Estimation for PCFGs We have X log P (T j ) = where Count(T;r) is number of times rule r is seen in tree T And, r2r X X X L( ) = log P (T i j ) = i i r2r Solving ML = argmax 2Ω L( ) gives P i Count(T i ; r) r = where r is of form ff! fi for some fi P i P Count(T i ; s) s2r(ff)

48 Models with Hidden Variables Now say we have two sets X and Y, and a joint distribution (X; Y j ) P If we had fully observed data, (X i ; Y i ) pairs, n L( ) = log P (X i ; Y i j ) X i If we have partially observed data, X i examples, n L( ) = log P (X i j ) X i = log P (X i ; Y j ) X X i Y 2Y

49 ML = argmax X The EM (Expectation Maximization) algorithm isamethod for finding log P (X i ; Y j ) X i Y 2Y

50 The Three Coins Example e.g., in three coins example: = fh,tg Y = fhhh,ttt,htt,thh,hht,tth,hth,thtg X = f ; p 1 ; p 2 g and where ρ If Y = H 1 If Y = T P (X; Y j ) = P (Y j )P (X j Y; ) P (Y j ) = and ρ p h 1 (1 p 1 ) t Y = If H Y = If T P (X j Y; ) = where h = number of heads in X, t = number of tails in X p h 2 (1 p 2)t

51 = 3 5 p 1 = 3 3 p 2 = 0 3 The Three Coins Example Fully observed data might look like: (hhhhi; H); (httti; T ); (hhhhi; H); (httti; T ); (hhhhi; H) In this case maximum likelihood estimates are:

52 The Three Coins Example Partially observed data might look like: hhhhi; httti; hhhhi; httti; hhhhi How do we find maximum likelihood parameters?

53 The EM Algorithm t is parameter vector at t th iteration Choose 0 (at random, or using various heuristics) Iterative procedure is defined as where t = argmax Q( ; t 1 ) Q( ; t 1 ) = P (Y j X i ; t 1 ) log P (X i ; Y j ) X X i Y 2Y

54 argmax X argmax X The EM Algorithm Iterative procedure is defined as t = argmax Q( ; t 1 ),where Q( ; t 1 ) = P (Y j X i ; t 1 ) log P (X i ;Y j ) X X i Key points: Y 2Y Intuition: fill in hidden Y variables according P (Y j X to ; ) i EM is guaranteed to converge to a local maximum, or saddle-point, of likelihood function In general, if has a simple (analytic) solution, n i log P (X i ;Y i j ) P (Y j X i ; ) log P (X i ;Y j ) X also has a simple (analytic) solution. i Y

55 Say X = hhhhi, current parameters are ; p 1 ; p 2 = p (1 )p3 2 p 3 1 p (1 )p3 2 The Three Coins Example Partially observed data might look like: hhhhi; httti; hhhhi; httti; hhhhi P (hhhhi) = P (hhhhi; H) + P (hhhhi; T ) and P (hhhhi; H) P (Y = H j hhhhi) = P (hhhhi; H) + P (hhhhi; T ) =

56 The Three Coins Example After filling in hidden variables for each example, partially observed data might look like: (hhhhi; H) P (Y = H j HHH) = 0:6 (hhhhi; T ) P (Y = T j HHH) = 0:4 (httti; H) P (Y = H j TTT) = 0:3 (httti; T ) P (Y = T j TTT) = 0:7 (hhhhi; H) P (Y = H j HHH) = 0:6 (hhhhi; T ) P (Y = T j HHH) = 0:4 (httti; H) P (Y = H j TTT) = 0:3 (httti; T ) P (Y = T j TTT) = 0:7 (hhhhi; H) P (Y = H j HHH) = 0:6 (hhhhi; T ) P (Y = T j HHH) = 0:4

57 EM for Probabilistic Context-Free Grammars A PCFG defines a distribution P (S; T j ) over tree/sentence (S; T ) pairs If we had tree/sentence pairs (fully observed data)n L( ) = log P (S i ; T i j ) X i Say we have sentences only, S 1 : : : S n trees are hidden variables ) L( ) = log P (S i ; T j ) X X i T

58 EM for Probabilistic Context-Free Grammars Say we have sentences only, S 1 : : : S n trees are hidden variables ) L( ) = log P (S i ; T j ) X X i T EM algorithm is n t = argmax Q( ; t 1 ), where Q( ; t 1 ) = P (T j S i ; t 1 ) log P (S i ; T j ) X X i T

59 Count(S i ; T;r) log r Count(S i ;r) log r Count(S i ;T;r) log r Remember: X log P (S i ; T j ) = where Count(S; T; r) is number of times rule r is seen in sentence/tree pair (S; T ) r2r X X ) Q( ; t 1 ) = P (T j S i ; t 1 ) log P (S i ;T j ) i T X X X = P (T j S i ; t 1 ) i T r2r X X = where Count(S i ;r) = P T P (T j S i; t 1 )Count(S i ;T;r) expected counts i r2r

60 Solving ML = argmax 2Ω L( ) gives P i Count(S i ; r) r = where r is of form ff! fi for some fi P i P Count(S i ; s) s2r(ff) We ll see next week that re are efficient (dynamic programming) algorithms for computation of Count(S i ; r) = P (T j S i ; t 1 )Count(S i ; T;r) X T

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic