{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Size: px

Start display at page:

Download "{Probabilistic Stochastic} Context-Free Grammars (PCFGs)"

Sandra Pierce
6 years ago
Views:

1 {Probabilistic Stochastic} Context-Free Grammars (PCFGs) 116

2 The velocity of the seismic waves rises to... S NP sg VP sg DT NN PP risesto... The velocity IN NP pl of the seismic waves 117

3 PCFGs APCFGGconsists of: A set of terminals, {w k },k=1,...,v A set of nonterminals, {N i },i =1,...,n A designated start symbol, N 1 A set of rules, {N i ζ j }, (where ζ j is a sequence of terminals and nonterminals) A corresponding set of probabilities on rules such that: i P(N i ζ j ) = 1 j 118

4 PCFG notation Sentence: sequence of words w 1 w m w ab : the subsequence w a w b N i ab : nonterminal Ni dominates w a w b N j N i w a w b ζ: Repeated derivation from N i gives ζ. 119

5 PCFG probability of a string P(w 1n ) = t P(w 1n,t) t a parse of w 1n = {t:yield(t)=w 1n } P(t) 120

6 A simple PCFG (in CNF) S NP VP 1.0 NP NP PP 0.4 PP PNP 1.0 NP astronomers 0.1 VP VNP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes

7 t 1 : S 1.0 NP 0.1 VP 0.7 astronomers V 1.0 NP 0.4 saw NP 0.18 PP 1.0 stars P 1.0 NP 0.18 with ears 122

8 t 2 : S 1.0 NP 0.1 VP 0.3 astronomers VP 0.7 PP 1.0 V 1.0 NP 0.18 P 1.0 NP 0.18 saw stars with ears 123

9 The two parse trees probabilities and the sentence probability P(t 1 ) = = P(t 2 ) = = P(w 15 ) = P(t 1 )+P(t 2 ) =

10 Assumptions of PCFGs 1. Place invariance (like time invariance in HMM): 2. Context-free: k P(N j k(k+c) ζ)is the same P(N j kl ζ words outside w k...w l )=P(N j kl ζ) 3. Ancestor-free: P(N j kl ζ ancestor nodes of Nj kl ) = P(Nj kl ζ) 125

11 Let the upper left index in i N j be an arbitrary identifying index for a particular token of a nonterminal. Then, P 2 NP 1 S the man 3 VP snores = P(1 S NP 12 VP33, 2 NP 12 the 1 man 2, 3 VP 33 sno =... = P(S NP V P)P(NP the man)p(v P snores) 126

12 Some features of PCFGs Reasons to use a PCFG, and some idea of their limitations: Partial solution for grammar ambiguity: a PCFG gives some idea of the plausibility of a sentence. But not a very good idea, as not lexicalized. Better for grammar induction (Gold 1967) Robustness. (Admit everything with low probability.) 127

13 Some features of PCFGs Gives a probabilistic language model for English. In practice, a PCFG is a worse language model for English than a trigram model. Can hope to combine the strengths of a PCFG and a trigram model. PCFG encodes certain biases, e.g., that smaller trees are normally more probable. 128

14 Improper (inconsistent) distributions S rhubarb P = 1 3 S S S P = 2 3 rhubarb rhubarb rhubarb rhubarb rhubarb rhubarb P(L) = = = 2 ( ) 2 2 ( Improper/inconsistent distribution 27 ) 3 2 = Not a problem if you estimate from parsed treebank: Chi and Geman 1998). 129

15 Questions for PCFGs Just as for HMMs, there are three basic questions we wish to answer: P(w 1m G) arg max t P(t w 1m,G) Learning algorithm. Find G such that P(w 1m G) is maximized. 130

16 Chomsky Normal Form grammars We ll do the case of Chomsky Normal Form grammars, which only have rules of the form: N i N j N k N i w j Any CFG can be represented by a weakly equivalent CFG in Chomsky Normal Form. It s straightforward to generalize the algorithm (recall chart parsing). 131

17 PCFG parameters We ll do the case of Chomsky Normal Form grammars, which only have rules of the form: N i N j N k N i w j The parameters of a CNF PCFG are: P(N j N r N s G) A n 3 matrix of parameters P(N j w k G) An nt matrix of parameters For j = 1,...,n, P(N j N r N s )+ P(N j w k ) = 1 r,s k 132

18 Probabilistic Regular Grammar: N i w j N k N i w j Start state, N 1 HMM: P(w 1n ) = 1 n w 1n whereas in a PCFG or a PRG: P(w) = 1 w L 133

19 Consider: P(John decided to bake a) High probability in HMM, low probability in a PRG or a PCFG. Implement via sink state. APRG Π Start HMM Finish 134

20 Comparison of HMMs (PRGs) and PCFGs X: NP N N N sink O: the big brown box NP the N big N brown N 0 box 135

21 Inside and outside probabilities This suggests: whereas for an HMM we have: Forwards = α i (t) = P(w 1(t 1),X t =i) Backwards = β i (t) = P(w tt X t = i) for a PCFG we make use of Inside and Outside probabilities, defined as follows: Outside = α j (p, q) = P(w 1(p 1),Npq,w j (q+1)m G) Inside = β j (p, q) = P(w pq Npq,G) j A slight generalization of dynamic Bayes Nets covers PCFG inference by the inside-outside algorithm (and-or tree of conjunctive daughters disjunctively chosen) 136

22 Inside and outside probabilities in PCFGs. N 1 α w 1 N j β w p 1 w p w q w q+1 w m 137

23 Probability of a string Inside probability P(w 1m G) = P(N 1 w 1m G) = P(w 1m,N1m 1,G) =β 1(1,m) Base case: We want to find β j (k, k) (the probability of a rule N j w k ): β j (k, k) = P(w k N j kk,g) = P(N j w k G) 138

24 Induction: We want to find β j (p, q), forp<q. As this is the inductive step using a Chomsky Normal Form grammar, the first rule must be of the form N j N r N s, so we can proceed by induction, dividing the string in two in various places and summing the result: N j N r N s w p w d w d+1 w q These inside probabilities can be calculated bottom up. 139

25 For all j, β j (p, q) = P(w pq N j pq,g) = r,s = r,s = r,s = r,s q 1 d=p q 1 d=p P(w pd,n r pd,w (d+1)q,n s (d+1)q Nj pq,g) P(N r pd,ns (d+1)q Nj pq,g) P(w pd N j pq,n r pd,ns (d+1)q,g) P(w (d+1)q N j pq,n r pd,ns (d+1)q,w pd,g) q 1 d=p P(N r pd,ns (d+1)q Nj pq,g) P(w pd N r pd, G)P(w (d+1)q N s (d+1)q,g) q 1 d=p P(N j N r N s )β r (p, d)β s (d + 1,q) 140

26 Calculation of inside probabilities (CKY algorithm) β NP = 0.1 β S = β S = β NP = 0.04 β VP = β VP = β V = β NP = 0.18 β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears 141

27 Outside probabilities Probability of a string: For any k, 1 k m, P(w 1m G) = j P(w 1(k 1),w k,w (k+1)m,n j kk G) = j P(w 1(k 1),N j kk,w (k+1)m G) = j P(w k w 1(k 1),N j kk,w (k+1)n,g) α j (k, k)p(n j w k ) Inductive (DP) calculation: One calculates the outside probabilities top down (after determining the inside probabilities). 142

28 Outside probabilities Base Case: α 1 (1,m)=1 α j (1,m)=0,forj 1 Inductive Case: N 1 N f pe N j pq N g (q+1)e w 1 w p 1 w p w q w q+1 w e w e+1 w m 143

29 Outside probabilities Base Case: α 1 (1,m)=1 α j (1,m)=0,forj 1 Inductive Case: it s either a left or right branch we will some over both possibilities and calculate using outside and inside probabilities N 1 N f pe N j pq N g (q+1)e w 1 w p 1 w p w q w q+1 w e w e+1 w m 144

30 Outside probabilities inductive case AnodeNpq j might be the left or right branch of the parent node. We sum over both possibilities. N 1 N f eq N g e(p 1) N j pq w 1 w e 1 w e w p 1 w p w q w q+1 w m 145

31 Inductive Case: α j (p, q) = [ f,g m e=q+1 +[ f,g = [ m P(w 1(p 1),w (q+1)m,n f pe,n j pq,n g (q+1)e )] p 1 e=1 f,gnej e=q+1 P(w 1(p 1),w (q+1)m,n f eq,n g e(p 1),Nj pq)] P(w 1(p 1),w (e+1)m,n f pe)p(n j pq,n g (q+1)e Nf pe) p 1 P(w (q+1)e N g (q+1)e )] + [ P(w 1(e 1),w (q+1)m,neq) f f,g e=1 P(N g e(p 1),Nj pq Neq)P(w f e(p 1) N g e(p 1 )] = [ m α f (p, e)p(n f N j N g )β g (q + 1,e)] f,g e=q+1 +[ f,g p 1 e=1 α f (e, q)p(n f N g N j )β g (e, p 1)] 146

32 Overall probability of a node existing As with a HMM, we can form a product of the inside and outside probabilities. This time: α j (p, q)β j (p, q) = P(w 1(p 1),Npq,w j (q+1)m G)P(w pq Npq,G) j =P(w 1m,Npq G) j Therefore, p(w 1m,N pq G) = j α j (p, q)β j (p, q) Just in the cases of the root node and the preterminals, we know there will always be some such constituent. 147

33 Training a PCFG We construct an EM training algorithm, as for HMMs. We would like to calculate how often each rule is used: ˆP(N j ζ) = C(Nj ζ) γ C(N j γ) Have data count; else work iteratively from expectations of current model. Consider: α j (p, q)β j (p, q) = P(N 1 w 1m,N j = P(N 1 w 1m G)P(N j wpq G) wpq N 1 w 1m,G) We have already solved how to calculate P(N 1 w 1m ); let us call this probability π. Then: and P(N j w pq N 1 E(N j is used in the derivation) = w 1m,G)= α j(p, q)β j (p, q) π m m p=1 q=p α j (p, q)β j (p, q) π 148

34 In the case where we are not dealing with a preterminal, we substitute the inductive definition of β, and r,s,p > q: P(N j N r N s w pq N 1 w 1n,G)= q 1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) Therefore the expectation is: E(N j N r N s,n j used) m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) π Now for the maximization step, we want: π P(N j N r N s ) = E(Nj N r N s,n j used) E(N j used) 149

35 Therefore, the reestimation formula, ˆP(N j N r N s ) is the quotient: ˆP(N j N r N s ) = Similarly, m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) m m p=1 q=1 α j(p, q)β j (p, q) E(N j w k N 1 w 1m,G)= mh=1 α j (h, h)p(n j w h,w h =w k ) Therefore, ˆP(N j w k ) = π mh=1 α j (h, h)p(n j w h,w h =w k ) mp=1 mq=1 α j (p, q)β j (p, q) Inside-Outside algorithm: repeat this process until the estimated probability change is small. 150

36 Multiple training instances: if we have training sentences W = (W 1,...W ω ), with W i = (w 1,...,w mi ) and we let u and v bet the common subterms from before: and u i (p,q,j,r,s)= q 1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) P(N 1 W i G) v i (p,q,j)= α j(p, q)β j (p, q) P(N 1 W i G) Assuming the observations are independent, we can sum contributions: ωi=1 mi 1 mi ˆP(N j N r N s p=1 q=p+1 ) = u i(p,q,j,r,s) ωi=1 mi mi p=1 q=pv i (p,q,j) and ωi=1 ˆP(N j w k ) = ωi=1 mi p=1 {h:w h =w k } v i(h, h, j) mi q=p v i (p,q,j) 151

37 Problems with the Inside-Outside algorithm 1. Slow. Each iteration is O(m 3 n 3 ), where m = ω i=1 m i, and n is the number of nonterminals in the grammar. 2. Local maxima are much more of a problem. Charniak reports that on each trial a different local maximum was found. Use simulated annealing? Restrict rules by initializing some parameters to zero? Or HMM initialization? Reallocate nonterminals away from greedy terminals? 3. Lari and Young suggest that you need many more nonterminals available than are theoretically necessary to get good grammar learning (about a threefold increase?). This compounds the first problem. 4. There is no guarantee that the nonterminals that the algorithm learns will have any satisfactory resemblance to the kinds of nonterminals normally motivated in linguistic analysis (NP, VP, etc.). 152

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612