Probabilistic Context-Free Grammar

Size: px

Start display at page:

Download "Probabilistic Context-Free Grammar"

Timothy Gaines
6 years ago
Views:

1 Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, Brno, CZ FRVŠ MŠMT FR97/2011/G1

2 Outline Probabilistic Context-Free Grammar Definition and examples Properties and usage Probabilistic Context-Free Grammar 2 / 57

3 Outline Probabilistic Context-Free Grammar Definition and examples Properties and usage Inside and Outside Probabilities Definitions Algorithms Probabilistic Context-Free Grammar 3 / 57

4 Outline Probabilistic Context-Free Grammar Definition and examples Properties and usage Inside and Outside Probabilities Definitions Algorithms Inside-Outside Algorithm Idea, formal description and properties Probabilistic Context-Free Grammar 4 / 57

5 Topic Probabilistic Context-Free Grammar Definition and examples Properties and usage Inside and Outside Probabilities Definitions Algorithms Inside-Outside Algorithm Idea, formal description and properties Probabilistic Context-Free Grammar 5 / 57

6 Probabilistic Context-Free Grammar A probabilistic context-free grammar (PCFG; also called stochastic CFG, SCFG) is a context-free grammar, where a certain probability is assigned to each rule. Thus, some derivations become more likely than other. Definition A PCFG G is a quintuple G = (M, T, R, S, P), where M = {N i : i = 1,..., n} is a set of nonterminals T = {w k : k = 1,..., V } is a set of terminals R = {N i ζ j : ζ j (M T ) } is a set of rules S = N 1 is the start symbol P is a corresponding set of probabilities on rules such that i P(N i ζ j ) = 1 j Probabilistic Context-Free Grammar 6 / 57

7 Probabilistic Context-Free Grammar Notations G Grammar (PCFG) L(G) Language generated by grammar G t Parse tree {N 1,..., N n } Nonterminal vocabulary {w 1,..., w V } Terminal vocabulary N 1 Start symbol w 1... w m N j pq α j (p, q) β j (p, q) Sentence to be parsed Nonterminal N j spans positions p through q in string Outside probabilities Inside probabilities Probabilistic Context-Free Grammar 7 / 57

8 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 Probabilistic Context-Free Grammar 8 / 57

9 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 t 1 : S 1.0 NP 0.1 VP 0.7 astronomers V 1.0 NP 0.4 saw NP 0.18 PP 1.0 stars P 1.0 NP 0.18 with ears Probabilistic Context-Free Grammar 9 / 57

10 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 t 1 : S 1.0 NP 0.1 VP 0.7 astronomers V 1.0 NP 0.4 saw NP 0.18 PP 1.0 stars P 1.0 NP 0.18 P(t 1 ) = = with ears Probabilistic Context-Free Grammar 10 / 57

11 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 Probabilistic Context-Free Grammar 11 / 57

12 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 t 2 : S 1.0 NP 0.1 VP 0.3 astronomers VP 0.7 PP 1.0 V 1.0 NP 0.18 P 1.0 NP 0.18 saw stars with ears Probabilistic Context-Free Grammar 12 / 57

13 PCFG Example S NP VP 1.0 PP P NP 1.0 VP V NP 0.7 VP VP PP 0.3 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescopes 0.1 t 2 : S 1.0 NP 0.1 VP 0.3 astronomers VP 0.7 PP 1.0 V 1.0 NP 0.18 P 1.0 NP 0.18 saw stars with ears P(t 2 ) = = Probabilistic Context-Free Grammar 13 / 57

14 PCFG Example For the sentence astronomers saw stars with ears, we can construct 2 parse trees. P(t 1 ) = P(t 2 ) = Sentence probability: P(w 15 ) = P(t 1 ) + P(t 2 ) P(w 15 ) = P(w 15 ) = Probabilistic Context-Free Grammar 14 / 57

15 PCFG Assumptions 1 Place invariance k, l P(N j k(k+c) ζ) = P(Nj l(l+c) ζ) Probabilistic Context-Free Grammar 15 / 57

16 PCFG Assumptions 1 Place invariance k, l P(N j k(k+c) ζ) = P(Nj l(l+c) ζ) 2 Context-free P(N j kl ζ anything outside k through l) = P(N j kl ζ) Probabilistic Context-Free Grammar 16 / 57

17 PCFG Assumptions 1 Place invariance k, l P(N j k(k+c) ζ) = P(Nj l(l+c) ζ) 2 Context-free P(N j kl ζ anything outside k through l) = P(N j kl ζ) 3 Ancestor-free P(N j kl ζ any ancestor nodes outside N j kl ) = P(Nj kl ζ) Probabilistic Context-Free Grammar 17 / 57

18 PCFG Features Gives a probabilistic language model. Can give some idea of the plausibility of different parses of ambiguous sentences. However, only structure is taken into account, no lexical co-occurence. Good for grammar induction. Can be learned from positive data alone. Robust, able to deal with grammatical mistakes. In practice, PCFG shows to be a worse language model for English than n-gram models (no lexical context). However, we could combine the strengths of PCFGs (sentence structure) and n-gram models (lexical co-ocurence). Probabilistic Context-Free Grammar 18 / 57

19 PCFG Questions 1 Probability of a sentence w 1m according to grammar G: P(w 1m G) =? We will consider grammars in Chomsky Normal Form (without loss of generality). Probabilistic Context-Free Grammar 19 / 57

20 PCFG Questions 1 Probability of a sentence w 1m according to grammar G: P(w 1m G) =? 2 The most likely parse for a sentence: arg max P(t w 1m, G) =? t We will consider grammars in Chomsky Normal Form (without loss of generality). Probabilistic Context-Free Grammar 20 / 57

21 PCFG Questions 1 Probability of a sentence w 1m according to grammar G: P(w 1m G) =? 2 The most likely parse for a sentence: arg max P(t w 1m, G) =? t 3 Setting rule probabilities to maximize the probability of a sentence: arg max G P(w 1m G) =? We will consider grammars in Chomsky Normal Form (without loss of generality). Probabilistic Context-Free Grammar 21 / 57

22 Topic Probabilistic Context-Free Grammar Definition and examples Properties and usage Inside and Outside Probabilities Definitions Algorithms Inside-Outside Algorithm Idea, formal description and properties Probabilistic Context-Free Grammar 22 / 57

23 Sentence Probability Definition Sentence probability of a sentence w 1m according to grammar G: P(w 1m G) = t P(w 1m, t) where t is a parse tree of the sentence. Trivial solution: Find all parse trees, calculate and sum up their probabilities. Problem: Exponential time complexity in general - unsuitable in practice. Efficient solution: Using inside and outside probabilities. Probabilistic Context-Free Grammar 23 / 57

24 Inside and Outside Probabilities Definition Inside probability: β j (p, q) = P(w pq N j pq, G) N 1 N j β w 1... w p 1 w p... w q w q+1... w m Probabilistic Context-Free Grammar 24 / 57

25 Inside and Outside Probabilities Definition Outside probability: α j (p, q) = P(w 1(p 1), N j pq, w (q+1)m G) N 1 α N j w 1... w p 1 w p... w q w q+1... w m Probabilistic Context-Free Grammar 25 / 57

26 Inside and Outside Probabilities Definition Inside probability: β j (p, q) = P(w pq N j pq, G) Outside probability: α j (p, q) = P(w 1(p 1), N j pq, w (q+1)m G) N 1 α N j β w 1... w p 1 w p... w q w q+1... w m Probabilistic Context-Free Grammar 26 / 57

27 Inside Algorithm Dynamic programming algorithm based on inside probabilities: P(w 1m G) = P(w 1m N1m, 1 G) = β 1 (1, m) Calculates the inside probabilities recursively, bottom up. 1 Base case: β j (k, k) = P(w k N j kk, G) = P(Nj w k G) 2 Induction: N j N r N s β j (p, q) = P(w pq Npq, j G) w p w d w d+1 w q = q 1 P(N j N r N s )β r (p, d)β s (d + 1, q) r,s d=p Probabilistic Context-Free Grammar 27 / 57

28 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes astronomers saw stars with ears Probabilistic Context-Free Grammar 28 / 57

29 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = astronomers saw stars with ears Probabilistic Context-Free Grammar 29 / 57

30 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β V = astronomers saw stars with ears Probabilistic Context-Free Grammar 30 / 57

31 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β V = β NP = astronomers saw stars with ears Probabilistic Context-Free Grammar 31 / 57

32 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β V = β NP = β P = astronomers saw stars with ears Probabilistic Context-Free Grammar 32 / 57

33 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β V = β NP = β P = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 33 / 57

34 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β VP = β V = β NP = β P = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 34 / 57

35 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = β NP = 0.04 β VP = β V = β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 35 / 57

36 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = 0.1 β S = β NP = 0.04 β VP = β V = β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 36 / 57

37 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = 0.1 β S = β NP = 0.04 β VP = β V = β NP = 0.18 β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 37 / 57

38 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = 0.1 β S = β NP = 0.04 β VP = β VP = β V = β NP = 0.18 β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 38 / 57

39 Inside Algorithm Example S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomers 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes β NP = 0.1 β S = β S = β NP = 0.04 β VP = β VP = β V = β NP = 0.18 β NP = β P = 1.0 β PP = β NP = 0.18 astronomers saw stars with ears Probabilistic Context-Free Grammar 39 / 57

40 Outside Algorithm Dynamic programming algorithm based on outside probabilities: P(w 1m G) = j P(w 1(k 1), w k, w (k+1)m, N j kk G) = j P(w 1(k 1), N j kk, w (k+1)m G) P(w k w 1(k 1), N j kk, w (k+1)m, G) = j α j (k, k)p(n j w k ) for any k such that 1 k m. Calculates the outside probabilities recursively, top down. Requires reference to inside probabilities. Probabilistic Context-Free Grammar 40 / 57

41 Outside Algorithm Case 1 N 1 N f pe Npq j N g (q+1)e w 1... w p 1 w p... w q w q+1... w e w e+1... w m α j (p, q) = f,g m e=q+1 α f (p, e)p(n f N j N g )β g (q + 1, e) Probabilistic Context-Free Grammar 41 / 57

42 Outside Algorithm Case 2 N 1 Neq f N g Npq j e(p 1) w 1... w e 1 w e... w p 1 w p... w q w q+1... w m α j (p, q) = f,g p 1 α f (e, q)p(n f N g N j )β g (e, p 1) e=1 Probabilistic Context-Free Grammar 42 / 57

43 Outside Algorithm 1 Base case: α 1 (1, m) = 1 α j (1, m) = 0 for j 1 2 Induction: α j (p, q) = m α f (p, e)p(n f N j N g )β g (q + 1, e) f,g e=q+1 + p 1 α f (e, q)p(n f N g N j )β g (e, p 1) f,g e=1 Probabilistic Context-Free Grammar 43 / 57

44 Sentence Probability Summary Using inside probabilities: Using outside probabilities: P(w 1m G) = β 1 (1, m) P(w 1m G) = j α j (k, k)p(n j w k ) for any k such that 1 k m. Probability of a sentence w 1m and that there is some constituent spanning from word p to q: P(w 1m, N pq G) = α j (p, q)β j (p, q) Probabilistic Context-Free Grammar 44 / 57

45 Finding the Most Likely Parse Modication of the inside algorithm: Find the maximum element of the sum in each step. Record which rule gave this maximum. We can define accumulators (similar to Viterbi algorithm for HMM): δ i (p, q) = the highest probability parse of a subtree N i pq Probabilistic Context-Free Grammar 45 / 57

46 Finding the Most Likely Parse 1 Base case: 2 Induction: Backtrace: 3 Termination: δ i (p, p) = P(N i w p ) δ i (p, q) = max P(N i N j N k )δ j (p, r)δ k (r + 1, q) 1 j,k n p r<q ψ i (p, q) = arg max P(N i N j N k )δ j (p, r)δ k (r + 1, q) (j,k,r) P(ˆt) = δ 1 (1, m) We need to reconstruct the parse tree ˆt. Probabilistic Context-Free Grammar 46 / 57

47 Topic Probabilistic Context-Free Grammar Definition and examples Properties and usage Inside and Outside Probabilities Definitions Algorithms Inside-Outside Algorithm Idea, formal description and properties Probabilistic Context-Free Grammar 47 / 57

48 Training a PCFG Assume a certain topology of the grammar G given in advance. Number of terminals and nonterminals. Name of the start symbol. Set of rules (we can have a given structure of the grammar, but we can also assume all possible rewriting rules exist). We want to set the probabilities of rules to maximize the likelihood of the training data. ˆP(N j ζ) = C(Nj ζ) γ C(Nj γ) where C(x) is the number of times the rule x is used. Trivial if we have a parsed corpus for training. Probabilistic Context-Free Grammar 48 / 57

49 Inside-Outside Algorithm Idea Usually, a parsed training corpus is not available. Hidden data problem - we can only directly see the probabilities of sentences, not rules. We can use an iterative algorithm to determine improving estimates - the inside-outside algorithm. 1 Begin with a given grammar topology and some initial probability estimates for rules. 2 The probability of each parse of a training sentence according to G will act as our confidence in it. 3 Sum the probabilities of each rule being used in each place to give an expectation of how often each rule was used. 4 Use the expectations to refine the probability estimates - increase the likelihood of the traning corpus according to G. Probabilistic Context-Free Grammar 49 / 57

50 Inside-Outside Algorithm α j (p, q)β j (p, q) = P(w 1m, N j pq G) = P(w 1m G)P(N j pq w 1m, G) P(N j pq w 1m, G) = α j(p, q)β j (p, q) P(w 1m G) To estimate the count of times the nonterminal N j is used in the derivation: E(N j is used in the derivation) = m m p=1 q=p α j (p, q)β j (p, q) P(w 1m G) (1) Probabilistic Context-Free Grammar 50 / 57

51 Inside-Outside Algorithm If N j is not a preterminal, we can substitute the inductive definition of β. Then, r, s, p, q: P(N j pq w 1m, G) = q 1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1, q) P(w 1m G) To estimate the number of times this rule is used in the derivation: E(N j N r N s, N j used) = m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1, q) P(w 1m G) (2) Probabilistic Context-Free Grammar 51 / 57

52 Inside-Outside Algorithm For the maximization step, we want: Reestimation formula: ˆP(N j N r N s ) = E(Nj N r N s, N j used) E(N j used) ˆP(N j N r N s ) = (1)/(2) = m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1, q) m m p=1 q=p α j(p, q)β j (p, q) (3) Analogically for preterminals, we get: ˆP(N j w k ) = m h=1 α j(h, h)p(w h = w k )β j (h, h) m m p=1 q=p α j(p, q)β j (p, q) (4) Probabilistic Context-Free Grammar 52 / 57

53 Inside-Outside Algorithm Method 1 Initialize probabilities of rules in G. 2 Calculate inside probabilities for the training sentence. 3 Calculate outside probabilities for the training sentence. 4 Update the rule probabilities using reestimation formulas (3) and (4). 5 Repeat from step 2 until the change in estimated rule probabilities is sufficiently small. The probability of the training corpus according to G will improve (or at least not get worse): P(W G i+1 ) P(W G i ) where i is the current iteration of training. Probabilistic Context-Free Grammar 53 / 57

54 Inside-Outside Algorithm Properties Time complexity: For each sentence, each iteration of training is O(m 3 n 3 ), where m is the length of the sentence and n is the number of nonterminals in the grammar. Relatively slow compared to linear models (such as HMM). Problems with local maxima, higly sensitive to the initialization of parameters. Generally, we cannot guarantee any resemblance between the trained grammar and the kinds of structures commonly used in NLP (NP, VP, etc.). The only hard constraint is that N 1 remains the start symbol. We could impose further constraints. Probabilistic Context-Free Grammar 54 / 57

55 References Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press, 1999 Fei Xia: Inside-ouside algorithm (presentation), University of Washington, LING572/inside-outside.ppt Probabilistic Context-Free Grammar 55 / 57

56 Thank you for your attention!

57 End

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

{Probabilistic Stochastic} Context-Free Grammars (PCFGs) 116 The velocity of the seismic waves rises to... S NP sg VP sg DT NN PP risesto... The velocity IN NP pl of the seismic waves 117 PCFGs APCFGGconsists