Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Size: px

Start display at page:

Download "Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya"

Randell Hopkins
6 years ago
Views:

1 Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya

2 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques to improve LM [e.g. cashing, skipping models, clustering, sentence mixture...] Concepts are the same, formulas are different so check several sources before implementation 2

3 Recall: Language Models Given a sentence, we would like to estimate how likely it is to see such a sentence in a language P (w 1 length(s = i Problems: P w 1 i 1 P (w k w 1 k 1 = C (w 1 k C (w 1 k 1 sparseness of training data (not enough to estimate probabilities zero probability of unseen events Solution: SMOOTHING! 3

4 Smoothing Take some probability mass from seen events and assign it to unseen events P(unseen P(seen=1 P(seen=

5 Recall: Laplace Smoothing P (w n w 1 n 1 = C (w 1 n C (w 1 n 1 P Laplace (w n w n 1 1 = C (w n 1+1 C (w n V 5

6 Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen P(unseen P(seen once P(seen once P(seen > once P(seen > once 6

7 Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen N c = the count of events we ve seen c times Estimate (C(w = c: P Good Turing (w= 1 N (c+1 N c+1 N c Here N is M from previous lecture 7

8 Slide from Dan Jurafsky, MOOC Natural Language Processing : Language Modeling. Advanced: Good Turing Smoothing 8

9 Today Interpolation Absolute discounting Kneser-Ney Smoothing: Back-off Kneser-Ney Interpolated Kneser-Ney Modified Kneser-Ney 9

10 Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION 10

11 Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION value 1 value 2 value 3 11

12 Interpolation. Concept 2 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3, α i coefficients or weights Usually αi in [0, 1], and α i =1 i=2: α 1 =α, α 2 =1 α i=3 :α 1 =α, α 2 =β, α 3 =1 α β 12

13 Interpolation. Concept 3 value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3 value 1 = 112 kg 90 kg = 22 kg value 2 = 20 kg value 3 = 25 kg α 1 =.3, α 2 =.2, α 3 =.5 value = 22* * *.5 = 23.1 > 23 kg 13

14 Examples of interpolation? Take the mean value (split a bill equally (α i =1/n Assume elections result based on an opinion poll Basically, everywhere when you try to assess true value via several sources 14

15 Linear Interpolation in LM We ve never seen read a book, but we might have seen a book, and we ve certainly seen book, Linear Interpolation: P INT w i 1, w i 2 =α 3 P w i 1, w i 2 +α 2 P w i 1 +α 1 P P(read a book=.5*0 +.2* *1.74*10-3 = 6.42*

16 Absolute discounting Discount all non-zero n-gram count by a small constant amount D and interpolate with bigram model: Discounted n-gram P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 Interpolation weight lower-order n-gram 16

17 Absolute discounting. Interpolation weight P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 If Z seen word types occur after w i-2 w i-1 in the training data, this reserves the probability mass P(U = (Z D/C-2 w i-1 to be computed according to P w i-1. Set: (1 λ=p (U = Z D C 2 w i 1 N.B.: with N 1, N 2 the number of n-grams that occur once or twice, D = N 1 /(N 1 +2N 2 works well in practice 17

18 Kneser-Ney Smoothing Idea: the higher-order models work better, but when count is small or zero, the lower-order models can help a lot. But lower-order models should be used wisely: San Francisco is common, so absolute discounting will give Francisco high probability in future predictions, while actually Francisco accurs only after San => bigram model is better in this case; Another idea is to take into account context each word occurs in. 18

19 Kneser-Ney Smoothing. Contexts Number of different words wi-1 that w i follows: e.g. N 1+ (.read = 2 N 1+ (.a = 5 N 1+ (. w i = {w i 1 :C 1 w i >0} N 1+ (..= w i N 1+ (.w i N 1+ (.. = =17 19

20 Kneser-Ney Smoothing. Lower-order Replace raw counts with count of contexts: P KN = N 1+ (. w i N 1+ (.. e.g. P KN (read = 2/17 P KN (a = 5/17 P KN (to = 6/17 20

21 Back-off Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: C 1 w i D P BKN w i 1 C (w ={ i 1 α 1 P KN if C 1 w i >0 otherwise where P KN = N 1+ (. w i N 1+ (.. Back-off to lower-order model in case bigram count is 0 α normalization constant 21

22 Back-off Kneser-Ney Smoothing. Example C 1 w i D if C (w P BKN w i 1 C (w ={ i 1 i 1 w i >0 α 1 P KN otherwise D = 0.5, α = 0.01 Counts available: P KN (a want = (10-0.5/292 = Have never seen bigram before P KN (to want = 0.01*6/17 =

23 Interpolated Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: P IKN w i 1 = C 1 w i D C 1 Interpolation +α 1 P KN where P KN = N 1+ (. w i N 1+ (.. IKN for high-orders, Kneser-Ney for unigram α normalization constant 23

24 Modified Kneser-Ney Smoothing Chen&Goodman introduced modified Kneser- Ney: Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of a single discount for all counts: 1 if c=1 D(c={D D 2 if c=2 if c 3 D 3+ Estimates discounts on held-out data instead of using a formula based on training counts Modified Kneser-Ney consistently had best performance. 24

25 Questions We've just seen interpolation with lower-order models. What else could be interpolated? 25

26 Questions We've just seen interpolation with lower-order models. What else could be interpolated? Why don't just take high-order models and back-off or interpolate with lower-order models? 26

27 References Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Language Modeling. Advanced: Good Turing Smoothing Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Advanced: Kneser-Ney Smoothing Bill MacCartney, NLP Lunch Tutorial: Smoothing, 2005 Joshua T. Goodman, A Bit of Progress in Language Modeling, 2001 Philipp Koehn, "Statistical Machine Translation", chapter Language models, 2009 Daniel Jurafsky, James H. Martin, Speech and Language Processing,

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability