Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015

Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques to improve LM [e.g. cashing, skipping models, clustering, sentence mixture...] Concepts are the same, formulas are different so check several sources before implementation 2

Recall: Language Models Given a sentence, we would like to estimate how likely it is to see such a sentence in a language P (w 1 length(s = i Problems: P w 1 i 1 P (w k w 1 k 1 = C (w 1 k C (w 1 k 1 sparseness of training data (not enough to estimate probabilities zero probability of unseen events Solution: SMOOTHING! 3

Smoothing Take some probability mass from seen events and assign it to unseen events P(unseen P(seen=1 P(seen=.999.. 4

Recall: Laplace Smoothing P (w n w 1 n 1 = C (w 1 n C (w 1 n 1 P Laplace (w n w n 1 1 = C (w n 1+1 C (w n 1 1 + V 5

Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen P(unseen P(seen once P(seen once P(seen > once P(seen > once 6

Recall: Good-Turing Smoothing Use the count of events we have seen once to help estimate the count of events we have never seen N c = the count of events we ve seen c times Estimate (C(w = c: P Good Turing (w= 1 N (c+1 N c+1 N c Here N is M from previous lecture 7

Slide from Dan Jurafsky, MOOC Natural Language Processing : Language Modeling. Advanced: Good Turing Smoothing 8

Today Interpolation Absolute discounting Kneser-Ney Smoothing: Back-off Kneser-Ney Interpolated Kneser-Ney Modified Kneser-Ney 9

Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION 10

Interpolation. Concept 1 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them = INTERPOLATION value 1 value 2 value 3 11

Interpolation. Concept 2 Problem: No measuring is perfect Solution: combine several measurements (with different trust on them value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3, α i coefficients or weights Usually αi in [0, 1], and α i =1 i=2: α 1 =α, α 2 =1 α i=3 :α 1 =α, α 2 =β, α 3 =1 α β 12

Interpolation. Concept 3 value = α 1 *value 1 + α 2 *value 2 + α 3 *value 3 value 1 = 112 kg 90 kg = 22 kg value 2 = 20 kg value 3 = 25 kg α 1 =.3, α 2 =.2, α 3 =.5 value = 22*.3 + 20*.2 + 25*.5 = 23.1 > 23 kg 13

Examples of interpolation? Take the mean value (split a bill equally (α i =1/n Assume elections result based on an opinion poll Basically, everywhere when you try to assess true value via several sources 14

Linear Interpolation in LM We ve never seen read a book, but we might have seen a book, and we ve certainly seen book, Linear Interpolation: P INT w i 1, w i 2 =α 3 P w i 1, w i 2 +α 2 P w i 1 +α 1 P P(read a book=.5*0 +.2*0.0006 +.3*1.74*10-3 = 6.42*10-4 15

Absolute discounting Discount all non-zero n-gram count by a small constant amount D and interpolate with bigram model: Discounted n-gram P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 Interpolation weight lower-order n-gram 16

Absolute discounting. Interpolation weight P AD w i 1, w i 2 = max(c (w w w i 2 i 1 i D, 0 C 2 w i 1 +(1 λp AD w i 1 If Z seen word types occur after w i-2 w i-1 in the training data, this reserves the probability mass P(U = (Z D/C-2 w i-1 to be computed according to P w i-1. Set: (1 λ=p (U = Z D C 2 w i 1 N.B.: with N 1, N 2 the number of n-grams that occur once or twice, D = N 1 /(N 1 +2N 2 works well in practice 17

Kneser-Ney Smoothing Idea: the higher-order models work better, but when count is small or zero, the lower-order models can help a lot. But lower-order models should be used wisely: San Francisco is common, so absolute discounting will give Francisco high probability in future predictions, while actually Francisco accurs only after San => bigram model is better in this case; Another idea is to take into account context each word occurs in. 18

Kneser-Ney Smoothing. Contexts Number of different words wi-1 that w i follows: e.g. N 1+ (.read = 2 N 1+ (.a = 5 N 1+ (. w i = {w i 1 :C 1 w i >0} N 1+ (..= w i N 1+ (.w i N 1+ (.. = 2+6+2+5+2=17 19

Kneser-Ney Smoothing. Lower-order Replace raw counts with count of contexts: P KN = N 1+ (. w i N 1+ (.. e.g. P KN (read = 2/17 P KN (a = 5/17 P KN (to = 6/17 20

Back-off Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: C 1 w i D P BKN w i 1 C (w ={ i 1 α 1 P KN if C 1 w i >0 otherwise where P KN = N 1+ (. w i N 1+ (.. Back-off to lower-order model in case bigram count is 0 α normalization constant 21

Back-off Kneser-Ney Smoothing. Example C 1 w i D if C (w P BKN w i 1 C (w ={ i 1 i 1 w i >0 α 1 P KN otherwise D = 0.5, α = 0.01 Counts available: P KN (a want = (10-0.5/292 = 0.03253 Have never seen bigram before P KN (to want = 0.01*6/17 = 0.00353 22

Interpolated Kneser-Ney Smoothing KN Smoothing: Similar to absolute discounting, but use KN estimate for lower-order: P IKN w i 1 = C 1 w i D C 1 Interpolation +α 1 P KN where P KN = N 1+ (. w i N 1+ (.. IKN for high-orders, Kneser-Ney for unigram α normalization constant 23

Modified Kneser-Ney Smoothing Chen&Goodman introduced modified Kneser- Ney: Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of a single discount for all counts: 1 if c=1 D(c={D D 2 if c=2 if c 3 D 3+ Estimates discounts on held-out data instead of using a formula based on training counts Modified Kneser-Ney consistently had best performance. 24

Questions We've just seen interpolation with lower-order models. What else could be interpolated? 25

Questions We've just seen interpolation with lower-order models. What else could be interpolated? Why don't just take high-order models and back-off or interpolate with lower-order models? 26

References Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Language Modeling. Advanced: Good Turing Smoothing Dan Jurafsky, Christopher Manning, MOOC Natural Language Processing, lecture Advanced: Kneser-Ney Smoothing Bill MacCartney, NLP Lunch Tutorial: Smoothing, 2005 Joshua T. Goodman, A Bit of Progress in Language Modeling, 2001 Philipp Koehn, "Statistical Machine Translation", chapter Language models, 2009 Daniel Jurafsky, James H. Martin, Speech and Language Processing, 1999 27