Language Models. Philipp Koehn. 11 September 2018

Size: px

Start display at page:

Download "Language Models. Philipp Koehn. 11 September 2018"

Beatrix Arnold
5 years ago
Views:

1 Language Models Philipp Koehn 11 September 2018

2 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small) > p LM (small the is house) Help with word choice p LM (I am going home) > p LM (I am going house)

3 N-Gram Language Models 2 Given: a string of English words W = w 1, w 2, w 3,..., w n Question: what is p(w )? Sparse data: Many good English sentences will not have been seen before Decomposing p(w ) using the chain rule: p(w 1, w 2, w 3,..., w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1, w 2 )...p(w n w 1, w 2,...w n 1 ) (not much gained yet, p(w n w 1, w 2,...w n 1 ) is equally sparse)

4 Markov Chain 3 Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2-gram language model: p(w 1, w 2, w 3,..., w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here w i 1 is called the history

5 Estimating N-Gram Probabilities 4 Maximum likelihood estimation Collect counts over a large text corpus p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Millions to billions of words are easy to get (trillions of English words available on the web)

6 Example: 3-Gram 5 Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper group light party ecu the red (total: 225) word c. prob. cross tape army card , the blue (total: 54) word c. prob. box flag , angel trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is =

7 How good is the LM? 6 A good model assigns a text of real English W a high probability This can be also measured with cross entropy: H(W ) = 1 n log p(w n 1 ) Or, perplexity perplexity(w ) = 2 H(W )

8 Example: 3-Gram 7 prediction p LM -log 2 p LM p LM (i </s><s>) p LM (would <s>i) p LM (like i would) p LM (to would like) p LM (commend like to) p LM (the to commend) p LM (rapporteur commend the) p LM (on the rapporteur) p LM (his rapporteur on) p LM (work on his) p LM (. his work) p LM (</s> work.) average 2.634

9 Comparison 1 4-Gram 8 word unigram bigram trigram 4-gram i would like to commend the rapporteur on his work </s> average perplexity

10 9 count smoothing

11 Unseen N-Grams 10 We have seen i like to in our corpus We have never seen i like to smooth in our corpus p(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0

12 Add-One Smoothing 11 For all possible n-grams, add the count of one. c = count of n-gram in corpus n = count of history v = vocabulary size p = c + 1 n + v But there are many more unseen n-grams than seen n-grams Example: Europarl 2-bigrams: 86, 700 distinct words 86, = 7, 516, 890, 000 possible bigrams but only about 30, 000, 000 words (and bigrams) in corpus

13 Add-α Smoothing 12 Add α < 1 to each count p = c + α n + αv What is a good value for α? Could be optimized on held-out set

14 What is the Right Count? 13 Example: the 2-gram red circle occurs in a 30 million word corpus exactly once 1 maximum likelihood estimation tells us that its probability is 30,000, but we would expect it to occur less often than that Question: How likely does a 2-gram that occurs once in a 30,000,000 word corpus occur in the wild? Let s find out: get the set of all 2-grams that occur once (red circle, funny elephant,...) record the size of this set: N 1 get another 30,000,000 word corpus for each word in the set: count how often it occurs in the new corpus (many occur never, some once, fewer twice, even fewer 3 times,...) sum up all these counts ( ) divide by N 1 that is our test count t c

15 Example: 2-Grams in Europarl 14 Count Adjusted count Test count c (c + 1) n n (c + α) t n+v 2 n+αv 2 c Add-α smoothing with α =

16 Deleted Estimation 15 Estimate true counts in held-out data split corpus in two halves: training and held-out counts in training C t (w 1,..., w n ) number of ngrams with training count r: N r total times ngrams of training count r seen in held-out data: T r Held-out estimator: p h (w 1,..., w n ) = T r N r N where count(w 1,..., w n ) = r Both halves can be switched and results combined p h (w 1,..., w n ) = T 1 r + T 2 r N(N 1 r + N 2 r ) where count(w 1,..., w n ) = r

17 Good-Turing Smoothing 16 Adjust actual counts r to expected counts r with formula r = (r + 1) N r+1 N r N r number of n-grams that occur exactly r times in corpus N 0 total number of n-grams Where does this formula come from? Derivation is in the textbook.

18 Good-Turing for 2-Grams in Europarl 17 Count Count of counts Adjusted count Test count r N r r t 0 7,514,941, ,132, , , , , , , , , adjusted count fairly accurate when compared against the test count

19 18 backoff and interpolation

20 Back-Off 19 In given corpus, we may never observe Scottish beer drinkers Scottish beer eaters Both have count 0 our smoothing methods will assign them same probability Better: backoff to bigrams: beer drinkers beer eaters

21 Interpolation 20 Higher and lower order n-gram models have different strengths and weaknesses high-order n-grams are sensitive to more context, but have sparse counts low-order n-grams consider only very limited context, but have robust counts Combine them p I (w 3 w 1, w 2 ) = λ 1 p 1 (w 3 ) + λ 2 p 2 (w 3 w 2 ) + λ 3 p 3 (w 3 w 1, w 2 )

22 Recursive Interpolation 21 We can trust some histories w i n+1,..., w i 1 more than others Condition interpolation weights on history: λ wi n+1,...,w i 1 Recursive definition of interpolation p I n(w i w i n+1,..., w i 1 ) = λ wi n+1,...,w i 1 p n (w i w i n+1,..., w i 1 ) + + (1 λ wi n+1,...,w i 1 ) p I n 1(w i w i n+2,..., w i 1 )

23 Back-Off 22 Trust the highest order language model that contains n-gram Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 )

24 Back-Off with Good-Turing Smoothing 23 Previously, we computed n-gram probabilities based on relative frequency p(w 2 w 1 ) = count(w 1, w 2 ) count(w 1 ) Good Turing smoothing adjusts counts c to expected counts c count (w 1, w 2 ) count(w 1, w 2 ) We use these expected counts for the prediction model (but 0 remains 0) α(w 2 w 1 ) = count (w 1, w 2 ) count(w 1 ) This leaves probability mass for the discounting function d 2 (w 1 ) = 1 w 2 α(w 2 w 1 )

25 Example 24 Good Turing discounting is used for all positive counts count p GT count α 3 p(big a) 3 7 = = p(house a) 3 7 = = p(new a) 1 7 = = ( ) = 0.30 is left for back-off d 2 (a) Note: actual values for d 2 is slightly higher, since the predictions of the lowerorder model to seen events at this level are not used.

26 Diversity of Predicted Words 25 Consider the bigram histories spite and constant both occur 993 times in Europarl corpus only 9 different words follow spite almost always followed by of (979 times), due to expression in spite of 415 different words follow constant most frequent: and (42 times), concern (27 times), pressure (26 times), but huge tail of singletons: 268 different words More likely to see new bigram that starts with constant than spite Witten-Bell smoothing considers diversity of predicted words

27 Witten-Bell Smoothing 26 Recursive interpolation method Number of possible extensions of a history w 1,..., w n 1 in training data N 1+ (w 1,..., w n 1, ) = {w n : c(w 1,..., w n 1, w n ) > 0} Lambda parameters 1 λ w1,...,w n 1 = N 1+ (w 1,..., w n 1, ) N 1+ (w 1,..., w n 1, ) + w n c(w 1,..., w n 1, w n )

28 Witten-Bell Smoothing: Examples 27 Let us apply this to our two examples: 1 λ spite = = 1 λ constant = = N 1+ (spite, ) N 1+ (spite, ) + w n c(spite, w n ) = N 1+ (constant, ) N 1+ (constant, ) + w n c(constant, w n ) =

29 Diversity of Histories 28 Consider the word York fairly frequent word in Europarl corpus, occurs 477 times as frequent as foods, indicates and providers in unigram language model: a respectable probability However, it almost always directly follows New (473 times) Recall: unigram model only used, if the bigram model inconclusive York unlikely second word in unseen bigram in back-off unigram model, York should have low probability

30 Kneser-Ney Smoothing 29 Kneser-Ney smoothing takes diversity of histories into account Count of histories for a word N 1+ ( w) = {w i : c(w i, w) > 0} Recall: maximum likelihood estimation of unigram language model p ML (w) = c(w) i c(w i) In Kneser-Ney smoothing, replace raw counts with count of histories p KN (w) = N 1+( w) w i N 1+ ( w i )

31 Modified Kneser-Ney Smoothing 30 Based on interpolation Requires p BO n (w i w i n+1,..., w i 1 ) = α n (w i w i n+1,..., w i 1 ) if count n (w i n+1,..., w i ) > 0 = d n (w i n+1,..., w i 1 ) p BO n 1(w i w i n+2,..., w i 1 ) else adjusted prediction model α n (w i w i n+1,..., w i 1 ) discounting function d n (w 1,..., w n 1 )

32 Formula for α for Highest Order N-Gram Model 31 Absolute discounting: subtract a fixed D from all non-zero counts α(w n w 1,..., w n 1 ) = c(w 1,..., w n ) D w c(w 1,..., w n 1, w) Refinement: three different discount values D 1 if c = 1 D(c) = D 2 if c = 2 D 3+ if c 3

33 Discount Parameters 32 Optimal discounting parameters D 1, D 2, D 3+ can be computed quite easily Y = N 1 N 1 + 2N 2 D 1 = 1 2Y N 2 N 1 D 2 = 2 3Y N 3 N 2 D 3+ = 3 4Y N 4 N 3 Values N c are the counts of n-grams with exactly count c

34 Formula for d for Highest Order N-Gram Model 33 Probability mass set aside from seen events d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n ) N i for i {1, 2, 3+} are computed based on the count of extensions of a history w 1,..., w n 1 with count 1, 2, and 3 or more, respectively. Similar to Witten-Bell smoothing

35 Formula for α for Lower Order N-Gram Models 34 Recall: base on count of histories N 1+ ( w) in which word may appear, not raw counts. α(w n w 1,..., w n 1 ) = N 1+( w 1,..., w n ) D w N 1+( w 1,..., w n 1, w) Again, three different values for D (D 1, D 2, D 3+ ), based on the count of the history w 1,..., w n 1

36 Formula for d for Lower Order N-Gram Models 35 Probability mass set aside available for the d function d(w 1,..., w n 1 ) = i {1,2,3+} D in i (w 1,..., w n 1 ) w n c(w 1,..., w n )

37 Interpolated Back-Off 36 Back-off models use only highest order n-gram if sparse, not very reliable. two different n-grams with same history occur once same probability one may be an outlier, the other under-represented in training To remedy this, always consider the lower-order back-off models Adapting the α function into interpolated α I function by adding back-off α I (w n w 1,..., w n 1 ) = α(w n w 1,..., w n 1 ) Note that d function needs to be adapted as well + d(w 1,..., w n 1 ) p I (w n w 2,..., w n 1 )

38 Evaluation 37 Evaluation of smoothing methods: Perplexity for language models trained on the Europarl corpus Smoothing method bigram trigram 4-gram Good-Turing Witten-Bell Modified Kneser-Ney Interpolated Modified Kneser-Ney

39 38 efficiency

40 Managing the Size of the Model 39 Millions to billions of words are easy to get (trillions of English words available on the web) But: huge language models do not fit into RAM

41 Number of Unique N-Grams 40 Number of unique n-grams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) remove singletons of higher order n-grams

42 Efficient Data Structures 41 4-gram the very 3-gram backoff very 2-gram backoff large boff: very large boff: important boff: best boff: serious boff: accept p: acceptable p: accession p: accidents p: accountancy p: accumulated p: accumulation p: action p: additional p: administration p: large boff: important boff: best boff: serious boff: amount p: amounts p: and p: area p: companies p: cuts p: degree p: extent p: financial p: foreign p: majority p: number p: and p: areas p: challenge p: debate p: discussion p: fact p: international p: issue p: gram backoff aa-afns p: aachen p: aaiun p: aalborg p: aarhus p: aaron p: aartsen p: ab p: abacha p: aback p: Need to store probabilities for the very large majority the very language number Both share history the very large no need to store history twice Trie

43 Reducing Vocabulary Size 42 For instance: each number is treated as a separate token Replace them with a number token NUM but: we want our language model to prefer p LM (I pay in May 2007) > p LM (I pay 2007 in May ) not possible with number token p LM (I pay NUM in May NUM) = p LM (I pay NUM in May NUM) Replace each digit (with unique symbol, or 5), retain some distinctions p LM (I pay in May 5555) > p LM (I pay 5555 in May )

44 Summary 43 Language models: How likely is a string of English words good English? N-gram models (Markov assumption) Perplexity Count smoothing add-one, add-α deleted estimation Good Turing Interpolation and backoff Good Turing Witten-Bell Kneser-Ney Managing the size of the model

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer