N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Size: px

Start display at page:

Download "N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:"

Verity Hamilton
5 years ago
Views:

1 N-gram Language Model 2 Given a text w = w 1...,w t,...,w w we can compute its probability by: Language Models Marcello Federico FBK-irst Trento, Italy 2016 w Y Pr(w) =Pr(w 1 ) Pr(w t h t ) (1) t=2 where h t = w 1,...,w t 1 indicates the history of word w t. Pr(w t h t ) becomes di cult to estimate as the history h t grows We take the n-gram approximation: h t w t n+1...w t 1 e.g. Full history: Pr(Parliament I declare resumed the session of the European) 3 gram : p(parliament the European) Outline 1 Language Model Evaluation 3 N-gram Language Models Evaluation of LMs Probability Smoothing Neural Network LMs LM Toolkits N-gram Queries by MT Search Data structures for N-grams Queries Conclusions Extrinsic: impact on task (e.g. BLEU score for MT) Intrinsic: prediction of w given an history h The perplexity (PP) measure is defined as: 1 where w is a su Property: PPisalways 1 PP =2 LP where LP = log 2 p(w) w ciently long test sample and p(w) is the LM probability. (2) Meaning: the expected prediction accuracy is one every PP guesses More practical meaning: the lower the PP the higher the BLEU scores! 1 [Exercise 1. Find PP of 1-gram LM on the sequence T H T H T H T T H T T H for p(t)=0.3, p(h)=0.7 and p(h)=0.3, p(t)=0.7. Comment the results.]

2 Train-set vs. test-set perplexity 4 Frequency Smoothing 6 PP is a function of a LM p( ) and a text w PP =2 LP LP = w 1 X log w 2 p(w t h t ) t=1 the lower the PP the better the LM if w is a new sequence, PP evaluates the LM generalisation capability if w belong to the training data, then PP is just a likelihood measure PP strongly penalizes zero probabilities (PP! +1) Notice: the natural feature function of the LM is h(f, e, a) = log p(e) N-gram Probabilities 5 Frequency Smoothing 7 Even estimating 3-gram probabilities 2 is not trivial due to: model complexity: e.g. 10,000 words correspond to 1 trillion 3-grams! data sparseness: e.g. most 3-grams are rare events even in huge corpora. Take o some fraction of probability from f( xy) and assign it to all words never observed after xy. discounted frequency f ( xy): Relative frequency estimate: MLE of any discrete conditional distribution is: p(w xy) f(w xy)= c(w xy) c(x yw) = c(w xy) c(x y) P w where n-gram counts c( ) are taken over the training corpus. Limitation: it gives probability zero to all new n-grams! zero-frequency probability (x y): 3 (x y)=1 0 apple f apple f X f (w xy) w2v We need frequency smoothing! How to redistribute the total discount? 2 We will often refer to trigrams just for simplicity, but without loss of generality. 3 Notice: by convention (x y)=1if f(w xy)=0for all w, i.e.c(x y)=0.

3 Frequency Smoothing 8 Frequency Smoothing of 1-grams 9 Redistribute ( ) according to the lower-order n-gram Unigram smoothing is needed to cope with out-of-vocabulary (OOV) words If c(aiming at snorkeling) =0then p(snorkeling aiming at) / (aiming at )p(snorkeling at) Back-o scheme: select the best available n-gram approximation Assumptions: V : size of observed vocabulary; N: size of training corpus U : upper-bound estimate of size of true vocabulary Laplace smoothing: f (w) = c(w) = V N + V N + V Then: 1-gram back-o /interpolation schemes collapse to: p(w xy)= f (w xy) if > 0 xy (xy)p(w y) otherwise (3) 8 < f (w) p(w) = 1 : U V if w 2 V if w/2 V (i.e. OOV word) (5) where xy is an appropriate normalization term. 4 Important: use a common value U when comparing/combining di erent LMs! 4 [Exercise 2. Find and expression for xy s.t. P w p(w xy)=1.] Frequency Smoothing 8 10 Redistribute ( ) according to the lower-order n-gram If c(aiming at snorkeling) =0then p(snorkeling aiming at) / (aiming at )p(snorkeling at) Witten-Bell estimate (WB) [Witten and Bell, 1991] Insight: count how often you would back-o after x y in the training data corpus: x y u x x y t t x y u w x y w x y t u x y u x y t assume (x y) / number of back-o s (i.e. 3) hence f (w xy) / relative frequency (linear discounting) Solution: Interpolation: sum up the two approximations p(w xy)=f (w xy)+ (x y)p(w y). (4) Smoothed frequencies are learned bottom-up, starting from 1-grams... (x y)= n(x y ) c(x y)+n(x y ) and f (w xy) = c(x yw) c(x y)+n(x y ) where c(x y)= P 5 w c(x yw) and n(x y ) = {w : c(x yw) > 0}. Pros: easy to compute, robust for small corpora, works with artificial data Cons: underestimates probability of frequent n-grams 5 [Exercise 3. Compute f (u xy) with WB on the above corpus. Try to relate WB with Laplace smoothing.]

4 11 13 Interpolation and back-o with WB discounting Trigram LMs estimated on the English Europarl corpus Logprobs of 3-grams of type aiming at logprob observed in training WB interpolation WB back-off Top frequent 3gr 'aiming at _' in Europarl Peaks correspond to very probable 2-grams interpolated with f respectively: at that, at national, at European Back-o performs slightly better than interpolation but costs more Kneser-Ney method (KN) [Kneser and Ney, 1995] Insight: lower order counts are only used in case of back-o estimate frequency of back-o s to y w in the training data (cf. WB) corpus: x y w x t y w t x y w u y w t y w u x y w u u y w replace c(y w) with n( yw)=# of observed back-o s (=3) Solution: (for 3-gram normal counts) n( yw) f (w y) =max n( y ), 0 (y) = {w : n( yw) > b c} n( y ) where n( yw)= {x : c(x yw) > 0} and n( y ) = {x w: c(x yw) > 0} Pros: better back-o probabilities, can be applied to other smoothing methods Cons: LM cannot be used to compute lower order n-gram probs Absolute Discounting (AD) [Ney and Essen, 1991] Insight: high counts are be more reliable than low counts subtract a small constant (0 < apple 1) from each count estimate via MLE with leaving-one-out on the training data Solution: (notice: one distinct c(xyw) f (w xy)=max c(xy) for each n-gram order), 0 (xy) = where n 1 n 1 +2n 2 apple 1 and n r = {x yw: c(x yw)=r}. 6 Pros: easy to compute, accurate estimate of frequent n-grams. Cons: problematic with small and artificial samples. k{w : c(xyw) > b c} c(xy) 6 [Exercise 4. Given the text in WB slide find the number of 3-grams, n1, n 2,, f (w x y) and (x y)] Modified Kneser-Ney (MKN) [Chen and Goodman, 1999] Insight: - specific non-linear discounting coe cients for infrequent n-grams - corrected counts for lower order n-grams [Kneser and Ney, 1995] Solution: f c(x yw) (c(x yw)) (w xy)= c(x y) where (0) = 0, (1) = D 1, (2) = D 2, (c) =D 3+ if c 3, coe cients are computed from n r statistics (=number of n-grams observed r times) Pros: more smoothing on singleton n-grams than other n-grams Cons: more sensitiveness to noise Important: MKN is the most popular smoothing method.

5 Approximate Smoothing 15 Neural Network LMs 17 LM Quantization [Federico and Bertoldi, 2006] Two codebooks for each n-gram level: one for f ( ), one for ( ) Improves storage e ciency: 4 bit codebook vs 32 bit float Slightly reduced discriminatory power Stupid back-o [Brants et al., 2007] no discounting, no corrected counts, no back-o normalization LM scores directly computed from counts f(w xy) if > 0 p(w xy)= k p(w y) otherwise where k = 0.4 and p(w) = c(w)/n. (6) Most promising among recent development on n-gram LMs. Idea: Map single word into a V -dimensional vector space Represent n-gram LM as a map between vector spaces Solution: Learnmapwithneuralnetwork(NN)architecture one hidden layer compress information (projection) second hidden layer performs the n-gram prediction other architectures are possible: e.g. recurrent NN Implementations: Continuous Space Language Model [Schwenk et al., 2006] Recurrent Neural Network Language Modeling Toolkit 7 Pros: Fast run-time, competitive when used jointly with standard model Cons: Computational cost of training phase Not easy to integrate into search algorithm (used in re-scoring) 7 Is Frequency Smoothing Necessary? 16 Neural Network LMs 18 From [Brants et al., 2007]. SB=stupid back-o, KN=modified Kneser-Ney Conclusion: proper smoothing useful up to few billion words training data! (From [Schwenk et al., 2006])

6 Language Modelling Toolkits 19 Randomized structures 21 Availability of large scale corpora has pushed research toward using huge LMs MT systems set for evaluations use LMs with over a billion of 5-grams Estimating accurate large scale LMs is still computationally costly Querying large LMs can be carried out rather e ciently (with adequate RAM) Available LM toolkits SRILM: training and run-time, Moses support, open source (no commercial) IRSTLM: training and run-time, Moses support, open source KENLM: run-time, Moses support, open source CSLM: training and run-time, Moses support, open source RNNLM: training and run-time, re-scoring, open source Store n-gram and their statistics in a giant hash table Replace n-grams with their 64 bit hash key Birthday attack: 1.25 p 2 64 = First false positive appears after 5 billion di erent n-grams! Hash table size N<<2 64 : pos = key mod N hence collisions may occur for key i 6= key j Options: Hash table (KENLM) Minimal perfect hash function: map n-grams to a unique integer (Google) Bloom filters: need way to incorporate prob and bow info (RandLM)... Interoperability The standard for n-gram LM representation is the so-called ARPA file format. Inverted Trie Structure to Query LM 20 Hash Table: Separate Chaining 22 1-gr 2-gr 3-gr world friends......! world world! friends hi my name is w idx bow pr w idx bow pr hi 3 1 w pr welcome aboard n-gram bow pr ptr E ciency analysis (worst case): logpr(! friends) 1. Search(!) + Search(friends) + Get(pr(!)) 2. Search(friends) + Get(bow(friends))+ Search() + Get(bow(,friends) In total: 4 binary searches (Improvements by using interpolated search!) Each query requires at least two memory accesses!

Hash Table: Open Addressing 23 Conclusion 25 During insertion, in case of collision, apply linear probing: put item in next free position add Load factor must be below 80%! world!

7 Hash Table: Open Addressing 23 Conclusion 25 During insertion, in case of collision, apply linear probing: put item in next free position add Load factor must be below 80%! world! friends my name is hi welcome aboard n-gram bow pr Important feature to consider in MMT: Trade-o of memory and speed Incremental training: adding new n-grams Distributed representation: data sharding Networking issues: latency, failures... What to start with? Fast training models: stupid back-o vs. approximate MKN Distributed NoSQL databases: memory vs speed? Randomized models: distributed computation? Define workflow for LM training Define simple API for LM query LM Query Speed 24 References 26 [Berger et al., 1996] Berger, A., Pietra, S. A. D., and Pietra, V. J. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1): [Brants et al., 2007] Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages , Prague, Czech Republic. (From slide of Kenneth Heafield, MT Marathon 2014) [Chen and Goodman, 1999] Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 4(13): [Federico and Bertoldi, 2006] Federico, M. and Bertoldi, N. (2006). How many bits are needed to store probabilities for phrase-based translation? In Proceedings on the Workshop on Statistical Machine Translation, pages , New York City. Association for Computational Linguistics. [Kneser and Ney, 1995] Kneser, R. and Ney, H. (1995). Improved backing-o for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages , Detroit, MI.

8 [Ney and Essen, 1991] Ney, H. and Essen, U. (1991). On smoothing techniques for bigram-based natural language modelling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages S12.11: , Toronto, Canada. [Rosenfeld, 1996] Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10: [Schwenk et al., 2006] Schwenk, H., Dechelotte, D., and Gauvain, J.-L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages , Sydney, Australia. Association for Computational Linguistics. [Witten and Bell, 1991] Witten, I. H. and Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory, IT-37(4):

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012 Language Modelling Marcello Federico FBK-irst Trento, Italy MT Marathon, Edinburgh, 2012 Outline 1 Role of LM in ASR and MT N-gram Language Models Evaluation of Language Models Smoothing Schemes Discounting