Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling

Richer Interpolative Soothing Based on Modified Kneser-Ney Language Modeling Ehsan Shareghi, Trevor Cohn and Gholareza Haffari Faculty of Inforation Technology, Monash University Coputing and Inforation Systes, The University of Melbourne first.last@{onash.edu,unielb.edu.au} Abstract In this work we present a generalisation of the Modified Kneser-Ney interpolative soothing for richer soothing via additional discount paraeters. We provide atheatical underpinning for the estiator of the new discount paraeters, and showcase the utility of our rich MKN language odels on several European languages. We further explore the interdependency aong the training data size, language odel order, and nuber of discount paraeters. Our epirical results illustrate that larger nuber of discount paraeters, i) allows for better allocation of ass in the soothing process, particularly on sall data regie where statistical sparsity is severe, and ii) leads to significant reduction in perplexity, particularly for out-of-doain test sets which introduce higher ratio of out-ofvocabulary words. 1 1 Introduction Probabilistic language odels (LMs) are the core of any natural language processing tasks, such as achine translation and autoatic speech recognition. -gra odels, the corner stone of language odeling, decopose the probability of an utterance into conditional probabilities of words given a fixed-length context. Due to sparsity of the events in natural language, soothing techniques are critical for generalisation beyond the training text when estiating the paraeters of -gra LMs. This is particularly iportant when the training text is 1 For the ipleentation see: https://github.co/ eehsan/cstl sall, e.g. building language odels for translation or speech recognition in low-resource languages. A widely used and successful soothing ethod is interpolated Modified Kneser-Ney (MKN) (Chen and Goodan, 1999). This ethod uses a linear interpolation of higher and lower order -gra probabilities by preserving probability ass via absolute discounting. In this paper, we extend MKN by introducing additional discount paraeters, leading to a richer soothing schee. This is particularly iportant when statistical sparsity is ore severe, i.e., in building high-order LMs on sall data, or when out-of-doain test sets are used. Previous research in MKN language odeling, and ore generally -gra odels, has ainly dedicated efforts to ake the faster and ore copact (Stolcke et al., 011; Heafield, 011; Shareghi et al., 015) using advanced data structures such as succinct suffix trees. An exception is Hierarchical Pitan-Yor Process LMs (Teh, 006a; Teh, 006b) providing a rich Bayesian soothing schee, for which Kneser-Ney soothing corresponds to an approxiate inference ethod. Inspired by this work, we directly enrich MKN soothing realising soe of the reductions while reaining ore efficient in learning and inference. We provide estiators for our additional discount paraeters by extending the discount bounds in MKN. We epirically analyze our enriched MKN LMs on several European languages in in- and outof-doain settings. The results show that our discounting echanis significantly iproves the perplexity copared to MKN and offers a ore elegant 9 Proceedings of the 016 Conference on Epirical Methods in Natural Language Processing, pages 9 99, Austin, Texas, Noveber 1-5, 016. c 016 Association for Coputational Linguistics

way of dealing with out-of-vocabulary (OOV) words and doain isatch. Enriched Modified Kneser-Ney Interpolative Modified Kneser-Ney (MKN) (Chen and Goodan, 1999) soothing is widely accepted as a state-of-the-art technique and is ipleented in leading LM toolkits, e.g., SRILM (Stolcke, 00) and KenLM (Heafield, 011). MKN uses lower order k-gra probabilities to sooth higher order probabilities. P (w u) is defined as, c(uw) D (c(uw)) c(u) + γ(u) c(u) P (w π(u)) where c(u) is the frequency of the pattern u, γ(.) is a constant ensuring the distribution sus to one, and P (w π(u)) is the soothed probability coputed recursively based on a siilar forula conditioned on the suffix of the pattern u denoted by π(u). Of particular interest are the discount paraeters D (.) which reove soe probability ass fro the axiu likelihood estiate for each event which is redistributed over the soothing distribution. The discounts are estiated as 0, if i = 0 D 1 n [] n 1 [] n (i) =, if i = 1 1 [] n 1 []+n [] n [] n 1 [], if i = n [] n 1 []+n [] n []. n 1 [], if i n [] n 1 []+n [] where n i () is the nuber of unique -gras of frequency i. This effectively leads to three discount paraeters {D (1), D (), D (+)} for the distributions on a particular context length,..1 Generalised MKN Ney et al. (199) characterized the data sparsity using the following epirical inequalities, n [] < n [] < n 1 [] for It can be shown (see Appendix A) that these epirical inequalities can be extended to higher fre- Note that in all but the top layer of the hierarchy, continuation counts, which count the nuber of unique contexts, are used in place of the frequency counts (Chen and Goodan, 1999). Continuation counts are used for the lower layers. quencies and larger contexts >, (N )n N [] <... < n [] < n 1 [] < i>0 n i [] n 0 [] < σ where σ is the possible nuber of -gras over a vocabulary of size σ, n 0 [] is the nuber of - gras that never occurred, and i>0 n i[] is the nuber of -gras observed in the training data. We use these inequalities to extend the discount depth of MKN, resulting in new discount paraeters. The additional discount paraeters increase the flexibility of the odel in altering a wider range of raw counts, resulting in a ore elegant way of assigning the ass in the soothing process. In our experients, we set the nuber of discounts to for all the levels of the hierarchy, (copare this to these in MKN). This results in the following estiators for the discounts, 0, if i = 0 D (i) = i (i + 1) n i+1[] n 1 [] n i [] n 1 []+n, if i < [] 11 n 11[] n 1 [], if i. n [] n 1 []+n [] It can be shown that the above estiators for our discount paraeters are derived by axiizing a lower bound on the leave-one-out likelihood of the training set, following (Ney et al., 199; Chen and Goodan, 1999) (see Appendix B for the proof sketch). Experients We copare the effect of using different nubers of discount paraeters on perplexity using the Finnish (FI), Spanish (ES), Geran (DE), English (EN) portions of the Europarl v7 (Koehn, 005) corpus. For each language we excluded the first K sentences and used it as the in-doain test set (denoted as EU), skipped the second K sentences, and used the rest as the training set. The data was tokenized, sentence split, and the XML arkup discarded. We tested the effect of doain isatch, under two settings for out-of-doain test sets: i) ild using the Spanish section of news-test 01, the Geran, English sections of news-test 01, and the Finnish section We have selected the value of arbitrarily; however our approach can be used with larger nuber of discount paraeters, with the caveat that we would need to handle sparse counts in the higher orders. 95

size (M) size (K) MKN (D 1... ) MKN (D [1...] ) MKN (D [1...] ) Training tokens sents Test tokens sents OOV% = = 5 = = = 5 = = = 5 = NT 19.8 9. 656.6 5900. 5897. 651. 587.6 58.6 615. 5575.0 557.5 FI 6.5. EU 197. 6.1 90.7 87. 86.8 90.7 87. 86.6 90. 87. 86.8 TW.9 1. 5.1 57 85.1 51 7.1 51 70.1 55 550. 9 88. 9 881. 7 696. 77. 75.5 NT 70.7 9.1 565.6 1.5 9. 560.0 5.5.5 51.5 09.0 07. ES 68.0. EU 81.5. 9.7 51.5 51.1 9.8 51.5 51.1 9.8 51. 51.0 TW 11. 9 78.5 17 80. 1 06.7 1 07.1 17 11. 1 87. 1 5.1 1 915.7 11 8.1 11 807. NT 6.5 18.7 190.7 178.6 1781.8 158.9 1755.8 175. 065. 1680.6 1678. DE 61.. EU.0.6 156.9 91.7 91. 156.9 91.6 91. 156. 91.7 91. MED 17.7 59.8 515.7. 6.7 5007.5 1.0 117.5 66.0 81. 86.6 NT 69.5 5.5 89. 875.0 87. 71.1 857. 85. 11.5 806.7 80. EN 67.5. EU 7.9 1.7 90.1 8. 8.1 90.1 8. 8.0 90.5 8. 8.0 MED 05.9.1 19.7 197.9 19.5 61.6 189. 1888. 071.9 17.9 170.8 Table 1: for various -gra orders,, and training languages fro Europarl, using different nubers of discount paraeters for MKN. MKN (D [1...] ), MKN (D [1...] ), MKN (D [1...] ) represent vanilla MKN, MKN with 1 ore discounts, and MKN with 7 ore discount paraeters, respectively. Test sets sources EU, NT, TW, MED are Europarl, news-test, Twitter, and edical patent descriptions, respectively. OOV is reported as the ratio {OOV test-set} {w test-set}. [% Reduction] 1 8 1 0 ppl D[1...] ppl D[1...] CZ FI ES DE EN FR CZ FI ES DE EN FR [Corpus] Figure 1: Percentage of perplexity reduction for pplx D[1...] and pplx D[1...] copared with pplx D[1..] on different training corpora (Europarl CZ, FI, ES, DE, EN, FR) and on news-test sets (NT) for = (left), and = (right). of news-test 015 (all denoted as NT) 5, and ii) extree using a hour period of streaed Finnish, and Spanish tweets 6 (denoted as TW), and the Geran and English sections of the patent description of edical translation task 7 (denoted as MED). See Table 1 for statistics of the training and test sets..1 Table 1 shows substantial reduction in perplexity on all languages for out-of-doain test sets when expanding the nuber of discount paraeters fro in vanilla MKN to and. Consider the English 5 http://www.statt.org/{wt1,1,15}/test.tgz 6 Streaed via Twitter API on 17/05/016. 7 http://www.statt.org/wt1/edical-task/ news-test (NT), in which even for a -gra language odel a single extra discount paraeter ( =, D [1...] ) iproves the perplexity by 18 points and this iproveent quadruples to 77 points when using discounts ( =, D [1...] ). This effect is consistent across the Europarl corpora, and for all LM orders. We observe a substantial iproveents even for = -gra odels (see Figure 1). On the edical test set which has 9 ties higher OOV ratio, the perplexity reduction shows a siilar trend. However, these reductions vanish when an in-doain test set is used. Note that we use the sae treatent of OOV words for coputing the perplexities which is used in KenLM (Heafield, 01).. Analysis Out-of-doain and Out-of-vocabulary We selected the Finnish language for which the nuber and ratio of OOVs are close on its out-of-doain and in-doain test sets (NT and EU), while showing substantial reduction in perplexity on out-ofdoain test set, see FI bars on Figure 1. Figure (left), shows the full perplexity results for Finnish for vanilla MKN, and our extensions when tested on in-doain (EU) and out-of-doain (NT) test sets. The discount plot, Figure (iddle) illustrates the behaviour of the various discount paraeters. We also easured the average hit length for queries by varying on in-doain and out-of-doain test sets. As illustrated in Figure (right) the in-doain test set allows for longer atches to the training data as 96

5 6 7 8 9 5 6 7 8 9 6550 5900 5800 5500 90 85 D [1...] D [1...] D [1...] NT EU Discount.5.5.5 1.5 1 1 gra gra gra gra 5 gra 6 gra 7 gra 8 gra 9 gra gra Average hit length.5 1.5 1 NT EU 5 6 7 8 9 1 5 6 7 8 9 5 6 7 8 9 i Figure : Statistics for the Finnish section of Europarl. The left plot illustrates the perplexity when tested on an outof-doain (NT) and in-doain (EU) test sets varying LM order,. The iddle plot shows the discount paraeters D i [1...] for different -gra orders. The right plot correspond to average hit length on EU and NT test sets. 190 1678 Discount 6 180 Discount Figure : (z-axis) vs. [...] (x-axis) vs. nuber of discounts D i,, (y-axis) for Geran language trained on Europarl (left), and CoonCrawl01 (right) and tested on news-test. Arrows show the direction of the increase. grows. This indicates that having ore discount paraeters is not only useful for test sets with extreely high nuber of OOV, but also allows for a ore elegant way of assigning ass in the soothing process when there is a doain isatch. Interdependency of, data size, and discounts To explore the correlation between these factors we selected the Geran and investigated this correlation on two different training data sizes: Europarl (61M words), and CoonCrawl 01 (98M words). Figure illustrates the correlation between these factors using the sae test set but with sall and large training sets. Considering the slopes of the surfaces indicates that the sall training data regie (left) which has higher sparsity, and ore OOV in the test tie benefits substantially fro the ore accurate discounting copared to the large training set (right) in which the gain fro discounting is slight. 8 8 Nonetheless, the iproveent in perplexity consistently grows with introducing ore discount paraeters even under Conclusions In this work we proposed a generalisation of Modified Kneser-Ney interpolative language odeling by introducing new discount paraeters. We provide the atheatical proof for the discount bounds used in Modified Kneser-Ney and extend it further and illustrate the ipact of our extension epirically on different Europarl languages using in-doain and out-of-doain test sets. The epirical results on various training and test sets show that our proposed approach allows for a ore elegant way of treating OOVs and ass assignents in interpolative soothing. In future work, we will integrate our language odel into the Moses achine translation pipeline to intrinsically easure its ipact on translation qualities, which is of particular use for out-of-doain scenario. Acknowledgeents This research was supported by the National ICT Australia (NICTA) and Australian Research Council Future Fellowship (project nuber FT15). This work was done when Ehsan Shareghi was an intern at IBM Research Australia. A. Inequalities We prove that these inequalities hold in expectation by aking the reasonable assuption that events in the large training data regie, which suggests that ore discount paraeters, e.g., up to D 0, ay be required for larger training corpus to reflect the fact that even an event with frequency of 0 ight be considered rare in a corpus of nearly 1 billion words. 97

the natural language follow the power law (Clauset et al., 009), p ( C(u) = f ) f 1 1 s, where s is the paraeter of the distribution, and C(u) is the rando variable denoting the frequency of the - gras pattern u. We now copute the expected nuber of unique patterns having a specific frequency E[n i []]. Corresponding to each -gras pattern u, let us define a rando variable X u which is 1 if the frequency of u is i and zero otherwise. It is not hard to see that n i [] = u X u, and E [ n i [] ] = E [ ] X u = E[X u ] = σ E[X u ] u u = σ ( p ( C(u) = i ) 1 + p ( C(u) i ) ) 0 σ i 1 1 s. We can verify that (i + 1)E [ n i+1 [] ] < ie [ n i [] ] (i + 1)σ (i + 1) 1 1 s < iσ i 1 1 s i 1 s < (i + 1) 1 s. which copletes the proof of the inequalities. B. Discount bounds proof sketch The leave-one-out (leaving those -gras which occurred only once) log-likelihood function of the interpolative soothing is lower bounded by backoff odel s (Ney et al., 199), hence the estiated discounts for later can be considered as an approxiation for the discounts of the forer. Consider a backoff odel with absolute discounting paraeter D, were P (w i wi +1 i 1 ) is defined as: c(w i i +1 ) D i +1 ) if c(w i i +1 ) > 0 Dn 1+(wi +1 ) i 1 P c(w (w i w i 1 i 1 i +1 ) i + ) if c(wi i +1 ) = 0 where n 1+ (wi +1 ) i 1 is the nuber of unique right contexts for the wi +1 i 1 pattern. Assue that for any choice of 0 < D < 1 we can define P such that P (w i wi i 1 +1 ) sus to 1. For readability we use the λ(wi +1 i 1 ) = n 1+(wi +1 ) i 1 replaceent. i +1 ) 1 Following (Chen and Goodan, 1999), rewriting the leave-one-out log-likelihood for KN (Ney et al., 199) to include ore discounts (in this proof up to 98 D ), results in: c(wi +1) i log c(wi i +1 ) 1 D i +1 ) 1 + w i i +1 c(w i i +1 )> ( c(wi +1) i log c(wi i +1 ) 1 D j 1 i +1 ) 1 j= w i i +1 c(w i i +1 )=j w i i +1 c(w i i +1 )=1 ( c(w i i +1) log( n j []D j )λ(w i 1 i +1 ) P ) which can be siplified to, c(wi +1) i log(c(wi +1) i 1 D )+ w i i +1 c(w i i +1 )> j= ( ) jn j [] log(j 1 D j 1 ) + n 1 [] log( n j []D j ) + const ) + To find the optial D 1, D, D, D we set the partial derivatives to zero. For D, n [] = n 1 [] D n n [] = 0 j[]d j D n 1 []n []( D ) = n [] n j []D j n 1 []n [] D n 1 []n [] n []n 1 []D 1 > 0 n [] n [] D 1 > D And after taking c(wi +1 i ) = 5 out of the suation, for D : D = c(w i i +1 )>5 c(w i i +1 ) c(w i i +1 ) 1 D 5n 5[] D n [] + n 1 [] n = 0 5n 5[] j[]d j D n [] + n 1 [] n > 0 n 1 []n []( D ) j[]d j > 5n 5 [] n j []D j 5 n 5[] n [] D 1 > D

References Stanley F. Chen and Joshua Goodan. 1999. An epirical study of soothing techniques for language odeling. Coputer Speech & Language, 1():59 9. Aaron Clauset, Cosa Rohilla Shalizi, and Mark EJ Newan. 009. Power-law distributions in epirical data. SIAM review, 51():661 70. Kenneth Heafield. 011. KenLM: Faster and saller language odel queries. In Proceedings of the Workshop on Statistical Machine Translation. Kenneth Heafield. 01. Efficient Language Modeling Algoriths with Applications to Statistical Machine Translation. Ph.D. thesis, Carnegie Mellon University. Philipp Koehn. 005. Europarl: A parallel corpus for statistical achine translation. In Proceedings of the Machine Translation suit. Herann Ney, Ute Essen, and Reinhard Kneser. 199. On structuring probabilistic dependences in stochastic language odelling. Coputer Speech & Language, 8(1):1 8. Ehsan Shareghi, Matthias Petri, Gholareza Haffari, and Trevor Cohn. 015. Copact, efficient and unliited capacity: Language odeling with copressed suffix trees. In Proceedings of the Conference on Epirical Methods in Natural Language Processing. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. 011. SRILM at sixteen: Update and outlook. In Proceedings of IEEE Autoatic Speech Recognition and Understanding Workshop. Andreas Stolcke. 00. SRILM an extensible language odeling toolkit. In Proceedings of the International Conference of Spoken Language Processing. Yee Whye Teh. 006a. A Bayesian interpretation of interpolated Kneser-Ney. Technical report, NUS School of Coputing. Yee Whye Teh. 006b. A hierarchical Bayesian language odel based on Pitan-Yor processes. In Proceedings of the Annual Meeting of the Association for Coputational Linguistics. 99