Richer Interpolative Smoothing Based on Modified KneserNey Language Modeling


 Tracy James
 5 days ago
 Views:
Transcription
1 Richer Interpolative Soothing Based on Modified KneserNey Language Modeling Ehsan Shareghi, Trevor Cohn and Gholareza Haffari Faculty of Inforation Technology, Monash University Coputing and Inforation Systes, The University of Melbourne Abstract In this work we present a generalisation of the Modified KneserNey interpolative soothing for richer soothing via additional discount paraeters. We provide atheatical underpinning for the estiator of the new discount paraeters, and showcase the utility of our rich MKN language odels on several European languages. We further explore the interdependency aong the training data size, language odel order, and nuber of discount paraeters. Our epirical results illustrate that larger nuber of discount paraeters, i) allows for better allocation of ass in the soothing process, particularly on sall data regie where statistical sparsity is severe, and ii) leads to significant reduction in perplexity, particularly for outofdoain test sets which introduce higher ratio of outofvocabulary words. 1 1 Introduction Probabilistic language odels (LMs) are the core of any natural language processing tasks, such as achine translation and autoatic speech recognition. gra odels, the corner stone of language odeling, decopose the probability of an utterance into conditional probabilities of words given a fixedlength context. Due to sparsity of the events in natural language, soothing techniques are critical for generalisation beyond the training text when estiating the paraeters of gra LMs. This is particularly iportant when the training text is 1 For the ipleentation see: eehsan/cstl sall, e.g. building language odels for translation or speech recognition in lowresource languages. A widely used and successful soothing ethod is interpolated Modified KneserNey (MKN) (Chen and Goodan, 1999). This ethod uses a linear interpolation of higher and lower order gra probabilities by preserving probability ass via absolute discounting. In this paper, we extend MKN by introducing additional discount paraeters, leading to a richer soothing schee. This is particularly iportant when statistical sparsity is ore severe, i.e., in building highorder LMs on sall data, or when outofdoain test sets are used. Previous research in MKN language odeling, and ore generally gra odels, has ainly dedicated efforts to ake the faster and ore copact (Stolcke et al., 011; Heafield, 011; Shareghi et al., 015) using advanced data structures such as succinct suffix trees. An exception is Hierarchical PitanYor Process LMs (Teh, 006a; Teh, 006b) providing a rich Bayesian soothing schee, for which KneserNey soothing corresponds to an approxiate inference ethod. Inspired by this work, we directly enrich MKN soothing realising soe of the reductions while reaining ore efficient in learning and inference. We provide estiators for our additional discount paraeters by extending the discount bounds in MKN. We epirically analyze our enriched MKN LMs on several European languages in in and outofdoain settings. The results show that our discounting echanis significantly iproves the perplexity copared to MKN and offers a ore elegant 9 Proceedings of the 016 Conference on Epirical Methods in Natural Language Processing, pages 9 99, Austin, Texas, Noveber 15, 016. c 016 Association for Coputational Linguistics
2 way of dealing with outofvocabulary (OOV) words and doain isatch. Enriched Modified KneserNey Interpolative Modified KneserNey (MKN) (Chen and Goodan, 1999) soothing is widely accepted as a stateoftheart technique and is ipleented in leading LM toolkits, e.g., SRILM (Stolcke, 00) and KenLM (Heafield, 011). MKN uses lower order kgra probabilities to sooth higher order probabilities. P (w u) is defined as, c(uw) D (c(uw)) c(u) + γ(u) c(u) P (w π(u)) where c(u) is the frequency of the pattern u, γ(.) is a constant ensuring the distribution sus to one, and P (w π(u)) is the soothed probability coputed recursively based on a siilar forula conditioned on the suffix of the pattern u denoted by π(u). Of particular interest are the discount paraeters D (.) which reove soe probability ass fro the axiu likelihood estiate for each event which is redistributed over the soothing distribution. The discounts are estiated as 0, if i = 0 D 1 n [] n 1 [] n (i) =, if i = 1 1 [] n 1 []+n [] n [] n 1 [], if i = n [] n 1 []+n [] n []. n 1 [], if i n [] n 1 []+n [] where n i () is the nuber of unique gras of frequency i. This effectively leads to three discount paraeters {D (1), D (), D (+)} for the distributions on a particular context length,..1 Generalised MKN Ney et al. (199) characterized the data sparsity using the following epirical inequalities, n [] < n [] < n 1 [] for It can be shown (see Appendix A) that these epirical inequalities can be extended to higher fre Note that in all but the top layer of the hierarchy, continuation counts, which count the nuber of unique contexts, are used in place of the frequency counts (Chen and Goodan, 1999). Continuation counts are used for the lower layers. quencies and larger contexts >, (N )n N [] <... < n [] < n 1 [] < i>0 n i [] n 0 [] < σ where σ is the possible nuber of gras over a vocabulary of size σ, n 0 [] is the nuber of  gras that never occurred, and i>0 n i[] is the nuber of gras observed in the training data. We use these inequalities to extend the discount depth of MKN, resulting in new discount paraeters. The additional discount paraeters increase the flexibility of the odel in altering a wider range of raw counts, resulting in a ore elegant way of assigning the ass in the soothing process. In our experients, we set the nuber of discounts to for all the levels of the hierarchy, (copare this to these in MKN). This results in the following estiators for the discounts, 0, if i = 0 D (i) = i (i + 1) n i+1[] n 1 [] n i [] n 1 []+n, if i < [] 11 n 11[] n 1 [], if i. n [] n 1 []+n [] It can be shown that the above estiators for our discount paraeters are derived by axiizing a lower bound on the leaveoneout likelihood of the training set, following (Ney et al., 199; Chen and Goodan, 1999) (see Appendix B for the proof sketch). Experients We copare the effect of using different nubers of discount paraeters on perplexity using the Finnish (FI), Spanish (ES), Geran (DE), English (EN) portions of the Europarl v7 (Koehn, 005) corpus. For each language we excluded the first K sentences and used it as the indoain test set (denoted as EU), skipped the second K sentences, and used the rest as the training set. The data was tokenized, sentence split, and the XML arkup discarded. We tested the effect of doain isatch, under two settings for outofdoain test sets: i) ild using the Spanish section of newstest 01, the Geran, English sections of newstest 01, and the Finnish section We have selected the value of arbitrarily; however our approach can be used with larger nuber of discount paraeters, with the caveat that we would need to handle sparse counts in the higher orders. 95
3 size (M) size (K) MKN (D 1... ) MKN (D [1...] ) MKN (D [1...] ) Training tokens sents Test tokens sents OOV% = = 5 = = = 5 = = = 5 = NT FI 6.5. EU TW NT ES EU TW NT DE 61.. EU MED NT EN EU MED Table 1: for various gra orders,, and training languages fro Europarl, using different nubers of discount paraeters for MKN. MKN (D [1...] ), MKN (D [1...] ), MKN (D [1...] ) represent vanilla MKN, MKN with 1 ore discounts, and MKN with 7 ore discount paraeters, respectively. Test sets sources EU, NT, TW, MED are Europarl, newstest, Twitter, and edical patent descriptions, respectively. OOV is reported as the ratio {OOV testset} {w testset}. [% Reduction] ppl D[1...] ppl D[1...] CZ FI ES DE EN FR CZ FI ES DE EN FR [Corpus] Figure 1: Percentage of perplexity reduction for pplx D[1...] and pplx D[1...] copared with pplx D[1..] on different training corpora (Europarl CZ, FI, ES, DE, EN, FR) and on newstest sets (NT) for = (left), and = (right). of newstest 015 (all denoted as NT) 5, and ii) extree using a hour period of streaed Finnish, and Spanish tweets 6 (denoted as TW), and the Geran and English sections of the patent description of edical translation task 7 (denoted as MED). See Table 1 for statistics of the training and test sets..1 Table 1 shows substantial reduction in perplexity on all languages for outofdoain test sets when expanding the nuber of discount paraeters fro in vanilla MKN to and. Consider the English Streaed via Twitter API on 17/05/ newstest (NT), in which even for a gra language odel a single extra discount paraeter ( =, D [1...] ) iproves the perplexity by 18 points and this iproveent quadruples to 77 points when using discounts ( =, D [1...] ). This effect is consistent across the Europarl corpora, and for all LM orders. We observe a substantial iproveents even for = gra odels (see Figure 1). On the edical test set which has 9 ties higher OOV ratio, the perplexity reduction shows a siilar trend. However, these reductions vanish when an indoain test set is used. Note that we use the sae treatent of OOV words for coputing the perplexities which is used in KenLM (Heafield, 01).. Analysis Outofdoain and Outofvocabulary We selected the Finnish language for which the nuber and ratio of OOVs are close on its outofdoain and indoain test sets (NT and EU), while showing substantial reduction in perplexity on outofdoain test set, see FI bars on Figure 1. Figure (left), shows the full perplexity results for Finnish for vanilla MKN, and our extensions when tested on indoain (EU) and outofdoain (NT) test sets. The discount plot, Figure (iddle) illustrates the behaviour of the various discount paraeters. We also easured the average hit length for queries by varying on indoain and outofdoain test sets. As illustrated in Figure (right) the indoain test set allows for longer atches to the training data as 96
4 D [1...] D [1...] D [1...] NT EU Discount gra gra gra gra 5 gra 6 gra 7 gra 8 gra 9 gra gra Average hit length NT EU i Figure : Statistics for the Finnish section of Europarl. The left plot illustrates the perplexity when tested on an outofdoain (NT) and indoain (EU) test sets varying LM order,. The iddle plot shows the discount paraeters D i [1...] for different gra orders. The right plot correspond to average hit length on EU and NT test sets Discount Discount Figure : (zaxis) vs. [...] (xaxis) vs. nuber of discounts D i,, (yaxis) for Geran language trained on Europarl (left), and CoonCrawl01 (right) and tested on newstest. Arrows show the direction of the increase. grows. This indicates that having ore discount paraeters is not only useful for test sets with extreely high nuber of OOV, but also allows for a ore elegant way of assigning ass in the soothing process when there is a doain isatch. Interdependency of, data size, and discounts To explore the correlation between these factors we selected the Geran and investigated this correlation on two different training data sizes: Europarl (61M words), and CoonCrawl 01 (98M words). Figure illustrates the correlation between these factors using the sae test set but with sall and large training sets. Considering the slopes of the surfaces indicates that the sall training data regie (left) which has higher sparsity, and ore OOV in the test tie benefits substantially fro the ore accurate discounting copared to the large training set (right) in which the gain fro discounting is slight. 8 8 Nonetheless, the iproveent in perplexity consistently grows with introducing ore discount paraeters even under Conclusions In this work we proposed a generalisation of Modified KneserNey interpolative language odeling by introducing new discount paraeters. We provide the atheatical proof for the discount bounds used in Modified KneserNey and extend it further and illustrate the ipact of our extension epirically on different Europarl languages using indoain and outofdoain test sets. The epirical results on various training and test sets show that our proposed approach allows for a ore elegant way of treating OOVs and ass assignents in interpolative soothing. In future work, we will integrate our language odel into the Moses achine translation pipeline to intrinsically easure its ipact on translation qualities, which is of particular use for outofdoain scenario. Acknowledgeents This research was supported by the National ICT Australia (NICTA) and Australian Research Council Future Fellowship (project nuber FT15). This work was done when Ehsan Shareghi was an intern at IBM Research Australia. A. Inequalities We prove that these inequalities hold in expectation by aking the reasonable assuption that events in the large training data regie, which suggests that ore discount paraeters, e.g., up to D 0, ay be required for larger training corpus to reflect the fact that even an event with frequency of 0 ight be considered rare in a corpus of nearly 1 billion words. 97
5 the natural language follow the power law (Clauset et al., 009), p ( C(u) = f ) f 1 1 s, where s is the paraeter of the distribution, and C(u) is the rando variable denoting the frequency of the  gras pattern u. We now copute the expected nuber of unique patterns having a specific frequency E[n i []]. Corresponding to each gras pattern u, let us define a rando variable X u which is 1 if the frequency of u is i and zero otherwise. It is not hard to see that n i [] = u X u, and E [ n i [] ] = E [ ] X u = E[X u ] = σ E[X u ] u u = σ ( p ( C(u) = i ) 1 + p ( C(u) i ) ) 0 σ i 1 1 s. We can verify that (i + 1)E [ n i+1 [] ] < ie [ n i [] ] (i + 1)σ (i + 1) 1 1 s < iσ i 1 1 s i 1 s < (i + 1) 1 s. which copletes the proof of the inequalities. B. Discount bounds proof sketch The leaveoneout (leaving those gras which occurred only once) loglikelihood function of the interpolative soothing is lower bounded by backoff odel s (Ney et al., 199), hence the estiated discounts for later can be considered as an approxiation for the discounts of the forer. Consider a backoff odel with absolute discounting paraeter D, were P (w i wi +1 i 1 ) is defined as: c(w i i +1 ) D i +1 ) if c(w i i +1 ) > 0 Dn 1+(wi +1 ) i 1 P c(w (w i w i 1 i 1 i +1 ) i + ) if c(wi i +1 ) = 0 where n 1+ (wi +1 ) i 1 is the nuber of unique right contexts for the wi +1 i 1 pattern. Assue that for any choice of 0 < D < 1 we can define P such that P (w i wi i 1 +1 ) sus to 1. For readability we use the λ(wi +1 i 1 ) = n 1+(wi +1 ) i 1 replaceent. i +1 ) 1 Following (Chen and Goodan, 1999), rewriting the leaveoneout loglikelihood for KN (Ney et al., 199) to include ore discounts (in this proof up to 98 D ), results in: c(wi +1) i log c(wi i +1 ) 1 D i +1 ) 1 + w i i +1 c(w i i +1 )> ( c(wi +1) i log c(wi i +1 ) 1 D j 1 i +1 ) 1 j= w i i +1 c(w i i +1 )=j w i i +1 c(w i i +1 )=1 ( c(w i i +1) log( n j []D j )λ(w i 1 i +1 ) P ) which can be siplified to, c(wi +1) i log(c(wi +1) i 1 D )+ w i i +1 c(w i i +1 )> j= ( ) jn j [] log(j 1 D j 1 ) + n 1 [] log( n j []D j ) + const ) + To find the optial D 1, D, D, D we set the partial derivatives to zero. For D, n [] = n 1 [] D n n [] = 0 j[]d j D n 1 []n []( D ) = n [] n j []D j n 1 []n [] D n 1 []n [] n []n 1 []D 1 > 0 n [] n [] D 1 > D And after taking c(wi +1 i ) = 5 out of the suation, for D : D = c(w i i +1 )>5 c(w i i +1 ) c(w i i +1 ) 1 D 5n 5[] D n [] + n 1 [] n = 0 5n 5[] j[]d j D n [] + n 1 [] n > 0 n 1 []n []( D ) j[]d j > 5n 5 [] n j []D j 5 n 5[] n [] D 1 > D
6 References Stanley F. Chen and Joshua Goodan An epirical study of soothing techniques for language odeling. Coputer Speech & Language, 1():59 9. Aaron Clauset, Cosa Rohilla Shalizi, and Mark EJ Newan Powerlaw distributions in epirical data. SIAM review, 51(): Kenneth Heafield KenLM: Faster and saller language odel queries. In Proceedings of the Workshop on Statistical Machine Translation. Kenneth Heafield. 01. Efficient Language Modeling Algoriths with Applications to Statistical Machine Translation. Ph.D. thesis, Carnegie Mellon University. Philipp Koehn Europarl: A parallel corpus for statistical achine translation. In Proceedings of the Machine Translation suit. Herann Ney, Ute Essen, and Reinhard Kneser On structuring probabilistic dependences in stochastic language odelling. Coputer Speech & Language, 8(1):1 8. Ehsan Shareghi, Matthias Petri, Gholareza Haffari, and Trevor Cohn Copact, efficient and unliited capacity: Language odeling with copressed suffix trees. In Proceedings of the Conference on Epirical Methods in Natural Language Processing. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash SRILM at sixteen: Update and outlook. In Proceedings of IEEE Autoatic Speech Recognition and Understanding Workshop. Andreas Stolcke. 00. SRILM an extensible language odeling toolkit. In Proceedings of the International Conference of Spoken Language Processing. Yee Whye Teh. 006a. A Bayesian interpretation of interpolated KneserNey. Technical report, NUS School of Coputing. Yee Whye Teh. 006b. A hierarchical Bayesian language odel based on PitanYor processes. In Proceedings of the Annual Meeting of the Association for Coputational Linguistics. 99