Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling

Similar documents
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

A Simple Regression Problem

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Machine Learning for natural language processing

Non-Parametric Non-Line-of-Sight Identification 1

Probability Distributions

A method to determine relative stroke detection efficiencies from multiplicity distributions

CS Lecture 13. More Maximum Likelihood

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Analyzing Simulation Results

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

ANLP Lecture 6 N-gram models and smoothing

Generalized Queries on Probabilistic Context-Free Grammars

Kernel Methods and Support Vector Machines

Defect-Aware SOC Test Scheduling

Feature Extraction Techniques

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

NUMERICAL MODELLING OF THE TYRE/ROAD CONTACT

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Models. Philipp Koehn. 11 September 2018

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Estimating Parameters for a Gaussian pdf

Analysis of ground vibration transmission in high precision equipment by Frequency Based Substructuring

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Scalable Non-Markovian Sequential Modelling for Natural Language Processing

Machine Learning Basics: Estimators, Bias and Variance

Pattern Recognition and Machine Learning. Artificial Neural networks

A Theoretical Analysis of a Warm Start Technique

Pattern Recognition and Machine Learning. Artificial Neural networks

Handwriting Detection Model Based on Four-Dimensional Vector Space Model

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Fairness via priority scheduling

Lower Bounds for Quantized Matrix Completion

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Inter-document Contextual Language Model

Fixed-to-Variable Length Distribution Matching

8.1 Force Laws Hooke s Law

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Pattern Recognition and Machine Learning. Artificial Neural networks

A Note on the Applied Use of MDL Approximations

Easy Evaluation Method of Self-Compactability of Self-Compacting Concrete

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Warning System of Dangerous Chemical Gas in Factory Based on Wireless Sensor Network

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

COS 424: Interacting with Data. Written Exercises

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

General Properties of Radiation Detectors Supplements

Natural Language Processing. Statistical Inference: n-grams

Randomized Recovery for Boolean Compressed Sensing

In this chapter, we consider several graph-theoretic and probabilistic models

Optimizing energy potentials for success in protein tertiary structure prediction Ting-Lan Chiu 1 and Richard A Goldstein 1,2

On the Analysis of the Quantum-inspired Evolutionary Algorithm with a Single Individual

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Appendix. Bayesian Information Criterion (BIC) E Proof of Proposition 1. Compositional Kernel Search Algorithm. Base Kernels.

The accelerated expansion of the universe is explained by quantum field theory.

Research in Area of Longevity of Sylphon Scraies

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

The Transactional Nature of Quantum Information

Ensemble Based on Data Envelopment Analysis

Estimating Entropy and Entropy Norm on Data Streams

What to Expect from Expected Kneser-Ney Smoothing

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

OBJECTIVES INTRODUCTION

Hierarchical Bayesian Nonparametric Models of Language and Text

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

An Approximate Model for the Theoretical Prediction of the Velocity Increase in the Intermediate Ballistics Period

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

PULSE-TRAIN BASED TIME-DELAY ESTIMATION IMPROVES RESILIENCY TO NOISE

Stochastic Subgradient Methods

Mathematical Model and Algorithm for the Task Allocation Problem of Robots in the Smart Warehouse

Sharp Time Data Tradeoffs for Linear Inverse Problems

A remark on a success rate model for DPA and CPA

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

Using a De-Convolution Window for Operating Modal Analysis

Language Model. Introduction to N-grams

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

ANALYTICAL INVESTIGATION AND PARAMETRIC STUDY OF LATERAL IMPACT BEHAVIOR OF PRESSURIZED PIPELINES AND INFLUENCE OF INTERNAL PRESSURE

Block designs and statistics

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Variational Bayes for generic topic models

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

On Constant Power Water-filling

Transcription:

Richer Interpolative Soothing Based on Modified Kneser-Ney Language Modeling Ehsan Shareghi, Trevor Cohn and Gholareza Haffari Faculty of Inforation Technology, Monash University Coputing and Inforation Systes, The University of Melbourne first.last@{onash.edu,unielb.edu.au} Abstract In this work we present a generalisation of the Modified Kneser-Ney interpolative soothing for richer soothing via additional discount paraeters. We provide atheatical underpinning for the estiator of the new discount paraeters, and showcase the utility of our rich MKN language odels on several European languages. We further explore the interdependency aong the training data size, language odel order, and nuber of discount paraeters. Our epirical results illustrate that larger nuber of discount paraeters, i) allows for better allocation of ass in the soothing process, particularly on sall data regie where statistical sparsity is severe, and ii) leads to significant reduction in perplexity, particularly for out-of-doain test sets which introduce higher ratio of out-ofvocabulary words. 1 1 Introduction Probabilistic language odels (LMs) are the core of any natural language processing tasks, such as achine translation and autoatic speech recognition. -gra odels, the corner stone of language odeling, decopose the probability of an utterance into conditional probabilities of words given a fixed-length context. Due to sparsity of the events in natural language, soothing techniques are critical for generalisation beyond the training text when estiating the paraeters of -gra LMs. This is particularly iportant when the training text is 1 For the ipleentation see: https://github.co/ eehsan/cstl sall, e.g. building language odels for translation or speech recognition in low-resource languages. A widely used and successful soothing ethod is interpolated Modified Kneser-Ney (MKN) (Chen and Goodan, 1999). This ethod uses a linear interpolation of higher and lower order -gra probabilities by preserving probability ass via absolute discounting. In this paper, we extend MKN by introducing additional discount paraeters, leading to a richer soothing schee. This is particularly iportant when statistical sparsity is ore severe, i.e., in building high-order LMs on sall data, or when out-of-doain test sets are used. Previous research in MKN language odeling, and ore generally -gra odels, has ainly dedicated efforts to ake the faster and ore copact (Stolcke et al., 011; Heafield, 011; Shareghi et al., 015) using advanced data structures such as succinct suffix trees. An exception is Hierarchical Pitan-Yor Process LMs (Teh, 006a; Teh, 006b) providing a rich Bayesian soothing schee, for which Kneser-Ney soothing corresponds to an approxiate inference ethod. Inspired by this work, we directly enrich MKN soothing realising soe of the reductions while reaining ore efficient in learning and inference. We provide estiators for our additional discount paraeters by extending the discount bounds in MKN. We epirically analyze our enriched MKN LMs on several European languages in in- and outof-doain settings. The results show that our discounting echanis significantly iproves the perplexity copared to MKN and offers a ore elegant 9 Proceedings of the 016 Conference on Epirical Methods in Natural Language Processing, pages 9 99, Austin, Texas, Noveber 1-5, 016. c 016 Association for Coputational Linguistics

way of dealing with out-of-vocabulary (OOV) words and doain isatch. Enriched Modified Kneser-Ney Interpolative Modified Kneser-Ney (MKN) (Chen and Goodan, 1999) soothing is widely accepted as a state-of-the-art technique and is ipleented in leading LM toolkits, e.g., SRILM (Stolcke, 00) and KenLM (Heafield, 011). MKN uses lower order k-gra probabilities to sooth higher order probabilities. P (w u) is defined as, c(uw) D (c(uw)) c(u) + γ(u) c(u) P (w π(u)) where c(u) is the frequency of the pattern u, γ(.) is a constant ensuring the distribution sus to one, and P (w π(u)) is the soothed probability coputed recursively based on a siilar forula conditioned on the suffix of the pattern u denoted by π(u). Of particular interest are the discount paraeters D (.) which reove soe probability ass fro the axiu likelihood estiate for each event which is redistributed over the soothing distribution. The discounts are estiated as 0, if i = 0 D 1 n [] n 1 [] n (i) =, if i = 1 1 [] n 1 []+n [] n [] n 1 [], if i = n [] n 1 []+n [] n []. n 1 [], if i n [] n 1 []+n [] where n i () is the nuber of unique -gras of frequency i. This effectively leads to three discount paraeters {D (1), D (), D (+)} for the distributions on a particular context length,..1 Generalised MKN Ney et al. (199) characterized the data sparsity using the following epirical inequalities, n [] < n [] < n 1 [] for It can be shown (see Appendix A) that these epirical inequalities can be extended to higher fre- Note that in all but the top layer of the hierarchy, continuation counts, which count the nuber of unique contexts, are used in place of the frequency counts (Chen and Goodan, 1999). Continuation counts are used for the lower layers. quencies and larger contexts >, (N )n N [] <... < n [] < n 1 [] < i>0 n i [] n 0 [] < σ where σ is the possible nuber of -gras over a vocabulary of size σ, n 0 [] is the nuber of - gras that never occurred, and i>0 n i[] is the nuber of -gras observed in the training data. We use these inequalities to extend the discount depth of MKN, resulting in new discount paraeters. The additional discount paraeters increase the flexibility of the odel in altering a wider range of raw counts, resulting in a ore elegant way of assigning the ass in the soothing process. In our experients, we set the nuber of discounts to for all the levels of the hierarchy, (copare this to these in MKN). This results in the following estiators for the discounts, 0, if i = 0 D (i) = i (i + 1) n i+1[] n 1 [] n i [] n 1 []+n, if i < [] 11 n 11[] n 1 [], if i. n [] n 1 []+n [] It can be shown that the above estiators for our discount paraeters are derived by axiizing a lower bound on the leave-one-out likelihood of the training set, following (Ney et al., 199; Chen and Goodan, 1999) (see Appendix B for the proof sketch). Experients We copare the effect of using different nubers of discount paraeters on perplexity using the Finnish (FI), Spanish (ES), Geran (DE), English (EN) portions of the Europarl v7 (Koehn, 005) corpus. For each language we excluded the first K sentences and used it as the in-doain test set (denoted as EU), skipped the second K sentences, and used the rest as the training set. The data was tokenized, sentence split, and the XML arkup discarded. We tested the effect of doain isatch, under two settings for out-of-doain test sets: i) ild using the Spanish section of news-test 01, the Geran, English sections of news-test 01, and the Finnish section We have selected the value of arbitrarily; however our approach can be used with larger nuber of discount paraeters, with the caveat that we would need to handle sparse counts in the higher orders. 95

size (M) size (K) MKN (D 1... ) MKN (D [1...] ) MKN (D [1...] ) Training tokens sents Test tokens sents OOV% = = 5 = = = 5 = = = 5 = NT 19.8 9. 656.6 5900. 5897. 651. 587.6 58.6 615. 5575.0 557.5 FI 6.5. EU 197. 6.1 90.7 87. 86.8 90.7 87. 86.6 90. 87. 86.8 TW.9 1. 5.1 57 85.1 51 7.1 51 70.1 55 550. 9 88. 9 881. 7 696. 77. 75.5 NT 70.7 9.1 565.6 1.5 9. 560.0 5.5.5 51.5 09.0 07. ES 68.0. EU 81.5. 9.7 51.5 51.1 9.8 51.5 51.1 9.8 51. 51.0 TW 11. 9 78.5 17 80. 1 06.7 1 07.1 17 11. 1 87. 1 5.1 1 915.7 11 8.1 11 807. NT 6.5 18.7 190.7 178.6 1781.8 158.9 1755.8 175. 065. 1680.6 1678. DE 61.. EU.0.6 156.9 91.7 91. 156.9 91.6 91. 156. 91.7 91. MED 17.7 59.8 515.7. 6.7 5007.5 1.0 117.5 66.0 81. 86.6 NT 69.5 5.5 89. 875.0 87. 71.1 857. 85. 11.5 806.7 80. EN 67.5. EU 7.9 1.7 90.1 8. 8.1 90.1 8. 8.0 90.5 8. 8.0 MED 05.9.1 19.7 197.9 19.5 61.6 189. 1888. 071.9 17.9 170.8 Table 1: for various -gra orders,, and training languages fro Europarl, using different nubers of discount paraeters for MKN. MKN (D [1...] ), MKN (D [1...] ), MKN (D [1...] ) represent vanilla MKN, MKN with 1 ore discounts, and MKN with 7 ore discount paraeters, respectively. Test sets sources EU, NT, TW, MED are Europarl, news-test, Twitter, and edical patent descriptions, respectively. OOV is reported as the ratio {OOV test-set} {w test-set}. [% Reduction] 1 8 1 0 ppl D[1...] ppl D[1...] CZ FI ES DE EN FR CZ FI ES DE EN FR [Corpus] Figure 1: Percentage of perplexity reduction for pplx D[1...] and pplx D[1...] copared with pplx D[1..] on different training corpora (Europarl CZ, FI, ES, DE, EN, FR) and on news-test sets (NT) for = (left), and = (right). of news-test 015 (all denoted as NT) 5, and ii) extree using a hour period of streaed Finnish, and Spanish tweets 6 (denoted as TW), and the Geran and English sections of the patent description of edical translation task 7 (denoted as MED). See Table 1 for statistics of the training and test sets..1 Table 1 shows substantial reduction in perplexity on all languages for out-of-doain test sets when expanding the nuber of discount paraeters fro in vanilla MKN to and. Consider the English 5 http://www.statt.org/{wt1,1,15}/test.tgz 6 Streaed via Twitter API on 17/05/016. 7 http://www.statt.org/wt1/edical-task/ news-test (NT), in which even for a -gra language odel a single extra discount paraeter ( =, D [1...] ) iproves the perplexity by 18 points and this iproveent quadruples to 77 points when using discounts ( =, D [1...] ). This effect is consistent across the Europarl corpora, and for all LM orders. We observe a substantial iproveents even for = -gra odels (see Figure 1). On the edical test set which has 9 ties higher OOV ratio, the perplexity reduction shows a siilar trend. However, these reductions vanish when an in-doain test set is used. Note that we use the sae treatent of OOV words for coputing the perplexities which is used in KenLM (Heafield, 01).. Analysis Out-of-doain and Out-of-vocabulary We selected the Finnish language for which the nuber and ratio of OOVs are close on its out-of-doain and in-doain test sets (NT and EU), while showing substantial reduction in perplexity on out-ofdoain test set, see FI bars on Figure 1. Figure (left), shows the full perplexity results for Finnish for vanilla MKN, and our extensions when tested on in-doain (EU) and out-of-doain (NT) test sets. The discount plot, Figure (iddle) illustrates the behaviour of the various discount paraeters. We also easured the average hit length for queries by varying on in-doain and out-of-doain test sets. As illustrated in Figure (right) the in-doain test set allows for longer atches to the training data as 96

5 6 7 8 9 5 6 7 8 9 6550 5900 5800 5500 90 85 D [1...] D [1...] D [1...] NT EU Discount.5.5.5 1.5 1 1 gra gra gra gra 5 gra 6 gra 7 gra 8 gra 9 gra gra Average hit length.5 1.5 1 NT EU 5 6 7 8 9 1 5 6 7 8 9 5 6 7 8 9 i Figure : Statistics for the Finnish section of Europarl. The left plot illustrates the perplexity when tested on an outof-doain (NT) and in-doain (EU) test sets varying LM order,. The iddle plot shows the discount paraeters D i [1...] for different -gra orders. The right plot correspond to average hit length on EU and NT test sets. 190 1678 Discount 6 180 Discount Figure : (z-axis) vs. [...] (x-axis) vs. nuber of discounts D i,, (y-axis) for Geran language trained on Europarl (left), and CoonCrawl01 (right) and tested on news-test. Arrows show the direction of the increase. grows. This indicates that having ore discount paraeters is not only useful for test sets with extreely high nuber of OOV, but also allows for a ore elegant way of assigning ass in the soothing process when there is a doain isatch. Interdependency of, data size, and discounts To explore the correlation between these factors we selected the Geran and investigated this correlation on two different training data sizes: Europarl (61M words), and CoonCrawl 01 (98M words). Figure illustrates the correlation between these factors using the sae test set but with sall and large training sets. Considering the slopes of the surfaces indicates that the sall training data regie (left) which has higher sparsity, and ore OOV in the test tie benefits substantially fro the ore accurate discounting copared to the large training set (right) in which the gain fro discounting is slight. 8 8 Nonetheless, the iproveent in perplexity consistently grows with introducing ore discount paraeters even under Conclusions In this work we proposed a generalisation of Modified Kneser-Ney interpolative language odeling by introducing new discount paraeters. We provide the atheatical proof for the discount bounds used in Modified Kneser-Ney and extend it further and illustrate the ipact of our extension epirically on different Europarl languages using in-doain and out-of-doain test sets. The epirical results on various training and test sets show that our proposed approach allows for a ore elegant way of treating OOVs and ass assignents in interpolative soothing. In future work, we will integrate our language odel into the Moses achine translation pipeline to intrinsically easure its ipact on translation qualities, which is of particular use for out-of-doain scenario. Acknowledgeents This research was supported by the National ICT Australia (NICTA) and Australian Research Council Future Fellowship (project nuber FT15). This work was done when Ehsan Shareghi was an intern at IBM Research Australia. A. Inequalities We prove that these inequalities hold in expectation by aking the reasonable assuption that events in the large training data regie, which suggests that ore discount paraeters, e.g., up to D 0, ay be required for larger training corpus to reflect the fact that even an event with frequency of 0 ight be considered rare in a corpus of nearly 1 billion words. 97

the natural language follow the power law (Clauset et al., 009), p ( C(u) = f ) f 1 1 s, where s is the paraeter of the distribution, and C(u) is the rando variable denoting the frequency of the - gras pattern u. We now copute the expected nuber of unique patterns having a specific frequency E[n i []]. Corresponding to each -gras pattern u, let us define a rando variable X u which is 1 if the frequency of u is i and zero otherwise. It is not hard to see that n i [] = u X u, and E [ n i [] ] = E [ ] X u = E[X u ] = σ E[X u ] u u = σ ( p ( C(u) = i ) 1 + p ( C(u) i ) ) 0 σ i 1 1 s. We can verify that (i + 1)E [ n i+1 [] ] < ie [ n i [] ] (i + 1)σ (i + 1) 1 1 s < iσ i 1 1 s i 1 s < (i + 1) 1 s. which copletes the proof of the inequalities. B. Discount bounds proof sketch The leave-one-out (leaving those -gras which occurred only once) log-likelihood function of the interpolative soothing is lower bounded by backoff odel s (Ney et al., 199), hence the estiated discounts for later can be considered as an approxiation for the discounts of the forer. Consider a backoff odel with absolute discounting paraeter D, were P (w i wi +1 i 1 ) is defined as: c(w i i +1 ) D i +1 ) if c(w i i +1 ) > 0 Dn 1+(wi +1 ) i 1 P c(w (w i w i 1 i 1 i +1 ) i + ) if c(wi i +1 ) = 0 where n 1+ (wi +1 ) i 1 is the nuber of unique right contexts for the wi +1 i 1 pattern. Assue that for any choice of 0 < D < 1 we can define P such that P (w i wi i 1 +1 ) sus to 1. For readability we use the λ(wi +1 i 1 ) = n 1+(wi +1 ) i 1 replaceent. i +1 ) 1 Following (Chen and Goodan, 1999), rewriting the leave-one-out log-likelihood for KN (Ney et al., 199) to include ore discounts (in this proof up to 98 D ), results in: c(wi +1) i log c(wi i +1 ) 1 D i +1 ) 1 + w i i +1 c(w i i +1 )> ( c(wi +1) i log c(wi i +1 ) 1 D j 1 i +1 ) 1 j= w i i +1 c(w i i +1 )=j w i i +1 c(w i i +1 )=1 ( c(w i i +1) log( n j []D j )λ(w i 1 i +1 ) P ) which can be siplified to, c(wi +1) i log(c(wi +1) i 1 D )+ w i i +1 c(w i i +1 )> j= ( ) jn j [] log(j 1 D j 1 ) + n 1 [] log( n j []D j ) + const ) + To find the optial D 1, D, D, D we set the partial derivatives to zero. For D, n [] = n 1 [] D n n [] = 0 j[]d j D n 1 []n []( D ) = n [] n j []D j n 1 []n [] D n 1 []n [] n []n 1 []D 1 > 0 n [] n [] D 1 > D And after taking c(wi +1 i ) = 5 out of the suation, for D : D = c(w i i +1 )>5 c(w i i +1 ) c(w i i +1 ) 1 D 5n 5[] D n [] + n 1 [] n = 0 5n 5[] j[]d j D n [] + n 1 [] n > 0 n 1 []n []( D ) j[]d j > 5n 5 [] n j []D j 5 n 5[] n [] D 1 > D

References Stanley F. Chen and Joshua Goodan. 1999. An epirical study of soothing techniques for language odeling. Coputer Speech & Language, 1():59 9. Aaron Clauset, Cosa Rohilla Shalizi, and Mark EJ Newan. 009. Power-law distributions in epirical data. SIAM review, 51():661 70. Kenneth Heafield. 011. KenLM: Faster and saller language odel queries. In Proceedings of the Workshop on Statistical Machine Translation. Kenneth Heafield. 01. Efficient Language Modeling Algoriths with Applications to Statistical Machine Translation. Ph.D. thesis, Carnegie Mellon University. Philipp Koehn. 005. Europarl: A parallel corpus for statistical achine translation. In Proceedings of the Machine Translation suit. Herann Ney, Ute Essen, and Reinhard Kneser. 199. On structuring probabilistic dependences in stochastic language odelling. Coputer Speech & Language, 8(1):1 8. Ehsan Shareghi, Matthias Petri, Gholareza Haffari, and Trevor Cohn. 015. Copact, efficient and unliited capacity: Language odeling with copressed suffix trees. In Proceedings of the Conference on Epirical Methods in Natural Language Processing. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. 011. SRILM at sixteen: Update and outlook. In Proceedings of IEEE Autoatic Speech Recognition and Understanding Workshop. Andreas Stolcke. 00. SRILM an extensible language odeling toolkit. In Proceedings of the International Conference of Spoken Language Processing. Yee Whye Teh. 006a. A Bayesian interpretation of interpolated Kneser-Ney. Technical report, NUS School of Coputing. Yee Whye Teh. 006b. A hierarchical Bayesian language odel based on Pitan-Yor processes. In Proceedings of the Annual Meeting of the Association for Coputational Linguistics. 99