Scalable Non-Markovian Sequential Modelling for Natural Language Processing

Size: px

Start display at page:

Download "Scalable Non-Markovian Sequential Modelling for Natural Language Processing"

Homer Sparks
5 years ago
Views:

1 Scalable Non-Markovian Sequential Modelling for Natural Language Processing by Ehsan Shareghi Nojehdeh Thesis Submitted for the fulfillment of the requirements for the degree of Doctor of Philosophy Faculty of Information Technology Monash University October, 2017

2 To Elham The ancient pond a frog jumps in the splash of water Matsuo Bashō ii

3 Copyright by Ehsan Shareghi Nojehdeh 2017 Notice Except as provided in the Copyright Act 1968, this thesis may not be reproduced in any form without the written permission of the author. I certify that I have made all reasonable efforts to secure copyright permissions for thirdparty content included in this thesis and have not knowingly added copyright content to my work without the owner s permission iii

4 Scalable Non-Markovian Sequential Modelling for Natural Language Processing Declaration I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text and a list of references is given. Ehsan Shareghi Nojehdeh October 12, 2017 iv

5 ABSTRACT Scalable Non-Markovian Sequential Modelling for Natural Language Processing Ehsan Shareghi Nojehdeh Monash University, 2017 Markov models are popular means of modelling the underlying structure of natural language, which is naturally represented as sequences and trees. The locality assumption made in low-order Markov models such as n-gram language models is limiting, because if the data generation process exhibits long range dependencies, modelling the distribution well requires consideration of long range context. On the other hand, higher-order Markov, or infinite-order Non-Markovian 1 models, exhibit computational complexity and statistical challenges during learning and inference. In particular, under the large data setting their exponential number of parameters often results in estimation and sampler mixing issues, while representing the structure of the model, and sufficient statistics or sampler states can quickly become computationally inefficient and impractical. In order to exploit global context, we propose a novel Non-Markovian model based on the Hierarchical Nonparametric Bayesian paradigm to incorporate potentially infinite-length context. We demonstrate better performance compared with finite-order Markov models on various structured and sequence prediction tasks. To address the computational complexity issues inherent in the nature of Non-Markovian models, we propose a new modelling framework based on lossless compressed data structures to represent the required sufficient statistics of the model compactly. This allows infinite-depth Hierarchical Nonparametric Bayesian models to be presented in a space proportional to the size of the input data, while enabling an efficient inference mechanism to be developed. Using our compressed framework to represent the models, we explore its scalability under two Non-Markovian language modelling settings, using large scale data and infinite context. First, we model the Kneser-Ney family of language models and illustrate that our approach is several orders of magnitude more memory efficient than the state-of-the-art, in training and 1 We use the term Non-Markovian to refer to Markov models where the order of the model is unbounded. v

6 testing, while it is highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, our approach is orders of magnitude faster than the state-ofthe-art. Second, we consider the full Hierarchical Nonparametric Bayesian language model and propose a fast and memory-efficient approximate inference scheme. Compared with the existing Hierarchical Nonparametric Bayesian language models, our approach has several orders of magnitudes lower memory footprint allowing us to apply it on data sizes more than one hundred fold larger than the largest data used in previous models. This is achieved while avoiding potential mixing issues, while consistently outperforming the state-of-the-art count-based Kneser-Ney family of language models by a significant margin. The results of this work, as well as being significant for the sequence and structured prediction tasks in NLP, point at a new direction of developing complex but compact statistical models that can scale up to very large and potentially real-world datasets without the need for compute clusters. vi

7 Acknowledgments I will forever be thankful to my advisers Dr. Gholamreza Haffari, Dr. Trevor Cohn, and Prof. Ann Nicholson. I consider myself extraordinarily lucky to have been given a chance to learn and develop under their careful and patient guidance. I thank my panel members Prof. Wray Buntine and Prof. Ingrid Zukerman, and my thesis examiners Dr. Brian Roark (Google) and Dr. Mark Dras (Macquarie University) for their feedback on this thesis. My sincere thanks go to Prof. Buntine for his generosity in sharing his insights regarding the nonparametric Bayesian modelling and inference. I would also like to thank Philip Chan for the support to run many resource-intensive experiments on Monash Advanced Research Computing Hybrid (MonARCH) servers, and Dr. Matthias Petri for helping me to use the succinct data structure library (SDSL). I would like to extend my appreciation to Danette Deriane for her unbounded support as the Graduate Research Student coordinator, National ICT Australia (NICTA) for contributing to my scholarship, and IBM Research Australia for giving me the opportunity to work on interesting projects during my internship. A special feeling of gratitude goes to my wife Elham, who has been with me all these years and has made them the best years of my life. My dear parents Fatemeh and Yusef, and my brother Aydin who have always been supportive and encouraging over many years, I cannot thank you enough. Of course no acknowledgments would be complete without giving thanks to wonderful colleagues and friends. I would like to thank my colleagues in Monash, Mohammad Shamsur Rahman, Bo Chen, Sameen Maruf, Poorya ZareMoodi, Ming Liu, Narjes Askarian, and He Zhao for their moral support and kindness. My dear friends Nader Chmait, James Collier, Parthan Kasarapu, Han Duy Phan, Omid Zanganeh, Quan Hung Tran, Andisheh Partovi, Kai Siong Yow, Milad Chenaghlou, Xuhui Zhang, Ying Yang, and Dinithi Sumanaweera, you have my deepest feeling of gratitude. vii

8 Publications The publications arising from my thesis are: (Published) E. Shareghi, G. Haffari, T. Cohn, A. Nicholson, Structured Prediction of Sequences and Trees using Infinite Contexts", Proceedings of the European Conference on Machine Learning (ECML), 2015, Porto, Portugal. (Published) E. Shareghi, M. Petri, G. Haffari, T. Cohn, Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2015, Lisbon, Portugal. (Published) E. Shareghi, M. Petri, G. Haffari, T. Cohn, Fast, Small and Exact: Infiniteorder Language Modelling with Compressed Suffix Trees", Transactions of the Association for Computational Linguistics (TACL), (Published) E. Shareghi, T. Cohn, G. Haffari, Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2016, Austin, USA. (Published) E. Shareghi, G. Haffari, T. Cohn, Compressed Nonparametric Language Modelling", Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017, Melbourne, Australia. viii

9 Contents Abstract v Acknowledgments vii Publications viii List of Tables xiii List of Figures xiv List of Algorithms xvi List of Abbreviations xvii List of Symbols xix 1 Introduction Applications of non-markovian sequence models for structure prediction in NLP Scalable non-markovian sequence models Scalable Bayesian non-markovian sequence models Thesis outline Background Statistical Language Modelling Finite-order Markov (n-gram) Language Models ix

10 2.1.2 Statistical Sparsity and Smoothing Kneser-Ney (KN) smoothing Modified Kneser-Ney (MKN) smoothing Existing approaches for n-gram language modelling Nonparametric Bayesian Language Modelling Hierarchical Pitman-Yor Process (HPYP) language model The relation between KN and HPYP language models Inference in HPYP Markovian vs. Non-Markovian Language Models Compressed Data Structures Basic Concepts Compressed Suffix Arrays (CSA) and Compressed Suffix Trees (CST) Compressed Vectors Summary Applications of Non-Markovian Sequential Modelling for Structured Prediction Probabilistic Context Free Grammars and Extensions Modelling Structures as Sequences Minimal Assumption and MAP Parameter Estimation Prediction A* search MCMC sampling Experiments Morphological parsing Syntactic parsing Part-of-Speech tagging Analysis Summary Scalable Non-Markovian Sequential Modelling Overview of KN and MKN language models Basic compressed Kneser-Ney language models Basic dual Cst approach x

11 4.2.2 Improved single Cst approach Extending KN to MKN and further speed-up Efficient precomputation and compressed counts Computing MKN probability Experiments Perplexity Benchmarking runtime and memory requirements Character-level language modelling Richer Modified Kneser-Ney Language Model Generalized MKN Experiments Summary Scalable Bayesian Non-Markovian Sequential Modelling Hierarchical Pitman-Yor Process Language Model Compressed HPYP LM Representation Efficient Approximate Inference Joint distribution of table and customer counts n u w, t u w Sampling under the joint distribution of n u w, t u w Sampling concentration θ u and discount d u parameters Sampling under Cst mechanics Experiments Perplexity Memory and time Analysis Summary Conclusions and Future Directions Future directions for the research presented in this thesis Appendix A Introducing auxiliary variables into the joint distribution Appendix B Sampling auxiliary variables xi

12 Appendix C Sampling discount parameter Bibliography xii

13 List of Tables 2.1 The required quantities for computing a 4-gram probability under KN The required quantities for computing a 4-gram probability under MKN One-to-One mapping of interpolative smoothing under KN and MKN One-to-One mapping of interpolative smoothing under KN and HPYP Datasets statistics for parsing and part-of-speech tagging Morphological parsing results Syntactic parsing results Part-of-speech tagging results Comparison between predicted and gold standard grammar pattersn One-to-One mapping of interpolative smoothing under KN and MKN Summary of Csa and Cst functions used and their time complexity of inference Perplexity results on 32GiB of text for various languages Perplexity on various data sizes, domains, and n-gram orders Perplexity results for 1 billion word corpus Perplexity for various n-gram orders and discount parameters One-to-One mapping of interpolative smoothing under KN and HPYP Complexities of computing critical sampling quantities Perpelxity of KN, MKN, HPYP-based smoothing on various data sizes Perpelxity of our approach under various settings xiii

14 List of Figures 1.1 Examples of syntax tree, part-of-speech tag sequence, and language model A compraison of various n-gram modelling approaches Example of the smoothing hierarchy for an interpolative LM of depth Examples of Suffix Tree, Suffix Array, Burrows-Wheeler Transform, Wavelet Tree Procedure for generating Burrows-Wheeler Transformation Example of a Backward-Search procedure RRR compressed bit vector Directly Addressable Variable-Length Codes (DAC) compressed integer vector Structure and sequence prediction tasks Examples of CFG refinements Smoothing hierarchy for infinite-order parsing Directions of Hierarchical Pitman-Yor Process smoothing Pitman-Yor generated samples vs. data distribution Example of a hyper-graph representation of the search space Example of a binarized morphological tree Presenting a part-of-speech Hidden-Markov-Model via our representation Quantities required for computing KN using the forward and backward Csts Examples of character-level data structures Time breakdown of Cst-based procedures required for KN xiv

15 4.4 Time breakdown for KN and MKN with/out precomputation Distribution of precomputed values Graphical representation of our MKN computation Memory and time comparison - Cst, KenLM, SRILM on a small set Memory and time comparison - Cst and KenLM Precentage of perplexity reduction for various European languages Examples of perplexity, discount values, and average hit length Perplexity vs. number of discount parameters Hierarchy of Chinese Restaurants Direction of search, interpolation, and sampling for abc" Memory and time comparison - Sequence Memoizer and our approach xv

16 List of Algorithms 1 Gibbs Sampler Algorithm of Teh (2006a) Compute one-sided occurrence counts, N 1+ ( α) or N 1+ (α ) for pattern α Two-sided occ., N 1+ ( α ) ( ) 4 KN probability P w k w k k (n 1) ( ) 5 KN probability P w k w k 1 using a single Cst k (n 1) 6 N 1+ ( α ), using forward Cst Compute backward occurrence counts, N 1+ ( α), using only forward Cst N {1,2,3+} (α ) or N {1,2,3+} (α ) Precomputing expensive counts N {1,2} (α ), N 1+ ( α ), N 1+ ( α), N {1,2} (α ) MKN probability P ( ) w i w i 1 i (n 1) Compute discounts Gibbs Sampler for η γ xvi

17 List of Abbreviations Burrows-Wheeler Transformation Chinese Restaurant Process Contex Free Grammar Compressed Suffix Array Compressed Suffix Tree Dirichlet Process Gigibyte Hidden Markov Model Hierarchical Chinese Restaurant Process Hierarchical Dirichlet Process Hierarchical Pitman-Yor Process Kullback-Leibler divergence Kneser-Ney Language Model/Modelling Maximum A Posteriori Maximum Likelihood Estimation Mebibyte Modified Kneser-Ney Monte Carlo Markov Chain Natural Language Processing Out-of-Vocabulary Part-of-Speech Pitman-Yor Process BWT CRP CFG CSA CST DP GiB HMM HCRP HDP HPYP KL divergence KN LM MAP MLE MiB MKN MCMC NLP OOV POS PYP xvii

18 Probabilistic Context Free Grammar Suffix Array Suffix Tree Wavelet Tree PCFG SA ST WT xviii

19 List of Symbols n σ w r w N 1 T T u c(α) ε d u θ u n u w n ụ t u w t ụ Iw u S(n, t) S d (n, t) (a b) c (a)! Γ(a) order of the markov model alphabet/vocabulary set word grammar rule a sequence w 1 w 2...w N of length N text in Language Modelling syntax tree in Syntactic parsing Overloaded : restaurant/distribution in CRP/PYP, and context in NLP tasks frequency of α null context discount parameter of Pitman-Yor Process u concentration parameter of Pitman-Yor Process u number of customers in restaurant u having dish w total number of customers in restaurant u number of tables in restaurant u having dish w total number of tables in restaurant u Arrangement of n u w customenrs around t u w tables in restaurant u Stirling number of the second kind Genralized Stirling number pochhammer symbol factorial function Gamma function xix

20 CHAPTER 1 Introduction Generating an utterance to convey a message involves, at least, the coordination between two phenomena: choosing meaningful words and phrases, and placing them in a correct grammatical structure. This is, typically, orchestrated in an incremental procedure which generates the utterance from left to right. Important properties of this incremental procedure are different layers of dependency on which each step of the generating process relies. For example, in a sentence, choosing a meaningful verb depends on the sequence of previously generated words while satisfying the subject-verb agreement that exists between a verb and its subject. Another example, among many other effects, is the number agreement between a determiner and its noun, which can be separated by an arbitrary number of adjectives. In fact, the notion of dependency is embodied in various aspects of human language, from its recursive grammatical structure (Chomsky, 1959) to selecting the next word of a sentence which has a strong statistical dependency on the previous words in the utterance (Shannon, 1951). Methods for automatically capturing this notion have attracted a significant amount of attention from the statistical modelling perspective. A family of statistical models designed to capture language dependencies are Markov models, where the dependency is assumed to only exist within a short span. The notion of order in Markov models is analogous to the range of dependency considered by these models, and typically Markov models are used in low-order setting. While the locality assumption in low-order Markov models often fails to capture long range dependencies, these models are 1

21 Chapter 1. Introduction 2 still popular due to their mathematical simplicity in learning and prediction. An example of such dependency where a finite order Markov model fails is the determiner-noun agreement which can be separated by any number (infinite in theory) of adjectives, e.g., a ball, a red ball, a red bouncy ball, etc. In fact, human language is far more complex than can be captured by low-order Markov models (Good, 1969) and an ideal model should have an infinite memory to completely capture the past, or have a selection mechanism to skip the uninformative segments of the past. One natural approach would be to use a Markov model with a very high order. On the other hand, high-order Markov models can quickly become impractical as the size of the Markov memory n increases: as n, the number of parameters σ n 1 grows exponentially in theory (in practice the growth is still unmanageable for orders 7 10 and above), where σ is the number of model parameters when n = 0 (i.e. size of the vocabulary). An approximation to high-order Markov models are variable-order Markov models (Ron et al., 1994), which were proposed to mitigate this problem by having a model with a dynamic range of memories. While these models are more robust than low-order Markov models in capturing long-range dependencies, they still rely on pruning an exponentially large space of possibilities which grows exponentially as n grows. Also, the pruning requires careful threshold tuning as well as computing a statistical measure before and after each pruning decision, a process which is prone to overfit to the training data (Mochihashi and Sumita, 2007). Another limitation of these models is that they assume a most-recent bias (most recent element is more important), which is sensible but may not be universally appropriate. In the context of large datasets, having scalable models is of central importance. However, all the aforementioned issues with high or variable-order Markov models are exacerbated as the size of the data grows. This is due both to the growth of the number of parameters, and the computational complexity of collecting the sufficient statistics for estimating the model parameters. Therefore, Markov models have only been successfully applied either in the loworder setting on potentially large datasets, or in the high-orders setting on small datasets. In order to capture all dependency ranges in this thesis we develop techniques around non-markovian (which we refer to herein as -order Markov) sequential models. The - order in these models is to highlight their capacity to capture various dependency ranges each component of an utterance exhibit, without imposing a fixed range dependency on all its components.

22 3 More formally, the main hypothesis of the thesis is to empirically examine and illustrate that finite order Markov models fail to capture the long range dependency that exists in human language. We illustrate this is the case in various NLP tasks framed as sequence, or tree structures. For instance, in language modelling task we show that higher order n-gram models result in better predictive accuracy compared to the widely used 5 6-gram models, and similarly for syntactic parsing incorporating longer range context improves the performance. We propose -order modelling frameworks which make no assumption about the range of the dependency that exists in the data and achieve significant performance improvements over low/finite order models in various NLP tasks. We examine, for different NLP tasks, the effectiveness of various modelling, learning, and inference paradigms from point estimation of model parameters via Maximizing the Likelihood (MLE) or the Posterior (MAP), to the full Bayesian treatment where the entire distribution over the parameter space is considered. Given a modest sized training corpus of only a few GiB in size, under any of the aforementioned modelling paradigms, presenting the structure of an -order model amounts to a significant memory usage which can quickly become impractical. This is due to the exponential growth of the number of model parameters (floats) as a function of n and data size. We propose a framework based on compressed data structures which operates on the compressed representation of the data, hence, keeps the memory usage of modelling, learning, and inference steps independent from the order of the Markov models, and proportional to the size of the data. Having lifted the memory requirement of model presentation, we now turn to learning challenges. The very large (potentially infinite) space of parameters in -order models are required to be estimated to be used during prediction. This introduces both computational and statistical burdens in learning (training) phase. This is also a function of the data size. Our proposed compressed framework is compact and supports efficient search for count-based quantities. We utilize the search efficiency of our framework via algorithmic optimizations, and adapt a lazy approach which skips the parameter estimation during the training phase. Then, at the inference phase, given a query, various search operations are launched into the compressed representation of the data to extract the required quantities on-the-fly. This avoids the statistical and computational challenges of the learning phase, but demands a careful design of the inference algorithms.

23 Chapter 1. Introduction 4 The final step is the prediction which involves two remaining issues: a statistical issue being accurate on-the-fly parameter estimation, and a computational issue being the time usage of parameter estimation and then inference. These issues are functions of the data size and the choice of model. For example, the number of required parameters for an - order model can still be tractable if a simple count ratio is required. However, for almost all the models that perform well, a naive on-the-fly estimation of the parameters will be too slow to be practical, and a fast but inexact parameter estimation will be too inaccurate to be useful. While our proposed solution varies in details for different modelling paradigms, at their cores they rely on the same concept. We utilize the search efficiency of our proposed compressed framework in the inference time and propose algorithmic optimizations to collect all the required quantities on-the-fly and efficiently. To avoid the statistical issue of estimating an infinitely large parameter space, we limit the space to the smallest subset which includes the required parameters to answer a particular query. In the following three sections, we provide more details about the contributions of this thesis and the way in which we framed and presented our research. We start, in Section 1.1, by looking at structure and sequence prediction tasks, and compare the performance improvements that -order models offer with the finite-order models. At its core, our proposed models are established on top of a sequence model which allows for a unified modelling and inference scheme to be developed for both sequential and non-sequential tasks. The aim is to verify the hypothesis of the thesis in various NLP tasks, namely, syntactic parsing, part-ofspeech tagging, and morphological parsing. In Section 1.2, we deal with training text corpora ranging from small to large in size and elaborate the model presentation, computational complexity, and statistical issues of learning and inference phases. Having established that a sequence model can be extended for structure prediction in Section 1.1, we focus on an inherently sequential task of language modelling. And we propose our compressed framework along with algorithmic optimizations required for fast prediction. While our chosen modelling paradigm in Section 1.2 was MLE, in Section 1.3 we turn to Bayesian modelling which is extremely powerful but unpopular due to all the aforementioned issues. We adjust our proposed compressed framework in Section 1.2 and develop efficient inference algorithms that are fast and capable of avoiding common statistical issues of large Bayesian models.

24 Applications of non-markovian sequence models for structure prediction in NLP 1.1 Applications of non-markovian sequence models for structure prediction in NLP Many natural language processing tasks rely on learning and predicting linguistic structures (Smith, 2011). For instance, in Machine Translation, accurately identifying the language constituents, e.g. noun phrase (NP) and verb phrase (VP), is a key element for identifying re-ordering needed when translating from one language to another. A prime example of linguistic structures is a syntax tree. In a syntax tree, inner nodes (non-terminals) are syntactic categories such as NP, VP; and leaves (terminals) are words, see Figure 1.1(a). The task of predicting a syntax tree is called syntactic parsing, and can be defined as finding one or more syntactic structures of a given sentence that can be generated with a particular grammar (Manning and Schütze, 1999). The syntax tree of an utterance can be generated by combining a set of rules from a grammar, such as a context free grammar (CFG). A CFG is a 4-tuple G = (T, N, S, R), where T is a set of terminal symbols, N is a set of non-terminal symbols, S N is the distinguished root non-terminal and R is a set of grammar rules. The grammar rules are often in Chomsky Normal Form (CNF), taking either the form A B C or A a where A, B, C are nonterminals, and a is a terminal. A Probabilistic CFG (PCFG) assigns a probability to each grammar rule, where B,C P(A B C A) = 1, and a P(A a A) = 1. The modelling, and parsing are traditionally done via PCFG, and a dynamic programming algorithm (Cocke and Schwartz, 1970), respectively. The PCFG model can be considered as a Markov model, where the selection of the next grammar rule only depends on a single frontier constituent. Hence, PCFG parsing performs poorly and lacks the sensitivity to the context (both lexical, and syntactic) in the tree and ignores the long range dependency that exists between the constituents. Many context dependent extensions of PCFGs have been proposed (Johnson, 1998; Collins, 1999; Petrov et al., 2006; Johnson et al., 2007a; Liang et al., 2007; Finkel et al., 2007; Cohn et al., 2010), but they all consider short range dependencies. 1 We relax the strong local Markov assumptions in PCFG by increasing the order of the model to, hence capturing phenomena outside of the 1 Likewise, previous works on applying Markov models to part-of-speech tagging (see Figure 1.1(b)) either considered finite-order Markov models (Brants, 2000), or finite-order HMM (Thede and Harper, 1999). In this Section we focus on syntactic parsing, and the details about part-of-speech tagging and morphological parsing will be discussed in Chapter 3.

25 Chapter 1. Introduction 6 S NP VP DT NN VBZ ADJP The Force is JJ PP strong IN NP with PRP DT NN VBZ JJ IN PRP you The Force is strong with The Force is strong with you (a) (b) (c) Figure 1.1: (a) Syntax tree, and (b) Part-of-Speech tags for The Force is strong with you". (c) Predicting the next word in language modelling. In the prediction step, we condition each decision, denoted by gray boxes, on the full chain of ancestors (context), denoted by green dashed lines. local Markov context. Our model model conditions the generation of a rule in a tree on its unbounded history, i.e., its ancestors on the path towards the root of the tree. As illustrated in Figure 1.1, the prediction in syntactic parsing and part-of-speech tagging can both be framed as an instance of language modelling problem (sequence prediction) Figure 1.1(c) where a single prediction depends on a chain of previously generated rules, or words. For instance, in syntactic parsing Figure 1.1(a) the green dashed line marks the chain of ancestor rules which appear before expanding NP to PRP. The same concept is illustrated in part-of-speech tagging Figure 1.1(b). Therefore, we frame all these tasks as a sequence prediction problem and propose a non- Markovian model based on -order sequence models (Gasthaus and Teh, 2010; Wood et al., 2011) for predicting latent linguistic structures, such as syntax trees or part-of-speech tags. We show that our sequential modelling approach can be applied to various structure prediction tasks in NLP, and propose effective algorithms to tackle significant learning and inference challenges posed by the infinite memory. More specifically, we propose an infinite memory hierarchical Bayesian non-parametric model for the generation of linguistic utterances and their corresponding structure (e.g., the sequence of POS tags or syntax trees). Our model conditions each decision in a tree generating process on an unbounded context consisting of the vertical chain of their ancestors, in the same

26 Scalable non-markovian sequence models way that infinite sequence models (e.g., -gram language models) condition on an unbounded window of linear context (Mochihashi and Sumita, 2007; Wood et al., 2009). Learning in this model is particularly challenging due to the large space of contexts and corresponding data sparsity. For this reason predictive distributions associated with contexts are smoothed using distributions for successively smaller contexts via a hierarchical Bayesian model. The infinite context makes it impossible to directly apply dynamic programming for structure prediction. We present two inference algorithms based on A* and Markov Chain Monte Carlo (MCMC) for predicting the best structure for a given input utterance, which lead to performance improvements over the finite-order Markov models. As explained, many other fundamental NLP tasks can be framed as a sequence prediction problem of which language modelling is a prime example. This motivates our research in developing a scalable -order language modelling framework. In the next two Sections, we briefly overview the shortcomings of existing approaches to be applicable for -order setup, and propose our compressed framework for -order modelling. We first, in Section 1.2 establish means of efficient -order modelling and prediction under MLE, and then in Section 1.3 extend our framework to a fully Bayesian paradigm. 1.2 Scalable non-markovian sequence models Language models (LMs), as illustrated in Figure 1.1(c), are critical components in many modern NLP systems, including automatic speech recognition (Rabiner and Juang, 1993). The most widely used LMs are non-bayesian n-gram models (Chen and Goodman, 1999), which follow a Markov assumption and decompose the probability of an utterance into conditional probabilities of words given a finite size context. Three sources of performance improvements for n-gram models are the use of smoothing techniques (Chen and Goodman, 1999), higher-order models (Wood et al., 2009), and the inclusion of more training data (Buck et al., 2014). The most popular LM toolkits, SRILM (Stolcke et al., 2011) and KenLM (Heafield, 2011) are based on explicit storage of n-grams and their probabilities using various smoothing techniques. However, depending on the order and the training corpus size, a typical model may contain as many as several hundred billions of n- grams Brants et al. (2007), raising challenges of efficient storage and retrieval. In fact, these

27 Chapter 1. Introduction 8 toolkits are impractical for learning high-order LMs on large corpora, due to their poor scaling properties in both training and query phases. There is a trade-off among accuracy, space, and time, with recent papers considering small but approximate lossy LMs (Chazelle et al., 2004; Talbot and Osborne, 2007a; Guthrie and Hepple, 2010; Church et al., 2007), lossless LMs backed by tries (Stolcke et al., 2011), and related compressed structures(germann et al., 2009; Heafield, 2011; Pauls and Klein, 2011; Sorensen and Allauzen, 2011), or distributed computation (Heafield et al., 2013; Brants et al., 2007). However, none of these approaches scale well to very high-order or very large corpora, due to their high memory and time requirements. As briefly mentioned in Section 1.1, our proposed models are all based on -order sequence prediction modelling. The -order sequence models were proposed for language modelling task where they showed a significant success, but could only deal with very small datasets (Wood et al., 2009; Gasthaus and Teh, 2010). This is, similar to the mentioned n-gram LM, due to statistical challenges of estimating the parameters, and the computational time and memory usage of these models in representing, training, and test phases as n or data size exceeds a few 100s of MiBs. We take a fundamentally different approach and skip the estimation of the required parameters in the training time. Instead, we make use of recent advances in compressed suffix trees (Csts) (Sadakane, 2007) and build a compact representations of text with a memory requirement proportional to the size of the data. We then use the compact representation for extracting the required statistics during the test phase and on-the-fly. This bypasses the memory demand of representing the model, and the statistical issues of estimating its parameters during the training phase. Then at the test phase, frequency and other required statistics for given n-gram queries from Cst are extracted. To extract the sufficient statistics efficiently, optimized algorithms for computing the Kneser-Ney (KN) (Kneser and Ney, 1995) and Modified Kneser-Ney (MKN) (Chen and Goodman, 1999) LM probabilities are developed. This proposed approach has favorable scaling properties with n and data size, has only a modest memory requirement, and allows for fast construction and querying. To make the query speed competitive with the state-of-the-art methods, we precompute counts (i.e., number of unique contexts to left/right of a string) that are very expensive to compute at query time. The precomputed quantities are then stored in a compressed data structure,

28 Scalable Bayesian non-markovian sequence models supporting efficient memory usage and lookup. Also, we re-use Cst nodes within n-gram probability computation as a sentence gets scored left-to-right, thus saving many expensive lookups. The strengths of this method are apparent when applied to very large training datasets ( 16 GiB) and for high order models, n 5. In this setting, while this approach is more memory efficient than the leading toolkits, both in the construction (training) and querying (testing) phases, it is highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, the proposed approach is orders of magnitude faster than the state of the art. Moreover, our method allows for efficient querying with an unlimited Markov order, n, without resorting to approximations or heuristics. We revisit the training procedure of Kneser-Ney and illustrate that Modified Kneser-Ney language model is in fact trained by maximizing the likelihood (leave-one-out likelihood). We then extend these models by allowing a more elegant way of dealing with out-of-vocabulary words and domain mismatch. We will in Chapter 2 show that Kneser-Ney smoothings (MLE) is an approximation to a much richer hierarchical Bayesian model, the Hierarchical Pitman-Yor Process. Having established means of -order language modelling in a MLE fashion, in the next section we briefly explain our proposed approach for efficient and scalable modelling and inference under the Bayesian paradigm. 1.3 Scalable Bayesian non-markovian sequence models Bayesian modelling is a natural fit to describe the uncertainty about the model parameters (Bishop, 2007). Hierarchical Bayesian modelling extends this and provides a powerful framework to statistically model the dependence between different phenomena, e.g. the dependence between a word and its topic. A significant advancement for such general-purpose models was the development of the Bayesian nonparametric Hierarchical Dirichlet-Process (HDP) (Teh et al., 2006) and the Hierarchical Pitman-Yor Process (HPYP) extension which was applied to language modelling by Teh (2006a). The HPYP allows the complexity of the model to be learned from the data, and grow as more data arrives. The intuition behind KN smoothing is to adjust the original distribution to assign non-zero probability to unseen or rare events. This is achieved by re-allocating the maximum likelihood

29 Chapter 1. Introduction 10 estimated probability mass from more frequent events to rare and unseen events in an interpolative procedure via absolute discounting. It turns out that the Bayesian generalization of KN smoothing is the HPYP LMs (Teh, 2006a), which was originally developed for finite-order LM (Teh, 2006b), and was extended as the Sequence Memoizer (SM) (Wood et al., 2011) to model infinite-order LMs. Capturing long range dependencies via HPYP improves the estimation of conditional probabilities. These types of models, however, remain impractical due to several computational and learning challenges, namely large model size (data structure representing the model, and the number of parameters), long training and test time, and poor sampler mixing. We address these issues by building the HPYP model on top of a Cst. In the training step, only the Cst representation of text is constructed, allowing for fast training, while proposing an efficient approximate inference algorithm for the test time. The mixing issue is avoided via heuristic sampler initialization and design. Our proposed approximation of HPYP is richer than KN and MKN, and is much more efficient in learning and inference phase compared to the full HPYP. Compared with 10- gram KN and MKN models, the -gram model consistently improves the perplexity by up to 15%. Using compressed data structures allows us to train on large collections of text, i.e. 100 larger than the largest dataset used in HPYP language models (Wood et al., 2011) while having several orders of magnitude smaller memory footprint and supporting fast and efficient inference. 1.4 Thesis outline In this section we provide an outline of the rest of the thesis, a summary of each chapter, and references to published works resulting from each chapter. Chapter 2: Background In Chapter 2 we provide a brief overview of the foundations for the research in this thesis, including finite-order Markov language models, infinite-order non-parametric Bayesian language models, and compressed data structures. We discuss the related works in detail in each specific chapter.

30 Thesis outline Chapter 3: Applications of Non-Markovian Sequential Modelling for Structured Prediction This chapter is based on : E. Shareghi, G. Haffari, T. Cohn, A. Nicholson, Structured Prediction of Sequences and Trees using Infinite Contexts", Proceedings of the European Conference on Machine Learning (ECML), 2015, Porto, Portugal. To elaborate that -order models achieve better performance in various NLP tasks, we propose a novel Hierarchical Pitman-Yor Process for structured prediction over sequences and trees which exploits infinite context by conditioning each generation decision on an infinite context of previous events. While at its core our model stays as a sequence model, at the inference stage we propose novel algorithms to predict structured labels (i.e., grammar trees). We propose prediction algorithms based on A* and Markov Chain Monte Carlo sampling. Empirical results demonstrate the potential of our infinite-context model compared to baseline finite-context Markov models on morphological and syntactic parsing, and competitive performance for the POS tagging task. Chapter 4: Scalable Non-Markovian Sequential Modelling This chapter is based on: E. Shareghi, M. Petri, G. Haffari, T. Cohn, Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2015, Lisbon, Portugal. E. Shareghi, M. Petri, G. Haffari, T. Cohn, Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees", Transactions of the Association for Computational Linguistics (TACL), E. Shareghi, T. Cohn, G. Haffari, Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2016, Austin, USA. A prime example of sequence modelling is the language modelling task. In this chapter we consider the Kneser-Ney and Modified Kneser-Ney language models. We propose a new approach based on compressed suffix trees to represent the structure of the KN-based language models. This results in a memory usage roughly matching the size of the bzip2 compressed text, and fast training while the prediction remains slow. To speed up inference, we precompute and compactly store the expensive quantities and propose algorithmic optimizations to reuse information while processing queries as they arrive in the test time. Our proposed approach for KN and MKN is several orders of magnitudes

31 Chapter 1. Introduction 12 more memory efficient than the state-of-the-art, in training and testing. We are highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, our approach is orders of magnitude faster than the state-of-the-art. Finally, as opposed to the current state-of-the-art, our approach scales very efficiently to large (i.e. 32 GiB) data, and infinite contexts. The ability of the proposed approach in representing the models compactly allows us to explore more complex models. Based on the training procedure of KN, we illustrate that MKN is trained by maximizing the leave-one-out likelihood and present a generalization of the MKN LM for richer smoothing via introducing additional discount parameters. The discount parameters are responsible for preserving some mass to allocate non-zero probability to unseen or rare events in the test time. We provide the mathematical underpinning for the estimators of the discount bounds and extend them further. We showcase the utility of our rich MKN LM on several languages and further explore the interdependency among the training data size, language model order, and number of discount parameters. Our empirical results illustrate that larger number of discount parameters, compared to KN and MKN LM, allows for better allocation of mass in the smoothing process, particularly on small data regime where statistical sparsity is severe, and leads to significant reduction in perplexity, particularly for out-of-domain test sets which introduce a higher ratio of out-of-vocabulary words. Chapter 5: Scalable Bayesian Non-Markovian Sequential Modelling This chapter is based on: E. Shareghi, G. Haffari, T. Cohn Compressed Nonparametric Language Modelling", Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017, Melbourne, Australia. The KN and MKN language models, trained in MLE fashion, are approximations for the Hierarchical Pitman-Yor Process (Bayesian) model. In this chapter we turn to the fully Bayesian approach in modelling and inference and propose a compressed framework to compactly represent the structure of a HPYP along with the sufficient statistics to recover the state of the sampler. While the compressed data structures allow for compact representation of HPYP LM, it is not sufficient for scaling up these types of models to standard corpora. In fact, these models can still remain impractical due to computational complexity of sampling, and the costly inference, e.g. through poor sampler mixing. To address these issues, we develop an efficient

32 Thesis outline fast approximate inference scheme with a much lower memory footprint compared to full HPYP inference of existing models. The experimental results illustrate that our proposed framework and approximate inference scheme can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and consistently outperforming the MKN LM in terms of predictive perplexity by up to 15%. Chapter 6: Conclusion In Chapter 6 we provide concluding remarks on the contributions presented in this thesis, as well as potential avenues for future research.

33 CHAPTER 2 Background In this chapter we provide a brief overview of the foundations for the research in this thesis, including statistical (n-gram) language models, non-parametric Bayesian language models, and compressed data structures. We start by showing how a probability of a sequence is estimated under finite-order n- gram language models while explaining two of the widely used approaches, Kneser-Ney (KN) and Modified Kneser-Ney (MKN). Then, we cover various approaches proposed in the literature that aim to improve the scalability of n-gram models. Next, we describe the non-parametric Bayesian language models which are based on hierarchical Pitman-Yor Process (HPYP). We explain how learning and inference is done under -order HPYP models and provide the link between the KN and HPYP language models. We extend the -order sequential modelling of HPYP in Chapter 3, and illustrate its NLP applications beyond the language modelling problem. In the latter section we describe the compressed algorithmic framework required to scale the models to -order setting with large corpora. We explain some of the basic data structures, e.g. suffix arrays and trees, as well as the more advanced data structures such as wavelet trees, compressed suffix arrays and trees, and the required operations, e.g. rank, select and backward-search, that allow us to represent the data in a compact form while still supporting various search operations very efficiently. This framework is the basis of Chapter 4, and 14

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop