Scalable Non-Markovian Sequential Modelling for Natural Language Processing

Size: px
Start display at page:

Download "Scalable Non-Markovian Sequential Modelling for Natural Language Processing"

Transcription

1 Scalable Non-Markovian Sequential Modelling for Natural Language Processing by Ehsan Shareghi Nojehdeh Thesis Submitted for the fulfillment of the requirements for the degree of Doctor of Philosophy Faculty of Information Technology Monash University October, 2017

2 To Elham The ancient pond a frog jumps in the splash of water Matsuo Bashō ii

3 Copyright by Ehsan Shareghi Nojehdeh 2017 Notice Except as provided in the Copyright Act 1968, this thesis may not be reproduced in any form without the written permission of the author. I certify that I have made all reasonable efforts to secure copyright permissions for thirdparty content included in this thesis and have not knowingly added copyright content to my work without the owner s permission iii

4 Scalable Non-Markovian Sequential Modelling for Natural Language Processing Declaration I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text and a list of references is given. Ehsan Shareghi Nojehdeh October 12, 2017 iv

5 ABSTRACT Scalable Non-Markovian Sequential Modelling for Natural Language Processing Ehsan Shareghi Nojehdeh Monash University, 2017 Markov models are popular means of modelling the underlying structure of natural language, which is naturally represented as sequences and trees. The locality assumption made in low-order Markov models such as n-gram language models is limiting, because if the data generation process exhibits long range dependencies, modelling the distribution well requires consideration of long range context. On the other hand, higher-order Markov, or infinite-order Non-Markovian 1 models, exhibit computational complexity and statistical challenges during learning and inference. In particular, under the large data setting their exponential number of parameters often results in estimation and sampler mixing issues, while representing the structure of the model, and sufficient statistics or sampler states can quickly become computationally inefficient and impractical. In order to exploit global context, we propose a novel Non-Markovian model based on the Hierarchical Nonparametric Bayesian paradigm to incorporate potentially infinite-length context. We demonstrate better performance compared with finite-order Markov models on various structured and sequence prediction tasks. To address the computational complexity issues inherent in the nature of Non-Markovian models, we propose a new modelling framework based on lossless compressed data structures to represent the required sufficient statistics of the model compactly. This allows infinite-depth Hierarchical Nonparametric Bayesian models to be presented in a space proportional to the size of the input data, while enabling an efficient inference mechanism to be developed. Using our compressed framework to represent the models, we explore its scalability under two Non-Markovian language modelling settings, using large scale data and infinite context. First, we model the Kneser-Ney family of language models and illustrate that our approach is several orders of magnitude more memory efficient than the state-of-the-art, in training and 1 We use the term Non-Markovian to refer to Markov models where the order of the model is unbounded. v

6 testing, while it is highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, our approach is orders of magnitude faster than the state-ofthe-art. Second, we consider the full Hierarchical Nonparametric Bayesian language model and propose a fast and memory-efficient approximate inference scheme. Compared with the existing Hierarchical Nonparametric Bayesian language models, our approach has several orders of magnitudes lower memory footprint allowing us to apply it on data sizes more than one hundred fold larger than the largest data used in previous models. This is achieved while avoiding potential mixing issues, while consistently outperforming the state-of-the-art count-based Kneser-Ney family of language models by a significant margin. The results of this work, as well as being significant for the sequence and structured prediction tasks in NLP, point at a new direction of developing complex but compact statistical models that can scale up to very large and potentially real-world datasets without the need for compute clusters. vi

7 Acknowledgments I will forever be thankful to my advisers Dr. Gholamreza Haffari, Dr. Trevor Cohn, and Prof. Ann Nicholson. I consider myself extraordinarily lucky to have been given a chance to learn and develop under their careful and patient guidance. I thank my panel members Prof. Wray Buntine and Prof. Ingrid Zukerman, and my thesis examiners Dr. Brian Roark (Google) and Dr. Mark Dras (Macquarie University) for their feedback on this thesis. My sincere thanks go to Prof. Buntine for his generosity in sharing his insights regarding the nonparametric Bayesian modelling and inference. I would also like to thank Philip Chan for the support to run many resource-intensive experiments on Monash Advanced Research Computing Hybrid (MonARCH) servers, and Dr. Matthias Petri for helping me to use the succinct data structure library (SDSL). I would like to extend my appreciation to Danette Deriane for her unbounded support as the Graduate Research Student coordinator, National ICT Australia (NICTA) for contributing to my scholarship, and IBM Research Australia for giving me the opportunity to work on interesting projects during my internship. A special feeling of gratitude goes to my wife Elham, who has been with me all these years and has made them the best years of my life. My dear parents Fatemeh and Yusef, and my brother Aydin who have always been supportive and encouraging over many years, I cannot thank you enough. Of course no acknowledgments would be complete without giving thanks to wonderful colleagues and friends. I would like to thank my colleagues in Monash, Mohammad Shamsur Rahman, Bo Chen, Sameen Maruf, Poorya ZareMoodi, Ming Liu, Narjes Askarian, and He Zhao for their moral support and kindness. My dear friends Nader Chmait, James Collier, Parthan Kasarapu, Han Duy Phan, Omid Zanganeh, Quan Hung Tran, Andisheh Partovi, Kai Siong Yow, Milad Chenaghlou, Xuhui Zhang, Ying Yang, and Dinithi Sumanaweera, you have my deepest feeling of gratitude. vii

8 Publications The publications arising from my thesis are: (Published) E. Shareghi, G. Haffari, T. Cohn, A. Nicholson, Structured Prediction of Sequences and Trees using Infinite Contexts", Proceedings of the European Conference on Machine Learning (ECML), 2015, Porto, Portugal. (Published) E. Shareghi, M. Petri, G. Haffari, T. Cohn, Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2015, Lisbon, Portugal. (Published) E. Shareghi, M. Petri, G. Haffari, T. Cohn, Fast, Small and Exact: Infiniteorder Language Modelling with Compressed Suffix Trees", Transactions of the Association for Computational Linguistics (TACL), (Published) E. Shareghi, T. Cohn, G. Haffari, Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2016, Austin, USA. (Published) E. Shareghi, G. Haffari, T. Cohn, Compressed Nonparametric Language Modelling", Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017, Melbourne, Australia. viii

9 Contents Abstract v Acknowledgments vii Publications viii List of Tables xiii List of Figures xiv List of Algorithms xvi List of Abbreviations xvii List of Symbols xix 1 Introduction Applications of non-markovian sequence models for structure prediction in NLP Scalable non-markovian sequence models Scalable Bayesian non-markovian sequence models Thesis outline Background Statistical Language Modelling Finite-order Markov (n-gram) Language Models ix

10 2.1.2 Statistical Sparsity and Smoothing Kneser-Ney (KN) smoothing Modified Kneser-Ney (MKN) smoothing Existing approaches for n-gram language modelling Nonparametric Bayesian Language Modelling Hierarchical Pitman-Yor Process (HPYP) language model The relation between KN and HPYP language models Inference in HPYP Markovian vs. Non-Markovian Language Models Compressed Data Structures Basic Concepts Compressed Suffix Arrays (CSA) and Compressed Suffix Trees (CST) Compressed Vectors Summary Applications of Non-Markovian Sequential Modelling for Structured Prediction Probabilistic Context Free Grammars and Extensions Modelling Structures as Sequences Minimal Assumption and MAP Parameter Estimation Prediction A* search MCMC sampling Experiments Morphological parsing Syntactic parsing Part-of-Speech tagging Analysis Summary Scalable Non-Markovian Sequential Modelling Overview of KN and MKN language models Basic compressed Kneser-Ney language models Basic dual Cst approach x

11 4.2.2 Improved single Cst approach Extending KN to MKN and further speed-up Efficient precomputation and compressed counts Computing MKN probability Experiments Perplexity Benchmarking runtime and memory requirements Character-level language modelling Richer Modified Kneser-Ney Language Model Generalized MKN Experiments Summary Scalable Bayesian Non-Markovian Sequential Modelling Hierarchical Pitman-Yor Process Language Model Compressed HPYP LM Representation Efficient Approximate Inference Joint distribution of table and customer counts n u w, t u w Sampling under the joint distribution of n u w, t u w Sampling concentration θ u and discount d u parameters Sampling under Cst mechanics Experiments Perplexity Memory and time Analysis Summary Conclusions and Future Directions Future directions for the research presented in this thesis Appendix A Introducing auxiliary variables into the joint distribution Appendix B Sampling auxiliary variables xi

12 Appendix C Sampling discount parameter Bibliography xii

13 List of Tables 2.1 The required quantities for computing a 4-gram probability under KN The required quantities for computing a 4-gram probability under MKN One-to-One mapping of interpolative smoothing under KN and MKN One-to-One mapping of interpolative smoothing under KN and HPYP Datasets statistics for parsing and part-of-speech tagging Morphological parsing results Syntactic parsing results Part-of-speech tagging results Comparison between predicted and gold standard grammar pattersn One-to-One mapping of interpolative smoothing under KN and MKN Summary of Csa and Cst functions used and their time complexity of inference Perplexity results on 32GiB of text for various languages Perplexity on various data sizes, domains, and n-gram orders Perplexity results for 1 billion word corpus Perplexity for various n-gram orders and discount parameters One-to-One mapping of interpolative smoothing under KN and HPYP Complexities of computing critical sampling quantities Perpelxity of KN, MKN, HPYP-based smoothing on various data sizes Perpelxity of our approach under various settings xiii

14 List of Figures 1.1 Examples of syntax tree, part-of-speech tag sequence, and language model A compraison of various n-gram modelling approaches Example of the smoothing hierarchy for an interpolative LM of depth Examples of Suffix Tree, Suffix Array, Burrows-Wheeler Transform, Wavelet Tree Procedure for generating Burrows-Wheeler Transformation Example of a Backward-Search procedure RRR compressed bit vector Directly Addressable Variable-Length Codes (DAC) compressed integer vector Structure and sequence prediction tasks Examples of CFG refinements Smoothing hierarchy for infinite-order parsing Directions of Hierarchical Pitman-Yor Process smoothing Pitman-Yor generated samples vs. data distribution Example of a hyper-graph representation of the search space Example of a binarized morphological tree Presenting a part-of-speech Hidden-Markov-Model via our representation Quantities required for computing KN using the forward and backward Csts Examples of character-level data structures Time breakdown of Cst-based procedures required for KN xiv

15 4.4 Time breakdown for KN and MKN with/out precomputation Distribution of precomputed values Graphical representation of our MKN computation Memory and time comparison - Cst, KenLM, SRILM on a small set Memory and time comparison - Cst and KenLM Precentage of perplexity reduction for various European languages Examples of perplexity, discount values, and average hit length Perplexity vs. number of discount parameters Hierarchy of Chinese Restaurants Direction of search, interpolation, and sampling for abc" Memory and time comparison - Sequence Memoizer and our approach xv

16 List of Algorithms 1 Gibbs Sampler Algorithm of Teh (2006a) Compute one-sided occurrence counts, N 1+ ( α) or N 1+ (α ) for pattern α Two-sided occ., N 1+ ( α ) ( ) 4 KN probability P w k w k k (n 1) ( ) 5 KN probability P w k w k 1 using a single Cst k (n 1) 6 N 1+ ( α ), using forward Cst Compute backward occurrence counts, N 1+ ( α), using only forward Cst N {1,2,3+} (α ) or N {1,2,3+} (α ) Precomputing expensive counts N {1,2} (α ), N 1+ ( α ), N 1+ ( α), N {1,2} (α ) MKN probability P ( ) w i w i 1 i (n 1) Compute discounts Gibbs Sampler for η γ xvi

17 List of Abbreviations Burrows-Wheeler Transformation Chinese Restaurant Process Contex Free Grammar Compressed Suffix Array Compressed Suffix Tree Dirichlet Process Gigibyte Hidden Markov Model Hierarchical Chinese Restaurant Process Hierarchical Dirichlet Process Hierarchical Pitman-Yor Process Kullback-Leibler divergence Kneser-Ney Language Model/Modelling Maximum A Posteriori Maximum Likelihood Estimation Mebibyte Modified Kneser-Ney Monte Carlo Markov Chain Natural Language Processing Out-of-Vocabulary Part-of-Speech Pitman-Yor Process BWT CRP CFG CSA CST DP GiB HMM HCRP HDP HPYP KL divergence KN LM MAP MLE MiB MKN MCMC NLP OOV POS PYP xvii

18 Probabilistic Context Free Grammar Suffix Array Suffix Tree Wavelet Tree PCFG SA ST WT xviii

19 List of Symbols n σ w r w N 1 T T u c(α) ε d u θ u n u w n ụ t u w t ụ Iw u S(n, t) S d (n, t) (a b) c (a)! Γ(a) order of the markov model alphabet/vocabulary set word grammar rule a sequence w 1 w 2...w N of length N text in Language Modelling syntax tree in Syntactic parsing Overloaded : restaurant/distribution in CRP/PYP, and context in NLP tasks frequency of α null context discount parameter of Pitman-Yor Process u concentration parameter of Pitman-Yor Process u number of customers in restaurant u having dish w total number of customers in restaurant u number of tables in restaurant u having dish w total number of tables in restaurant u Arrangement of n u w customenrs around t u w tables in restaurant u Stirling number of the second kind Genralized Stirling number pochhammer symbol factorial function Gamma function xix

20 CHAPTER 1 Introduction Generating an utterance to convey a message involves, at least, the coordination between two phenomena: choosing meaningful words and phrases, and placing them in a correct grammatical structure. This is, typically, orchestrated in an incremental procedure which generates the utterance from left to right. Important properties of this incremental procedure are different layers of dependency on which each step of the generating process relies. For example, in a sentence, choosing a meaningful verb depends on the sequence of previously generated words while satisfying the subject-verb agreement that exists between a verb and its subject. Another example, among many other effects, is the number agreement between a determiner and its noun, which can be separated by an arbitrary number of adjectives. In fact, the notion of dependency is embodied in various aspects of human language, from its recursive grammatical structure (Chomsky, 1959) to selecting the next word of a sentence which has a strong statistical dependency on the previous words in the utterance (Shannon, 1951). Methods for automatically capturing this notion have attracted a significant amount of attention from the statistical modelling perspective. A family of statistical models designed to capture language dependencies are Markov models, where the dependency is assumed to only exist within a short span. The notion of order in Markov models is analogous to the range of dependency considered by these models, and typically Markov models are used in low-order setting. While the locality assumption in low-order Markov models often fails to capture long range dependencies, these models are 1

21 Chapter 1. Introduction 2 still popular due to their mathematical simplicity in learning and prediction. An example of such dependency where a finite order Markov model fails is the determiner-noun agreement which can be separated by any number (infinite in theory) of adjectives, e.g., a ball, a red ball, a red bouncy ball, etc. In fact, human language is far more complex than can be captured by low-order Markov models (Good, 1969) and an ideal model should have an infinite memory to completely capture the past, or have a selection mechanism to skip the uninformative segments of the past. One natural approach would be to use a Markov model with a very high order. On the other hand, high-order Markov models can quickly become impractical as the size of the Markov memory n increases: as n, the number of parameters σ n 1 grows exponentially in theory (in practice the growth is still unmanageable for orders 7 10 and above), where σ is the number of model parameters when n = 0 (i.e. size of the vocabulary). An approximation to high-order Markov models are variable-order Markov models (Ron et al., 1994), which were proposed to mitigate this problem by having a model with a dynamic range of memories. While these models are more robust than low-order Markov models in capturing long-range dependencies, they still rely on pruning an exponentially large space of possibilities which grows exponentially as n grows. Also, the pruning requires careful threshold tuning as well as computing a statistical measure before and after each pruning decision, a process which is prone to overfit to the training data (Mochihashi and Sumita, 2007). Another limitation of these models is that they assume a most-recent bias (most recent element is more important), which is sensible but may not be universally appropriate. In the context of large datasets, having scalable models is of central importance. However, all the aforementioned issues with high or variable-order Markov models are exacerbated as the size of the data grows. This is due both to the growth of the number of parameters, and the computational complexity of collecting the sufficient statistics for estimating the model parameters. Therefore, Markov models have only been successfully applied either in the loworder setting on potentially large datasets, or in the high-orders setting on small datasets. In order to capture all dependency ranges in this thesis we develop techniques around non-markovian (which we refer to herein as -order Markov) sequential models. The - order in these models is to highlight their capacity to capture various dependency ranges each component of an utterance exhibit, without imposing a fixed range dependency on all its components.

22 3 More formally, the main hypothesis of the thesis is to empirically examine and illustrate that finite order Markov models fail to capture the long range dependency that exists in human language. We illustrate this is the case in various NLP tasks framed as sequence, or tree structures. For instance, in language modelling task we show that higher order n-gram models result in better predictive accuracy compared to the widely used 5 6-gram models, and similarly for syntactic parsing incorporating longer range context improves the performance. We propose -order modelling frameworks which make no assumption about the range of the dependency that exists in the data and achieve significant performance improvements over low/finite order models in various NLP tasks. We examine, for different NLP tasks, the effectiveness of various modelling, learning, and inference paradigms from point estimation of model parameters via Maximizing the Likelihood (MLE) or the Posterior (MAP), to the full Bayesian treatment where the entire distribution over the parameter space is considered. Given a modest sized training corpus of only a few GiB in size, under any of the aforementioned modelling paradigms, presenting the structure of an -order model amounts to a significant memory usage which can quickly become impractical. This is due to the exponential growth of the number of model parameters (floats) as a function of n and data size. We propose a framework based on compressed data structures which operates on the compressed representation of the data, hence, keeps the memory usage of modelling, learning, and inference steps independent from the order of the Markov models, and proportional to the size of the data. Having lifted the memory requirement of model presentation, we now turn to learning challenges. The very large (potentially infinite) space of parameters in -order models are required to be estimated to be used during prediction. This introduces both computational and statistical burdens in learning (training) phase. This is also a function of the data size. Our proposed compressed framework is compact and supports efficient search for count-based quantities. We utilize the search efficiency of our framework via algorithmic optimizations, and adapt a lazy approach which skips the parameter estimation during the training phase. Then, at the inference phase, given a query, various search operations are launched into the compressed representation of the data to extract the required quantities on-the-fly. This avoids the statistical and computational challenges of the learning phase, but demands a careful design of the inference algorithms.

23 Chapter 1. Introduction 4 The final step is the prediction which involves two remaining issues: a statistical issue being accurate on-the-fly parameter estimation, and a computational issue being the time usage of parameter estimation and then inference. These issues are functions of the data size and the choice of model. For example, the number of required parameters for an - order model can still be tractable if a simple count ratio is required. However, for almost all the models that perform well, a naive on-the-fly estimation of the parameters will be too slow to be practical, and a fast but inexact parameter estimation will be too inaccurate to be useful. While our proposed solution varies in details for different modelling paradigms, at their cores they rely on the same concept. We utilize the search efficiency of our proposed compressed framework in the inference time and propose algorithmic optimizations to collect all the required quantities on-the-fly and efficiently. To avoid the statistical issue of estimating an infinitely large parameter space, we limit the space to the smallest subset which includes the required parameters to answer a particular query. In the following three sections, we provide more details about the contributions of this thesis and the way in which we framed and presented our research. We start, in Section 1.1, by looking at structure and sequence prediction tasks, and compare the performance improvements that -order models offer with the finite-order models. At its core, our proposed models are established on top of a sequence model which allows for a unified modelling and inference scheme to be developed for both sequential and non-sequential tasks. The aim is to verify the hypothesis of the thesis in various NLP tasks, namely, syntactic parsing, part-ofspeech tagging, and morphological parsing. In Section 1.2, we deal with training text corpora ranging from small to large in size and elaborate the model presentation, computational complexity, and statistical issues of learning and inference phases. Having established that a sequence model can be extended for structure prediction in Section 1.1, we focus on an inherently sequential task of language modelling. And we propose our compressed framework along with algorithmic optimizations required for fast prediction. While our chosen modelling paradigm in Section 1.2 was MLE, in Section 1.3 we turn to Bayesian modelling which is extremely powerful but unpopular due to all the aforementioned issues. We adjust our proposed compressed framework in Section 1.2 and develop efficient inference algorithms that are fast and capable of avoiding common statistical issues of large Bayesian models.

24 Applications of non-markovian sequence models for structure prediction in NLP 1.1 Applications of non-markovian sequence models for structure prediction in NLP Many natural language processing tasks rely on learning and predicting linguistic structures (Smith, 2011). For instance, in Machine Translation, accurately identifying the language constituents, e.g. noun phrase (NP) and verb phrase (VP), is a key element for identifying re-ordering needed when translating from one language to another. A prime example of linguistic structures is a syntax tree. In a syntax tree, inner nodes (non-terminals) are syntactic categories such as NP, VP; and leaves (terminals) are words, see Figure 1.1(a). The task of predicting a syntax tree is called syntactic parsing, and can be defined as finding one or more syntactic structures of a given sentence that can be generated with a particular grammar (Manning and Schütze, 1999). The syntax tree of an utterance can be generated by combining a set of rules from a grammar, such as a context free grammar (CFG). A CFG is a 4-tuple G = (T, N, S, R), where T is a set of terminal symbols, N is a set of non-terminal symbols, S N is the distinguished root non-terminal and R is a set of grammar rules. The grammar rules are often in Chomsky Normal Form (CNF), taking either the form A B C or A a where A, B, C are nonterminals, and a is a terminal. A Probabilistic CFG (PCFG) assigns a probability to each grammar rule, where B,C P(A B C A) = 1, and a P(A a A) = 1. The modelling, and parsing are traditionally done via PCFG, and a dynamic programming algorithm (Cocke and Schwartz, 1970), respectively. The PCFG model can be considered as a Markov model, where the selection of the next grammar rule only depends on a single frontier constituent. Hence, PCFG parsing performs poorly and lacks the sensitivity to the context (both lexical, and syntactic) in the tree and ignores the long range dependency that exists between the constituents. Many context dependent extensions of PCFGs have been proposed (Johnson, 1998; Collins, 1999; Petrov et al., 2006; Johnson et al., 2007a; Liang et al., 2007; Finkel et al., 2007; Cohn et al., 2010), but they all consider short range dependencies. 1 We relax the strong local Markov assumptions in PCFG by increasing the order of the model to, hence capturing phenomena outside of the 1 Likewise, previous works on applying Markov models to part-of-speech tagging (see Figure 1.1(b)) either considered finite-order Markov models (Brants, 2000), or finite-order HMM (Thede and Harper, 1999). In this Section we focus on syntactic parsing, and the details about part-of-speech tagging and morphological parsing will be discussed in Chapter 3.

25 Chapter 1. Introduction 6 S NP VP DT NN VBZ ADJP The Force is JJ PP strong IN NP with PRP DT NN VBZ JJ IN PRP you The Force is strong with The Force is strong with you (a) (b) (c) Figure 1.1: (a) Syntax tree, and (b) Part-of-Speech tags for The Force is strong with you". (c) Predicting the next word in language modelling. In the prediction step, we condition each decision, denoted by gray boxes, on the full chain of ancestors (context), denoted by green dashed lines. local Markov context. Our model model conditions the generation of a rule in a tree on its unbounded history, i.e., its ancestors on the path towards the root of the tree. As illustrated in Figure 1.1, the prediction in syntactic parsing and part-of-speech tagging can both be framed as an instance of language modelling problem (sequence prediction) Figure 1.1(c) where a single prediction depends on a chain of previously generated rules, or words. For instance, in syntactic parsing Figure 1.1(a) the green dashed line marks the chain of ancestor rules which appear before expanding NP to PRP. The same concept is illustrated in part-of-speech tagging Figure 1.1(b). Therefore, we frame all these tasks as a sequence prediction problem and propose a non- Markovian model based on -order sequence models (Gasthaus and Teh, 2010; Wood et al., 2011) for predicting latent linguistic structures, such as syntax trees or part-of-speech tags. We show that our sequential modelling approach can be applied to various structure prediction tasks in NLP, and propose effective algorithms to tackle significant learning and inference challenges posed by the infinite memory. More specifically, we propose an infinite memory hierarchical Bayesian non-parametric model for the generation of linguistic utterances and their corresponding structure (e.g., the sequence of POS tags or syntax trees). Our model conditions each decision in a tree generating process on an unbounded context consisting of the vertical chain of their ancestors, in the same

26 Scalable non-markovian sequence models way that infinite sequence models (e.g., -gram language models) condition on an unbounded window of linear context (Mochihashi and Sumita, 2007; Wood et al., 2009). Learning in this model is particularly challenging due to the large space of contexts and corresponding data sparsity. For this reason predictive distributions associated with contexts are smoothed using distributions for successively smaller contexts via a hierarchical Bayesian model. The infinite context makes it impossible to directly apply dynamic programming for structure prediction. We present two inference algorithms based on A* and Markov Chain Monte Carlo (MCMC) for predicting the best structure for a given input utterance, which lead to performance improvements over the finite-order Markov models. As explained, many other fundamental NLP tasks can be framed as a sequence prediction problem of which language modelling is a prime example. This motivates our research in developing a scalable -order language modelling framework. In the next two Sections, we briefly overview the shortcomings of existing approaches to be applicable for -order setup, and propose our compressed framework for -order modelling. We first, in Section 1.2 establish means of efficient -order modelling and prediction under MLE, and then in Section 1.3 extend our framework to a fully Bayesian paradigm. 1.2 Scalable non-markovian sequence models Language models (LMs), as illustrated in Figure 1.1(c), are critical components in many modern NLP systems, including automatic speech recognition (Rabiner and Juang, 1993). The most widely used LMs are non-bayesian n-gram models (Chen and Goodman, 1999), which follow a Markov assumption and decompose the probability of an utterance into conditional probabilities of words given a finite size context. Three sources of performance improvements for n-gram models are the use of smoothing techniques (Chen and Goodman, 1999), higher-order models (Wood et al., 2009), and the inclusion of more training data (Buck et al., 2014). The most popular LM toolkits, SRILM (Stolcke et al., 2011) and KenLM (Heafield, 2011) are based on explicit storage of n-grams and their probabilities using various smoothing techniques. However, depending on the order and the training corpus size, a typical model may contain as many as several hundred billions of n- grams Brants et al. (2007), raising challenges of efficient storage and retrieval. In fact, these

27 Chapter 1. Introduction 8 toolkits are impractical for learning high-order LMs on large corpora, due to their poor scaling properties in both training and query phases. There is a trade-off among accuracy, space, and time, with recent papers considering small but approximate lossy LMs (Chazelle et al., 2004; Talbot and Osborne, 2007a; Guthrie and Hepple, 2010; Church et al., 2007), lossless LMs backed by tries (Stolcke et al., 2011), and related compressed structures(germann et al., 2009; Heafield, 2011; Pauls and Klein, 2011; Sorensen and Allauzen, 2011), or distributed computation (Heafield et al., 2013; Brants et al., 2007). However, none of these approaches scale well to very high-order or very large corpora, due to their high memory and time requirements. As briefly mentioned in Section 1.1, our proposed models are all based on -order sequence prediction modelling. The -order sequence models were proposed for language modelling task where they showed a significant success, but could only deal with very small datasets (Wood et al., 2009; Gasthaus and Teh, 2010). This is, similar to the mentioned n-gram LM, due to statistical challenges of estimating the parameters, and the computational time and memory usage of these models in representing, training, and test phases as n or data size exceeds a few 100s of MiBs. We take a fundamentally different approach and skip the estimation of the required parameters in the training time. Instead, we make use of recent advances in compressed suffix trees (Csts) (Sadakane, 2007) and build a compact representations of text with a memory requirement proportional to the size of the data. We then use the compact representation for extracting the required statistics during the test phase and on-the-fly. This bypasses the memory demand of representing the model, and the statistical issues of estimating its parameters during the training phase. Then at the test phase, frequency and other required statistics for given n-gram queries from Cst are extracted. To extract the sufficient statistics efficiently, optimized algorithms for computing the Kneser-Ney (KN) (Kneser and Ney, 1995) and Modified Kneser-Ney (MKN) (Chen and Goodman, 1999) LM probabilities are developed. This proposed approach has favorable scaling properties with n and data size, has only a modest memory requirement, and allows for fast construction and querying. To make the query speed competitive with the state-of-the-art methods, we precompute counts (i.e., number of unique contexts to left/right of a string) that are very expensive to compute at query time. The precomputed quantities are then stored in a compressed data structure,

28 Scalable Bayesian non-markovian sequence models supporting efficient memory usage and lookup. Also, we re-use Cst nodes within n-gram probability computation as a sentence gets scored left-to-right, thus saving many expensive lookups. The strengths of this method are apparent when applied to very large training datasets ( 16 GiB) and for high order models, n 5. In this setting, while this approach is more memory efficient than the leading toolkits, both in the construction (training) and querying (testing) phases, it is highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, the proposed approach is orders of magnitude faster than the state of the art. Moreover, our method allows for efficient querying with an unlimited Markov order, n, without resorting to approximations or heuristics. We revisit the training procedure of Kneser-Ney and illustrate that Modified Kneser-Ney language model is in fact trained by maximizing the likelihood (leave-one-out likelihood). We then extend these models by allowing a more elegant way of dealing with out-of-vocabulary words and domain mismatch. We will in Chapter 2 show that Kneser-Ney smoothings (MLE) is an approximation to a much richer hierarchical Bayesian model, the Hierarchical Pitman-Yor Process. Having established means of -order language modelling in a MLE fashion, in the next section we briefly explain our proposed approach for efficient and scalable modelling and inference under the Bayesian paradigm. 1.3 Scalable Bayesian non-markovian sequence models Bayesian modelling is a natural fit to describe the uncertainty about the model parameters (Bishop, 2007). Hierarchical Bayesian modelling extends this and provides a powerful framework to statistically model the dependence between different phenomena, e.g. the dependence between a word and its topic. A significant advancement for such general-purpose models was the development of the Bayesian nonparametric Hierarchical Dirichlet-Process (HDP) (Teh et al., 2006) and the Hierarchical Pitman-Yor Process (HPYP) extension which was applied to language modelling by Teh (2006a). The HPYP allows the complexity of the model to be learned from the data, and grow as more data arrives. The intuition behind KN smoothing is to adjust the original distribution to assign non-zero probability to unseen or rare events. This is achieved by re-allocating the maximum likelihood

29 Chapter 1. Introduction 10 estimated probability mass from more frequent events to rare and unseen events in an interpolative procedure via absolute discounting. It turns out that the Bayesian generalization of KN smoothing is the HPYP LMs (Teh, 2006a), which was originally developed for finite-order LM (Teh, 2006b), and was extended as the Sequence Memoizer (SM) (Wood et al., 2011) to model infinite-order LMs. Capturing long range dependencies via HPYP improves the estimation of conditional probabilities. These types of models, however, remain impractical due to several computational and learning challenges, namely large model size (data structure representing the model, and the number of parameters), long training and test time, and poor sampler mixing. We address these issues by building the HPYP model on top of a Cst. In the training step, only the Cst representation of text is constructed, allowing for fast training, while proposing an efficient approximate inference algorithm for the test time. The mixing issue is avoided via heuristic sampler initialization and design. Our proposed approximation of HPYP is richer than KN and MKN, and is much more efficient in learning and inference phase compared to the full HPYP. Compared with 10- gram KN and MKN models, the -gram model consistently improves the perplexity by up to 15%. Using compressed data structures allows us to train on large collections of text, i.e. 100 larger than the largest dataset used in HPYP language models (Wood et al., 2011) while having several orders of magnitude smaller memory footprint and supporting fast and efficient inference. 1.4 Thesis outline In this section we provide an outline of the rest of the thesis, a summary of each chapter, and references to published works resulting from each chapter. Chapter 2: Background In Chapter 2 we provide a brief overview of the foundations for the research in this thesis, including finite-order Markov language models, infinite-order non-parametric Bayesian language models, and compressed data structures. We discuss the related works in detail in each specific chapter.

30 Thesis outline Chapter 3: Applications of Non-Markovian Sequential Modelling for Structured Prediction This chapter is based on : E. Shareghi, G. Haffari, T. Cohn, A. Nicholson, Structured Prediction of Sequences and Trees using Infinite Contexts", Proceedings of the European Conference on Machine Learning (ECML), 2015, Porto, Portugal. To elaborate that -order models achieve better performance in various NLP tasks, we propose a novel Hierarchical Pitman-Yor Process for structured prediction over sequences and trees which exploits infinite context by conditioning each generation decision on an infinite context of previous events. While at its core our model stays as a sequence model, at the inference stage we propose novel algorithms to predict structured labels (i.e., grammar trees). We propose prediction algorithms based on A* and Markov Chain Monte Carlo sampling. Empirical results demonstrate the potential of our infinite-context model compared to baseline finite-context Markov models on morphological and syntactic parsing, and competitive performance for the POS tagging task. Chapter 4: Scalable Non-Markovian Sequential Modelling This chapter is based on: E. Shareghi, M. Petri, G. Haffari, T. Cohn, Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2015, Lisbon, Portugal. E. Shareghi, M. Petri, G. Haffari, T. Cohn, Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees", Transactions of the Association for Computational Linguistics (TACL), E. Shareghi, T. Cohn, G. Haffari, Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling", Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2016, Austin, USA. A prime example of sequence modelling is the language modelling task. In this chapter we consider the Kneser-Ney and Modified Kneser-Ney language models. We propose a new approach based on compressed suffix trees to represent the structure of the KN-based language models. This results in a memory usage roughly matching the size of the bzip2 compressed text, and fast training while the prediction remains slow. To speed up inference, we precompute and compactly store the expensive quantities and propose algorithmic optimizations to reuse information while processing queries as they arrive in the test time. Our proposed approach for KN and MKN is several orders of magnitudes

31 Chapter 1. Introduction 12 more memory efficient than the state-of-the-art, in training and testing. We are highly competitive in terms of runtimes of both phases. When memory is a limiting factor at query time, our approach is orders of magnitude faster than the state-of-the-art. Finally, as opposed to the current state-of-the-art, our approach scales very efficiently to large (i.e. 32 GiB) data, and infinite contexts. The ability of the proposed approach in representing the models compactly allows us to explore more complex models. Based on the training procedure of KN, we illustrate that MKN is trained by maximizing the leave-one-out likelihood and present a generalization of the MKN LM for richer smoothing via introducing additional discount parameters. The discount parameters are responsible for preserving some mass to allocate non-zero probability to unseen or rare events in the test time. We provide the mathematical underpinning for the estimators of the discount bounds and extend them further. We showcase the utility of our rich MKN LM on several languages and further explore the interdependency among the training data size, language model order, and number of discount parameters. Our empirical results illustrate that larger number of discount parameters, compared to KN and MKN LM, allows for better allocation of mass in the smoothing process, particularly on small data regime where statistical sparsity is severe, and leads to significant reduction in perplexity, particularly for out-of-domain test sets which introduce a higher ratio of out-of-vocabulary words. Chapter 5: Scalable Bayesian Non-Markovian Sequential Modelling This chapter is based on: E. Shareghi, G. Haffari, T. Cohn Compressed Nonparametric Language Modelling", Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017, Melbourne, Australia. The KN and MKN language models, trained in MLE fashion, are approximations for the Hierarchical Pitman-Yor Process (Bayesian) model. In this chapter we turn to the fully Bayesian approach in modelling and inference and propose a compressed framework to compactly represent the structure of a HPYP along with the sufficient statistics to recover the state of the sampler. While the compressed data structures allow for compact representation of HPYP LM, it is not sufficient for scaling up these types of models to standard corpora. In fact, these models can still remain impractical due to computational complexity of sampling, and the costly inference, e.g. through poor sampler mixing. To address these issues, we develop an efficient

32 Thesis outline fast approximate inference scheme with a much lower memory footprint compared to full HPYP inference of existing models. The experimental results illustrate that our proposed framework and approximate inference scheme can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and consistently outperforming the MKN LM in terms of predictive perplexity by up to 15%. Chapter 6: Conclusion In Chapter 6 we provide concluding remarks on the contributions presented in this thesis, as well as potential avenues for future research.

33 CHAPTER 2 Background In this chapter we provide a brief overview of the foundations for the research in this thesis, including statistical (n-gram) language models, non-parametric Bayesian language models, and compressed data structures. We start by showing how a probability of a sequence is estimated under finite-order n- gram language models while explaining two of the widely used approaches, Kneser-Ney (KN) and Modified Kneser-Ney (MKN). Then, we cover various approaches proposed in the literature that aim to improve the scalability of n-gram models. Next, we describe the non-parametric Bayesian language models which are based on hierarchical Pitman-Yor Process (HPYP). We explain how learning and inference is done under -order HPYP models and provide the link between the KN and HPYP language models. We extend the -order sequential modelling of HPYP in Chapter 3, and illustrate its NLP applications beyond the language modelling problem. In the latter section we describe the compressed algorithmic framework required to scale the models to -order setting with large corpora. We explain some of the basic data structures, e.g. suffix arrays and trees, as well as the more advanced data structures such as wavelet trees, compressed suffix arrays and trees, and the required operations, e.g. rank, select and backward-search, that allow us to represent the data in a compact form while still supporting various search operations very efficiently. This framework is the basis of Chapter 4, and 14

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop

More information

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL Bayesian Tools for Natural Language Learning Yee Whye Teh Gatsby Computational Neuroscience Unit UCL Bayesian Learning of Probabilistic Models Potential outcomes/observations X. Unobserved latent variables

More information

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview

More information

Hierarchical Bayesian Models of Language and Text

Hierarchical Bayesian Models of Language and Text Hierarchical Bayesian Models of Language and Text Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James Overview Probabilistic

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

The Infinite PCFG using Hierarchical Dirichlet Processes

The Infinite PCFG using Hierarchical Dirichlet Processes S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Bayesian (Nonparametric) Approaches to Language Modeling

Bayesian (Nonparametric) Approaches to Language Modeling Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, 2013 1 / 34 Message Whole Distributions As Random Variables

More information

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Hierarchical Bayesian Nonparametrics

Hierarchical Bayesian Nonparametrics Hierarchical Bayesian Nonparametrics Micha Elsner April 11, 2013 2 For next time We ll tackle a paper: Green, de Marneffe, Bauer and Manning: Multiword Expression Identification with Tree Substitution

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling

Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling Richer Interpolative Soothing Based on Modified Kneser-Ney Language Modeling Ehsan Shareghi, Trevor Cohn and Gholareza Haffari Faculty of Inforation Technology, Monash University Coputing and Inforation

More information

Algorithms for Syntax-Aware Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation Algorithms for Syntax-Aware Statistical Machine Translation I. Dan Melamed, Wei Wang and Ben Wellington ew York University Syntax-Aware Statistical MT Statistical involves machine learning (ML) seems crucial

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Spectral Unsupervised Parsing with Additive Tree Metrics

Spectral Unsupervised Parsing with Additive Tree Metrics Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Succinct Data Structures for NLP-at-Scale

Succinct Data Structures for NLP-at-Scale Succinct Data Structures for NLP-at-Scale Matthias Petri Trevor Cohn Computing and Information Systems The University of Melbourne, Australia first.last@unimelb.edu.au November 20, 2016 Who are we? Trevor

More information

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612

More information

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes Yee Whye Teh Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech,

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

The Infinite Markov Model

The Infinite Markov Model The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview ɛ ɛ is of will is of

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Infinite Hierarchical Hidden Markov Models

Infinite Hierarchical Hidden Markov Models Katherine A. Heller Engineering Department University of Cambridge Cambridge, UK heller@gatsby.ucl.ac.uk Yee Whye Teh and Dilan Görür Gatsby Unit University College London London, UK {ywteh,dilan}@gatsby.ucl.ac.uk

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto Revisiting PoS tagging Will/MD the/dt chair/nn chair/?? the/dt meeting/nn from/in that/dt

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Sparse Forward-Backward for Fast Training of Conditional Random Fields Sparse Forward-Backward for Fast Training of Conditional Random Fields Charles Sutton, Chris Pal and Andrew McCallum University of Massachusetts Amherst Dept. Computer Science Amherst, MA 01003 {casutton,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Inference in Explicit Duration Hidden Markov Models

Inference in Explicit Duration Hidden Markov Models Inference in Explicit Duration Hidden Markov Models Frank Wood Joint work with Chris Wiggins, Mike Dewar Columbia University November, 2011 Wood (Columbia University) EDHMM Inference November, 2011 1 /

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

28 : Approximate Inference - Distributed MCMC

28 : Approximate Inference - Distributed MCMC 10-708: Probabilistic Graphical Models, Spring 2015 28 : Approximate Inference - Distributed MCMC Lecturer: Avinava Dubey Scribes: Hakim Sidahmed, Aman Gupta 1 Introduction For many interesting problems,

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes A Hierarchical Bayesian Language Model based on Pitman-Yor Processes Yee Whye Teh School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543. tehyw@comp.nus.edu.sg Abstract

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Roger Levy Probabilistic Models in the Study of Language draft, October 2, Roger Levy Probabilistic Models in the Study of Language draft, October 2, 2012 224 Chapter 10 Probabilistic Grammars 10.1 Outline HMMs PCFGs ptsgs and ptags Highlight: Zuidema et al., 2008, CogSci; Cohn

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann School of Computer Science Carnegie Mellon University Pittsburgh,

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005 A DOP Model for LFG Rens Bod and Ronald Kaplan Kathrin Spreyer Data-Oriented Parsing, 14 June 2005 Lexical-Functional Grammar (LFG) Levels of linguistic knowledge represented formally differently (non-monostratal):

More information

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP

More information

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett David Pfau Frank Wood Department of Statistics Center for Theoretical Neuroscience Columbia University, 2960

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by: N-gram Language Model 2 Given a text w = w 1...,w t,...,w w we can compute its probability by: Language Models Marcello Federico FBK-irst Trento, Italy 2016 w Y Pr(w) =Pr(w 1 ) Pr(w t h t ) (1) t=2 where

More information

Aspects of Tree-Based Statistical Machine Translation

Aspects of Tree-Based Statistical Machine Translation Aspects of Tree-Based Statistical Machine Translation Marcello Federico Human Language Technology FBK 2014 Outline Tree-based translation models: Synchronous context free grammars Hierarchical phrase-based

More information

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

CMPT-825 Natural Language Processing. Why are parsing algorithms important? CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop October 26, 2010 1/34 Why are parsing algorithms important? A linguistic theory is implemented in a formal system to generate

More information