Bayesian (Nonparametric) Approaches to Language Modeling

Size: px
Start display at page:

Download "Bayesian (Nonparametric) Approaches to Language Modeling"

Transcription

1 Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, / 34

2 Message Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, / 34

3 Challenge Problem Predict What Comes Next , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Wood (University of Oxford) January, / 34

4 .6 < Entropy 1 of English < 1.3 bits/byte [Shannon, 1951] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L= stream length (log 10 ) Sequence memoizer with life-long learning computational characteristics [Bartlett, Pfau, and Wood, 2010] Wikipedia : 26.8GB 4GB ( 1.2) bits/byte gzip : 7.9GB PAQ : 3.8GB L is number of states in SM, data Wikipedia dump 2010 head, depth 16 1 Entropy = 1 T T t=1 log 2 P(x t x 1:t 1) (lower is better) Wood (University of Oxford) January, / 34

5 How Good Is 1.5 bits/byte? the fastest way to make money is... being player. bucketti wish i made it end in the reducers and assemblance smart practices to allow city is something to walk in most of the work of agriculture. i d be able to compete with an earlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasm beverly hills. she companies or running back down that the book, grass. and the coynes, exaggerate between the pad, a debate on emissions on air political capital crashing that the new obviously program irock price, coach began refugees, much and 2 2 constant-space, byte-sm, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2 million rest. s Wood (University of Oxford) January, / 34

6 Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Wood (University of Oxford) January, / 34

7 Notation and iid Conditional Distribution Estimation Let Σ be a set of symbols, let x = x 1, x 2,..., x T be an observed sequence, x t, s, s Σ, and u W = {w 1, w 2,... w K } with w k Σ +. N(us) G u (s) = s Σ N(us ) is the maximum likelihood estimator for the conditional distribution P(X = s u) G u (s). Estimating a finite set of conditional distributions corresponding to a finite set of contexts u generally yields a poor model. Why? Long contexts occur infrequently Maximum likelihood estimates given few observations are often poor Finite set of contexts might be sparse Wood (University of Oxford) January, / 34

8 Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u Wood (University of Oxford) January, / 34

9 Bayesian Smoothing Use independent Dirichlet priors, i.e. G u Dir( α Σ ) x t x t k,..., x t 1 = u G u and embrace uncertainty about G u s : P(x T +1 = s x = u) = P(x t+1 = s G u )P(G u x)dg = E[G u (s)], yielding, for instance, E[G u (s)] = N(us) + α Σ s Σ N(us ) + α Wood (University of Oxford) January, / 34

10 Hierarchical Smoothing Smooth G u using a similar distribution G σ(u), i.e. G u Dir(cG σ(u) ) 3 or G u DP(cG σ(u) ) 4 with, for instance, σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t (suffix operator) 3 [MacKay and Peto, 1995] 4 [Teh et al., 2006] Wood (University of Oxford) January, / 34

11 Natural Question Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? Wood (University of Oxford) January, / 34

12 Unexpected Answer Is it possible to simultaneously estimate all distributions in such a model, even in the infinite-depth case? YES! Sequence memoizer (SM), Wood et al. [2011] The SM is a compact O(T ), unbounded-depth, hierarchical, Bayesian-nonparametric distribution over discrete sequence data. Graphical Pitman-Yor process (GPYP), Wood and Teh [2008] The GPYP is a hierarchical sharing prior with directed acyclic dependency. Wood (University of Oxford) January, / 34

13 SM = hierarchical PY process [Teh, 2006] of unbounded depth. Figure : Binary Sequence Memoizer Graphical Model Wood (University of Oxford) January, / 34

14 Binary Sequence Memoizer Notation G ε U Σ, d 0 PY(d 0, 0, U Σ ) G u G σ(u), d u PY(d u, 0, G σ(u) ) u Σ + x t x 1,..., x t 1 = u G u Here σ(x 1 x 2 x 3... x t ) = x 2 x 3... x t is the suffix operator, Σ = {0, 1}, and U Σ = [.5,.5]. Encodes natural assumptions : Recency assumption encoded by form of coupling related conditional distributions ( back-off ) Power-law effects encoded by the choice of hierarchical modeling glue: the Pitman-Yor process (PY). Wood (University of Oxford) January, / 34

15 Pitman-Yor process A Pitman-Yor process PY(c, d, H) is a distribution over distributions with three parameters A discount 0 d < 1 that controls power-law behavior d = 0 is Dirichlet process (DP) A concentration c > d like that of the Dirichlet process (DP) A base distribution H also like that of the DP Key properties : If G PY(c, d, H) then a priori E[G(s)] = H(s) Var[G(s)] = (1 d)h(s)(1 H(s)) Wood (University of Oxford) January, / 34

16 Text and Pitman-Yor Process Power-Law Properties Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency) Power-law scaling of word frequencies in English text. Relative word frequency is plotted against each words rank when ordered according to frequency. There are a few very common words and a large number of relatively rare words. Figure from Wood et al. [2011]. Wood (University of Oxford) January, / 34

17 Big Problem Space. And time. Wood (University of Oxford) January, / 34

18 Coagulation and Fragmentation 5 PY properties used to achieve linear space sequence memoizer. Coagulation If G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) then G 3 G 1 PY(d 1 d 2, 0, G 1 ) Fragmentation Suppose G 3 G 1 PY(d 1 d 2, 0, G 1 ). Operations exist to draw G 2 G 1 PY(d 1, 0, G 1 ) and G 3 G 2 PY(d 2, 0, G 2 ) with G 2 marginalized out. 5 [Pitman, 1999, Ho et al., 2006, Wood et al., 2009] Wood (University of Oxford) January, / 34

19 O(T ) Sequence Memoizer 6 Graphical Model for x = [Wood et al, 2009] Wood (University of Oxford) January, / 34

20 O(T ) Sequence Memoizer vs. Smoothed n th -order Markov Model Perplexity Markov perplexity SM perplexity Markov #nodes SM #nodes Context length (n) Number of nodes SM versus n th order Markov models with hierarchical PYP priors as n varies. In red is computational complexity. Wood (University of Oxford) January, / 34

21 Problem Nonparametric O(T ) space = doesn t work for life-long learning Wood (University of Oxford) January, / 34

22 Towards Lifelong Learning : Incremental Estimation Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010] Wikipedia 100MB 20.8MB (1.66 bits/byte 7 ) PAQ, special-purpose, non-incremental, Hutter-prize leader Still O(T ) storage complexity. 7 average log-loss 1 T T log 2 P(x t x 1:t 1) t=1 is average # of bits to transmit one byte under optimal entropy encoding (lower is better) Wood (University of Oxford) January, / 34

23 Towards Lifelong Learning : Constant Space Approximation Forgetting Counts [Bartlett, Pfau, and Wood, 2010] Average Log Loss (bits) Naive Finite Depth Random Deletion Greedy Deletion Sequence Memoizer gzip bzip2 14,164 41,860 89, , , , , ,547 States Constant-state-count SM approximation Data: Calgary corpus, States = number of states used in 3-10 grams Improvements to the SM [Gasthaus and Teh, 2011] Constant-size representation of predictive distributions Wood (University of Oxford) January, / 34

24 Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010] bits / byte 4 2 L=10 3 L=10 4 L=10 5 L=10 6 L= stream length (log 10 ) SM with life-long learning computational characteristics. Wikipedia : 26.8GB 4GB ( 1.2) bits/byte L is number of states in SM, data Wikipedia dump 2010 head, depth 16, Wood (University of Oxford) January, / 34

25 Summary So Far SM = powerful Bayesian nonparametric sequence model Intuition : lim n smoothing n-gram Byte predictive performance in human range Bayesian-nonparametric = incompatible with lifelong learning Non-trivial approximations preserve performance Resulting model compatible with lifelong learning Widely useful Language-modeling Lossless compression Plug-in replacement for Markov model component of any complex graphical model Direct plug-in replacement for n-gram language model Software : Wood (University of Oxford) January, / 34

26 Can We Do Better? Problem : No long-range coherence Domain specificity Solution : Develop models that more efficiently encode long-range dependencies Develop models that use domain-specific context Graphical Pitman-Yor process (GPYP) Wood (University of Oxford) January, / 34

27 GPYP Application : Language Model Domain Adaptation Wood (University of Oxford) January, / 34

28 Doubly Hierarchical Pitman Yor Process Language Model 8 G{} D PY(d0 D, α0 D, λ D 0 U + (1 λ D 0 )G{} H ) G{w D t 1 } PY(d1 D, α1 D, λ D 1 G{} D + (1 λd 1 )G{w H t 1 } ). G{w D t j :w t 1 } PY(dj D, αj D, λ D j G{w D t j+1 :w t 1 } + (1 λd j )G{w H t j :w t 1 } ) x w t n+1 : w t 1 G{w D t n+1 :w t 1 }. 8 [Wood and Teh, 2009] Wood (University of Oxford) January, / 34

29 Domain Adaptation Schematic Latent Domain 1 Switch Priors... Domain 2 Wood (University of Oxford) January, / 34

30 Domain Adaptation Effect (Copacetic) Test perplexity Union DHPYPLM Mixture Baseline Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small State of the Union (SOU) , k tokens 13k types Big Brown m tokens 50k types Test corpus Johnson s SOU speeches k tokens Wood (University of Oxford) January, / 34

31 Domain Adaptation Effect (Not) Test perplexity Union DHPYPLM Mixture Baseline Brown corpus subset size x 10 5 Mixture : [Kneser and Steinbiss, 1993] Union : [Bellegarda, 2004] Training corpora Small Augmented Multi-Party Interaction (AMI) k tokens 8k types Big Brown m tokens 50k types Test corpus AMI excerpt k tokens Wood (University of Oxford) January, / 34

32 Domain Adaptation Cost-Benefit Analysis Test Perplexity x Realistic scenario Increasing Brown Corpus size imposes computational cost $ Increasing SOU corpus size imposes data acquisition cost $$$ Brown Corpus Size SOU Corpus Size x 10 4 Wood (University of Oxford) January, / 34

33 Summary Whole Distributions As Random Variables Hierarchical Modeling Embracing Uncertainty Wood (University of Oxford) January, / 34

34 The End Thank you. Wood (University of Oxford) January, / 34

35 References I Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 63 70, June URL J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication, 42:93 108, J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. In Proceedings of Neural Information Processing Systems, pages , J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer. In Data Compression Conference 2010, pages , Wood (University of Oxford) January, / 34

36 References II M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general coagulations of two-parameter Poisson-Dirichlet processes R. Kneser and V. Steinbiss. On the dynamic adaptation of stochastic language models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages , D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet language model. Natural language engineering, 1(2): , J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27: , C.E. Shannon. Prediction and entropy of printed english. The Bell System Technical Journal, pages 50 64, Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics, pages , Wood (University of Oxford) January, / 34

37 References III Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101 (476): , F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor process language model. In ICML/UAI Nonparametric Bayes Workshop, F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. In Artificial Intelligence and Statistics, F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for sequence data. In Proceedings of the 26th International Conference on Machine Learning, pages , Montreal, Canada, F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. The sequence memoizer. Communications of the ACM, Wood (University of Oxford) January, / 34

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview

More information

Hierarchical Bayesian Models of Language and Text

Hierarchical Bayesian Models of Language and Text Hierarchical Bayesian Models of Language and Text Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James Overview Probabilistic

More information

Hierarchical Bayesian Nonparametric Models of Language and Text

Hierarchical Bayesian Nonparametric Models of Language and Text Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop

More information

Deplump for Streaming Data

Deplump for Streaming Data Deplump for Streaming Data Nicholas Bartlett Frank Wood Department of Statistics, Columbia University, New York, USA Abstract We present a general-purpose, lossless compressor for streaming data. This

More information

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL Bayesian Tools for Natural Language Learning Yee Whye Teh Gatsby Computational Neuroscience Unit UCL Bayesian Learning of Probabilistic Models Potential outcomes/observations X. Unobserved latent variables

More information

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett David Pfau Frank Wood Department of Statistics Center for Theoretical Neuroscience Columbia University, 2960

More information

The Sequence Memoizer

The Sequence Memoizer The Sequence Memoizer Frank Wood Department of Statistics Columbia University New York, New York fwood@stat.columbia.edu Lancelot James Department of Information Systems, Business Statistics and Operations

More information

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes A Hierarchical Bayesian Language Model based on Pitman-Yor Processes Yee Whye Teh School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543. tehyw@comp.nus.edu.sg Abstract

More information

Bayesian Nonparametric Mixture, Admixture, and Language Models

Bayesian Nonparametric Mixture, Admixture, and Language Models Bayesian Nonparametric Mixture, Admixture, and Language Models Yee Whye Teh University of Oxford Nov 2015 Overview Bayesian nonparametrics and random probability measures Mixture models and clustering

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models Avinava Dubey School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Sinead A. Williamson McCombs School of Business University

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

The Infinite Markov Model

The Infinite Markov Model The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview ɛ ɛ is of will is of

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic Deterministic Infinite Automata

Probabilistic Deterministic Infinite Automata 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words University of Cambridge hmw26@cam.ac.uk June 26, 2006 Generative Probabilistic Models of Text Used in text compression, predictive text entry, information retrieval Estimate probability of a word in a

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic

More information

Inference in Explicit Duration Hidden Markov Models

Inference in Explicit Duration Hidden Markov Models Inference in Explicit Duration Hidden Markov Models Frank Wood Joint work with Chris Wiggins, Mike Dewar Columbia University November, 2011 Wood (Columbia University) EDHMM Inference November, 2011 1 /

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Graphical Models for Query-driven Analysis of Multimodal Data

Graphical Models for Query-driven Analysis of Multimodal Data Graphical Models for Query-driven Analysis of Multimodal Data John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation

More information

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes Yee Whye Teh Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech,

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06 A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06 Yee Whye Teh tehyw@comp.nus.edu.sg School of Computing, National University of Singapore, 3 Science

More information

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Spatial Bayesian Nonparametrics for Natural Image Segmentation Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University Joint work with Michael Jordan University of California Soumya Ghosh Brown University Parsing Visual Scenes

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Tamara Broderick Brian Kulis Michael I. Jordan Cat Clusters Mouse clusters Dog 1 Cat Clusters Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0. CS 224 HW:#3 ARIA HAGHIGHI SUID :# 05041774 1. Smoothing Probability Models (a). Let r be the number of words with r counts and p r be the probability for a word with r counts in the Absolute discounting

More information

Fast Text Compression with Neural Networks

Fast Text Compression with Neural Networks Fast Text Compression with Neural Networks Matthew V. Mahoney Florida Institute of Technology 150 W. University Blvd. Melbourne FL 32901 mmahoney@cs.fit.edu Abstract Neural networks have the potential

More information

Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Introduction to Information Theory. Part 3

Introduction to Information Theory. Part 3 Introduction to Information Theory Part 3 Assignment#1 Results List text(s) used, total # letters, computed entropy of text. Compare results. What is the computed average word length of 3 letter codes

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

CS Lecture 3. More Bayesian Networks

CS Lecture 3. More Bayesian Networks CS 6347 Lecture 3 More Bayesian Networks Recap Last time: Complexity challenges Representing distributions Computing probabilities/doing inference Introduction to Bayesian networks Today: D-separation,

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

Exploring Asymmetric Clustering for Statistical Language Modeling

Exploring Asymmetric Clustering for Statistical Language Modeling Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL, Philadelphia, July 2002, pp. 83-90. Exploring Asymmetric Clustering for Statistical Language Modeling Jianfeng

More information

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems Scott W. Linderman Matthew J. Johnson Andrew C. Miller Columbia University Harvard and Google Brain Harvard University Ryan

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Hidden Markov models: from the beginning to the state of the art

Hidden Markov models: from the beginning to the state of the art Hidden Markov models: from the beginning to the state of the art Frank Wood Columbia University November, 2011 Wood (Columbia University) HMMs November, 2011 1 / 44 Outline Overview of hidden Markov models

More information

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes Shared Segmentation of Natural Scenes using Dependent Pitman-Yor Processes Erik Sudderth & Michael Jordan University of California, Berkeley Parsing Visual Scenes sky skyscraper sky dome buildings trees

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Present and Future of Text Modeling

Present and Future of Text Modeling Present and Future of Text Modeling Daichi Mochihashi NTT Communication Science Laboratories daichi@cslab.kecl.ntt.co.jp T-FaNT2, University of Tokyo 2008-2-13 (Wed) Present and Future of Text Modeling

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Distance dependent Chinese restaurant processes

Distance dependent Chinese restaurant processes David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Model Averaging (Bayesian Learning)

Model Averaging (Bayesian Learning) Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e: p(y x e) = m M P(Y m x e) = m M P(Y m x e)p(m x e) = m M P(Y m x)p(m

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Tamara Broderick ITT Career Development Assistant Professor Electrical Engineering & Computer Science MIT Bayes Foundations

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

lossless, optimal compressor

lossless, optimal compressor 6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Tutorial at CVPR 2012 Erik Sudderth Brown University Work by E. Fox, E. Sudderth, M. Jordan, & A. Willsky AOAS 2011: A Sticky HDP-HMM with

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information