Hierarchical Bayesian Models of Language and Text
|
|
- Adrian Fox
- 6 years ago
- Views:
Transcription
1 Hierarchical Bayesian Models of Language and Text Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James
2 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
3 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
4 Sequence Models for Language and Text Probabilistic models for sequences of words and characters, e.g. statistical, machine, learning s, t, a, t, i, s, t, i, c, a, l, _, m, a, c, h, i, n, e, _, l, e, a, r, n, i, n, g Uses: Natural language processing: speech recognition, OCR, machine translation. Compression. Cognitive models of language acquisition. Sequence data arises in many other domains.
5 Probabilistic Modelling Set of potential outcomes/observations X. Set of unobserved latent variables Y. Joint distribution over X and Y: θ parameters of the model. P (x X, y Y θ) Inference: P (y Y x X, θ) = P (y, x θ) P (x θ) Rev. Thomas Bayes Learning: P (training data θ) Bayesian learning: P (θ training data) = P (training data θ)p (θ) Z
6 Communication via Noisy Channel Mary has a little lamb Mary likes little Sam Sentence Utterance P (s u) = Reconstructed sentence P (s)p (u s) P (u)
7 Communication via Noisy Channel María tiene un pequeño cordero Mary has a little lamb Mary has a little lamb Sentence foreign sentence P (s u) = Reconstructed sentence P (s)p (u s) P (u)
8 Markov Models for Language and Text Probabilistic models for sequences of words and characters. P(statistical machine learning) = P(statistical)* P(machine statistical)* P(learning statistical machine) Andrey Markov Usually makes a Markov assumption: P(statistical machine learning) = P(statistical)* P(machine statistical)* P(learning machine) George E. P. Box Order of Markov model typically ranges from ~3 to > 10.
9 Sparsity in Markov Models Consider a high order Markov models: P (sentence) = i P (word i word i N+1...word i 1 ) Large vocabulary size means naïvely estimating parameters of this model from data counts is problematic for N>2. P ML (word i word i N+1...word i 1 )= C(word i N+1...word i ) C(word i N+1...word i 1 ) Naïve priors/regularization fail as well: most parameters have no associated data. Smoothing. Hierarchical Bayesian models.
10 Smoothing in Language Models Smoothing is a way of dealing with data sparsity by combining large and small models together. P smooth (word i word i 1 i N+1 )= N n=1 λ(n)q n (word i word i 1 i n+1 ) Combines expressive power of large models with better estimation of small models (cf bias-variance trade-off). P smooth (learning statistical machine) = λ(3)q 3 (learning statistical machine) + λ(2)q 2 (learning machine) + λ(1)q 1 (learning )
11 Smoothing in Language Models [Chen and Goodman 1998] found that Interpolated and modified Kneser-Ney are best under virtually all circumstances.
12 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
13 Hierarchical Bayesian Models Hierarchical modelling an important overarching theme in modern statistics [Gelman et al, 1995, James & Stein 1961]. In machine learning, have been used for multitask learning, transfer learning, learning-to-learn and domain adaptation. φ 0 φ 1 φ 2 φ 3 x 1i x 2i x 3i i=1...n1 i=1...n2 i=1...n3
14 Context Tree Context of conditional probabilities naturally organized using a tree. Smoothing makes conditional probabilities of neighbouring contexts more similar. P smooth (learning statistical machine) =λ(3)q 3 (learning statistical machine)+ λ(2)q 2 (learning machine)+ λ(1)q 1 (learning ) Later words in context more important in predicting next word. machine statistical machine a machine Bayesian machine in statistical machine is statistical machine
15 Hierarchical Bayesian Models on Context Tree Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G [MacKay and Peto 1994]. G statistical machine G machine G amachine G Bayesian machine G in statistical machine G is statistical machine
16 Hierarchical Dirichlet Language Models What is P (G u G pa(u)?)[mackay and Peto 1994] proposed using the standard Dirichlet distribution over probability vectors. T N-1 IKN MKN HDLM We will use Pitman-Yor processes instead [Perman, Pitman and Yor 1992], [Pitman and Yor 1997], [Ishwaran and James 2001].
17 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
18 Pitman-Yor Processes Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)
19 Pitman-Yor Processes PY Jan Gasthaus (Gatsby Unit, UCL) Sequence Memoizer DCC / 19 Yee Whye Teh
20 Chinese Restaurant Processes Easiest to understand them using Chinese restaurant processes. x 5 x 1 x 2 y 1 x 3 x 4 x 6 p(sit at table k) = p(sit at new table) = x 7 x 8 x 9 y 2 y 3 y 4 c k d θ + K j=1 c j θ + dk θ + K j=1 c j Defines an exchangeable stochastic process over sequences The de Finetti measure is the Pitman-Yor process, G PY(θ,d,H) x i G i =1, 2,... [Perman, Pitman & Yor 1992, Pitman & Yor 1997] p(table serves dish y) =H(y) x 1,x 2,...
21 Power Law Properties of Pitman-Yor Processes Chinese restaurant process: p(sit at table k) c k d p(sit at new table) θ + dk Pitman-Yor processes produce distributions over words given by a power-law distribution with index. 1+d Customers = word instances, tables = dictionary look-up; Small number of common word types; Large number of rare word types. This is more suitable for languages than Dirichlet distributions. [Goldwater, Griffiths and Johnson 2005] investigated the Pitman-Yor process from this perspective.
22 Power Law Properties of Pitman-Yor Processes Word frequency Pitman-Yor English text Dirichlet Rank (according to frequency)
23 Power Law Properties of Pitman-Yor Processes
24 Hierarchical Pitman-Yor Language Models Parametrize the conditional probabilities of Markov model: P (word i = w word i 1 i N+1 = u) =G u(w) G u =[G u (w)] w vocabulary Gu is a probability vector associated with context u. G Place Pitman-Yor process prior on each Gu. G machine G statistical machine G amachine G Bayesian machine G in statistical machine G is statistical machine
25 Hierarchical Pitman-Yor Language Models Significantly improved on the hierarchical Dirichlet language model. Results better Kneser-Ney smoothing, state-of-the-art language models. T N-1 IKN MKN HDLM HPYLM Similarity of perplexities not a surprise---kneser-ney can be derived as a particular approximate inference method.
26 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
27 Markov Models for Language and Text Usually makes a Markov assumption to simplify model: P(south parks road) ~ P(south)* P(parks south)* P(road south parks) Language models: usually Markov models of order 2-4 (3-5-grams). How do we determine the order of our Markov models? Is the Markov assumption a reasonable assumption? Be nonparametric about Markov order...
28 Non-Markov Models for Language and Text Model the conditional probabilities of each possible word occurring after each possible context (of unbounded length). Use hierarchical Pitman-Yor process prior to share information across all contexts. Hierarchy is infinitely deep. G Sequence memoizer. G machine G statistical machine G amachine G Bayesian machine G in statistical machine G is statistical machine G this is statistical machine
29 Overview Probabilistic Models for Language and Text Sequences The Sequence Memoizer Hierarchical Bayesian Modelling on Context Trees Modelling Power Laws with Pitman-Yor Processes Non-Markov Models Efficient Computation Conclusions
30 Model Size: Infinite -> O(T 2 ) The sequence memoizer model is very large (actually, infinite). Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree. H But there are still O(T 2 ) number of nodes in the context tree... G [ac] G [] G c [c] a o G [oa] a o o G [a] c G [ca] G [o] a G [cac] c o c a G [aca] G [acac] a a G [oac] o G [oaca] G [oacac] o c
31 Model Size: Infinite -> O(T 2 ) -> 2T Idea: integrate out non-branching, non-leaf nodes of the context tree. Resulting tree is related to a suffix tree data structure, and has at most 2T nodes. H There are linear time construction algorithms [Ukkonen 1995]. G [ac] ac G [] o G [oa] a o o G [a] G [o] a oac o a c G [oac] oac G [oaca] G [oacac] c
32 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca]? PY(θ 3,d 3,G [ca] ) G [aca] G [aca] E.g.: If each conditional is Dirichlet, resulting conditional is not of known analytic form.
33 Closure under Marginalization In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable. G [a] G [a] PY(θ 2,d 2,G [a] ) G [ca] PY(θ 2 d 3,d 2 d 3,G [a] ) PY(θ 2 d 3,d 3,G [ca] ) G [aca] G [aca] For certain parameter settings, Pitman-Yor processes are closed under marginalization! [Pitman 1999, Ho, James & Lau 2006]
34 Coagulation and Fragmentation Operators π1 2 7 π
35 Coagulation and Fragmentation Operators A π1 A B C 2 7 D Coagulate π B C D
36 Coagulation and Fragmentation Operators 1 3 A B A π1 C D Coagulate π2 B C C D
37 Coagulation and Fragmentation Operators 1 3 A B A π1 C D Coagulate π2 B C C 1 3 Fragment D 7
38 Coagulation and Fragmentation Operators The following statements are equivalent: Coagulate Fragment A B C D A B C D π2 C (I) π 2 CRP [n] (αd 2,d 2 ) and π 1 π 2 CRP π2 (α,d 1 ) (II) C CRP [n] (αd 2,d 1 d 2 ) and F a C CRP a ( d 1 d 2,d 2 ) a C π1
39 Final Model Specification Probability of sequence: P (x 1:T )= T P (x i x 1:i 1 )= T G x1:i 1 (x i ) i=1 i=1 Prior over conditional probabilities: G PY(θ,d,H) G u G σ(u) PY(θ u,d u,g σ(u) ), for u Σ \{ }, Constraint on parameters: θ u = θ v=, suffix ofu d v
40 Comparison to Finite Order HPYLM
41 Inference using Gibbs Sampling x Tables e+06 4e+06 6e+06 8e+06 1e+07 Observations
42 Inference using Gibbs Sampling
43 Entropic Coding for Compression Encoder: x i Model P(x i x 1...x i-1,θ i ) -log 2 P(x i x 1...x i-1,θ i ) Decoder: x i Model P(x i x 1...x i-1,θ i ) -log 2 P(x i x 1...x i-1,θ i ) θ i parameter value estimated from x 1...x i-1. A good probabilistic model = good compressor. Claude Shannon
44 Compression Results Model Average bits/byte gzip 2.61 bzip CTW 1.99 PPM 1.93 Sequence Memoizer 1.89 Calgary corpus SM inference: particle filter PPM: Prediction by Partial Matching CTW: Context Tree Weigting Online inference, entropic coding.
45 Related Works Infinite Markov models [Mochihashi & Sumita 2008] Bayesian nonparametric grammars (Goldwater, Johnson, Blunsom, Cohn etc). Text compression: Prediction by Partial Matching [Cleary & Witten 1984], Context Tree Weighting [Willems et al 1995]... Language model smoothing algorithms [Chen & Goodman 1998, Kneser & Ney 1995]. Variable length/order/memory Markov models [Ron et al 1996, Buhlmann & Wyner 1999, Begleiter et al ]. Hierarchical Bayesian nonparametric models [Teh & Jordan 2010].
46 Conclusions Probabilistic models of sequence models without making Markov assumptions with efficient construction and inference algorithms. State-of-the-art text compression and language modelling results. Hierarchical Bayesian modelling leads to improved performance. Pitman-Yor processes allow us to encode prior knowledge about powerlaw properties, leading to improved performance. Hierarchical Pitman-Yor processes have been used successfully for various more linguistically motivated models. (Java implementation) (text compression demo) Jan Gasthaus webpage (C++ implementation)
47 Publications A Hierarchical Bayesian Language Model based on Pitman-Yor Processes. Y.W. Teh. Coling/ACL A Bayesian Interpretation of Interpolated Kneser-Ney. Y.W. Teh. Technical Report TRA2/06, School of Computing, NUS, revised A Stochastic Memoizer for Sequence Data. F. Wood, C. Archambeau, J. Gasthaus, L. F. James and Y.W. Teh. ICML Text Compression Using a Hierarchical Pitman-Yor Process Prior. J. Gasthaus, F. Wood and Y.W. Teh. DCC Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process. N. Bartlett, D. Pfau and F. Wood. ICML Some Improvements to the Sequence Memoizer. J. Gasthaus and Y.W. Teh. NIPS The Sequence Memoizer. F. Wood, J. Gasthaus, C. Archambeau, L. F. James and Y.W. Teh. CACM Hierarchical Bayesian Nonparametric Models with Applications. Y.W. Teh and M.I. Jordan, in Bayesian Nonparametrics. Cambridge University Press, 2010.
48 Thank You! Acknowledgements: Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James Lee Kuan Yew Foundation Gatsby Charitable Foundation
Hierarchical Bayesian Nonparametric Models of Language and Text
Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James August 2010 Overview
More informationHierarchical Bayesian Nonparametric Models of Language and Text
Hierarchical Bayesian Nonparametric Models of Language and Text Gatsby Computational Neuroscience Unit, UCL Joint work with Frank Wood *, Jan Gasthaus *, Cedric Archambeau, Lancelot James SIGIR Workshop
More informationBayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL
Bayesian Tools for Natural Language Learning Yee Whye Teh Gatsby Computational Neuroscience Unit UCL Bayesian Learning of Probabilistic Models Potential outcomes/observations X. Unobserved latent variables
More informationBayesian Nonparametric Mixture, Admixture, and Language Models
Bayesian Nonparametric Mixture, Admixture, and Language Models Yee Whye Teh University of Oxford Nov 2015 Overview Bayesian nonparametrics and random probability measures Mixture models and clustering
More informationBayesian (Nonparametric) Approaches to Language Modeling
Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, 2013 1 / 34 Message Whole Distributions As Random Variables
More informationA Hierarchical Bayesian Language Model based on Pitman-Yor Processes
A Hierarchical Bayesian Language Model based on Pitman-Yor Processes Yee Whye Teh School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543. tehyw@comp.nus.edu.sg Abstract
More informationThe Sequence Memoizer
The Sequence Memoizer Frank Wood Department of Statistics Columbia University New York, New York fwood@stat.columbia.edu Lancelot James Department of Information Systems, Business Statistics and Operations
More informationThe Infinite Markov Model
The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview ɛ ɛ is of will is of
More informationForgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process
: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett David Pfau Frank Wood Department of Statistics Center for Theoretical Neuroscience Columbia University, 2960
More informationHierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh
Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes Yee Whye Teh Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech,
More informationA Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06
A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06 Yee Whye Teh tehyw@comp.nus.edu.sg School of Computing, National University of Singapore, 3 Science
More informationDeplump for Streaming Data
Deplump for Streaming Data Nicholas Bartlett Frank Wood Department of Statistics, Columbia University, New York, USA Abstract We present a general-purpose, lossless compressor for streaming data. This
More informationCSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado
CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationLecture 3a: Dirichlet processes
Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationThe Infinite Markov Model
The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories Hikaridai -4, Keihanna Science City Kyoto, Japan 619-037 daichi@cslab.kecl.ntt.co.jp Eiichiro Sumita ATR / NICT Hikaridai
More informationBayesian Nonparametrics: Dirichlet Process
Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian
More informationParallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models
Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models Avinava Dubey School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Sinead A. Williamson McCombs School of Business University
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationHierarchical Dirichlet Processes
Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley
More informationBayesian Nonparametrics
Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent
More informationCollapsed Variational Bayesian Inference for Hidden Markov Models
Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and
More informationHaupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process
Haupthseminar: Machine Learning Chinese Restaurant Process, Indian Buffet Process Agenda Motivation Chinese Restaurant Process- CRP Dirichlet Process Interlude on CRP Infinite and CRP mixture model Estimation
More informationModern Bayesian Nonparametrics
Modern Bayesian Nonparametrics Peter Orbanz Yee Whye Teh Cambridge University and Columbia University Gatsby Computational Neuroscience Unit, UCL NIPS 2011 Peter Orbanz & Yee Whye Teh 1 / 71 OVERVIEW 1.
More informationModelling Genetic Variations with Fragmentation-Coagulation Processes
Modelling Genetic Variations with Fragmentation-Coagulation Processes Yee Whye Teh, Charles Blundell, Lloyd Elliott Gatsby Computational Neuroscience Unit, UCL Genetic Variations in Populations Inferring
More informationTopic Modeling: Beyond Bag-of-Words
University of Cambridge hmw26@cam.ac.uk June 26, 2006 Generative Probabilistic Models of Text Used in text compression, predictive text entry, information retrieval Estimate probability of a word in a
More informationDirichlet Process. Yee Whye Teh, University College London
Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant
More informationAdvanced Machine Learning
Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationNonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)
Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5) Tamara Broderick ITT Career Development Assistant Professor Electrical Engineering & Computer Science MIT Bayes Foundations
More informationDirichlet Processes: Tutorial and Practical Course
Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationSpatial Bayesian Nonparametrics for Natural Image Segmentation
Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University Joint work with Michael Jordan University of California Soumya Ghosh Brown University Parsing Visual Scenes
More informationA Brief Overview of Nonparametric Bayesian Models
A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine
More informationScalable Non-Markovian Sequential Modelling for Natural Language Processing
Scalable Non-Markovian Sequential Modelling for Natural Language Processing by Ehsan Shareghi Nojehdeh Thesis Submitted for the fulfillment of the requirements for the degree of Doctor of Philosophy Faculty
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationHierarchical Bayesian Nonparametrics
Hierarchical Bayesian Nonparametrics Micha Elsner April 11, 2013 2 For next time We ll tackle a paper: Green, de Marneffe, Bauer and Manning: Multiword Expression Identification with Tree Substitution
More informationLanguage Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationApplied Nonparametric Bayes
Applied Nonparametric Bayes Michael I. Jordan Department of Electrical Engineering and Computer Science Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan Acknowledgments:
More informationStochastic Processes, Kernel Regression, Infinite Mixture Models
Stochastic Processes, Kernel Regression, Infinite Mixture Models Gabriel Huang (TA for Simon Lacoste-Julien) IFT 6269 : Probabilistic Graphical Models - Fall 2018 Stochastic Process = Random Function 2
More informationInfinite latent feature models and the Indian Buffet Process
p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised
More informationGentle Introduction to Infinite Gaussian Mixture Modeling
Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationBayesian Nonparametrics
Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet
More informationSharing Clusters Among Related Groups: Hierarchical Dirichlet Processes
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics
More informationA marginal sampler for σ-stable Poisson-Kingman mixture models
A marginal sampler for σ-stable Poisson-Kingman mixture models joint work with Yee Whye Teh and Stefano Favaro María Lomelí Gatsby Unit, University College London Talk at the BNP 10 Raleigh, North Carolina
More informationPresent and Future of Text Modeling
Present and Future of Text Modeling Daichi Mochihashi NTT Communication Science Laboratories daichi@cslab.kecl.ntt.co.jp T-FaNT2, University of Tokyo 2008-2-13 (Wed) Present and Future of Text Modeling
More informationProbabilistic Deterministic Infinite Automata
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationDistance dependent Chinese restaurant processes
David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232
More informationSpatial Normalized Gamma Process
Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationShared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes
Shared Segmentation of Natural Scenes using Dependent Pitman-Yor Processes Erik Sudderth & Michael Jordan University of California, Berkeley Parsing Visual Scenes sky skyscraper sky dome buildings trees
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationThe Infinite Markov Model: A Nonparametric Bayesian approach
The Infinite Markov Model: A Nonparametric Bayesian approach Daichi Mochihashi NTT Communication Science Laboratories Postdoctoral Research Associate daichi@cslab.kecl.ntt.co.jp ISM Bayesian Inference
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationCollapsed Variational Dirichlet Process Mixture Models
Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science
More informationImproved Decipherment of Homophonic Ciphers
Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationHierarchical Bayesian Nonparametric Approach to Modeling and Learning the Wisdom of Crowds of Urban Traffic Route Planning Agents
Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the Wisdom of Crowds of Urban Traffic Route Planning Agents Jiangbo Yu, Kian Hsiang Low, Ali Oran, Patrick Jaillet Department of Computer
More informationBayesian nonparametric models for bipartite graphs
Bayesian nonparametric models for bipartite graphs François Caron Department of Statistics, Oxford Statistics Colloquium, Harvard University November 11, 2013 F. Caron 1 / 27 Bipartite networks Readers/Customers
More informationInfinite Hierarchical Hidden Markov Models
Katherine A. Heller Engineering Department University of Cambridge Cambridge, UK heller@gatsby.ucl.ac.uk Yee Whye Teh and Dilan Görür Gatsby Unit University College London London, UK {ywteh,dilan}@gatsby.ucl.ac.uk
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationDT2118 Speech and Speaker Recognition
DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationDecoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang Computer Science Department Princeton University chongw@cs.princeton.edu David M. Blei Computer Science Department
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationBayesian nonparametric latent feature models
Bayesian nonparametric latent feature models François Caron UBC October 2, 2007 / MLRG François Caron (UBC) Bayes. nonparametric latent feature models October 2, 2007 / MLRG 1 / 29 Overview 1 Introduction
More informationThe Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview
The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of
More informationAcoustic Unit Discovery (AUD) Models. Leda Sarı
Acoustic Unit Discovery (AUD) Models Leda Sarı Lucas Ondel and Lukáš Burget A summary of AUD experiments from JHU Frederick Jelinek Summer Workshop 2016 lsari2@illinois.edu November 07, 2016 1 / 23 The
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationInfinite Latent Feature Models and the Indian Buffet Process
Infinite Latent Feature Models and the Indian Buffet Process Thomas L. Griffiths Cognitive and Linguistic Sciences Brown University, Providence RI 292 tom griffiths@brown.edu Zoubin Ghahramani Gatsby Computational
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson
More informationBayesian non parametric approaches: an introduction
Introduction Latent class models Latent feature models Conclusion & Perspectives Bayesian non parametric approaches: an introduction Pierre CHAINAIS Bordeaux - nov. 2012 Trajectory 1 Bayesian non parametric
More informationCRF Word Alignment & Noisy Channel Translation
CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationDirichlet Processes and other non-parametric Bayesian models
Dirichlet Processes and other non-parametric Bayesian models Zoubin Ghahramani http://learning.eng.cam.ac.uk/zoubin/ zoubin@cs.cmu.edu Statistical Machine Learning CMU 10-702 / 36-702 Spring 2008 Model
More informationFast MCMC sampling for Markov jump processes and extensions
Fast MCMC sampling for Markov jump processes and extensions Vinayak Rao and Yee Whye Teh Rao: Department of Statistical Science, Duke University Teh: Department of Statistics, Oxford University Work done
More informationBayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro
More informationLanguage Models. Hongning Wang
Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs
More informationMultiword Expression Identification with Tree Substitution Grammars
Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More information