CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Similar documents
Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

A Hierarchical Bayesian Language Model based on Pitman-Yor Processes

Non-Parametric Bayes

Hierarchical Bayesian Nonparametrics

STAT Advanced Bayesian Inference

Bayesian Nonparametric Models

Non-parametric Clustering with Dirichlet Processes

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06

Bayesian Nonparametrics for Speech and Signal Processing

Gentle Introduction to Infinite Gaussian Mixture Modeling

Dirichlet Process. Yee Whye Teh, University College London

Bayesian Nonparametrics: Dirichlet Process

Lecture 3a: Dirichlet processes

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Hierarchical Bayesian Nonparametric Models of Language and Text

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Topic Modeling: Beyond Bag-of-Words

Hierarchical Bayesian Nonparametric Models of Language and Text

Spatial Normalized Gamma Process

Probabilistic Graphical Models

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

A Brief Overview of Nonparametric Bayesian Models

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Methods for Machine Learning

Stochastic Processes, Kernel Regression, Infinite Mixture Models

The Infinite Markov Model

Language Model. Introduction to N-grams

Bayesian nonparametrics

Dirichlet Processes: Tutorial and Practical Course

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Bayesian Nonparametrics

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Hierarchical Bayesian Models of Language and Text

Advanced Machine Learning

Distance dependent Chinese restaurant processes

Clustering using Mixture Models

Modeling Environment

Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

28 : Approximate Inference - Distributed MCMC

Bayesian Nonparametric Mixture, Admixture, and Language Models

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Construction of Dependent Dirichlet Processes based on Poisson Processes

Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC. Shane T. Jensen

Bayesian Nonparametrics

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

CPSC 540: Machine Learning

Nonparametric Mixed Membership Models

An Alternative Prior Process for Nonparametric Bayesian Clustering

Applied Nonparametric Bayes

Bayesian Structure Modeling. SPFLODD December 1, 2011

Infinite latent feature models and the Indian Buffet Process

Bayesian non parametric approaches: an introduction

Acoustic Unit Discovery (AUD) Models. Leda Sarı

An Alternative Prior Process for Nonparametric Bayesian Clustering

Image segmentation combining Markov Random Fields and Dirichlet Processes

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Part IV: Monte Carlo and nonparametric Bayes

Introduction to Machine Learning

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Hierarchical Bayesian Nonparametric Models with Applications

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Tree-Based Inference for Dirichlet Process Mixtures

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Topic Modelling and Latent Dirichlet Allocation

ANLP Lecture 6 N-gram models and smoothing

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process

Clustering and Gaussian Mixture Models

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Priors for Random Count Matrices with Random or Fixed Row Sums

An Infinite Product of Sparse Chinese Restaurant Processes

Bayesian Machine Learning

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Bayesian (Nonparametric) Approaches to Language Modeling

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Dirichlet Processes and other non-parametric Bayesian models

Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models (DPGMM)

Present and Future of Text Modeling

UNIVERSITY OF CALIFORNIA, IRVINE. Networks of Mixture Blocks for Non Parametric Bayesian Models with Applications DISSERTATION

arxiv: v2 [stat.ml] 4 Aug 2011

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

University of Illinois at Urbana-Champaign. Midterm Examination

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Variational Inference for the Nested Chinese Restaurant Process

Infinite-Word Topic Models for Digital Media

Natural Language Processing. Statistical Inference: n-grams

Bayesian nonparametric latent feature models

arxiv: v1 [stat.ml] 20 Nov 2012

Language Modeling. Michael Collins, Columbia University

Nonparametric Bayesian Methods - Lecture I

The Infinite Markov Model

Infinite Latent Feature Models and the Indian Buffet Process

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Collapsed Variational Bayesian Inference for Hidden Markov Models

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Transcription:

CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado

Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical Bayesian language model based on Pitman-Yor processes

Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech, handwriting recognition Challenge If n too small, doesn t capture regularities of language If n too large, insufficient data Past approaches have used heuristics for choosing appropriate n, or by averaging predictions across a range of values of n, or by smoothing hacks of various sorts. Hierarchical Pitman-Yor probabilistic model addresses this challenge via principled approach with explicit assumptions.

The Basic Story Dirichlet process Distribution over distributions of a countably infinite number of alternatives E.g., Dirichlet process mixture model Generative model of points in real-valued space Partitioning of points into a countably infinite number of clusters In principle, could use Dirichlet process to model language Draws from DP would represent distribution over words (not clusters) In terms of CRP Each table represents one word type Each diner represents one word token Types vs. tokens e.g., The 1 mean dog 1 ate the 2 small dog 2.

Zipf s Law Rank word types by frequency Plot log frequency vs. log word index

Dirichlet Distribution Distribution over distributions over a finite set of alternatives E.g., uncertainty in distribution over a finite set of words Dirichlet Process Distribution over a distributions over a countably infinite set of alternatives

Dirichlet distribution

Dirichlet process infinite case Dirichlet distribution

Dirichlet process draw collapsed sample CRP infinite case Dirichlet distribution

stick breaking process β k ~ Beta(1,α) defined via Dirichlet process draw collapsed sample CRP infinite case Dirichlet distribution

stick breaking process β k ~ Beta(1,α) stick breaking process β k ~ Beta(1 d,θ+ θ+dk) defined via defined via Dirichlet process draw collapsed sample CRP Pitman-Yor process draw collapsed sample fancy CRP infinite case infinite case Dirichlet distribution Pitman-Yor distribution simple analytic form no known analytic form

Pitman-Yor Pitman-Yor distribution Generalization of Dirichlet distribution No known analytic form for density of PY for finite vocabulary. Pitman-Yor Process G ~ PY(d, α, G 0 ) d: discount parameter, 0 <= d < 1 α: concentration parameter, α > d Large α = more tables occupied Large d = population at tables is more DP is special case of PY Process PY(0, α, G 0 ) = DP(α, G 0 ) see stick breaking construction

Understanding G 0 With Gaussian Mixture Model, domain of G 0 is real-valued vector G 0 (θ): θ is vector (µ Σ) that specifies the mean µ and covariance Σ of a Gaussian bump. G 0 (θ) represents prior distribution over Gaussian parameters. With natural language, think of domain of G 0 as all letter strings (potential words) G 0 (w) is the prior probability of word w

Drawing Sample Of Words Directly From PY Process Prior Chinese restaurant process person 1 comes into restaurant and sits at table 1 person c+1 comes into restaurant and sits at one of the populated tables, k, k {1,..., t}, with probability ~ c k (# people at table k)... or sits at a new table (t+1) with probability ~ α Modified CRP person 1 comes into restaurant and sits at table 1 person c+1 comes into restaurant and sits at one of the populated tables, k, k {1,..., t}, with probability ~ c k d... or sits at a new table (t+1) with probability ~ α + d t

Pitman Yor Why the restrictions 0 <= d < 1 and α > d? Generalized version of CRP person 1 comes into restaurant and sits at table 1 person c+1 comes into restaurant and sits at one of the populated tables, k, k {1,..., t}, with probability ~ c k d... or sits at a new table (t+1) with probability ~ α + d t

Pitman Yor Like the CRP, notion that a specific meal is served at each table θ k ~ G 0 is meal for table k φ c = θ k is meal instance served to individual c sitting at table k With DP mixture model, meal served was parameters of a gaussian bump With DP/PY language model, meal served is a word Types vs. tokens e.g., The 1 mean dog 1 ate the 2 small dog 2.

Algorithm for drawing word c+1 according to Pitman-Yor if c = 0 /* special case for first word */ t 1 /* t: number of tables */ θ t ~ G 0 /* meal associated with table */ c t 1 /* c t : count of customers at table t */ φ c θ t /* φ c : meal associated with customer c */ else choose k {1,..., t} with probability ~ c k d, k = t+1 with probability ~ α + d t if k = t + 1 t t + 1 θ t ~ G 0 c t 1 else c k c k + 1 endif φ c θ k endif

Pitman Yor Produces Power Law Distribution number of unique words (t) proportion unique words 100 d=.9 d=.5 10 α = 10 d=.5 α = 1 d=0 d=.5 d=.9 d=.5 d=0 c α α α α d is asymptotic proportion of words appearing only once note d=0 asymptotically no single-token words Number of unique words as a function of c d > 0: O(αc d ) d = 0: O(α log c)

Hierarchical Pitman-Yor (or Dirichlet) Process Suppose you want to model where people hang out in a town (hot spots) Not known in advance how many locations need to be modeled Some spots are generally popular, others not so much. But individuals also have preferences that deviate from the population preference. E.g., bars are popular, but not for individuals who don t drink. Need to model distribution over hot spots at level of both population and individual.

Hierarchical Pitman-Yor (or Dirichlet) Process individual distribution population distribution

From Two-Level Hot-Spot Model To Multilevel Word Model: Capturing Levels Of Generality P(hot spot) P(word i ) P(hot spot individual) P(word i word i-1 ) Intuition P(word i word i-1,word i-2 ) Use the more general distribution as the basis for the more context-specific distribution P(word i word i-1 ) is specialized version of P(word i ) P(word i word i-1,word i-2 ) is specialized version of P(word i word i-1 ) Formally Defines G u recursively, anchored with α α suffix of u consisting of all but the earliest word context (preceding n 1 words) states supreme United states supreme

Sampling With Hierarchical CRP Imagine a different restaurant for every possible context u e.g., supreme, states supreme, united states supreme, united states of... each with its own set of tables and own population of diners The meals served at the popular tables at restaurant u will tend to be same as meals at popular tables at restaurant π(u) Consider restaurants (contexts) u and π(u) e.g., u = the united and π(u) = united and some past (in generative model) assignments of tables based on united states united we stand united under god the united states

Notation c u.k number of diners in restaurant u assigned to table k number of tokens of the word indexed by k appearing in context u But we also need to represent the meal served at a table... c uw. number of diners in restaurant u assigned to a table with meal w number of tokens of word w appearing in context u NOTE: unlike DPMM, this count is an observation c uwk number of diners in restaurant u assigned to table k which has meal w in context u, number of tokens of word w, when word w is indexed by k

t u.. formerly just t: number of occupied tables in restaurant u number of word types appearing in context u t uw. 1 if restaurant u serves meal w, 0 otherwise 1 if word w appears in context u t uwk 1 if restaurant u has meal w assigned to table k, 0 otherwise Redundant indexing of c and t by both k and w necessary to allow for reference either by table (for generative model) or meal (for probability computation)

Algorithm for drawing word w in context u Function w = DrawWord(u) if u = 0 w ~ G 0 else choose k {1,..., t u.. } with probability ~ c u.k d u, k = t u.. +1 with probability ~ α u + d u t u.. if k = t u.. + 1 θ uk DrawWord(π(u)) w θ uk c uwk 1 t uwk 1 else w θ uk c uwk c uwk + 1 endif <- new table drawn or restaurant is empty COOL RECURSION!

Inference Given training data, D, compute predictive posterior over words w for a given context, u: Θ = {α m, d m : 0 <= m <= n 1} S = seating arrangement Because of structured relationship among the contexts, this probability depends on more than the empirical count. e.g., p( of united states ) will be influenced by p( of states ), p( of ), p( of altered states ), etc.

Inference Given training data, D, compute predictive posterior over words w for a given context, u: Compute P(w u,s,θ) with... Θ = {α m, d m : 0 <= m <= n 1} S = seating arrangement α α α

Inference Given training data, D, compute predictive posterior over words w for a given context, u: Compute P(w u,s,θ) with... Θ = {α m, d m : 0 <= m <= n 1} S = seating arrangement α α α Estimate P(S,Θ Θ D) with MCMC

Gibbs Sampling of Seat Assignments For a given individual (l) in a given restaurant (u), there are only a few possible choices for the seat assignment (k ul ) If the individual s meal is already assigned to a table, they can sit at one of those tables. A new table can be created that serves the individual s meal. α α α Sampling of Parameters Θ Parameters are sampled using an auxiliary variable sampler as detailed in Teh (2006)

Implementation G 0 (w) = 1/V for all w d i ~ Uniform(0,1) α i ~ Gamma(1,1)

Results IKN, MKN = traditional methods in language; HPYLM = hierarchical Pitman-Yor with Gibbs sampling; HPYCV= hierarchical Pitman-Yor with some cross validation hack to determine parameters; HDLM = hierarchical Dirichlet language model