Machine Learning Summer School, Austin, TX January 08, 2015

Similar documents
Negative Binomial Process Count and Mixture Modeling

Augment-and-Conquer Negative Binomial Processes

Priors for Random Count Matrices with Random or Fixed Row Sums

Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling

Non-Parametric Bayes

Non-parametric Clustering with Dirichlet Processes

Latent Dirichlet Allocation (LDA)

Bayesian Nonparametrics for Speech and Signal Processing

Study Notes on the Latent Dirichlet Allocation

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling

Infinite latent feature models and the Indian Buffet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametric Models

Document and Topic Models: plsa and LDA

Bayesian Nonparametrics: Dirichlet Process

Lecture 13 : Variational Inference: Mean Field Approximation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Dirichlet Bayesian Co-Clustering

Topic Modelling and Latent Dirichlet Allocation

Introduction to Probabilistic Machine Learning

Nonparametric Bayesian Matrix Factorization for Assortative Networks

A Brief Overview of Nonparametric Bayesian Models

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes: Supplementary Material

Image segmentation combining Markov Random Fields and Dirichlet Processes

Dirichlet Enhanced Latent Semantic Analysis

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Latent Dirichlet Allocation

Gaussian Mixture Model

Beta-Negative Binomial Process and Poisson Factor Analysis

Lecture 3a: Dirichlet processes

Bayesian Methods for Machine Learning

Online Bayesian Passive-Agressive Learning

Bayesian non parametric approaches: an introduction

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

The Indian Buffet Process: An Introduction and Review

Bayesian Methods: Naïve Bayes

Latent Dirichlet Alloca/on

Dirichlet Processes: Tutorial and Practical Course

Machine Learning Summer School

Bayesian Nonparametrics

Two Useful Bounds for Variational Inference

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling

Topic Models. Charles Elkan November 20, 2008

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Introduction to Bayesian inference

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Clustering using Mixture Models

Distance dependent Chinese restaurant processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

13: Variational inference II

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Lecture 6: Graphical Models: Learning

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Probabilistic modeling of NLP

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Dirichlet Processes and other non-parametric Bayesian models

LDA with Amortized Inference

Hierarchical Models, Nested Models and Completely Random Measures

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Learning Bayesian network : Given structure and completely observed data

Gentle Introduction to Infinite Gaussian Mixture Modeling

Generative Clustering, Topic Modeling, & Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

An Introduction to Bayesian Machine Learning

The supervised hierarchical Dirichlet process

Mixed-membership Models (and an introduction to variational inference)

Topic Models and Applications to Short Documents

Inference Methods for Latent Dirichlet Allocation

Nonparametric Mixed Membership Models

Bayesian nonparametric latent feature models

Bayesian Nonparametric Modeling of Driver Behavior using HDP Split-Merge Sampling Algorithm

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

CS Lecture 18. Topic Models and LDA

Text Mining for Economics and Finance Latent Dirichlet Allocation

UNIVERSITY OF CALIFORNIA, IRVINE. Networks of Mixture Blocks for Non Parametric Bayesian Models with Applications DISSERTATION

Recent Advances in Bayesian Inference Techniques

Collapsed Variational Dirichlet Process Mixture Models

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Advanced Machine Learning

arxiv: v1 [stat.ml] 8 Jan 2012

Latent variable models for discrete data

Latent Dirichlet Allocation (LDA)

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Bayes methods for categorical data. April 25, 2017

Nonparametric Probabilistic Modelling

Statistical Debugging with Latent Topic Models

Exponential Families

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

STAT Advanced Bayesian Inference

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Transcription:

Parametric Department of Information, Risk, and Operations Management Department of Statistics and Data Sciences The University of Texas at Austin Machine Learning Summer School, Austin, TX January 08, 2015 1 / 45

for Part of Poisson, gamma, and negative inference for the negative Sparse codes Regression Images for Dictionary counts P N Latent variable X = P K K N modelsφ Θ for discrete data Latent Dirichlet allocation Poisson factor Words Documents Matrix P N X Topics = P K Words Φ Topics Documents K N Θ 2 / 45

data is common Nonnegative and discrete: Number of auto insurance claims / highway accidents / crimes Consumer behavior, labor mobility, marketing, voting Photon counting Species sampling Text Infectious diseases, Google Flu Trends Next generation sequencing (statistical genomics) Mixture can be viewed as a count- problem Number of points in a cluster (mixture model, we are a count vector) Number of words assigned to topic k in document j (we are a K J latent count matrix in a topic model/mixed-membership model) 3 / 45

data is common Nonnegative and discrete: Number of auto insurance claims / highway accidents / crimes Consumer behavior, labor mobility, marketing, voting Photon counting Species sampling Text Infectious diseases, Google Flu Trends Next generation sequencing (statistical genomics) Mixture can be viewed as a count- problem Number of points in a cluster (mixture model, we are a count vector) Number of words assigned to topic k in document j (we are a K J latent count matrix in a topic model/mixed-membership model) 3 / 45

Siméon-Denis Poisson (21 June 1781 25 April 1840) "Life is good for only two things: doing mathematics and teaching it." Poisson http://en.wikipedia.org 4 / 45

Siméon-Denis Poisson (21 June 1781 25 April 1840) "Life is good for only two things: doing mathematics and teaching it." Poisson http://en.wikipedia.org 4 / 45

Poisson x Pois(λ) Probability mass function: P(x λ) = λx e λ, x {0, 1,...} x! The mean and variance is the same: E[x] = Var[x] = λ. Restrictive to model over-dispersed (variance greater than the mean) counts that are commonly observed in practice. A basic building block to construct more flexible count. Overdispersed are commonly observed due to Heterogeneity: difference individuals Contagion: dependence the occurrence of events 5 / 45

Mixed Poisson x Pois(λ), λ f Λ (λ) Mixing the Poisson rate parameter with a positive leads to a mixed Poisson. A mixed Poisson is always over-dispersed. Law of total expectation: Law of total variance: E[x] = E[E[x λ]] = E[λ]. Var[x] = Var[E[x λ]] + E[Var[x λ]] = Var[λ] + E[λ]. Thus Var[x] > E[x] unless λ is a constant. The gamma is a popular choice as it is conjugate to the Poisson. 6 / 45

Mixing the gamma with the Poisson as ( ) p x Pois(λ), λ Gamma r,, 1 p where p/(1 p) is the gamma scale parameter, leads to the negative x NB(r, p) with probability mass function P(x r, p) = Γ(x + r) x!γ(r) px (1 p) r, x {0, 1,...} 7 / 45

Compound Poisson A compound Poisson is the summation of a Poisson random number of i.i.d. random variables. If x = n i=1 y i, where n Pois(λ) and y i are i.i.d. random variable, then x is a compound Poisson random variable. The negative random variable x NB(r, p) can also be generated as a compound Poisson random variable as l x = u i, l Pois[ r ln(1 p)], u i Log(p) i=1 where u Log(p) is the logarithmic with probability mass function P(u p) = 1 p u ln(1 p) u, u {1, 2, }. 8 / 45

m NB(r, p) r is the dispersion parameter p is the probability parameter Probability mass function ( ) Γ(r + m) r f M (m r, p) = m!γ(r) pm (1 p) r = ( 1) m p m (1 p) r m It is a gamma-poisson mixture It is a compound Poisson rp Its variance is greater that its mean rp (1 p) 2 1 p Var[m] = E[m] + (E[m])2 r 9 / 45

The conjugate prior for the negative probability parameter p is the beta : if m i NB(r, p), p Beta(a 0, b 0 ), then ( ) n (p ) = Beta a 0 + m i, b 0 + nr i=1 The conjugate prior for the negative dispersion parameter r is unknown, but we have a simple data augmentation technique to derive closed-form Gibbs sampling update equations for r. 10 / 45

If we assign m customers to tables using a Chinese restaurant process with concentration parameter r, then the random number of occupied tables l follows the Chinese Restaurant Table (CRT) f L (l m, r) = Γ(r) Γ(m + r) s(m, l) r l, l = 0, 1,, m. s(m, l) are unsigned Stirling numbers of the first kind. The joint of the customer count m NB(r, p) and table count is the Poisson-logarithmic bivariate count s(m, l) r l f M,L (m, l r, p) = (1 p) r p m. m! 11 / 45

Poisson-logarithmic bivariate count Probability mass function: l s(m, l) r f M,L (m, l; r, p) = (1 p) r p m. m! It is clear that the gamma is a conjugate prior for r to this bivariate count. The joint of the customer count and table count are equivalent: Draw NegBino(r, p) customers Draw Poisson(--r ln (1 -- p)) tables Assign customers to tables using a Chinese restaurant process with concentration parameter r Draw Logarithmic(p) customers on each table 12 / 45

inference for the negative count : m i NegBino(r, p), p Beta(a 0, b 0 ), r Gamma(e 0, 1/f 0 ). Gibbs sampling via data augmetantion: (p ) Beta (a 0 + n i=1 m i, b 0 + nr) ; (l i ) = ( ) m i t=1 b t, b t Bernoulli r t+r 1 ; ( (r ) Gamma e 0 + ) n i=1 l 1 i, f 0 n ln(1 p). Expectation-Maximization Variational Bayes 13 / 45

inference for the negative count : m i NegBino(r, p), p Beta(a 0, b 0 ), r Gamma(e 0, 1/f 0 ). Gibbs sampling via data augmetantion: (p ) Beta (a 0 + n i=1 m i, b 0 + nr) ; (l i ) = ( ) m i t=1 b t, b t Bernoulli r t+r 1 ; ( (r ) Gamma e 0 + ) n i=1 l 1 i, f 0 n ln(1 p). Expectation-Maximization Variational Bayes 13 / 45

Gibbs sampling: E[r] = 1.076, E[p] = 0.525. Expectation-Maximization: r : 1.025, p : 0.528. Variational Bayes: E[r] = 0.999, E[p] = 0.534. For this example, variational Bayes inference correctly identifies the modes but underestimates the posterior variances of model parameters. 14 / 45

Gibbs sampling: E[r] = 1.076, E[p] = 0.525. Expectation-Maximization: r : 1.025, p : 0.528. Variational Bayes: E[r] = 0.999, E[p] = 0.534. For this example, variational Bayes inference correctly identifies the modes but underestimates the posterior variances of model parameters. 14 / 45

Gibbs sampling: E[r] = 1.076, E[p] = 0.525. Expectation-Maximization: r : 1.025, p : 0.528. Variational Bayes: E[r] = 0.999, E[p] = 0.534. For this example, variational Bayes inference correctly identifies the modes but underestimates the posterior variances of model parameters. 14 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

Poisson and multinomial Suppose that x 1,..., x K are independent Poisson random variables with x k Pois(λ k ), x = K k=1 x k. Set λ = K k=1 λ k; let (y, y 1,..., y K ) be random variables such that ( ) y Pois(λ), (y 1,..., y k ) y Mult y; λ 1 λ,..., λ K λ. Then the of x = (x, x 1,..., x K ) is the same as the of y = (y, y 1,..., y K ). 16 / 45

Model: Multinomial and Dirichlet (x i1,..., x ik ) Multinomial(n i, p 1,..., p k ), (p 1,..., p k ) Dirichlet(α 1,..., α k ) = Γ( k j=1 α j) k j=1 Γ(α j) The conditional posterior of (p 1,..., p k ) is Dirichlet distributed as ( (p 1,..., p k ) Dirichlet α 1 + i k j=1 x i1,..., α k + i p α j 1 j x ik ) 17 / 45

Gamma and Dirichlet Suppose that random variables y and (y 1,..., y K ) are independent with y Gamma(γ, 1/c), (y 1,..., y K ) Dir(γp 1,, γp K ) where K k=1 p k = 1; Let x k = yy k then {x k } 1,K are independent gamma random variables with x k Gamma(γp k, 1/c). The proof can be found in arxiv:1209.3442v1 18 / 45

various Modeling Mixture Modeling Gaussian Logit Logarithmic Polya-Gamma Binomial Chinese Restaurant Beta Bernoulli Latent Gaussian Poisson Multinomial Gamma Dirichlet 19 / 45

Poisson regression Model: y i Pois(λ i ), λ i = exp(x T i β) Model assumption: Var[y i x i ] = E[y i x i ] = exp(x T i β). Poisson regression does not model over-dispersion. 20 / 45

Model: Model property: Poisson regression with multiplicative random effects y i Pois(λ i ), λ i exp(x T i β)ɛ i Var[y i x i ] = E[y i x i ] + Var[ɛ i] E 2 [ɛ i ] E2 [y i x i ]. regression (gamma random effect): ɛ i Gamma(r, 1/r) = r r Γ(r) ɛ i r 1 e rɛ i. Lognormal-Poisson regression (lognormal random effect): ɛ i ln N (0, σ 2 ). 21 / 45

Model: Model property: Poisson regression with multiplicative random effects y i Pois(λ i ), λ i exp(x T i β)ɛ i Var[y i x i ] = E[y i x i ] + Var[ɛ i] E 2 [ɛ i ] E2 [y i x i ]. regression (gamma random effect): ɛ i Gamma(r, 1/r) = r r Γ(r) ɛ i r 1 e rɛ i. Lognormal-Poisson regression (lognormal random effect): ɛ i ln N (0, σ 2 ). 21 / 45

Model: Model property: Poisson regression with multiplicative random effects y i Pois(λ i ), λ i exp(x T i β)ɛ i Var[y i x i ] = E[y i x i ] + Var[ɛ i] E 2 [ɛ i ] E2 [y i x i ]. regression (gamma random effect): ɛ i Gamma(r, 1/r) = r r Γ(r) ɛ i r 1 e rɛ i. Lognormal-Poisson regression (lognormal random effect): ɛ i ln N (0, σ 2 ). 21 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) ( ) pi ψ i = log = x T i β + ln ɛ i, ɛ i ln N (0, σ 2 ) 1 p i inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) ( ) pi ψ i = log = x T i β + ln ɛ i, ɛ i ln N (0, σ 2 ) 1 p i inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) ( ) pi ψ i = log = x T i β + ln ɛ i, ɛ i ln N (0, σ 2 ) 1 p i inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) ( ) pi ψ i = log = x T i β + ln ɛ i, ɛ i ln N (0, σ 2 ) 1 p i inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

example on the NASCAR dataset: Model Poisson NB LGNB LGNB Parameters (MLE) (MLE) (VB) (Gibbs) σ 2 N/A N/A 0.1396 0.0289 r N/A 5.2484 18.5825 6.0420 β 0-0.4903-0.5038-3.5271-2.1680 β 1 (Laps) 0.0021 0.0017 0.0015 0.0013 β 2 (Drivers) 0.0516 0.0597 0.0674 0.0643 β 3 (TrkLen) 0.6104 0.5153 0.4192 0.4200 Using Variational Bayes inference, we can calculate the correlation matrix for (β 1, β 2, β 3 ) T as 1.0000 0.4824 0.8933 0.4824 1.0000 0.7171 0.8933 0.7171 1.0000 23 / 45

Latent Dirichlet allocation Poisson factor Latent Dirichlet allocation (Blei et al., 2003) Hierarchical model: x ji Mult(φ zji ) z ji Mult(θ j ) φ k Dir(η,..., η) ( α θ j Dir K,..., α ) K There are K topics {φ k } 1,K, each of which is a over the V words in the vocabulary. There are N documents in the corpus and θ j represents the proportion of the K topics in the jth document. x ji is the ith word in the jth document. z ji is the index of the topic selected by x ji. 24 / 45

Latent Dirichlet allocation Poisson factor Denote n vjk = i δ(x ji = v)δ(z ji = k), n v k = j n vjk, n jk = v n vjk, and n k = j n jk. Blocked Gibbs sampling: P(z ji = k ) φ xji kθ jk, k {1,..., K} (φ k ) Dir(η + n 1 k,..., η + n V k ) ( α (θ j ) Dir K + n j1,..., α ) K + n jk Variational Bayes inference (Blei et al., 2003). 25 / 45

Latent Dirichlet allocation Poisson factor Collapsed Gibbs sampling (Griffiths and Steyvers, 2004): Marginalizing out both the topics {φ k } 1,K and the topic proportions {θ j } 1,N. Sample z ji conditioning on all the other topic assignment indices z ji : P(z ji = k z ji ) η + n ji ( x ji k V η + n ji n ji jk + α ), k {1,..., K} K k This is easy to understand as P(z ji = k φ k, θ j ) φ xji kθ jk P(z ji = k z ji ) = P(z ji = k φ k, θ j )P(φ k, θ j z ji )dφ k dθ j P(φ k z ji ) = Dir(η + n ji 1 k,..., η + n ji V k ) ( α P(θ j z ji ) = Dir K + n ji j1,..., α ) K + n ji jk P(φ k, θ j z ji ) = P(φ k z ji )P(θ j z ji ) 26 / 45

Latent Dirichlet allocation Poisson factor In latent Dirichlet allocation, the words in a document are assumed to be exchangeable (bag-of-words assumption). Sparse codes Below we will Images relate latent Dirichlet Dictionary allocation to Poisson P N factor X and show = P K K N it essentially Φ tries Θ to factorize the term-document word count matrix under the Poisson likelihood: Words Documents Matrix P N X Topics = P K Words Φ Topics Documents K N Θ 27 / 45

Latent Dirichlet allocation Poisson factor Poisson factor alaysis Factorize the term-document word count matrix M Z V N + under the Poisson likelihood as M Pois(ΦΘ) where Z + = {0, 1,...} and R + = {x : x > 0}. m vj is the number of times that term v appears in document j. Factor loading matrix: Φ = (φ 1,..., φ K ) R V K +. Factor score matrix: Θ = (θ 1,..., θ N ) R K N +. A large number of discrete latent variable models can be united under the Poisson factor framework, with the main differences on how the priors for φ k and θ j are constructed. 28 / 45

Latent Dirichlet allocation Poisson factor Two equivalent augmentations Poisson factor ( K ) m vj Pois φ vk θ jk Augmentation 1: m vj = k=1 K n vjk, n vjk Pois(φ vk θ jk ) k=1 Augmentation 2: ( K ) φ vk θ jk m vj Pois φ vk θ jk, ζ vjk = K k=1 φ vkθ jk k=1 [n vj1,, n vjk ] Mult (m vj ; ζ vj1,, ζ vjk ) 29 / 45

Latent Dirichlet allocation Poisson factor Nonnegative matrix and gamma-poisson factor Gamma priors on Φ and( Θ: K ) m vj = Pois k=1 φ vkθ jk φ vk Gamma(a φ, 1/b φ ), θ jk Gamma(a θ, g k /a θ ). Expectation-Maximization (EM) algorithm: φ vk = φ vk θ jk = θ jk a φ 1 φ vk a θ 1 θ jk + N i=1 b φ + θ k m vj θ jk K k=1 φ vkθ jk m vj φ vk K k=1 φ vkθ jk + P p=1. a θ /g k + φ k If we set b φ = 0, a φ = a θ = 1 and g k =, then the EM algorithm is the same as those of non-negative matrix (Lee and Seung, 2000) with an objective function of minimizing the KL divergence D KL (M ΦΘ). 30 / 45

Latent Dirichlet allocation and Dirichlet-Poisson factor Latent Dirichlet allocation Poisson factor Dirichlet priors on Φ and( Θ: K ) m vj = Pois k=1 φ vkθ jk φ k Dir(η,..., η), θ j Dir(α/K,..., α/k). One may show that both the block Gibbs sampling inference and variational Bayes inference of the Dirichlet-Poisson factor model are the same as that of the Latent Dirichlet allocation. 31 / 45

Latent Dirichlet allocation Poisson factor Beta-gamma-Poisson factor Hierachical model (Zhou et al., 2012, Carin, 2014): m vj = K n vjk, n vjk Pois(φ vk θ jk ) k=1 φ k Dir (η,, η), θ jk Gamma [r j, p k /(1 p k )], r j Gamma(e 0, 1/f 0 ), p k Beta[c/K, c(1 1/K)]. n jk = V v=1 n vjk NB(r j, p k ) This parametric model becomes a nonparametric model governed by the beta-negative process as K. 32 / 45

Latent Dirichlet allocation Poisson factor Gamma-gamma-Poisson factor Hierachical model ( Carin, 2014): m vj = n jk NB(r k, p j ) K n vjk, n vjk Pois(φ vk θ jk ) k=1 φ k Dir (η,, η), θ jk Gamma [r k, p j /(1 p j )], p j Beta(a 0, b 0 ), r k Gamma(γ 0 /K, 1/c). This parametric model becomes a nonparametric model governed by the gamma-negative process as K. 33 / 45

Latent Dirichlet allocation Poisson factor Poisson factor and mixed-membership We may represent the Poisson factor m vj = K n vjk, n vjk Pois(φ vk θ jk ) k=1 in terms of a mixed-membership model, whose group sizes are randomized, as ( ) K θ jk x ji Mult(φ zji ), z ji k θ δ k, m j Pois θ jk, jk k k=1 where i = 1,..., m j in the jth document, and n vjk = m j i=1 δ(x ji = v)δ(z ji = k). The likelihoods of the two representations are different update to a multinomial coefficient (Zhou, 2014). 34 / 45

Latent Dirichlet allocation Poisson factor Connections to previous approaches Nonnegative matrix (K-L divergence) (NMF) Latent Dirichlet allocation (LDA) GaP: gamma-poisson factor model (GaP) (Canny, 2004) Hierarchical Dirichlet process LDA (HDP-LDA) (Teh et al., 2006) Poisson factor Infer Infer Support Related priors on θ jk (p k, r j ) (p j, r k ) K algorithms gamma NMF Dirichlet LDA beta-gamma GaP gamma-gamma HDP-LDA 35 / 45

Latent Dirichlet allocation Poisson factor Blocked Gibbs sampling Sample z ji from multinomial; n vjk = m j i=1 δ(x ji = v)δ(z ji = k). Sample φ k from Dirichlet For the beta-negative model (beta-gamma-poisson factor ) Sample l jk from CRT(n jk, r j ) Sample r j from gamma Sample p k from beta Sample θ jk from Gamma(r j + n jk, p k ) For the gamma-negative model (gamma-gamma-poisson factor ) Sample l jk from CRT(n jk, r k ) Sample r k from gamma Sample p j from beta Sample θ jk from Gamma(r k + n jk, p j ) Collapsed Gibbs sampling for the beta-negative model can be found in (Zhou, 2014). 36 / 45

Latent Dirichlet allocation Poisson factor Example application Example Topics of United Nation General Assembly Resolutions inferred by the gamma-gamma-poisson factor : Topic 1 trade world conference organization negotiations Topic 2 rights human united nations commission Topic 3 environment management protection affairs appropriate Topic 4 women gender equality including system Topic 5 economic summits outcomes conferences major The gamma-negative and beta-negative models have distinct mechanisms on controlling the number of inferred factors. They produce state-of-the-art perplexity results when used for topic of a document corpus (Zhou et al, 2012, Zhou and Carin 2014, Zhou 2014). 37 / 45

Stochastic blockmodel A relational (graph) is commonly used to describe the relationship nodes, where a node could represent a person, a movie, a protein, etc. Two nodes are connected if there is an edge (link) them. An undirected unweighted relational with N nodes can be equivalently represented with a sysmetric binary affinity matrix B {0, 1} N N, where b ij = b ji = 1 if an edge exists nodes i and j and b ij = b ji = 0 otherwise. 38 / 45

Stochastic blockmodel Each node is assigned to a cluster. Stochastic blockmodel The probability for an edge to exist two nodes is solely decided by the clusters that they are assigned to. Hierachical model: b ij Bernoulli(p zi z j ), p k1 k 2 Beta(a 0, b 0 ), z i Mult(π 1,..., π K ), (π 1,..., π K ) Dir(α/K,..., α/k) Blocked Gibbs sampling: P(z i = k ) = π k j i p b ij for j > i kz j (1 p kzj ) 1 b ij 39 / 45

Infinite relational model (Kemp et al., 2006) Stochastic blockmodel As K, the stochastic block model becomes a nonparametric model governed by the Chinese restaurant process (CRP) with concentration parameter α: b ij Bernoulli(p zi z j ), p k1 k 2 Beta(a 0, b 0 ), (z 1,..., z N ) CRP(α) for i > j Collapsed Gibbs sampling can be derived by marginalizing out p k1 k 2 and using the prediction rule of the Chinese restaurant process. 40 / 45

Stochastic blockmodel The coauthor of the top 234 NIPS authors. 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 41 / 45

Stochastic blockmodel The reordered using the stochastic blockmodel. 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 42 / 45

The estimated link probabilities within and blocks. 20 0.9 40 0.8 60 0.7 80 0.6 100 0.5 120 Stochastic blockmodel 140 160 180 200 220 0.4 0.3 0.2 0.1 20 40 60 80 100 120 140 160 180 200 220 43 / 45

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 2003. T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 2004. C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In AAAI, 2006. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix. In NIPS, 2000. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. JASA, 2006. M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative process and Poisson factor. In AISTATS, 2012. M. Zhou, L. Li, D. Dunson, and L. Carin. Lognormal and gamma mixed negative regression. In ICML, 2012. 44 / 45

M. L. Carin. Augment-and-conquer negative processes. In NIPS, 2012. M. L. Carin. process count and mixture. IEEE TPAMI, 2014. M. Zhou. Beta-negative process and exchangeable random partitions for mixed-membership. In NIPS, 2014. 45 / 45