Machine Learning Summer School, Austin, TX January 08, PDF Free Download

Parametric Department of Information, Risk, and Operations Management Department of Statistics and Data Sciences The University of Texas at Austin Machine Learning Summer School, Austin, TX January 08, 2015 1 / 45

for Part of Poisson, gamma, and negative inference for the negative Sparse codes Regression Images for Dictionary counts P N Latent variable X = P K K N modelsφ Θ for discrete data Latent Dirichlet allocation Poisson factor Words Documents Matrix P N X Topics = P K Words Φ Topics Documents K N Θ 2 / 45

data is common Nonnegative and discrete: Number of auto insurance claims / highway accidents / crimes Consumer behavior, labor mobility, marketing, voting Photon counting Species sampling Text Infectious diseases, Google Flu Trends Next generation sequencing (statistical genomics) Mixture can be viewed as a count- problem Number of points in a cluster (mixture model, we are a count vector) Number of words assigned to topic k in document j (we are a K J latent count matrix in a topic model/mixed-membership model) 3 / 45

Siméon-Denis Poisson (21 June 1781 25 April 1840) "Life is good for only two things: doing mathematics and teaching it." Poisson http://en.wikipedia.org 4 / 45

Poisson x Pois(λ) Probability mass function: P(x λ) = λx e λ, x {0, 1,...} x! The mean and variance is the same: E[x] = Var[x] = λ. Restrictive to model over-dispersed (variance greater than the mean) counts that are commonly observed in practice. A basic building block to construct more flexible count. Overdispersed are commonly observed due to Heterogeneity: difference individuals Contagion: dependence the occurrence of events 5 / 45

Mixed Poisson x Pois(λ), λ f Λ (λ) Mixing the Poisson rate parameter with a positive leads to a mixed Poisson. A mixed Poisson is always over-dispersed. Law of total expectation: Law of total variance: E[x] = E[E[x λ]] = E[λ]. Var[x] = Var[E[x λ]] + E[Var[x λ]] = Var[λ] + E[λ]. Thus Var[x] > E[x] unless λ is a constant. The gamma is a popular choice as it is conjugate to the Poisson. 6 / 45

Mixing the gamma with the Poisson as ( ) p x Pois(λ), λ Gamma r,, 1 p where p/(1 p) is the gamma scale parameter, leads to the negative x NB(r, p) with probability mass function P(x r, p) = Γ(x + r) x!γ(r) px (1 p) r, x {0, 1,...} 7 / 45

Compound Poisson A compound Poisson is the summation of a Poisson random number of i.i.d. random variables. If x = n i=1 y i, where n Pois(λ) and y i are i.i.d. random variable, then x is a compound Poisson random variable. The negative random variable x NB(r, p) can also be generated as a compound Poisson random variable as l x = u i, l Pois[ r ln(1 p)], u i Log(p) i=1 where u Log(p) is the logarithmic with probability mass function P(u p) = 1 p u ln(1 p) u, u {1, 2, }. 8 / 45

m NB(r, p) r is the dispersion parameter p is the probability parameter Probability mass function ( ) Γ(r + m) r f M (m r, p) = m!γ(r) pm (1 p) r = ( 1) m p m (1 p) r m It is a gamma-poisson mixture It is a compound Poisson rp Its variance is greater that its mean rp (1 p) 2 1 p Var[m] = E[m] + (E[m])2 r 9 / 45

The conjugate prior for the negative probability parameter p is the beta : if m i NB(r, p), p Beta(a 0, b 0 ), then ( ) n (p ) = Beta a 0 + m i, b 0 + nr i=1 The conjugate prior for the negative dispersion parameter r is unknown, but we have a simple data augmentation technique to derive closed-form Gibbs sampling update equations for r. 10 / 45

If we assign m customers to tables using a Chinese restaurant process with concentration parameter r, then the random number of occupied tables l follows the Chinese Restaurant Table (CRT) f L (l m, r) = Γ(r) Γ(m + r) s(m, l) r l, l = 0, 1,, m. s(m, l) are unsigned Stirling numbers of the first kind. The joint of the customer count m NB(r, p) and table count is the Poisson-logarithmic bivariate count s(m, l) r l f M,L (m, l r, p) = (1 p) r p m. m! 11 / 45

Poisson-logarithmic bivariate count Probability mass function: l s(m, l) r f M,L (m, l; r, p) = (1 p) r p m. m! It is clear that the gamma is a conjugate prior for r to this bivariate count. The joint of the customer count and table count are equivalent: Draw NegBino(r, p) customers Draw Poisson(--r ln (1 -- p)) tables Assign customers to tables using a Chinese restaurant process with concentration parameter r Draw Logarithmic(p) customers on each table 12 / 45

inference for the negative count : m i NegBino(r, p), p Beta(a 0, b 0 ), r Gamma(e 0, 1/f 0 ). Gibbs sampling via data augmetantion: (p ) Beta (a 0 + n i=1 m i, b 0 + nr) ; (l i ) = ( ) m i t=1 b t, b t Bernoulli r t+r 1 ; ( (r ) Gamma e 0 + ) n i=1 l 1 i, f 0 n ln(1 p). Expectation-Maximization Variational Bayes 13 / 45

Gibbs sampling: E[r] = 1.076, E[p] = 0.525. Expectation-Maximization: r : 1.025, p : 0.528. Variational Bayes: E[r] = 0.999, E[p] = 0.534. For this example, variational Bayes inference correctly identifies the modes but underestimates the posterior variances of model parameters. 14 / 45

gamma chain NegBino-Gamma-Gamma-... Augmentation (CRT, NegBino)-Gamma-Gamma-... Equivalence (Log, Poisson)-Gamma-Gamma-... Marginalization NegBino-Gamma-... 15 / 45

Poisson and multinomial Suppose that x 1,..., x K are independent Poisson random variables with x k Pois(λ k ), x = K k=1 x k. Set λ = K k=1 λ k; let (y, y 1,..., y K ) be random variables such that ( ) y Pois(λ), (y 1,..., y k ) y Mult y; λ 1 λ,..., λ K λ. Then the of x = (x, x 1,..., x K ) is the same as the of y = (y, y 1,..., y K ). 16 / 45

Model: Multinomial and Dirichlet (x i1,..., x ik ) Multinomial(n i, p 1,..., p k ), (p 1,..., p k ) Dirichlet(α 1,..., α k ) = Γ( k j=1 α j) k j=1 Γ(α j) The conditional posterior of (p 1,..., p k ) is Dirichlet distributed as ( (p 1,..., p k ) Dirichlet α 1 + i k j=1 x i1,..., α k + i p α j 1 j x ik ) 17 / 45

Gamma and Dirichlet Suppose that random variables y and (y 1,..., y K ) are independent with y Gamma(γ, 1/c), (y 1,..., y K ) Dir(γp 1,, γp K ) where K k=1 p k = 1; Let x k = yy k then {x k } 1,K are independent gamma random variables with x k Gamma(γp k, 1/c). The proof can be found in arxiv:1209.3442v1 18 / 45

various Modeling Mixture Modeling Gaussian Logit Logarithmic Polya-Gamma Binomial Chinese Restaurant Beta Bernoulli Latent Gaussian Poisson Multinomial Gamma Dirichlet 19 / 45

Poisson regression Model: y i Pois(λ i ), λ i = exp(x T i β) Model assumption: Var[y i x i ] = E[y i x i ] = exp(x T i β). Poisson regression does not model over-dispersion. 20 / 45

Model: Model property: Poisson regression with multiplicative random effects y i Pois(λ i ), λ i exp(x T i β)ɛ i Var[y i x i ] = E[y i x i ] + Var[ɛ i] E 2 [ɛ i ] E2 [y i x i ]. regression (gamma random effect): ɛ i Gamma(r, 1/r) = r r Γ(r) ɛ i r 1 e rɛ i. Lognormal-Poisson regression (lognormal random effect): ɛ i ln N (0, σ 2 ). 21 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

Lognormal and gamma mixed negative regression Model (Zhou et al., ICML2012): y i NegBino (r, p i ), r Gamma(a 0, 1/h) ( ) pi ψ i = log = x T i β + ln ɛ i, ɛ i ln N (0, σ 2 ) 1 p i inference with the Polya-Gamma. Model properties: ( ) Var[y i x i ] = E[y i x i ] + e σ2 (1 + r 1 ) 1 E 2 [y i x i ]. Special cases: regression: σ 2 = 0; Lognormal-Poisson regression: r ; Poisson regression: σ 2 = 0 and r. 22 / 45

example on the NASCAR dataset: Model Poisson NB LGNB LGNB Parameters (MLE) (MLE) (VB) (Gibbs) σ 2 N/A N/A 0.1396 0.0289 r N/A 5.2484 18.5825 6.0420 β 0-0.4903-0.5038-3.5271-2.1680 β 1 (Laps) 0.0021 0.0017 0.0015 0.0013 β 2 (Drivers) 0.0516 0.0597 0.0674 0.0643 β 3 (TrkLen) 0.6104 0.5153 0.4192 0.4200 Using Variational Bayes inference, we can calculate the correlation matrix for (β 1, β 2, β 3 ) T as 1.0000 0.4824 0.8933 0.4824 1.0000 0.7171 0.8933 0.7171 1.0000 23 / 45

Latent Dirichlet allocation Poisson factor Latent Dirichlet allocation (Blei et al., 2003) Hierarchical model: x ji Mult(φ zji ) z ji Mult(θ j ) φ k Dir(η,..., η) ( α θ j Dir K,..., α ) K There are K topics {φ k } 1,K, each of which is a over the V words in the vocabulary. There are N documents in the corpus and θ j represents the proportion of the K topics in the jth document. x ji is the ith word in the jth document. z ji is the index of the topic selected by x ji. 24 / 45

Latent Dirichlet allocation Poisson factor Denote n vjk = i δ(x ji = v)δ(z ji = k), n v k = j n vjk, n jk = v n vjk, and n k = j n jk. Blocked Gibbs sampling: P(z ji = k ) φ xji kθ jk, k {1,..., K} (φ k ) Dir(η + n 1 k,..., η + n V k ) ( α (θ j ) Dir K + n j1,..., α ) K + n jk Variational Bayes inference (Blei et al., 2003). 25 / 45

Latent Dirichlet allocation Poisson factor Collapsed Gibbs sampling (Griffiths and Steyvers, 2004): Marginalizing out both the topics {φ k } 1,K and the topic proportions {θ j } 1,N. Sample z ji conditioning on all the other topic assignment indices z ji : P(z ji = k z ji ) η + n ji ( x ji k V η + n ji n ji jk + α ), k {1,..., K} K k This is easy to understand as P(z ji = k φ k, θ j ) φ xji kθ jk P(z ji = k z ji ) = P(z ji = k φ k, θ j )P(φ k, θ j z ji )dφ k dθ j P(φ k z ji ) = Dir(η + n ji 1 k,..., η + n ji V k ) ( α P(θ j z ji ) = Dir K + n ji j1,..., α ) K + n ji jk P(φ k, θ j z ji ) = P(φ k z ji )P(θ j z ji ) 26 / 45

Latent Dirichlet allocation Poisson factor In latent Dirichlet allocation, the words in a document are assumed to be exchangeable (bag-of-words assumption). Sparse codes Below we will Images relate latent Dirichlet Dictionary allocation to Poisson P N factor X and show = P K K N it essentially Φ tries Θ to factorize the term-document word count matrix under the Poisson likelihood: Words Documents Matrix P N X Topics = P K Words Φ Topics Documents K N Θ 27 / 45

Latent Dirichlet allocation Poisson factor Poisson factor alaysis Factorize the term-document word count matrix M Z V N + under the Poisson likelihood as M Pois(ΦΘ) where Z + = {0, 1,...} and R + = {x : x > 0}. m vj is the number of times that term v appears in document j. Factor loading matrix: Φ = (φ 1,..., φ K ) R V K +. Factor score matrix: Θ = (θ 1,..., θ N ) R K N +. A large number of discrete latent variable models can be united under the Poisson factor framework, with the main differences on how the priors for φ k and θ j are constructed. 28 / 45

Latent Dirichlet allocation Poisson factor Two equivalent augmentations Poisson factor ( K ) m vj Pois φ vk θ jk Augmentation 1: m vj = k=1 K n vjk, n vjk Pois(φ vk θ jk ) k=1 Augmentation 2: ( K ) φ vk θ jk m vj Pois φ vk θ jk, ζ vjk = K k=1 φ vkθ jk k=1 [n vj1,, n vjk ] Mult (m vj ; ζ vj1,, ζ vjk ) 29 / 45

Latent Dirichlet allocation Poisson factor Nonnegative matrix and gamma-poisson factor Gamma priors on Φ and( Θ: K ) m vj = Pois k=1 φ vkθ jk φ vk Gamma(a φ, 1/b φ ), θ jk Gamma(a θ, g k /a θ ). Expectation-Maximization (EM) algorithm: φ vk = φ vk θ jk = θ jk a φ 1 φ vk a θ 1 θ jk + N i=1 b φ + θ k m vj θ jk K k=1 φ vkθ jk m vj φ vk K k=1 φ vkθ jk + P p=1. a θ /g k + φ k If we set b φ = 0, a φ = a θ = 1 and g k =, then the EM algorithm is the same as those of non-negative matrix (Lee and Seung, 2000) with an objective function of minimizing the KL divergence D KL (M ΦΘ). 30 / 45

Latent Dirichlet allocation and Dirichlet-Poisson factor Latent Dirichlet allocation Poisson factor Dirichlet priors on Φ and( Θ: K ) m vj = Pois k=1 φ vkθ jk φ k Dir(η,..., η), θ j Dir(α/K,..., α/k). One may show that both the block Gibbs sampling inference and variational Bayes inference of the Dirichlet-Poisson factor model are the same as that of the Latent Dirichlet allocation. 31 / 45

Latent Dirichlet allocation Poisson factor Beta-gamma-Poisson factor Hierachical model (Zhou et al., 2012, Carin, 2014): m vj = K n vjk, n vjk Pois(φ vk θ jk ) k=1 φ k Dir (η,, η), θ jk Gamma [r j, p k /(1 p k )], r j Gamma(e 0, 1/f 0 ), p k Beta[c/K, c(1 1/K)]. n jk = V v=1 n vjk NB(r j, p k ) This parametric model becomes a nonparametric model governed by the beta-negative process as K. 32 / 45

Latent Dirichlet allocation Poisson factor Gamma-gamma-Poisson factor Hierachical model ( Carin, 2014): m vj = n jk NB(r k, p j ) K n vjk, n vjk Pois(φ vk θ jk ) k=1 φ k Dir (η,, η), θ jk Gamma [r k, p j /(1 p j )], p j Beta(a 0, b 0 ), r k Gamma(γ 0 /K, 1/c). This parametric model becomes a nonparametric model governed by the gamma-negative process as K. 33 / 45

Latent Dirichlet allocation Poisson factor Poisson factor and mixed-membership We may represent the Poisson factor m vj = K n vjk, n vjk Pois(φ vk θ jk ) k=1 in terms of a mixed-membership model, whose group sizes are randomized, as ( ) K θ jk x ji Mult(φ zji ), z ji k θ δ k, m j Pois θ jk, jk k k=1 where i = 1,..., m j in the jth document, and n vjk = m j i=1 δ(x ji = v)δ(z ji = k). The likelihoods of the two representations are different update to a multinomial coefficient (Zhou, 2014). 34 / 45

Latent Dirichlet allocation Poisson factor Connections to previous approaches Nonnegative matrix (K-L divergence) (NMF) Latent Dirichlet allocation (LDA) GaP: gamma-poisson factor model (GaP) (Canny, 2004) Hierarchical Dirichlet process LDA (HDP-LDA) (Teh et al., 2006) Poisson factor Infer Infer Support Related priors on θ jk (p k, r j ) (p j, r k ) K algorithms gamma NMF Dirichlet LDA beta-gamma GaP gamma-gamma HDP-LDA 35 / 45

Latent Dirichlet allocation Poisson factor Blocked Gibbs sampling Sample z ji from multinomial; n vjk = m j i=1 δ(x ji = v)δ(z ji = k). Sample φ k from Dirichlet For the beta-negative model (beta-gamma-poisson factor ) Sample l jk from CRT(n jk, r j ) Sample r j from gamma Sample p k from beta Sample θ jk from Gamma(r j + n jk, p k ) For the gamma-negative model (gamma-gamma-poisson factor ) Sample l jk from CRT(n jk, r k ) Sample r k from gamma Sample p j from beta Sample θ jk from Gamma(r k + n jk, p j ) Collapsed Gibbs sampling for the beta-negative model can be found in (Zhou, 2014). 36 / 45

Latent Dirichlet allocation Poisson factor Example application Example Topics of United Nation General Assembly Resolutions inferred by the gamma-gamma-poisson factor : Topic 1 trade world conference organization negotiations Topic 2 rights human united nations commission Topic 3 environment management protection affairs appropriate Topic 4 women gender equality including system Topic 5 economic summits outcomes conferences major The gamma-negative and beta-negative models have distinct mechanisms on controlling the number of inferred factors. They produce state-of-the-art perplexity results when used for topic of a document corpus (Zhou et al, 2012, Zhou and Carin 2014, Zhou 2014). 37 / 45

Stochastic blockmodel A relational (graph) is commonly used to describe the relationship nodes, where a node could represent a person, a movie, a protein, etc. Two nodes are connected if there is an edge (link) them. An undirected unweighted relational with N nodes can be equivalently represented with a sysmetric binary affinity matrix B {0, 1} N N, where b ij = b ji = 1 if an edge exists nodes i and j and b ij = b ji = 0 otherwise. 38 / 45

Stochastic blockmodel Each node is assigned to a cluster. Stochastic blockmodel The probability for an edge to exist two nodes is solely decided by the clusters that they are assigned to. Hierachical model: b ij Bernoulli(p zi z j ), p k1 k 2 Beta(a 0, b 0 ), z i Mult(π 1,..., π K ), (π 1,..., π K ) Dir(α/K,..., α/k) Blocked Gibbs sampling: P(z i = k ) = π k j i p b ij for j > i kz j (1 p kzj ) 1 b ij 39 / 45

Infinite relational model (Kemp et al., 2006) Stochastic blockmodel As K, the stochastic block model becomes a nonparametric model governed by the Chinese restaurant process (CRP) with concentration parameter α: b ij Bernoulli(p zi z j ), p k1 k 2 Beta(a 0, b 0 ), (z 1,..., z N ) CRP(α) for i > j Collapsed Gibbs sampling can be derived by marginalizing out p k1 k 2 and using the prediction rule of the Chinese restaurant process. 40 / 45

Stochastic blockmodel The coauthor of the top 234 NIPS authors. 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 41 / 45

Stochastic blockmodel The reordered using the stochastic blockmodel. 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 42 / 45

The estimated link probabilities within and blocks. 20 0.9 40 0.8 60 0.7 80 0.6 100 0.5 120 Stochastic blockmodel 140 160 180 200 220 0.4 0.3 0.2 0.1 20 40 60 80 100 120 140 160 180 200 220 43 / 45

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 2003. T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 2004. C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In AAAI, 2006. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix. In NIPS, 2000. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. JASA, 2006. M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative process and Poisson factor. In AISTATS, 2012. M. Zhou, L. Li, D. Dunson, and L. Carin. Lognormal and gamma mixed negative regression. In ICML, 2012. 44 / 45

M. L. Carin. Augment-and-conquer negative processes. In NIPS, 2012. M. L. Carin. process count and mixture. IEEE TPAMI, 2014. M. Zhou. Beta-negative process and exchangeable random partitions for mixed-membership. In NIPS, 2014. 45 / 45