On the posterior structure of NRMI

On the posterior structure of NRMI Igor Prünster University of Turin, Collegio Carlo Alberto and ICER Joint work with L.F. James and A. Lijoi Isaac Newton Institute, BNR Programme, 8th August 2007

Outline CRM and NRMI Completely random measures (CRM) NRMIs Relation to other random probability measures

Outline CRM and NRMI Completely random measures (CRM) NRMIs Relation to other random probability measures Posterior structure Conjugacy Posterior characterization Predictive distributions Generalized Pólya urn scheme The two parameter Poisson Dirichlet process Hierarchical mixture models NRMI mixture model The posterior distribution of the mixture model

Completely random measures DEFINITION (Kingman, 1967). µ is a completely random measure (CRM) on (X, X ) if (i) µ( ) = 0 (ii) for any collection of disjoint sets in X, B 1, B 2,..., the random variables µ(b 1 ), µ(b 2 ),... are mutually independent and µ( j 1 B j ) = j 1 µ(b j ) Let G ν = {g : g(x) µ(dx) < }. Then, µ is characterized by its X Laplace functional [ E e X g(x) µ(dx)] { } = exp [1 e v g(x) ] ν(dv, dx) R + X for any g G ν. In the following, denote by ψ( ) the Laplace exponent R + X [1 e v ] ν(dv, dx).

Completely random measures DEFINITION (Kingman, 1967). µ is a completely random measure (CRM) on (X, X ) if (i) µ( ) = 0 (ii) for any collection of disjoint sets in X, B 1, B 2,..., the random variables µ(b 1 ), µ(b 2 ),... are mutually independent and µ( j 1 B j ) = j 1 µ(b j ) Let G ν = {g : g(x) µ(dx) < }. Then, µ is characterized by its X Laplace functional [ E e X g(x) µ(dx)] = exp { } [1 e v g(x) ] ν(dv, dx) R + X for any g G ν. In the following, denote by ψ( ) the Laplace exponent R + X [1 e v ] ν(dv, dx). = µ is identified by the intensity ν (which represents the intensity of the underlying Poisson random measure).

Letting α be a non atomic and σ finite measure on X. According to the decomposition of ν we distinguish two classes of CRM: (a) if ν(dv, dx) = ρ(dv) α(dx), we say that µ is homogeneous; (b) if ν(dv, dx) = ρ(dv x) α(dx), we say that µ is non homogeneous. Necessary assumptions for the normalization to be well defined: (A) µ is almost surely finite R + X [1 e λv ] ν(dv, dx) < for every λ > 0 if µ is a homogeneous CRM α being a finite measure

Normalized random measures with independent increments (NRMIs) DEFINITION. Let µ be a CRM on (X, X ) satisfying (A) and (B). Then the random probability measure on (X, X ) given by P( ) = µ( ) µ(x) is well defined and termed normalized random measure with independent increments (NRMI). A NRMI is uniquely characterized by the intensity ν of the corresponding CRM µ: according to the structure of ν we will distinguish homogeneous and non homogeneous NRMI.

Special cases of NRMI 1. Dirichlet process: Let µ be a gamma CRM with α a finite measure on X and ν(dv, dx) = e v dv α(dx) v = NRMI is a Dirichlet process with parameter measure α.

Special cases of NRMI 1. Dirichlet process: Let µ be a gamma CRM with α a finite measure on X and ν(dv, dx) = e v dv α(dx) v = NRMI is a Dirichlet process with parameter measure α. 2. Normalized generalized gamma (GG) process: Let µ be a GG CRM (Brix, 99) with α a finite measure on X and α ν(dv, dx) = Γ(1 α) s 1 α e τs ds α(dx) = NRMI is a normalized GG process with parameter α. 3. Normalized extended gamma process: Let µ be an extended gamma CRM (Dykstra & Laud, 81) with ν(dv, dx) = e b(t)v dv dt v with b a strictly positive function and α s.t. µ(x) < a.s. = NRMI is a normalized extended gamma process with parameters α and b.

Relation to other random probability measures Homogeneous NRMI are members of the following families of random probability measures: (i) Species sampling models (Pitman, 96) are defined as P( ) = i 1 P i δ Xi ( ) + ( 1 i 1 P i ) H( ) where 0 < P i < 1 are random weights such that i 1 P i 1, independent of the locations X i, which are i.i.d. with some non atomic distribution H. Problem: concrete assignment of the random weights P i : Stick-breaking procedure (Ishwaran and James 2001). Remark: A non homogeneous NRMI is not a species sampling model: weights and locations are not independent.

Characterization of the Dirichlet process (X n ) n 1 is a sequence of exchangeable observations with values in X governed by a NRMI.

Characterization of the Dirichlet process (X n ) n 1 is a sequence of exchangeable observations with values in X governed by a NRMI. A sample X (n) = (X 1,..., X n ) will contain: X1,..., X k (n) the k distinct observations in X n j > 0 the number of observations equal to Xj (j = 1,..., k). Let P be the set of all NRMIs and let P P. The posterior distribution of P, given X (n), is still in P if and only if P is a Dirichlet process.

The latent variable U U is not an auxiliary variable: it has a precise meaning summarizing the normalization procedure and the distribution of µ(x).

The latent variable U U is not an auxiliary variable: it has a precise meaning summarizing the normalization procedure and the distribution of µ(x). τ m (u x) = R + s m e us ρ x (ds) for any m 1 and x X

The latent variable U U is not an auxiliary variable: it has a precise meaning summarizing the normalization procedure and the distribution of µ(x). τ m (u x) = s m e us ρ R + x (ds) for any m 1 and x X U 0 is a positive random variable with density f 0 (u) e ψ(u) τ(u x) η(dx) U n is a positive random variable whose density, conditional on the data X (n), is for any n 1 f (u X (n) ) u n 1 e ψ(u) X k j=1 τ nj (u X j ) Remark: The distribution of (U X (n) ) is a mixture of gamma distributions with mixing measure the posterior total mass ( µ(x) X (n) ) y n f U X (n)(u) = Γ(n) un 1 e yu Q (dy X (n)) (0,+ ) where Q( X (n) ) denotes the posterior distribution of µ(x).

The posterior distribution of the CRM µ

The posterior distribution of the CRM µ The posterior distribution of µ, given X (n), is a mixture with respect to f (u X (n) )

The posterior distribution of the CRM µ The posterior distribution of µ, given X (n), is a mixture with respect to f (u X (n) ) Given U n = u and X (n), where µ d = µ u + k i=1 J (u) i δ X i

The posterior distribution of the CRM µ The posterior distribution of µ, given X (n), is a mixture with respect to f (u X (n) ) Given U n = u and X (n), where (i) jump J (u) i at Xi µ d = µ u + k i=1 J (u) i δ X i has density f i (s) s n i e us ρ X (ds) i

The posterior distribution of the CRM µ The posterior distribution of µ, given X (n), is a mixture with respect to f (u X (n) ) Given U n = u and X (n), where (i) jump J (u) i at Xi µ d = µ u + (ii) µ u is a CRM with intensity k i=1 J (u) i δ X i has density f i (s) s n i e us ρ X (ds) i ν (u) (dx, ds) = e us ρ x (ds) η(dx)

The posterior distribution of the NRMI P It now follows easily that given U n and X (n), the (posterior) distribution of P is again a NRMI:

The posterior distribution of the NRMI P It now follows easily that given U n and X (n), the (posterior) distribution of P is again a NRMI: P U n, X (n) = d = µ u + k i=1 J(u) i µ u (X) + k δ X i d = w µ u + (1 w) µ u (X) with w = µ u (X)( µ u (X) + k i=1 J(u) i ) 1. i=1 J(u) i k i=1 J(u) i k r=1 J(u) r δ X i

The posterior distribution of the normalized GG process Let P be a normalized GG process. Then the (posterior) distribution of µ given U n and X (n), µ can be represented as where µ u + k i=1 J (u) i δ X i (i) µ u is a GG CRM with intensity measure ν (u) (ds, dx) = σ Γ(1 σ) s 1 σ e (u+1)s ds α(dx) (ii) the fixed points of discontinuity coincide with the distinct observations Xi, the jumps J i Gamma(u + 1, n i σ), for i = 1,..., k; (iii) µ (u) and J i (i = 1,..., k) are independent. Moreover, the distribution of U, conditional on X (n), is f (u X (n) ) un 1 e α(x)(1+u)σ (u + 1) n kσ.

Predictive distributions The (predictive) distribution of X n+1, given X (n), coincides with P[X n+1 dx n+1 X 1,..., X n ] = w (n) α(dx n+1 ) + 1 n k j=1 w (n) j δ X j (dx n+1 )

Predictive distributions The (predictive) distribution of X n+1, given X (n), coincides with P[X n+1 dx n+1 X 1,..., X n ] = w (n) α(dx n+1 ) + 1 n k j=1 w (n) j δ X j (dx n+1 ) where w (n) = 1 n + 0 u τ 1 (u x n+1 ) f (u X (n) ) du + w (n) j = 0 u τn +1(u X j j ) τ nj (u Xj f (u X (n) ) du )

Sampling from the marginal distribution of the X i s Note that, conditionally on U n = u, the predictive distribution is k P[X n+1 dx n+1 X (n), U n = u] κ 1 (u) τ 1 (u x n+1 ) α(dx n+1 )+ where κ 1 (u) = X τ 1(u x) α(dx). j=1 τ nj +1(u Xj ) τ nj (u Xj δ ) X (dx n+1 ) j

Sampling from the marginal distribution of the X i s Note that, conditionally on U n = u, the predictive distribution is k P[X n+1 dx n+1 X (n), U n = u] κ 1 (u) τ 1 (u x n+1 ) α(dx n+1 )+ j=1 τ nj +1(u Xj ) τ nj (u Xj δ ) X (dx n+1 ) j where κ 1 (u) = X τ 1(u x) α(dx). From this one can implement an analogue of the Pólya urn scheme in order to draw a sample X (n) from P

Generalization of a Pólya urn scheme 1) Sample U 0 from f 0 (u) 2) Sample X 1 from m(dx U 0 )

Generalization of a Pólya urn scheme 1) Sample U 0 from f 0 (u) 2) Sample X 1 from m(dx U 0 ) 3) At step i Sample U i 1 from f (u X (i 1) ) Generate ξ i from m(dξ U i 1 ) and X i from m(dx X 1,..., X i 1, U i 1 ) { ξi prob κ X i = 1 (U i 1 ) X j,i 1 prob τ nj,i 1 +1(U i 1 X j,i 1 )/τ n j,i 1 +1(U i 1 X j,i 1 ) where X j,i 1 is the j th distinct value among X 1,..., X i 1 and n j,i 1 = card{x s : X s = X j,i 1, s = 1,..., i 1}

Sampling the posterior random measure Recall that, given U n = u and X (n), µ d = µ u + k i=1 J(u) i δ X i

Sampling the posterior random measure Recall that, given U n = u and X (n), µ = d µ u + k Algorithm: i=1 J(u) i δ X i

Sampling the posterior random measure Recall that, given U n = u and X (n), µ = d µ u + k Algorithm: (1) Sample U n from f (u X (i 1) ) (2) Sample J (Un) i from the density f i (s) ds s n i e Uns ρ X (ds) i i=1 J(u) i (3) Simulate a realization of the completely random measure µ (Un) with intensity measure ν (Un) (dx, ds) = e Uns ρ x (ds) η(dx) via the Ferguson and Klass algorithm. δ X i

The two parameter Poisson Dirichlet process The PD(σ, θ) can be represented (Pitman, 96) as species sampling model p i=1 iδ Xi with stick breaking weights i 1 ind iid p i = V i (1 V j ) V i beta(θ + iσ, 1 σ), X i H j=1 Using this representation, in Pitman ( 96), it is shown that k P X (n) = d (1 pi ) P k (k) + where P (k) = PD(σ, θ + kσ) and (p 1,..., p k ) Dir(n 1 σ,..., n k σ, θ + kσ) The PD(σ, θ) process is also representable as normalized measure i=1 P( ) = φ( ) φ(x), but φ does not have independent increments (Pitman and Yor, 97). Indeed, the Laplace functional of φ is of the form E[e f (x) φ(dx) ] = 1 Γ(θ) 0 u θ 1 e 0 j=1 p j δ X j (u+f (x)) σ P 0 (dx) du

Identify a latent variable U n such that U n X (n) has density f (u X (n) ) = α Γ(k + θ/α) uθ+kα 1 e u α Then, given U n and X (n), the (posterior) distribution of ϕ coincides with the distribution of the random measure µ u + k i=1 where µ u is a GG CRM with intensity ν (u) (s) = J (u) i δ X i α Γ(1 α) s 1 α e u s (1) The jumps J (u) i Gamma(u, n i α). Finally, the jumps J (u) i (i = 1,..., k) and µ u are, conditional on U n, independent.

Hierarchical mixture models Y i X i ind f ( X i ) X i P iid P P NRMI

Hierarchical mixture models Y i X i ind f ( X i ) X i P iid P P NRMI Equivalently, Y (n) = (Y 1,..., Y n ) are exchangeable draws from the random density f ( ) = f ( x) P(dx). X

The posterior distribution of the mixture model

The posterior distribution of the mixture model The posterior density f, given the observations Y (n), is f ( x) P(dx Y (n) ) X where P(dx Y (n) ) is the (posterior) random probability measure whose distribution is P(dx Y (n) ) = d P(dp X (n) )P(dX (n) Y (n) ). with P(dp X (n) ) is the (posterior) distribution of the NRMI P, given X (n) P(dX (n) Y (n) ) is determined via Bayes theorem as { n i=1 f (Y i X i ) } m(dx (n) ) { n i=1 f (Y i X i ) } m(dx (n) ) where m(dx (n) ) is the marginal distribution of the latent variables.

Some concluding remarks Question 1: is it preferable to specify a GG prior as mixing measure (which includes Dirichlet process as special case) or stick with the Dirichlet process and enrich it with hyperpriors? What about parsimony in model specification? Question 2: Do we need applied statistical motivations for the introduction of new classes of priors? E.g. beta process (Hjort, 90) introduced for survival analysis, but turned out to be also the de Finetti measure of the Indian Buffet Process. Random probability measures are objects of interest in their own well beyond what we may think: e.g. the distribution of a mean functional of the two parameter PD process is relevant for the study of phylogenetic trees.

Some concluding remarks Mixture model is not the only use one can make of discrete nonparametric priors: if the data come from a discrete distribution, then it is reasonable the model the data with a discrete nonparameteric prior (see Ramses talk). Simpler context and there you get a real feeling of the limitations of the Dirichlet process: prediction is not monotone in the number of observed species.