A marginal sampler for σ-stable Poisson-Kingman mixture models

Size: px

Start display at page:

Download "A marginal sampler for σ-stable Poisson-Kingman mixture models"

Jeffrey Marshall
5 years ago
Views:

1 A marginal sampler for σ-stable Poisson-Kingman mixture models joint work with Yee Whye Teh and Stefano Favaro María Lomelí Gatsby Unit, University College London Talk at the BNP 10 Raleigh, North Carolina June 22, 2015

2 Table of Contents I Latent variable models Infinite mixture models Poisson-Kingman processes Marginal and conditional samplers Conclusions

Latent variable models The world is complex so we need compositional models. These are built from blocks of simple models stacking them together.

3 Latent variable models The world is complex so we need compositional models. These are built from blocks of simple models stacking them together. We would like to expand the grammar of models that we can do inference with. [Picture borrowed from Zoubin Ghahramani s Unsupervised Learning notes]

4 Mixture models Let {Y i } n i=1 be our data. A mixture model is an example of a latent variable model which has a single discrete latent variable per observation X i Discrete(π) Y i X i f ( θ xi ). Under the discrete distribution P(X i = m) = π m, π m 0, k i=1 π m = 1 and P(Y i dy i ) = = k m=1 k m=1 P(Y i dy i X i )P(X i = m) π m F (dy i θ m ).

5 Mixture model For density estimation and/or clustering tasks, we could do the following: 1. Choose the number of mixture components k. 2. Learn model parameters {θ m } k m=1. We could repeat these steps for a bunch of k, say k {1, 2, 3}, and then carry out some model selection procedure to select the best model.

6 Infinite mixture models An infinite mixture model is a mixture model with potentially infinitely many mixture components. It can be formulated in the following hierarchical specification: P Random probability measure (RPM) X i P P Y i X i F Xi. Which RPMs can we choose?

7 Dirichlet process (DP) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(σ, τ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Ferguson, 1973]

8 Pitman-Yor process (PY) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(σ, τ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Perman et al., 1992, Pitman and Yor, 1997]

9 Pitman-Yor process (PY) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Perman et al., 1992, Pitman and Yor, 1997]

10 Normalized Stable process (NS) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Kingman, 1975]

11 Normalized Stable process (NS) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(σ, τ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Kingman, 1975]

12 Normalized Inverse Gaussian process (NIG) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(σ, τ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Lijoi et al., 2005]

13 Normalized Random Measures (NRM) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Regazzini et al., 2003, James et al., 2009]

14 Normalized Generalized Gamma (NGG) P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML θ>0 σ<0 NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Brix, 1999]

15 Q class P KP NRM σ P K(h T ) Gibbs-Type σ σ (0,1) CC Gamma Tilted σ=0 Lamperti Tilted Q σ (F τ ) τ=0, θ> σ MDP ML F τ =δ τ F τ =F τ σ,θ σ<0 NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [James, 2013] and our paper [Favaro et al., 2014]

16 σ-stable Poisson-Kingman processes (σ-pk) PKP NRM σ-pk(h T ) Gibbs-Type σ σ (0,1) Lamperti Tilted CC h 2,T Gamma Tilted σ=0 h 1,T Q σ (F τ ) τ=0, θ> σ h 3,T MDP σ<0 ML F τ =δ τ F τ =F τ σ,θ NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Pitman, 2003]

17 σ-stable Poisson-Kingman processes (σ-pk) PKP NRM σ-pk(h T ) Gibbs-Type σ σ (0,1) Lamperti Tilted CC h 2,T Gamma Tilted σ=0 h 1,T Q σ (F τ ) τ=0, θ> σ h 3,T MDP σ<0 ML F τ =δ τ F τ =F τ σ,θ NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Pitman, 2003]

18 Poisson-Kingman processes (PKP) PKP NRM σ-pk(h T ) Gibbs-Type σ σ (0,1) Lamperti Tilted CC h 2,T Gamma Tilted σ=0 h 1,T Q σ (F τ ) τ=0, θ> σ h 3,T MDP σ<0 ML F τ =δ τ F τ =F τ σ,θ NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Pitman, 2003]

19 σ-stable Poisson-Kingman processes PKP NRM σ-pk(h T ) Gibbs-Type σ σ (0,1) Lamperti Tilted CC h 2,T Gamma Tilted σ=0 h 1,T Q σ (F τ ) τ=0, θ> σ h 3,T MDP σ<0 ML F τ =δ τ F τ =F τ σ,θ NGG(τ, σ) PY(θ, σ) σ=0.5 τ=0 σ 0 θ=0 NIG(τ) NS(σ) DP(θ) MFSD σ 0 [Pitman, 2003]

20 Completely random measures Let X be a Polish space with the Borel σ-algebra B (X). A completely random measure (CRM) µ is a random measure on X such that, for all measurable disjoint sets A, B, µ(a) µ(b). µ is an a.s. discrete measure of the form µ = W i δ Xi. i=1 The distribution of the random weights and locations is characterised by an intensity measure ν. These form a Poisson process over R + X. If it factorizes ν = ρ H 0 then the CRM(ρ, H 0 ) is homogeneous. [Kingman, 1967]

21 Normalization of completely random measures If we take µ and normalize it by its total mass T = µ(x) = i=1 W i, we obtain a normalized random measure P = i=1 W i δ Xi denoted by P NRM(ρ, H 0 ). µ CRM(ρ, H 0 ) P = µ T

22 Poisson-Kingman processes As a generalisation of a CRM, a Poisson Kingman process is given by the following hierarchical specification T γ T (dt) h(dt)f T (dt) G T = t PK(ρ, H 0, δ t ) We call the random probability measure G PK(ρ, H 0, γ) a Poisson-Kingman RPM with Lévy measure ρ, base distribution H 0 and mixing distribution γ. [Pitman, 2003]

23 Poisson-Kingman mixture models We can define a hierarchical model such that T γ h f T G T = t PK(ρ σ, H 0, δ t ) X i G G Y i X i F (X i ) (1) How can we perform inference with this class of infinite mixture models?

24 Ilustration: Size biased sampling PKP T γ V 1 SBS σ (T ) V 1 Represented mass : V 1 Surplus mass: T V 1

25 Ilustration: Size biased sampling PKP X 1 H 0 V 1 X 1 = x 1 Represented mass : V 1 Surplus mass: T V 1

26 Ilustration: Size biased sampling PKP V 2 SBS σ (T V 1 ) V 1 V 2 X 1 = x 1 Represented mass : V 1 + V 2 Surplus mass: T V 1 V 2

27 Ilustration: Size biased sampling PKP X 2 H 0 X 2 = x 2 V 1 V 2 X 1 = x 1 Represented mass : V 1 + V 2 Surplus mass: T V 1 V 2

28 Ilustration: Size biased sampling PKP V 3 SBS σ (T 2 k=1 V k ) X 2 = x 2 V 1 V 2 V 3 X 1 = x 1 Represented mass : V 1 + V 2 + V 3 Surplus mass: T 3 k=1 V k

29 Ilustration: Size biased sampling PKP X 3 H 0 X 2 = x 2 V 1 V 2 V 3 X 3 = x 3 X 1 = x 1 Represented mass : V 1 + V 2 + V 3 Surplus mass: T 3 k=1 V k

30 Ilustration: Size biased sampling PKP X 4 = x 1 X 2 = x 2 V 1 V 2 V 3 X 3 = x 3 X 1 = x 1 Represented mass : V 1 + V 2 + V 3 Surplus mass: T 3 k=1 V k

31 Ilustration: Size biased sampling PKP V 4 SBS σ (T 3 k=1 V k ) X 4 = x 1 X 2 = x 2 V 1 V 2 V 3 V 4 X 3 = x 3 X 1 = x 1 Represented mass : 4 k=1 V k Surplus mass: T 4 k=1 V k

32 Ilustration: Size biased sampling PKP X 4 H 0 X 4 = x 1 X 2 = x 2 X 5 = x 4 V 1 V 2 V 3 V 4 X 3 = x 3 X 1 = x 1 Represented mass : 4 k=1 V k Surplus mass: T 4 k=1 V k

33 Ilustration: Size biased sampling PKP V 5 SBS σ (T 4 k=1 V k ) X 4 = x 1 X 2 = x 2 X 5 = x 4 V 1 V 2 V 3 V 4 V 5 X 3 = x 3 X 1 = x 1 Represented mass : 5 k=1 V k Surplus mass: T 5 k=1 V k

34 Ilustration: Size biased sampling PKP X 5 H 0 X 4 = x 1 X 2 = x 2 X 5 = x 4 X 6 = x 5 V 1 V 2 V 3 V 4 V 5 X 3 = x 3 X 1 = x 1 Represented mass : 5 k=1 V k Surplus mass: T 5 k=1 V k

35 Ilustration: Size biased sampling PKP X 4 = x 1 X 2 = x 2 X 5 = x 4 X 6 = x 5 V 1 V 2 V 3 V 4 V 5 X 3 = x 3 X 7 = x 5 X 1 = x 1 Represented mass : 5 k=1 V k Surplus mass: T 5 k=1 V k

36 Ilustration: Size biased sampling PKP X 4 = x 1 X 2 = x 2 X 5 = x 4 X 6 = x 5 V 1 V 2 V 3 V 4 V 5 X 3 = x 3 X 7 = x 5 X 8 = x 4 X 1 = x 1 Represented mass : 5 k=1 V k Surplus mass: T 5 k=1 V k

37 Joint distribution We can size-bias sample n times and there will be ties. This induces a random partition of size K n = k of n with frequencies n i for i = 1,, k. The corresponding joint distribution is P (Π n = π, {V j dv j } k j=1, {X j dxj }k t) γ(dt) (2) j=1 Can we build an exact MCMC sampler using this construction?

38 1. Marginal samplers. Integrate out the infinite dimensional component, just simulate from the partition induced by it. y 2 y 4 y 6 x 1 y 3 x 2 x 3 x 4 y 7 y 1 y 5 y 8 [MacEachern and Müller, 1998], [Neal, 2000] for DP mixture models,[favaro and Teh, 2013] for NRM mixture models.

39 Marginal sampler s intractabilities After integrating out the size-biased weights, we obtain [f σ (t)] 1 P σ (Π n = π, {X j dx j }k j=1 t) = σ k Γ(n σk) (t σ ) k 0 1 p n σk 1 f σ ((1 p)t) dp [1 σ] ni 1 H 0(dxk ) k i=1 [Kanter, 1975] s σ-stable density s integral representation, [Tanner and Wong, 1987] s data augmentation scheme.

40 σ-stable Poisson-Kingman augmented mixture model Let {Y i } i [n] be n data points, {Xj } are the cluster parameters, the c π joint distribution for P, W = T σ 1 σ, Z, Πn {X j } c π,{y i } i [n] is P σ (Π n = π, P dp, W dw, Z dz) P ({X c dx c } c π, {Y i dy i } i [n] π) = σ k πγ(n σk) w (1 σ)k 2 p n σk 1 (1 p) 1 A(z) 1 σ A(z) exp [ (1 p) σ 1 σ w ] h(w 1 σ σ ) [1 σ] c 1 H 0 (xc ) F (y i xc ). c π i c

41 σ-stable Poisson-Kingman augmented mixture model Let {Y i } i [n] be n data points, {Xj } are the cluster parameters, the c π joint distribution for P, W = T σ 1 σ, Z, Πn {X j } c π,{y i } i [n] is P σ (Π n = π, P dp, W dw, Z dz) P ({X c dx c } c π, {Y i dy i } i [n] π) = Prior Likelihood Posterior

42 σ-stable Poisson-Kingman augmented mixture model h(t) Z T σ P Π H 0 Y i i [n] X c c π Figure: Graphical model for the augmented mixture model

43 Chinese restaurant-like component-wise assigment P σ (l c π i, x, rest) c σ Γ(n σ π ) f (y i x c, {y j } j c ) 1 M σt σ p σ Γ(n σ π +1)) f (y i x new ) l c π i l [M] Algorithm 1 σ-pkmarginalsampler( h T (t), σ, M, H 0, {y i } n i=1 ) for t = 2 T do Update z: Slice sample P (Z dz rest) Update p : Slice sample P (P dp rest) Update t: Slice sample P (T dt rest) Update π, {x c } c π : ReUse(Π n, C, {X c } c Πn, {y i } n i=1, rest) end for Extension of Reuse algorithm from [Favaro and Teh, 2013], M is the number of empty clusters.

44 X 1 X 1 X 1 X 1 X 1 X X 1 1 X 2 X 2 X 2 X 2 X 2 2 X X 2 X 3 X 3 X 3 1 X3 4 1 X3 4 3 X X3 X4 X4 X 4 X 4 X 4 3 X X 5 X 5 X 5 X 5 X 5 X 6 X 6 Figure: An iteration of the component wise cluster assignment, M = 2 empty clusters.

45 Conditional MCMC samplers 2. Conditional samplers. Simulate from the joint posterior without integrating out the weights. These schemes rely on finding a finite representation of the infinite dimensional probability measure. Figure: Slice- conditional sampler [Ishwaran and James, 2001] [Papaspiliopoulos and Roberts, 2008], [Walker, 2007], [Kalli et al., 2011] for DP mixture models, [Favaro and Walker, 2012] for σ-pk mixture models.

46 Marginal sampler s performance assessment Algorithm σ Running time ESS(±std) Pitma-Yor process (θ = 10) Marginal (84.104) ( ) Conditional 0.3 NA NA Marginal (37.077) ( ) Conditional ( ) (41.475) Marginal (52.514) ( ) Conditional 0.7 NA NA Normalized Stable process Marginal ( ) ( ) Conditional 0.3 NA NA Marginal ( ) ( ) Conditional (22.647) ( ) Marginal (85.264) ( ) Conditional 0.7 NA NA Normalized Generalized Gamma process (τ = 1) Marginal (208.98) ( ) Conditional 0.3 NA NA Marginal (93.164) ( ) Conditional ( ) (70.148) Marginal ( ) ( ) Conditional 0.7 NA NA Comparison with our implementation of [Favaro and Walker, 2012] s slice conditional sampler.

47 Multidimensional experiment: olive oil dataset The dataset from [de la Mata-Espinosa et al., 2011] consists of the triacylglyceride profiles of n = 120 oil samples.

48 σ-stable Poisson-Kingman mixture of multivariate Normals 1. PCA on the original dataset (n = 120, d = 4000 dimensional vectors) taking the first 8 principal components which explained 97% of the variance. 2. Prior specification. σ-stable Poisson-Kingman mixture model with parameters (σ, h T (t)) for different h-tilting functions. 3. Likelihood specification Let X be the projected data set, x i R d F (x i φ c ) = Multivariate Normal (x i µ ci, R ci ) H 0 = Multivariate Normal(µ µ 0, R r 0 )Inverse Wishart(R ν, S).

49 Results Examples DP PY NGG Gamma tilted NS Lamperti tilted Average predictive probability e-12 (7.6848e-13) e-12 (7.5549e-13) e-12 (3.4035e-12) e-12(1.5767e-12) e-12(9.7106e-13) e-12 (1.0538e-12) Cluster size = Cluster size = Cluster size = Cluster size = Dendrogram with Matlab linkage routine Cluster size =6

50 Conclusions General purpose MCMC algorithm for a wide class of infinite mixture models. This class is natural in the sense that it is part of Gibbs type priors with σ (0, 1) [De Blasi et al., 2015]. Sequential version of our sampler, uses Sequential Monte Carlo (ongoing work with Pierre Jacob). Benefit from conditional and marginal sampler s perks: exploit the generative model s description (ongoing work).

51 Main References Lomeli, M. and Favaro, S. and Teh, Y.W. A marginal sampler for σ-stable Poisson-Kingman mixture models. Arxiv Favaro, S. and Lomeli, M. and Teh, Y. W. A tractable class of σ-stable Poisson-Kingman processes and an effective marginalized sampler. Statistics and Computing, Many thanks to the Gatsby Charitable Foundation and to the BNP-ISBA travel award for funding.

52 References I Brix, A. (1999). Generalized gamma measures and shot-noise Cox processes. Advances in Applied Probability, 31: De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prüenster, I., and Ruggiero, M. (2015). Are gibbs-type priors the most natural generalization of the Dirichlet process? In IEEE Transactions on Pattern Analysis & Machine Intelligence, volume 37, pages de la Mata-Espinosa, P., Bosque-Sendra, J., Bro, R., and Cuadros-Rodriguez, L. (2011). Olive oil quantification of edible vegetable oil blends using triacylglycerols chromatographic fingerprints and chemometric tools. Talanta, 85: Favaro, S., Lomeli, M., and Teh, Y. W. (2014). A tractable class of σ-stable Poisson-Kingman processes and an effective marginalized sampler. Statistics and Computing, in press.

53 References II Favaro, S. and Teh, Y. W. (2013). MCMC for normalized random measure mixture models. Statistical Science, 28(3): Favaro, S. and Walker, S. G. (2012). Slice sampling sigma-stable Poisson-Kingman mixture models. Journal of Computational and Graphical Statistics (To appear), 1:1. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2): Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick breaking priors. Journal of the American Statistical Association, 96: James, L. F. (2013). Stick-Breaking PG(α, ψ)-generalized Gamma processes. ArXiv: James, L. F., Lijoi, A., and Prüenster, I. (2009). Posterior analysis for normalized random measures with independent increments. Scandinavian Journal of Statistics, 36:76 97.

54 References III Kalli, M., Griffiths, J., and Walker, S. (2011). Slice sampling mixture models. Statistics and Computing, 21: Kanter, M. (1975). Stable Densities Under Change of Scale and Total Variation Inequalities. The Annals of Probability, 3: Kingman, J. C. (1967). Completely random measures. Pacific Journal of Mathematics, 21: Kingman, J. F. C. (1975). Random discrete distributions. Journal of the Royal Statistical Society, 37:1 22. Lijoi, A., Mena, R. H., and Prüenster, I. (2005). Hierarchical mixture modelling with normalized inverse-gaussian priors. Journal of the American Statistical Association, 100: MacEachern, S. N. and Müller, P. (1998). Estimating mixtures of Dirichlet process models. Journal of Computational and Graphical Statistics, 7:

55 References IV Neal, R. (2000). Markov chain sampling methods for Dirichlet Process mixture models. Journal of Computational and Graphical Statistics, 1(2):1. Papaspiliopoulos, O. and Roberts, G. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika, 95: Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, 92(1): Pitman, J. (2003). Poisson-Kingman partitions. In Goldstein, D. R., editor, Statistics and Science: a Festschrift for Terry Speed, pages Institute of Mathematical Statistics. Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability, 25:

56 References V Regazzini, E., Lijoi, A., and Prüenster, I. (2003). Distributional results for means of random measures with independent increments. Annals of Statistics, 31: Tanner, M. and Wong, W. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82: Walker, S. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics: Simulation and Computation, 36:45 54.

57 Appendix: Conditions on the CRM s Lévy measure We want the CRM to have infinitely many atoms We want it to have finite total mass 0 0 ρ(w)dw = [1 exp ( w)] ρ(w)dw

58 Algorithm 2 ReUse(Π n, M, {X c } c Πn, {y i } n i=1,rest) Draw {Xj e}m i.i.d. j=1 H 0 for i = 1 n do Let c Π n be such that i c c c {i} if c = then k DiscreteUniform( 1 M ) Xk e X c Π n Π n {c} end if Set c according to Pr[assign i to cluster c rest] if c [M] then Π n Π n {{i}} X {i} X c e Xc e H 0 else c c {i} end if end for

Slice sampling σ stable Poisson Kingman mixture models

ISSN 2279-9362 Slice sampling σ stable Poisson Kingman mixture models Stefano Favaro S.G. Walker No. 324 December 2013 www.carloalberto.org/research/working-papers 2013 by Stefano Favaro and S.G. Walker.