Dependent Random Measures and Prediction

Size: px

Start display at page:

Download "Dependent Random Measures and Prediction"

Edward Phelps
5 years ago
Views:

1 Dependent Random Measures and Prediction Igor Prünster University of Torino & Collegio Carlo Alberto 10th Bayesian Nonparametric Conference Raleigh, June 26, 2015 Joint wor with: Federico Camerlenghi, Antonio Lijoi, Bernardo Nipoti and Peter Orbanz 1 / 42

2 Outline Exchangeability & Partial Exchangeability Dependence and its extreme cases Dependent Random Measures with Additive Structure Dependent Random Measures with Nested Structure Dependent Random Measures with Hierarchical Structure 2 / 42

3 Exchangeability & Partial Exchangeability Exchangeability Exchangeability Probabilistic statement concerning homogeneity (symmetry, analogy) between past and future justifies induction = Inferences are not affected by the order of the observations Suitable mathematical framewor for inference alternative to the classical approach with fixed and unnown probability distribution P = A sequence of observations (X n ) n 1 is exchangeable if (X 1, X 2,..., X n ) d = (X π(1), X π(2),..., X π(n) ) for any n 1 and any permutation π of (1,..., n). de Finetti s representation theorem (X n ) n 1 is exchangeable if and only if n P[X 1 A 1,..., X n A n ] = P(A i )Q(dP) P i=1 where P is the space of probability measures on X. = Q is the de Finetti measure of (X n ) n 1 and acts as a prior distribution for Bayesian inference being the law of a random probability measure P. 3 / 42

4 Exchangeability & Partial Exchangeability Beyond Exchangeability In many applications with real data, dependence structures more general than exchangeability are required. Indeed, in de Finetti (1938): But the case of exchangeability can only be considered as a limiting case: the case in which this analogy is, in a certain sense, absolute for all events under consideration. [..] To get from the case of exchangeability to other cases which are more general but still tractable, we must tae up the case where we still encounter analogies among the events under consideration, but without attaining the limiting case of exchangeability. nonparametric regression (covariate indexed observations) time dependent data change point problems data from different related studies or experiments... 4 / 42

5 Exchangeability & Partial Exchangeability Partial exchangeability Partial exchangeability A sequence (X, Y ) = (X i, Y j ) i,j 1 is partially exchangeable if (X 1,..., X n1, Y 1,..., Y n2 ) = d (X π(1),1,..., X π(n1), Y φ(1),..., Y φ(n2)) for any n 1, n 2 1 and any permutations π and φ of (1,..., n 1 ) and (1,..., n 2 ). de Finetti s representation theorem (X, Y ) = (X i, Y j ) i,j 1 is partially exchangeable if and only if P [X 1 A 1,..., X n1 A n1, Y 1 B 1,..., Y n2 B n2 ] n 1 n 2 = P 1 (A i ) P 2 (B j ) Q(dP 1, dp 2 ). P 2 i=1 j=1 = This is the same as saying that (X i, Y j ) P 1, P 2 iid P 1 P 2 i, j 1 ( P 1, P 2 ) Q with ( P 1, P 2 ) a vector of dependent random probabilities and Q is the prior. 5 / 42

6 Exchangeability & Partial Exchangeability Discrete partial exchangeable priors Induced partition structure Two samples X (n1) = {X 1,..., X n1 } and Y (n2) = {Y 1,..., Y n2 } from partially exchangeable sequences (X j ) j 1 and (Y i ) i 1 Prior Q selects almost surely discrete random probabilities ( P 1, P 2 ) = ties within each sample and possibly also across different samples. Partition of [N] = {1,..., n 1 + n 2 } induced by X (n1) and Y (n2) into 1 distinct values specific to X (n1) with frequencies n X j (j = 1,..., 1 ) 2 distinct values specific to Y (n2) with frequencies n Y j (j = 1,..., 2 ) 0 distinct values shared by the two samples with frequencies q X r and q Y r (r = 1,..., 0 ). partially Exchangeable Partition Probability Function (peppf) Π (N) (n X, n Y, q X, q Y )=E X 1 P nx j j=1 2 1 (dx j) where N = n 1 + n 2 and = l=1 0 2 (dy l) P ny l r=1 P qx r 1 (dz r ) P qy r 2 (dz r ) The expression of Π (n1+n2) depends on the choice of Q and is ey for prediction. 6 / 42

7 Exchangeability & Partial Exchangeability Prediction over additional samples A structural predictive property Goal: Conditional on (X (n1), Y (n2) ), predict features of an additional sample X (m n1) = X n1+1,..., X n1+m Interest in probability of sampling values in Y (n2) but not X (n1) : A new,old = Y (n2) \ X (n1) # ( A new,old X (m n1)) = # observations in X (m n1) belonging to A new,old The corresponding estimate is linear in the additional sample size m E [ # ( A new,old X (m n1)) X (n1), Y (n2)] = m P[X n1+1 A new,old X (n1), Y (n2) ] Remars: Such a property holds true whatever the prior Q on P 2 The same property holds true for the second sample and even when considering more than two samples Linearity is due to the identity in distribution conditional on the observed samples (X (n1), Y (n2) ) = structural property of partial exchangeability 7 / 42

8 Exchangeability & Partial Exchangeability Prediction over additional samples Same conclusions hold true when estimating leading to # ( A old,new X (m n1)) with A old,new = X (n1) \ Y (n2) # ( A old,old X (m n1)) with A old,old = X (n1) Y (n2) # ( A new,new X (m n1)) with A new,new = {X (n1) Y (n2) } c E[# ( A, X (m n1)) X (n1), Y (n2) ] = m P[X n1+1 A, X (n1), Y (n2) ] over all 4 sets X n1+i, i = 1,..., m, will belong to. = The expected proportion of observations, from an additional sample, that coincide with either new or shared or sample specific distinct values is the same, for any m, and equals P[X n1+1 A, X (n1), Y (n2) ] 8 / 42

9 Dependence and its extreme cases Measuring dependence Extreme cases of dependence induced by the prior Q Maximal dependence full exchangeability i.e. Q is degenerate on the diagonal {(P 1, P 2 ) P 2 : P 1 = P 2 }, namely P 1 = P 2 (a.s.) Independence P 1 and P 2 are (unconditionally) independent with respect to Q. = corresponds to maximal heterogeneity: inference on each sample is not influenced by the observations from the other sample. Example: Correlation is a popular measure of dependence. Since Corr( P 1 (A), P 2 (A)), typically does not depend on A it is taen as a measure of overall dependence. With respect to the extreme cases P1 = P 2 (a.s.) = Corr( P 1 (A), P 2 (A)) = 1 P 1 P 2 = Corr( P 1 (A), P 2 (A)) = 0 but still it is not an ideal measure of dependence for ( P 1, P 2 ). 9 / 42

10 Dependence and its extreme cases Dependence via stic breaing Strategies for constructing dependent nonparametric priors Most popular approach to the definition of dependent random probability measures is via dependent stic-breaing constructions introduced by MacEachern (1999, 2000): Let P 1 = p i,1 δ Zi,1 P2 = p j,2 δ Zj,2 with i 1 j 1 i 1 j 1 p i,1 = W i,1 (1 W l,1 ) p j,2 = W j,2 (1 W l,2 ). l=1 and dependence between P 1 and P 2 is induced by specifying dependent stic breaing weights (W i,1, W j,2 ) dependent locations Z j,1 and Z j,2 Pros: quite straightforward implementation of conditional simulation algorithms such as, e.g., slice sampler (Waler, 2007) or the retrospective sampler (Papaspiliopoulos & Roberts, 2008) Cons: it is almost impossible to derive analytic expressions for quantities of interest both marginal and conditional l=1 10 / 42

11 Dependence and its extreme cases Dependence via random measures Dependent priors via transformed random measures Alternative approach based on general random measures with 3 possible strategies (and combinations thereof) for creating dependance: 1. Additive structures Let M i be random measures (i = 0, 1, 2) and define P 1 = T ( M 1 + M 0 ) P 2 = T ( M 2 + M 0 ) where T is a suitable transformation (e.g. normalization). 2. Hierarchical structures Let P 0 = T ( M 0 ) be such that T ( M i ) P 0 = P i P 0 Q i ( P 0 ) with E[ P i P 0 ] = P 0, for i = 1, 2 P 0 Q 0 (P 0 ) with P 0 = E[ P 0 ] 3. Nested structures Let R = T ( M) be a random probability measure on P ( P 1, P 2 ) R iid R R L (Q 0 ) with Q 0 = E[ R] the law of a random probability P 0 = T ( M 0 ) on X. = Strategies already used in, respectively, Müller, Quintana & Rosner (2004), Teh, Jordan, Beal & Blei (2006), Rodriguez, Dunson & Gelfand (2008) relying on stic breaing constructions instead of the general theory of random measures. 11 / 42

12 Dependence and its extreme cases Completely random measures Completely random measures Let M be the space of boundedly finite measures on X. Completely random measures (Kingman, 1967) A random element µ taing values in M such that µ(a 1 ), µ(a 2 ),... µ(a d ) are mutually independent for any d 1 and for any collection of pairwise disjoint sets A 1,..., A d said to be a completely random measure (CRM). is The realizations of a CRM are a.s. discrete and, assuming there are no fixed points of discontinuity, a CRM can be represented as µ( ) = J i δ Zi ( ) with both the locations Z i s and the positive jumps J i s random. i=1 12 / 42

13 Dependence and its extreme cases Completely random measures A CRM µ is uniquely characterized by its Laplace functional ] [e = e E g(x) µ(dx) X [1 e v g(x) ]ν(dv,dx) R + X with ν indicating the intensity measure, which characterizes the CRM µ. Two cases Homogenous CRM µ: jumps (Ji ) i 1 & locations (Z i ) i 1 independent ν(ds, dx) = α(dx) ρ(s) ds Non-homogeneous CRM µ: (Ji ) i 1 & (Z i ) i 1 dependent ν(ds, dx) = α(dx) ρ x (s) ds Noteworthy examples: 1. gamma CRM: ν(dv, dx) = e τv v dv α(dx) 2. σ stable CRM: ν(dv, dx) = σ dv α(dx) Γ(1 σ) v 1+σ 3. generalized gamma CRM: ν(dv, dx) = σ e τv dv α(dx) Γ(1 σ) v 1+σ 13 / 42

14 Dependence and its extreme cases Transformations of CRMs CRM based priors Normalized completely random measures (Regazzini, Lijoi and P., 2003) Let µ be a CRM on X such that 0 < µ(x) < a.s. Then P( ) = µ( ) µ(x) is a normalized completely random measure (NRMI=Normalized Random Measure with Independent increments). = characterized by the Lévy intensity ν associated to µ. Hazard rate mixture models (Dystra & Laud, 1981; James, 2005) CRM µ on Y and ernel (t, y) on R + Y h(t) = (t, y) µ(dy) = Survival function P((t, { t )) = exp h(s) } ds Y Other examples (see Lijoi & P., 2010): NTR priors, Pitman Yor process, cumulative hazard priors, / 42

15 Dependence and its extreme cases EPPF Partition structure in the exchangeable case With CRM based priors, i.e. P = T ( µ), in the exchangeable case one has Marginal and predictive distributions of the data Characterization of the posterior of P X1,..., X n MCMC sampling schemes Applications to: density estimation, clustering, species sampling problems, inference on functionals of P,... Most practical implementations with discrete P rely on the exchangeable partition probability function (EPPF) (Pitman, 1995) induced by P Π (n) (n 1,..., n ) = E P nj (dx j ) X If P is a NRMI, one has an explicit EPPF in terms of the intensity ν of µ. E.g., if P is a Dirichlet process (Ferguson, 1973) with base measure θ P 0, then i=1 Π (n) (n 1,..., n ) = θ (θ) n Γ(n j ) j=1 which corresponds to the celebrated Ewens sampling formula (Ewens, 1972). 15 / 42

16 Dependent Random Measures with Additive Structure GM dependent CRMs Additive structure: GM dependent CRMs Construct dependent (homogeneous and finite) CRMs ( µ 1, µ 2 ) by means of an additive structure with a random element in common: µ 1 = µ 1 + µ 0 µ 2 = µ 2 + µ 0 where µ 0, µ 1, µ 2 are mutually independent CRMs with random Lévy intensity ν(dv, dx) = θzp 0 (dx)ρ(v)dv for µ 1, µ 2 ν 0 (dv, dx) = θ(1 Z)P 0 (dx)ρ(v)dv for µ 0 and Z is some [0, 1] valued random variable. = Griffiths-Milne dependent CRMs: ( µ 1, µ 2 ) GM-CRM Remars: Marginally µ 1 and µ 2 are CRMs with ν(dv, dx) = θp 0 (dx)ρ(v)dv. If Z = 1, a.s., then µ 1 µ 2 If Z = 0, a.s., then µ 1 = µ 2 a.s. 16 / 42

17 Dependent Random Measures with Additive Structure GM dependent NRMI Additive structure: GM dependent NRMIs GM dependent NRMI (Lijoi, Nipoti & P, 2014) If ( µ 1, µ 2 ) GM-CRM with ρ(r + ) =, then P 1 = µ 1 µ 1 (X) = µ 1 + µ 0 µ 1 (X) + µ 0 (X), P 2 = µ 2 µ 2 (X) = µ 2 + µ 0 µ 2 (X) + µ 0 (X) define a vector of GM dependent NRMI, ( P 1, P 2 ) GM-NRMI. The distribution Q of ( P 1, P 2 ) attains the two extreme cases: Z = 1 = P 1 P 2 (Unconditional independence) Z = 0 = P 1 = P 2 (a.s.) (full exchangeability) Non negative correlation between P 1 (A) and P 2 (A), for any A, i.e. Corr z ( P 1 (A), P 2 (A)) = (1 z) I(θ, z) where I(θ, z) > 0 is expressed in terms of ν Discreteness of ( P 1, P 2 ) = ties within and across the two samples P[X i = Y l ] = with τ q (u) = s q e us ρ(s) ds 0 0 u e θ ψ(u) τ 2 (u) du > 0 and ψ(u) = [1 e us ] ρ(s) ds 0 17 / 42

18 Dependent Random Measures with Additive Structure GM dependent NRMI peppf induced by X (n 1) and Y (n 2) Π (n 1+n 2 ) (n X, n Y, q X, q Y θ z) = Γ(n 1) Γ(n 2) j=1 ( ) (1 z) 0+ i + l z 1+ 2 i l u n 1 1 v n 2 1 e θz[ψ(u)+ψ(v)] θ(1 z)ψ(u+v) τ n X (u + i j jv) 2 j=1 τ n Y (l j ju + v) 0 τ q X r +qr Y r=1 (u + v) du dv where the sum runs over the set of all vectors i = (i 1,..., i 1 ) {0, 1} 1 l = (l 1,..., l 2 ) {0, 1} 1 2 with i = j=1 ij and l = 2 lj. j=1 and z = 1 = 0 = 0 = product of marginal EPPFs Π (n 1+n 2 ) (n X, n Y, q X, q Y 1) = Π (n 1) 1 (n X 1,..., nx 1 ) Π (n 2) 2 (n Y 1,..., ny 2 ) z = 0 = i = 1 & l = 2 = global EPPF Π (n 1+n 2 ) (n X, n Y, q X, q Y θ 0) = Γ(n 1 + n 2) 0 s n 1+n 2 1 e θψ(s) 1 j=1 τ n X (s) j 2 i=1 τ n Y (s) i 0 τ q X r +qr Y r=1 (s) ds 18 / 42

19 Dependent Random Measures with Additive Structure GM dependent NRMI Invariance condition: If λ i is any permutation of the vector of integers (1,..., i ), with i = 0, 1, 2, and λ i w = (w λi (1),..., w λi ( i )), then Π (n1+n2) (n X, n Y, q) = Π (n1+n2) (λ 1 n X, λ 2 n Y, λ 0 q) = exchangeability within three separate groups of clusters those with non shared values for the two samples those shared by the two samples peppf for GM dependent Dirichlet processes is in terms of hypergeometric functions 3 F 2, with some numerical issues peppf for GM dependent σ stable processes simpler and easier to evaluate For MCMC sampling: augmentation with suitable auxiliary random variables Application to dependent mixtures f l (x) = (x; θ) p l (dθ) l = 1, 2 for density estimation and clustering Θ 19 / 42

20 Dependent Random Measures with Nested Structure The Model Nested models Nested Dirichlet process introduced in Rodriguez, Dunson and Gelfand (2008) can be generalized to nested homogeneous NRMIs Nested NRMI (X i, Y j ) ( P 1, P 2 ) ind P 1 P 2 ( P 1, P 2 ) R iid R R L (Q 0 ) Q 0 ( ) = P[ P 0 ] P 0 NRMI In other terms R is a nested random probability measure such that iid R = ω j δ Gj G j P 0 P0 = γ i δ θi Since R is discrete j 1 i 1 π 1 = P[ P 1 = P 2 ] = θ u e θ ψ(u) τ 2 (u) du > / 42

21 Dependent Random Measures with Nested Structure Partition structure Nested NRMI: The peppf associated to a nested NRMI is Π (N) (n X, n Y, q X, q Y ) = π 1 Π (n1+n2) (n X, n Y, q X + q Y ) +(1 π 1 ) Π (n1) }{{} 1 (n X ) Π (n2) 2 (n Y ) I {0} ( 0 ) }{{} joint exchangeability (unconditional) independence = Linear combination of exchangeability and (unconditional) independence EPPF with exchangeability of all N = n 1 + n 2 observations in both samples Π (N) (n X, n Y, q X + q Y ) = θ 0 Γ(N) 0 u N 1 e θ 0 ψ 0 (u) Marginal EPPF in the independence case Π (n) (n) = θ 0 Γ(n) 0 1 j=1 τ (0) (u) n X j u n 1 e θ 0 ψ 0 (u) 2 i=1 j=1 τ (0) (u) n Y i 0 r=1 τ n (0) j (u) du with ψ 0(u) = [1 e us ] ρ 0 0(s) ds & τ m (0) (u) = s m e us ρ 0 0(s) ds τ (0) (u) du qr X +qy r 21 / 42

22 Dependent Random Measures with Nested Structure Partition structure Nested NRMI: The peppf associated to a nested NRMI is Π (N) (n X, n Y, q X, q Y ) = π 1 Π (n1+n2) (n X, n Y, q X + q Y ) +(1 π 1 ) Π (n1) }{{} 1 (n X ) Π (n2) 2 (n Y ) I {0} ( 0 ) }{{} joint exchangeability (unconditional) independence Extreme cases of full exchangeability and (uncond.) independence are obtained If X (n1) Y (n2), i.e. the two samples share at least one distinct value: ] P [ P 1 = P 2 X (n 1), Y (n2) = 1 A single distinct value in common between the two samples, i.e. 0 1, implies that, conditionally on the data, P 1 = P 2 with probability 1 and one has full exchangeablity (no heterogeneity) Restrictive! Need a model that still accommodates for heterogeneity, despite the presence of shared observations across samples = additive structure: let both P 1 and P 2 depend on an idyosincratic component and a common random probability measure P 0, which explains ties with positive probability and avoids the degeneracy. 22 / 42

23 Dependent Random Measures with Nested Structure Alternative Model Nested NRMI with additive structure Nested NRMI with additive homogeneous component (X i, Y j ) ( P 1, P 2 ) iid P 1 P 2 (P 1, P 2, P 0 ) ( R, R ) R 2 R ξ is a [0, 1] valued random variable P l d = ξ Pl + (1 ξ) P 0 l = 1, 2 R is a nested random probability measure & R R R = δ P 0 and P 0 d = P 0 Conditionally on latent allocation variables ζ, the peppf is { Π (N) (n X, n Y, q X, q Y ; ζ) = Π (N v) π 1 Π (v) + (1 π 1 ) Π (v1) 1 Π (v2) 2 = Even with shared observations the component of the peppf that accounts for heterogeneity does not vanish Π (v1) Π (v2) > } 23 / 42

24 Dependent Random Measures with Nested Structure Illustration Illustrative example: Density estimation (X i, Y j ) (θ 1,i, θ 2,j ) ind h( ; θ 1,i ) h( ; θ 2,j ) (θ 1,i, θ 2,j ) iid P 1 P 2 Gaussian ernel h( ; (M, V )) with (M, V ) R R + Nested prior with additive structure based on stable NRMI: P 1 = ξp 1 + (1 ξ)p 0 P2 = ξp 2 + (1 ξ)p 0 with standard choices of (conjugate) base measure and hyperpriors = Comparison: Standard nested vs. nested with additive structure ξ = 1 (a.s.) vs ξ Unif(0, 1) Simulate n 1 =n 2 = 100 iid values from X N(5, 0.6) N(10, 0.6) X N(0, 0.6) + 1 N(5, 0.6) 2 and estimate marginal densities f 1 and f 2 (also clustering is possible) Marginal MCMC algorithm readily derived from peppf. 24 / 42

25 Dependent Random Measures with Nested Structure Illustration True density and estimated density Figure: Model with ξ = 1 Figure: Model with ξ Unif(0, 1) 25 / 42

26 Dependent Random Measures with Hierarchical Structure Hierarchical NRMI models Hierarchical processes Hierarchical (homogeneous) NRMI for partially exchangeable data (X 1,i, X 2,j ) ( P 1, P 2 ) iid P 1 P 2 (j 1, j 2 ) N 2 P i P 0 NRMI( P 0 ) i = 1, 2 P 0 = NRMI(P 0 ) with P 0 a non atomic measure on X. = If the NRMI coincides with the Dirichlet process, one obtains the hierarchical Dirichlet process (HDP) introduced in Teh, Jordan, Beal and Blei (2006). See also Teh & Jordan (2010). Data generated by the model are: X (ni ) = (X i,1,..., X i,ni ) for i = 1, 2 data will contain ties with positive probability within and across samples leading to X (ni ) having i distinct values specific to sample i with frequencies (n i,1,..., n i,i ) for i = 1, 2, and 0 values shared by the two samples with frequencies (q i,1,..., q i,0 ). Goal: Derive the distribution of the partition structure. 26 / 42

27 Dependent Random Measures with Hierarchical Structure Chinese restaurant franchise metaphor Chinese restaurant franchise metaphor Observable level: customers and dishes There are d = 2 restaurants sharing the same menu. X i,j : label of the dish served at restaurant i to customer j Each table is served the same dish chosen by 1st customer seated at that table The same dish can be served at different tables within the same restaurant and across different restaurants. The sample information is best recorded by = distinct dishes served either restaurant to N = n1 + n 2 customers with corresponding frequency vectors ni = (n i,1,..., n i, ), where n i,j 0 and n 1 and n 2 will contain 2 and 1 0 s, respectively. Latent level: tables (governed by P 0 ) l i,j is the number of tables in restaurant i serving dish j whose range is {1,..., n i,j } if dish j is served at restaurant i and 0 otherwise. l j = 2 i=1 l i,j is then the total number of tables serving dish j for j = 1,..., and l the overall number of tables. l : total number of tables in the two restaurants 27 / 42

28 Dependent Random Measures with Hierarchical Structure Augmented partition structure A suitable augmentation provides a more refined partition structure involving also: q i,j,t : frequency of customers at restaurant i eating dish j and sitting at table t q i,j = (q i,j,1,..., q i,j,li,j ): frequency vector of customers in restaurant i eating dish j at each of the l i,j tables. By marginalizing over the tables one re-obtains the observed frequencies n i,j = q i,j = l i,j t=1 q i,j,t. First target: characterize the joint distribution of the partitions of the N = n 1 + n 2 customers into the = distinct dishes being served in the 2 restaurants the q i,j customers among the l i,j tables in restaurant i serving dish j, for i = 1, 2 and j = 1,..., 2 Φ ( l ),0 (l 1,, l ) Φ (ni ) l }{{} (q i,i i,1,..., q i, ) i=1 }{{} EPPF on the number of tables multivariate partition distribution where Φ (n),0 is the EPPF associated to P 0 and Φ (n),i is deduced from ( P 1, P 2 ) 28 / 42

29 Dependent Random Measures with Hierarchical Structure peppf for the general case Partially exchangeable partition probability function peppf of a hierarchical NRMI Π (n) (n 1,, n ) = q l l is a sum over all partitions q Φ ( l ),0 (l 1,, l ) 2 i=1 Φ (ni ) l i,i (q i,1,..., q i, ) is a sum over all compatible table configurations, i.e. over l i,j {1,..., n i,j } with l i,j = 0 if n i,j = 0. Partition probability function with the constraint q i,j = n i,j Φ (ni ) l (q i,i i,1,..., q i, ) = θ l i Γ(n i ) 0 u ni 1 e θψ(u) l i,j τ qi,j,t (u)du. j=1 t=1 Φ ( l ),0 (l 1,, l ) is the EPPF associated to P 0 NRMI(P 0 ). = Unlie GM dependent and nested processes, hierarchical processes do not achieve the limiting cases of joint exchangeability or (unconditional) independence. 29 / 42

30 Dependent Random Measures with Hierarchical Structure peppf for the HDP & hierarchical σ stable process peppf for the HDP & hierarchical σ stable process peepf HDP Π (n) (n θ0 1,..., n ) = 2 (θ) i=1 n i l θ l (θ 0 ) l ( l j 1)! j=1 2 s(n i,j, l i,j ) where s(r, s) is the absolute value of the Stirling number of the first ind. Probabilistic interpretation: Φ ( l ),0 ( l 1,..., l ): EPPF of a Dirichlet process with base measure θ 0 P 0 K ni,j is the number of distinct tables in which the n i,j customers are seated, according to a Dirichlet process with base measure θp 0 Π (n) (n 1,..., n ) = 2 (θ) j=1 n i,j i=1 (θ) ni peepf hierarchical normalized σ stable l Φ ( l ),0 (l 1,, l ) i=1 2 i=1 j=1 P[K ni,j = l i,j ] σ 1 0 Γ() σ l 2 2 Γ( l i=1 i ) 2 i=1 Γ(ni ) Γ( l ) l j=1 (1 σ 0) l j 1 2 i=1 j=1 C (n i,j, l i,j; σ) σ l i,j where C (n, l; σ) is the generalized factorial coefficient. 30 / 42

31 Dependent Random Measures with Hierarchical Structure peppf for the hierarchical PY model peppf for the hierarchical Pitman Yor model With σ, σ 0 (0, 1) and θ, θ 0 > 0, the hierarchy of random probabilities is ( P 1, P 2 ) P 0 iid PY(σ0, θ 0 ; P 0 ), P 0 PY(σ, θ; P 0 ) Possible constructions of the Pitman-Yor (PY) process: randomization of the total mass c of a generalized gamma NRMI (Pitman & Yor, 1997). peppf of HPY processes Π (n) (n 1,..., n ) = l 1 r=1 (θ 0 + rσ 0 ) (θ 0 + 1) l 1 (1 σ 0 ) l j 1 j=1 2 i=1 li 1 (θ + rσ) r=1 (θ + 1) ni 1 j=1 C (n i,j, l i,j ; σ) σ li,j the number n i,j of customers eating dish j in restaurant i are allocated to K ni,j tables generated by a PY(σ, θ; P 0 ) process the global tables s configuration is weighted by the EPPF of the higher level PY(θ 0, σ 0 ; P 0 ) process the peppf is obtained by marginalizing with respect to the table configuration. 31 / 42

32 Dependent Random Measures with Hierarchical Structure Exchangeable case EPPF of a hierarchical NRMI prior (d = 1) d = 1 = exchangeable case with de Finetti measure a hierarchical process: Π (n) (n 1,..., n ) = Φ ( l ),0 (l 1,..., l ) Φ (n) l (q 1,..., q ) l q = A similar formula can be given for hierarchical Pitman Yor processes. From this result one recovers two coagulation properties (Pitman, 1999): If P is a hierarchical stable NRMI Π (n) (n 1,..., n ) = (σ σ 0) 1 Γ() Γ(n) (1 σ σ 0 ) nj 1 i.e. P is itself a stable NRMI with parameter σσ 0 (0, 1). If P is a HPY process with θ = θ 0 σ, then P PY(σσ 0, θ; P 0 ) j=1 32 / 42

33 Dependent Random Measures with Hierarchical Structure Prediction in Genomics Prediction with hierarchical processes Conditional on observed samples X (n1) and Y (n2), interest in prediction of features related to additional future samples X n1+1,..., X n1+m 1 Y n2+1,..., Y n2+m 2 Species sampling: X i s and Y j s are species labels with species shared within and between samples and the goal is to estimate e.g.: the number of new distinct species that will be observed the probability that (ni + m i + 1)-th observation will be a new species Exchangeable case: closed form estimators for Gibbs type priors (reviewed in De Blasi et al., 2015) Partially exchangeable case: it is not possible to evaluate exactly inferences = simulation algorithm based on the peppf Illustration: ESTs analyses Useful tool for gene identification in organisms ESTs are generated by randomly sequencing genes from a cdna library, which consist of a large and unnown number of differentially expressed genes (typically millions) = potentially infinite. Only a sample corresponding to a small portion of the library is typically available 33 / 42

34 Dependent Random Measures with Hierarchical Structure Prediction in Genomics Citrus clementina libraries Samples from two libraries of citrus clementina fruits: FRUIT1 ( FlavFr1 ): n 1 = 1593 ESTs with 1 = 806 distinct genes ( 1 /n 1.51) FRUIT2 ( RindPdig24 ): n 2 = 900 ESTs with 2 = 687 distinct genes ( 2 /n 2.76) The two samples share 183 genes and their frequency is 520 in the FRUIT 1 and 317 in the FRUIT 2 samples (about 1/3) Expression level FRUIT 1 FRUIT 2 FRUITS n K n / 42

Dependent Random Measures with Hierarchical Structure Prediction in Genomics Models: P1 P 2 independent PY processes vs ( P 1, P 2 ) hierarchical PY process Discovery probability decay: Probability

35 Dependent Random Measures with Hierarchical Structure Prediction in Genomics Models: P1 P 2 independent PY processes vs ( P 1, P 2 ) hierarchical PY process Discovery probability decay: Probability that the (n i + m i + 1) th observation is new as the size m i of additional sample varies (a) (separately) exchangeable (b) partially exchangeable Borrowing of information and narrower HPD bands 35 / 42

36 Dependent Random Measures with Hierarchical Structure Prediction in Genomics Other quantities of interest: # new distinct genes identified by additional sequencing: K X m n 1 & K Y m n 2 # additionally sequenced genes coinciding with new ones: L X m n 1 & L Y m n 2 P 1 P 2 PY processes ( P 1, P 2) HPY process FRUIT 1 FRUIT 2 FRUIT 1 FRUIT 2 m ˆK X m n 1 ˆL Y m n 2 ˆK X m n 1 ˆL Y m n 2 ˆK X m n 1 ˆL Y m n 2 ˆK X m n 1 ˆL Y m n In the dependent case, finer prediction is possible. For instance, considering additional sequencing for FRUIT 1: for m = 600, ˆK X m n 1 = 224.5, which is the sum of the predicted 37.6 genes new to FRUIT 1 but already observed in FRUIT 2 and overall new genes. for m = 2000, it is predicted that 96.6 genes of the 504 originally observed only for FRUIT 2 will be detected also in the FRUIT 1 library. 36 / 42

37 Dependent Random Measures with Hierarchical Structure Hierarchical random mixture hazards Hierarchical hazard rate mixtures & dependent mixture hazards Hierarchical CRMs on Y Define ( µ 1, µ 2 ) µ 0 d = CRM(ρ1, µ 0 ) CRM(ρ 2, µ 0 ) µ 0 d = CRM(ρ0, c 0 P 0 ) Construct dependent hazard rate mixture models h l (t) = (t, y) µ l (dy) l = 1, 2. Y = the corresponding partially exchangeable sequences (X i,1 ) i 1 and (X i,2 ) i 1 imply survival times distributions of the form [ ] P X i,1 > t 1, X j,2 > t 2 ( µ 1, µ 2 ) = 2 l=1 { tl exp 0 } h l (s) 37 / 42

38 Dependent Random Measures with Hierarchical Structure Posterior characterization n Lielihood: with K l (y) = l i=1 2 { L (µ 1, µ 2 ; X) = exp l=1 Xi,l Augmented lielihood: 2 { L (µ 1, µ 2 ; X, Y ) = exp l=1 Y 0 (t, y) dt K l (y)µ l (dy) Y (X i,l ; y) µ l (dy) } nl i=1 } nl K l (y)µ l (dy) i=1 Y (X i,l ; Y i,l ) µ l (dy i,l ) and the latent variables Y i,l are generated by a discrete random probability Posterior characterization The posterior of ( µ 1, µ 2 ), given the data X and the latents Y, equals the distribution of the random measure vector ( 1 2 ) (µ 1, µ 2 ) + Jj,1 δ Y, J j,1 j,2 δ Y j,2 j=1 (µ 1, µ 2 ) is a hierarchical CRM with updated marginal Lévy intensities j=1 jumps Jj,l are independent and with nown density = The model can be modified so to include censored observations and also other covariates, i.e. a semiparametric Cox proportional hazards type of specification 38 / 42

39 Dependent Random Measures with Hierarchical Structure Illustration Illustration: Estimation of the survival functions S 1 & S 2 Simulation study: 2 samples of size n 1 = n 2 = 100 generated from Weibull distributions with different parameters chosen s.t. the survival functions cross and do not satisfy the assumption of proportional hazards = dependent hierarchical model is able to distinguish the two survival functions Figure: Estimated and true survival functions 39 / 42

40 Concluding remars Inference with non exchangeable data is analytically challenging CRM based dependent priors loo promising: conditionally on a suitable latent structure, they typically display distributional properties reminiscent to those available in the exchangeable case Technique for deriving moment measures (and sometimes even complete posteriors) is general and needs only some adaptation depending on the specific transformations of the CRMs In the exchangeable case (d = 1) a hierarchical random probability yields a new class of nonparametric priors: for all of them, but the two cases where coagulation is achieved, one has ] P [X n+1 = new X (n) = f (n, K n, (N 1,..., N Kn )) Hence, they are not of Gibbs type, though one can still identify the associated prediction rule. 40 / 42

41 Some References Some References Camerlenghi, Lijoi, Orbanz & Prünster (2015). Hierarchical processes and prediction. Submitted. Camerlenghi, Lijoi & Prünster (2015). Hierarchical structures in survival analysis. In preparation. Camerlenghi, Dunson, Lijoi, Prünster & Rodriguez (2015). Generalized nested nonparametric priors for modeling heterogeneous data. In preparation. De Blasi, Favaro, Lijoi, Mena, Prünster & Ruggiero (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE TPAMI 37, Dystra & Laud (1981). A Bayesian nonparametric approach to reliability. Ann. Statist. 9, Ewens (1972). The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, Ferguson (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, de Finetti (1938). Sur la condition d équivalence partielle. Act.sci. industr. 739, Griffiths & Milne (1978). A class of bivariate Poisson processes. J. Mult. Anal. 8, James (2005). Bayesian Poisson process partition calculus with an application to Bayesian Lévy moving averages. Ann. Statist. 33, Kingman (1967). Completely random measures. Pacific J. Math. 21, Lijoi, Nipoti & Prünster (2014). Bayesian inference with dependent normalized completely random measures. Bernoulli 20, / 42

42 Some References Lijoi, Nipoti & Prünster (2014). Dependent mixture models: clustering and borrowing information. Comput. Statist. & Data Anal. 71, Lijoi & Prünster (2010). Models beyond the Dirichlet process. In Bayesian Nonparametrics, Cambridge University Press. MacEachern (1999). Dependent nonparametric processes. In ASA Proceedings. MacEachern (2000). Dependent Dirichlet processes. OSU Tech. Report. Müller, Quintana & Rosner (2004). A method for combining inference across related nonparametric Bayesian models. J. Roy. Statist. Soc. Ser. B 66, Papaspiliopoulos & Roberts (2008). Retrospective Marov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometria 95, Pitman (1995).Exchangeable and partially exchangeable random partitions. Prob. Th. and Rel. Fields 102, Pitman (1999). Coalescents with multiple collisions. Ann. Probab. 27, Pitman & Yor (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25, Regazzini, Lijoi & Prünster (2003). Distributional results for means of random measures with independent increments. Ann. Statist. 31, Rodríguez, Dunson & Gelfand (2008). The nested Dirichlet process. J. Amer. Statist. Assoc. 103, Teh & Jordan (2010). Hierarchical Bayesian nonparametric models with applications. In Bayesian Nonparametrics, Cambridge University Press. Teh, Jordan, Beal & Blei (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101, Waler (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simul. Computat. 36, / 42

Dependent hierarchical processes for multi armed bandits

Dependent hierarchical processes for multi armed bandits Federico Camerlenghi University of Bologna, BIDSA & Collegio Carlo Alberto First Italian meeting on Probability and Mathematical Statistics, Torino