Multivariate kernel partition processes

Size: px

Start display at page:

Download "Multivariate kernel partition processes"

Milton Jordan
5 years ago
Views:

1 Multivariate kernel partition processes David B. Dunson Department of Statistical Science Box 95, 8 Old Chemistry Building Duke University Durham, NC dunson@stat.duke.edu SUMMARY This article considers the problem of accounting for unknown multivariate mixture distributions within Bayesian hierarchical models motivated by functional data analysis. Most nonparametric Bayes methods rely on global partitioning, with subjects assigned to a single cluster index for all their random effects. We propose a multivariate kernel partition process (KPP) that instead allows the cluster indices to vary. The KPP is shown to be the driving measure for a multivariate ordered Chinese restaurant process, which allows a highly-flexible dependence structure in local clustering. This structure allows the relative locations of the random effects to inform the clustering process, with spatially-proximal random effects likely to be assigned the same cluster index. An efficient exact block Gibbs sampler is developed for posterior computation, avoiding the need to truncate the infinite measure. The methods are applied to functional data analysis, illustrating improvements in model fit, parsimony and predictive performance relative to current methods. Key words: Basis functions; Chinese Restaurant Process; Dirichlet process; Local clustering; Longitudinal data; Nonparametric Bayes; Random effects; Sparsity.

2 . Introduction Functional data analysis considers a random function as the basic unit of analysis for a subject, with the data consisting of error-prone measurements at a finite number of locations. Some examples of functional data include longitudinal trajectories in a biomarker and brain images. Our particular interest is in sparse functional data measured at unequally-spaced and varying locations for the different subjects. In such settings, a variety of modeling strategies have been proposed, including hierarchical Gaussian processes (Behseta, Kass and Wallstrom, 5; Kaufman and Sain, 7), random effects models relying on basis function representations (Thompson and Rosen, 8; Bigelow and Dunson, 7), and wavelet-based functional mixed models (Morris and Carroll, 6). For Gaussian process (GP) models, the choice of mean and covariance function plays a critical role, while for random effects models the choice of basis and parametric form for the random effects distribution is important. Let f i : X R denote the function for subject i, for i =,..., n. As a more flexible alternative to the GP, one can use a functional Dirichlet process (FDP) in which f i P and P DP(αP ), with α a scalar precision parameter and P a base probability measure obeying a GP law. Under this approach, subjects are allocated to functional clusters, with f i = Θ h for all subjects in cluster h, where Θ h is a realization from a GP. By clustering subjects, one obtains replicates so that through posterior updating the function estimates become less sensitive to the choice of GP covariance function. However, as noted by Petrone, Guindani and Gelfand (8), the FDP has the disadvantage of inducing global functional clusters, which does not allow local clustering of functions that differ only in local regions. They proposed a hybrid DP (hdp), which instead allows individual functions to be formed as a patchwork of segments locally selected from a collection of global GP realizations. Nguyen and Gelfand (8) consider properties of the hdp and propose approximations.

3 An alternative to relying on generalizations of GP priors is to use a basis representation, p f i (x) = θ ij b j (x), θ i = (θ i,..., θ ip ) P, () j= where b = {b j } p j= is a collection of basis functions, such as splines or kernels, θ i is a random effects vector, and P is the distribution of the random effects. The two primary questions that arise in using () are how to choose b and P. Addressing these questions is challenging, since the appropriate basis to use is typically not known in advance and one may need to allow varying basis functions for different subjects for sufficient flexibility. In addition, p is often large, so that P has many parameters, with simple parametric choices (e.g., Gaussian with diagonal covariance) providing a poor representation of the data. A common strategy for accommodating uncertainty in basis function selection is to pre-specify a large number of potential basis functions, and then choose a shrinkage prior distribution for the basis coefficients. For example, the relevance vector machine (Tipping, ) specifies a t-prior with zero degrees of freedom, and then obtains maximum a posteriori (MAP) estimates. Unfortunately, this choice of prior does not result in a proper posterior. An alternative is to choose a Cauchy prior, which following West (987) can be expressed as θ ij N(, κ ij ), with κ ij gamma(/, /). This prior is concentrated at zero with very heavy tails, so that coefficients for unnecessary basis functions are shrunk to zero, while coefficients for important basis functions fall in the tails. To borrow information across coefficients about the degree of shrinkage, let κ ij = κ. This approach can be generalized to the functional data analysis setting, while allowing P to be unknown in (), by letting P DP(αP ), p P = P, P Cauchy(κ), () j= where p j= P j denotes the product measure. A related specification to () - (), but without the Cauchy shrinkage, was proposed by Ray and Mallick (6) for wavelet-based functional clustering. De la Cruz-Mesia, Quintana and Müller (7) developed a discriminant analysis

4 approach for classification using a dependent Dirichlet process (MacEachern, 999; De Iorio et al., ) for the distribution of random effects underlying longitudinal markers in different groups. Specification () - () leads to global functional clustering, with f i (x) = b(x) Θ h for subjects in cluster h, where one allows for differential basis function selection automatically through letting the elements of the basis coefficient vector Θ h = (Θ h,..., Θ hp ) that are set close to zero vary across the clusters. However, the performance of the approach in accurately and sparsely characterizing complex functional data is critically dependent on global clustering. Dunson, Xue and Carin (8) and Dunson (8) proposed local generalizations of the Dirichlet process, which allow for more flexible borrowing of information through dependent local clustering. In particular, one allows θ ij = θ i j without restricting θ ij = θ i j for j j. Both of these approaches assume an exchangeable dependence structure in local clustering, so that the probability of clustering subjects i and i for the jth random effect increases by a constant amount given information that these subjects are clustered for the j th random effect. When random effects are indexed by time, space or predictors, this exchangeability assumption is overly-simplistic; instead, one should allow the relative locations of the random effects to inform the local clustering process. The richer dependence structure is needed for accurate interpolations and predictions of f i (x) across regions of X for which data are not directly available for subject i. Motivated by this problem, this article proposes a class of kernel partition process (KPP) priors for unknown mixture distributions in Bayesian hierarchical models. There is a rich literature on nonparametric Bayes modeling of random effects distributions using the Dirichlet process (Bush and MacEachern, 996; Müller and Rosner, 997; Kleinman and Ibrahim, 998) and more flexible predictor-dependent global partition processes, such as the kernel stick-breaking process (Dunson and Park, 8). The implicit assumption of global clustering underlying almost all of the literature on mixture modeling leads to major limitations

5 in terms of sparsely characterizing moderate- to high-dimensional distributions. In particular, as the dimension increases, it is increasingly unlikely that two subjects are similar with respect to all the elements of their random effects vectors. Hence, global partition models will either allocate subjects to many clusters, leading to an inefficient characterization of the data, or will inappropriately cluster subjects that are similar at most locations, obscuring local differences. In practice, both of these problems occur and one can obtain a simultaneous improvement in goodness of fit and reduction in model complexity using a carefully-specified local partition model. Section proposes the KPP formulation and discusses properties. Section develops an MCMC algorithm for posterior computation. Section considers a simulation example, Section 5 applies the methods to hormone curve data, and Section 6 discusses the results.. Kernel Partition Processes. Preliminaries Let θ i P, with P a probability measure over {Ω, B(Ω)}, where Ω is a measurable Polish space, and B(Ω) is the Borel σ-algebra of subsets of Ω. Our goal is to choose a prior P P, where P is a probability measure on the space of probability measures over {Ω, B(Ω)}. If we choose P DP(αP ) as in (), then it follows from Sethuraman (99) that P = π γ δ Θγ, Θ γ P, () γ= where γ {,..., } is a global cluster index, Pr(γ = h) = π h = πh l<h( πl ), πh beta(, α), and the unique coefficient vectors, Θ h = (Θ h,..., Θ hp ) for h =,...,, are generated independently according to P. As a local generalization of (), let P = where γ = (γ,..., γ p ) π γ γ p δ Θ γ, Θ γ = (Θ γ,..., Θ γpp), () γ = γ p= {,..., } p is a p-dimensional vector of local cluster indices, 5

6 Pr(γ = h,..., γ p = h p ) = π h h p for h j =,...,, j =,..., p, and Θ h P independently for h =,...,. The class of priors in () is similar to () in incorporating independent unique coefficient vectors sampled according to P. However, in drawing θ P, the DP prior in () allocates all the elements of θ to the same unique coefficient vector, while the local generalization in () allows the allocation to vary across the elements of θ. The allowance for local allocation should lead to a greater degree of borrowing of information across subjects and a sparser formulation of high-dimensional random effects distributions. To clarify, suppose that subjects i and i have similar random effects in elements,..., 5 but then deviate substantially in elements 6,...,. The DP prior in () should allocate these subjects to different unique coefficient vectors, so that borrowing of information in estimation of their random effects vectors relies entirely on the base parametric model P. In contrast, the prior in () allows the subjects to be allocated to the same component for elements,..., 5 but different components for elements 6,...,. The clustering in elements,..., 5 should substantially improve efficiency in estimating these random effects. In addition, by not forcing allocation to new unique coefficients for each new subject that deviates from previous subjects in any of their random effects, () favors allocation of subjects to fewer unique coefficient vectors. If the n subjects in a data set occupy k << n components, then we obtain a sparse representation of the data, with the occupied unique coefficient vectors providing a summary of important characteristics of the data. To complete a specification of (), it remains to choose a prior for π = {π h h p, h j =,...,, j =,..., p}. Petrone et al. (8) proposed using a Dirichlet prior in the finite case in which h j =,..., k, and then provided theory on the limiting case as k. In applications, they focused on the finite case and proposed a Gaussian process copula model, which results in substantial computational challenges. We propose a fundamentally different specification, which avoids truncation of the infinite-dimensional prior in () and relies on an innovative kernel specification to sparsely and flexibly induce a prior for π. 6

7 . Chinese Restaurants The proposed kernel partition process (KPP) prior is motivated by a generalization of the Chinese Restaurant Process (CRP) often discussed in considering clustering properties of the Dirichlet process (Aldous, 985; Ishwaran and James, ). The typical CRP is obtained in marginalizing out the infinitely-many atoms and stick-breaking weights in () to obtain the following predictive probability function (PPF), (θ n+ γ (n), θ (n) ) = ( ) α k (n) ( ) P + ñ(n) h δ, n =,,...,, (5) α + n h= α + n θ h where γ i = h denotes that customer i is seated at table h, γ (n) = ( γ,..., γ n ), tables are indexed in the order they are occupied as subjects i =,..., n are seated, all customers at table h are served dish θ h P, ñ (n) h = n i= ( γ i = h) is the number of individuals at table h, and k (n) is the number of occupied tables after n customers are seated. In order for the CRP to be more closely linked to the stick-breaking process in (), it is convenient to define a modification, which we refer to as the ordered CRP (ocrp). The distinguishing characteristic of the ocrp is that the tables are indexed not by the order in which they are occupied but by the order of the components in the stick-breaking representation. Because the stick-breaking weights are stochastically-decreasing, one can modify the metaphor so that the restaurant has infinitely-many tables ordered in terms of desirability, with table h = the most desirable. Lemma provides the predictive probability of allocation to table m for individual i = n + under the ocrp. Lemma. Assuming that θ i P for i =,,..., n +, with P DP(αP ) following (), γ i = h if θ i = Θ h, and γ (n) = (γ,..., γ n ), we have ( Pr(γ n+ = m γ (n) ) = + n (n) m α + + k (n) l=m n (n) l ) m h= where n (n) h = n i= (γ i = h) and k (n) = max{γ (n) }. ( α + k (n) l=h+ n (n) l α + + k (n) l=h n (n) l ), m =,...,, Teh et al. (6) proposed a hierarchical DP (HDP) as a prior for a collection of random probability measures, {P l } r l=. The HDP leads to a generalization of the CRP having multiple 7

8 restaurants with a shared collection of dishes. As a simpler hierarchical generalization of the DP that also includes shared atoms (dishes) within each P l, let P l = h= π lh δ Θh, with Θ h P, π lh = πlh m<h( πlm), and πlh beta(, α) independently for l =,..., r. In this specification, dish Θ h is served at the hth most popular table at each of the restaurants, with the seating process at each restaurant proceeding independently via Lemma. Dependence arises across the restaurants because of the incorporation of a common set of dishes with the same ordering in terms of popularity. However, neither this hierarchical ocrp or the Teh et al. (6) HDP are designed for local partitioning.. Multivariate Ordered Chinese Restaurant Process To accommodate model (), we propose a modification of the hierarchical ocrp in which each household in a city with r restaurants contains p types of individuals. Each restaurant serves dishes Θ h P = p j=p j to their hth most popular table, with individuals of type j at table h dining on Θ hj. Individuals in a household select restaurants separately, with an individual of type j selecting restaurant l with probability proportional to the popularity score of the restaurant, λ l gamma(β/r, ), multiplied by the closeness of the restaurant to the individual s preference, quantified as K ψ (z j, z l ). Here, K ψ : Z Z [, ] is a kernel, z j Z are features of type j individuals, and z l are features of restaurant l. Individuals in a household that select the same restaurant will be seated together when they arrive, with the seating following an ocrp. Theorem shows the resulting multivariate PPF. Theorem. For individual i in household j, let S ij = l denote choice of restaurant l and let γ ij = h denote seating at table h. Letting S i = (S i,..., S ip ), S (n) = (S,..., S n), γ i = (γ i,..., γ ip ), γ (n) = (γ,..., γ n ), and θ (n) = (θ,..., θ n ), (θ n+ S (n), γ (n), θ (n) ) r r { p } r ω jlj l = l p= j= l= m= π (n) lm p j= {(w (n) mj = )P j (θ n+,j ) + (w (n) mj > )δ Θmj (θ n+,j )}, 8

9 where w (n) mj = n i= (γ ij = m) is the number of individuals of type j at the mth table, ω jl = λ l K ψ (z j, z l ) rs= λ s K ψ (z j, z s), l =,..., r, j =,..., p π (n) lm = ( + n (n) lm α + + k(n) l t=m n (n) lt ) m h= ( k α + (n) l t=h+ n(n) lt α + + k(n) l t=h n(n) lt ), m =,...,, with n (n) lm = n i= (m {γ ij : S ij = l}) the number of households seated at table m in restaurant l. In Theorem, l,..., l p indexes the restaurant allocation for individuals of types,..., p, respectively, ω jl is the probability an individual of type j chooses restaurant l, and m indexes the table assignment of household n+ within restaurant l. If an individual of type j is seated at a table number that has not had an individual of that type previously, they will be given a new dish sampled from P j, with this dish served to all future individuals of type j seated at the same numbered table. It is clear from Theorem that most of the probability will be allocated to the first few tables in each restaurant when α is small, leading to a sparse formulation in which all individuals occupy a few clusters. In addition, when β is small, the popularity of the restaurants will vary greatly, leading to a small number of dominant restaurants that attract most of the customers unless the kernel is very tight. Proposition. Under Theorem, where ω j = (ω j,..., ω jr ). (i) Pr(γ ij = γ ij ) = ω jω j + ω jω j + α = ρ jj, for j j, (ii) Pr(γ ij = γ i j ) = ω jω j + α + ω jω j + α = κ jj, for any j, j, Proposition (i) shows the probability of allocating two individuals in a household of type j and j to the same numbered table and hence the same cluster index, while proposition (ii) shows the probability of assigning individuals from two different households to the same cluster index. For concreteness, consider a longitudinal data application in which 9

10 f i (x) = b(x) θ i, with b j (x) = exp{ ψ b (x z j ) } a kernel basis function located at time z j for j =,..., p = r. The within-subject pairwise dependence structure in the vector of cluster indices γ i is characterized by proposition (i), while the cross-subject dependence is characterized by proposition (ii). Both within- and across-subjects, it is more likely for random effects that are located close together to be allocated to the same cluster index. The ω jω j term is a cross-product of weighted and normalized kernel functions, so induces a highly flexible dependence structure, as is partially illustrated in Figure. The hyperparameter α provides a global control on clustering, with small values implying a high probability of being allocated to the same cluster index even for random effects located far apart. Indeed, in the limiting case lim α ρ jj = κ jj =, and θ i δ Θ, so that all subjects are allocated to a single global cluster. A small but non-zero α instead implies a sparse formulation with a combination of short and long range dependence in clustering. Figure provides some representative realizations from the prior, illustrating the impact of local clustering on the functions.. Driving Measure The definition of the multivariate ocrp bypasses the explicit specification of a prior for the random measure P. However, it can be shown that the multivariate ocrp in Theorem has a driving measure P KPP(α, β, ψ, P ), with the kernel partition process (KPP) prior specified as in () with π h h p = T T l:t l Pr(φ l = {h j, j T l } ν l ) j T l ω jl, (6) where T = (T,..., T p ) is a partition of the index set {,..., p} into r locations, with T l {,..., p} the indices at location l for l =,..., r, T is the set of all p! such partitions, {h j, j T l } are the cluster indices for random effects allocated to location l, φ l is a random cluster index specific to location l, Pr(φ l = h) = (#h = )ν lh, #h is the cardinality of the set h, ν lh = νlh m<h( νlm), and νlh beta(, α) for h =,...,.

11 Letting θ i P, with P KPP(α, β, ψ, P ), we obtain the hierarchical model θ i = Θγ i, Θγ i = (Θ γi,..., Θ γip p), Θ h P, γ ij r ω jl δ φil, l= φ il ν lh δ h, (7) h= with γ i = (γ i,..., γ ip ) {,..., } p a multivariate cluster index for subject i, and φ i = (φ i,..., φ ir ) a vector of latent cluster indices assigned to locations,..., r. In the restaurant terminology, φ il denotes the latent table assignment of any individuals from household i choosing restaurant l, while γ ij denotes the actual table assignment for the jth individual from household i. Formulation (7) will form the basis for the computational algorithm proposed in Section..5 Some Special Cases It is interesting to consider some special cases of the KPP. First, suppose K ψ (z j, z l ) = (j = l), with z j = z j for j =,..., p = r. In this case, there is a distinct restaurant catering to each type of individual, so that all individuals of type j go to restaurant j for j =,..., p. If P = p j= P j, then it is straightforward to show that we obtain θ ij P j with P j DP(αP j ) independently for j =,..., p. Hence, for a particular choice of kernel, the KPP reduces to independent Dirichlet process priors on the distribution for each random effect. These independent DP priors will induce independent clustering of subjects with respect to each element of the random effects vector. By modifying the stick-breaking prior on the distribution of the φ il s in (7), one can induce independent species sampling priors of other types for each P j. As another extreme case, suppose that r = so that there is a single restaurant visited by all individuals. In this case, it is straightforward to show that θ i P, with P DP(αP ). This same global DP prior can also be induced by letting K ψ (z j, z l ) = for all j, l and β. To illustrate this, note that when the kernel is a constant equal to one, ω j = ω = (ω,..., ω r ), with ω l = λ l / r m= λ m. Since λ l gamma(β/r, ), we obtain

12 ω Dirichlet(β/r,..., β/r). As β decreases, ω converges in distribution to δ vξ, with v ξ an r vector of zeros with a single one in position ξ r l= (/r)δ l. Hence, from (7) it is clear we obtain γ ij = γ i = φ iξ for all i, j, leading to P DP(αP ). By replacing the stick-breaking priors on the φ il s in (7), we can induce an arbitrary species sampling prior on P. In addition to independent and global DP priors, we can induce an exchangeable withinsubject dependence structure by letting K ψ (z j, z l ) = with β >. In this case, it follows from proposition that Pr(γ ij = γ ij ) = ( ) α λ + α + λ + α (λ, for all i, j, r ) with r an r vector of ones. As β decreases, the λ λ/(λ r ) term will converge in distribution to δ, so that Pr(γ ij = γ ij ) converges to one. As a sparse version of the KPP, which has computational advantages in avoiding the introduction of a separate stick-breaking process for each random effect, we propose to let ν lh = ν h = ν h m<h( ν m) with ν h beta(, α). Because the seating processes for the different restaurants are then coupled, Theorem can be modified by simply replacing n (n) lm in the expression for π (n) lm with n(n) m = r l= n (n) lm, so that π(n) lm = π(n) m for all restaurants. In addition, proposition (i)-(ii) is modified by replacing the /( + α) multipliers with /( + α), so that Pr(γ ij = γ i j ) = /( + α). Proposition. Let A : S ij = S ij denote allocation of subjects j and j from household i to the same restaurant, B : S i j = S i j denotes allocation of subjects j and j from household i to the same restaurant, and A and B denote the complements of A and B, respectively. Then, under the sparse KPP, Pr(θ ij = θ i j θ ij = θ i j ) = { + + α + α + (6 + α) }, ( + α)( + α) where = (,, ) is a probability vector, with = Pr(A B) = (ω jω j ), = Pr{(A B) (A B)} = ω jω j ω j( ω j ) and = Pr(A B) = {ω j( ω j )}.

13 As Pr(θ ij = θ i j) = /( + α), it is clear from proposition that the probability of clustering subjects i and i for their jth random effect increases given information that these subjects are clustered for their j th random effect. This is an appealing property in allowing borrowing of information in local clustering. The magnitude of the multiplicative increase in { } tends to decrease as the distance between the jth and j th random effect increases, though the function can take an extremely wide variety of shapes and is not restricted to be monotone due to the incorporation of the adaptive weights λ.. Posterior Computation For posterior computation, we propose a Markov chain Monte Carlo (MCMC) algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of Papaspiliopoulos (8), and Metropolis sampling. Papaspiliopoulos (8) proposed the exact block Gibbs sampler as an efficient approach to posterior computation in Dirichlet process mixture models, modifying the block Gibbs sampler of Ishwaran and James () to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler (Papaspilioupoulos and Roberts, 8) and the slice sampler (Walker, 7). For concreteness, we focus on a functional data analysis model with y it t υ (f i (x it ), τ), where t υ (µ, τ) denotes the t-density centered on µ, with υ degrees of freedom and scale parameter τ. In addition, f i (x) follows () with θ i P and P KPP(α, β, ψ, P ), focusing on the sparse formulation and letting P = p j=p where P denotes a Cauchy prior centered on zero. The Cauchy prior is an appealing choice for robust shrinkage of the basis coefficients, as small coefficients will tend to be shrunk to very close to zero, while larger coefficients fall in the tails (Bhuiyan et al., 7). To complete a Bayes specification, let α gamma(a α, b α ), β gamma(a β, b β ), ψ gamma(a ψ, b ψ ), υ gamma(a υ, b υ ), and τ gamma(a τ, b τ ). From West (987), the t-density can be expressed as a scale mixture of Gaussians, which results in y it N(b itθ i, τ ϕ it ), b it = b(x it ), ϕ it gamma(υ/, υ/), P = N(, κ ) and

14 κ gamma(/, /). Letting S ij {,..., r} denote the location γ ij is allocated to, the algorithm proceeds as follows:. Update S ij for all i, j from the multinomial conditional posterior, with Pr(S ij = h ) = ω ni jh t= N(y it ; b itθγ i (S ij =h), τ ϕ it ) rl= ω ni jl t= N(y it ; b itθγ i (S ij =l), τ ϕ, h =,..., r, (8) it ) where Θγ i (S ij =h) equals (Θ γi,..., Θ γip p) but with the current value of γ ij set to φ ih.. To update λ h, for h =,..., r, use a data augmentation approach related to Holmes and Held (6) and Dunson, Pillai and Park (7). Letting K jh = K ψ (z j, z h) and Kjh = K jh /( l h λ l K jl ), the conditional likelihood for λ h is p ( λh K ) jh i L(λ h ) = (S ij=h) ( ) i (S ij h), + λ h Kjh + λ h Kjh j= which can be obtained via (S ij = h) = (Zijh > ), with Zijh Poisson(λ h ξ ijh Kjh) and ξ ijh exp(). Update {Zijh, ξ ijh } and {λ h } in Gibbs steps: (a) Let Z ijh = if S ij h and otherwise Z ijh Poisson(λ h ξ ijh K jh)(z ijh > ). (b) ξ ijh gamma( + Z ijh, + λ h K jh). (c) λ h gamma(β/r + i,j Z ijh, + i,j ξ ijh K jh).. Update ψ, β, υ using Metropolis steps.. Update ϕ it, i =,..., n, t =,..., n i, and τ by sampling from gamma full conditionals ( υ + (ϕ it ) gamma, υ + τ (y it b itθ i ) ), ( (τ ) gamma a τ + n n i, b τ + n n i ϕ it (y it b i= itθ i ) ). i= t= 5. Implement exact block Gibbs sampler steps:

15 (a) Sample u ih uniform(, ν φih ), for i =,..., n, with ν l = ν l (b) Sample the stick-breaking random variables, m<l( ν m). ν l ( r n r n ) beta + (φ ih = l), α + (φ ih > l), h= i= h= i= for l =,..., φ, with φ the minimum value satisfying ν ν φ > min{u ih }. (c) Update Θ l, for l =,..., φ, from N p ( Θ l, Σ Θl ) with Θ l = Σ Θl n n i i= t= τϕ it Γ il b it (y it b itγ i( l) θ i ), Σ Θl = ( κi p + n n i τϕ it Γ il b it b itγ il ), i= t= where I p is the p p identity matrix, Γ il = diag((γ i = l),..., (γ ip = l)) and Γ i( l) = diag((γ i l),..., (γ ip l)). (d) Update φ ih, i =,..., n, h =,..., r from the multinomial conditional with Pr(φ ih = l ) (u ih < ν l ) where y (h) it n i t= ( N y (h) p it ; j= = y it p j= (S ij h)b itj θ ij. 6. Update α by sampling from the conditional posterior ( (α ) gamma a α + φ, b α + 7. Update κ by sampling from the conditional posterior ) (S ij = h)b itj Θ lj, τ ϕ it, l =,..., φ, φ l= ( (κ ) gamma + pφ, + ) log( νl ). φ p Θ lj l= j= This algorithm is straightforward to program and has exhibited good rates of convergence and mixing in simulated and real data applications we have considered. We also considered using the slice sampler of Walker (7) in place of the exact block Gibbs sampling steps, but this resulted in considerably slower rates of convergence and mixing. Adaptations for 5 ).

16 more complex random effects models, which include fixed and random effects and other complications, are trivial by imbedding steps similar to those outlined above in an MCMC algorithm that includes additional steps to update fixed effects and any additional unknowns involved in the more complex model.. Simulation Example To assess the performance of the approach, we start by considering a simulation example. We assumed n = and simulated data under the functional data analysis model described in Section, with n =, υ =, τ =, X = (, ), b (x) =, b j+ (x) = exp{ 5(x z j ) } and z j = (j )/9, for j =,..., 9. In addition, we let n i = plus a discrete uniform random variable on [,..., ], for i =,..., n, and simulated t ij Uniform(, C i ), with C i = for the first 5 subjects and C i = / for the second 5 subjects. We let Θ h = (Θ h,..., Θ l ), for h =,,, with the elements Θ hl.8δ +.N(, ) independently. Then, letting θ ij = Θ γij j, for j =,..., p, We simulated the elements of γ i = (γ i,..., γ ip ) from a Markov chain, with γ i set to a random element of {,, }, γ ij = γ i,j with probability.9 and γ ij otherwise set to a random element of {,, }. To implement our Bayesian analysis, the approach of Section was applied after choosing priors for each of the parameters by letting a α = b α =, a β = b β =, a ψ = 5, b ψ =, a υ =, b υ =, and a τ = b τ =.. These were considered reasonable default values when X = (, ) and the overall variance of y is within a factor of of. In the absence of prior knowledge of the scale of y, one can standardize the data. In addition, we let K ψ (z, z ) = exp{ ψ(z z ) } and z j = zj, for j =,..., 9. Note that these values of z j, zj, and p provide a reasonable default for smooth curves, as there are sufficient numbers of equally-spaced kernels to capture a very wide variety of smooth curve shapes. The results are robust to increases in the number of kernels due to the adaptive shrinkage resulting from allowing the data to inform about κ and β. The MCMC algorithm was run for,5 6

17 iterations including a 75 iteration burn-in, with the chain thinned by collecting every 5 iterations due to storage constraints. Based on examination of trace plots of the parameters and function estimates at a variety of points for different subjects, we observed rapid rates of apparent convergence and mixing. Note that it is important to avoid diagnosing convergence based on examination of the unique coefficient vectors, Θ h, due to the well-known label switching issue, which is omnipresent in mixture models (Jasra, Holmes and Stephens, 5). This label switching does not create problems as long as the focus of inference is not on mixture component-specific quantities and one obtains good rates of convergence and mixing in quantities of interest, such as individual-specific function estimates. The focus of this article is on using local partitioning as a highly-flexible approach for sparse modeling of unknown random effects distributions underlying functional data and not on clustering as a focus of the analysis. Figure shows the data, estimated posterior mean curves (solid lines), 95% pointwise credible intervals (dashed lines) and true curves (dotted lines) for subjects,..., 8 (top 8 panels) and subjects 5,..., 58 (bottom 8 panels). Recall that subjects,..., 5 have observations throughout the [, ] interval, while subjects 5,..., have no observations in the [/, ] interval. For subjects,..., 5 the estimates are very close to the true curves, while there is some deviation after the / point for some of the 5,..., subjects. However, in general, predictions across this interval are quite good, with the true curves enclosed in the credible bounds. The average mean square error across a dense grid of times between and is.56 and the mean width of 95% credible intervals is.96. Repeating the analysis for the LPP prior of Dunson (8), the results are very good at locations close to data points. However, interpolations and predictions are not as good as for the KPP. For example, for the subject in the (,) panel there is a substantial gap in the observations. Across this gap, the 95% credible intervals are much wider in the LPP analysis. In addition, across the [/,] region for many of the subjects 5,...,, the LPP estimates are substantially farther from 7

18 the truth than the KPP-based estimates. The mean square error for the LPP is Application to Hormone Curve Data We consider progesterone curve data in early pregnancy previously analyzed by Dunson (8) using a functional data analysis model with kernel basis functions and t-distributed measurement errors. That article demonstrated improved performance for a local partition process (LPP) prior on the random effects distribution relative to a DP. Data consisted of daily urinary measurements of PdG, a metabolite of progesterone, starting on the identified day of ovulation and continuing for up to additional days. There were 65 women, with an average of measurements per women (range = [,]). To analyze these data using the KPP prior for the random effects distribution, we implemented the approach used in Section, with the same prior specification. As in the simulation examples, rates of apparent convergence and mixing were good. In examining plots of the individual curve estimates for each of the 65 women, it is apparent that the estimates fit the data very well. Figure shows the estimated curves and 95% pointwise credible intervals for 6 randomly selected women. There is a small but notable improvement in fit in using the KPP prior for the random effects distribution instead of the LPP prior. The improvement in fit is not attributable to over-fitting, as it is clear that a very sparse representation of the data is obtained in examining the estimated parameters in Table. In particular, the small α value suggests that a small number of unique coefficient vectors are sufficient to characterize the data. Indeed, the estimate of φ was.7, implying that the basis coefficient vectors for all 65 women are constituted of elements selected from 5 unique coefficient vectors. In addition, the small value of β suggests the occurrence of a few dominate locations with much higher λ h values, again leading to a sparse characterization. A primary motivation for the KPP over the LPP is that the incorporation of information 8

19 on the relative locations of the basis functions should allow substantial improvements in prediction. Although the LPP is flexible enough to provide a good fit to complex functional data, one may expect diminished predictive performance when there is no observed data available for a subject at times close to the time of interest. To assess predictive performance, we repeated the analysis holding out the last 5 observations for 5 women randomly selected from among the women having at least observations. Figure shows the true values of y it for these held-out observations versus the predicted values, with 95% predictive intervals shown with light dotted lines. The correlation between the true and predicted values was.866 and the mean square predictive error was.5. Given that hormone trajectories are difficult to predict more than a few days out, this is very good performance. For the first held out observation the correlation was very high at.99, while for the second to fifth observations the correlations were.898,.765,.7 and.685, respectively. For comparison, we repeated the analysis using the LPP prior with the same held-out observations. Figure 5 shows the true values of y it versus the predictive values. The results are clearly not as good as those shown in Figure for the KPP, with a substantial subset of the points moving much further away from the line. The correlation between y it and ŷ it diminished to.795, the mean square predictive error was. and the correlations for observations -5 were.99,.85,.58,.5, and.89, respectively. As expected, the LPP has good predictive performance when the subject has data available close to the time of interest, but the performance decays rapidly with an increasing time gap. Repeating the analysis also for a DP prior on the random effects, the mean square predictive error was.9 and the correlation between y it and ŷ it was.68, suggesting substantially worse predictive performance than for either the KPP or LPP. 6. Discussion This article has proposed a new nonparametric Bayes prior for unknown random effects 9

20 distributions, allowing for flexible local borrowing of information across subjects through dependent local partitioning. The KPP has clear advantages over previous nonparametric Bayes methods based on global partitioning, such as the Dirichlet process. In particular, the KPP favors a sparser representation of the data in allowing subjects to have identical basis coefficients without forcing their functions to be identical. Although functional clustering is a useful exploratory tool, it seems unlikely that functions for two different subjects are exactly identical, though the functions may be locally similar within particular regions. Our goal is not to obtain functional clusters but to use local partitioning as a tool for sparsely characterizing complex functional data, while facilitating borrowing of information in a flexible manner. In addition to sparse characterization of unknown random effects distributions, there are clear applications of the KPP to multiple changepoint detection and image segmentation. In such settings, it is useful to consider a minor modification of the formulation to replace the vector Θ h = (Θ h,..., Θ hp ) with a scalar Θ h, letting θ ij = Θ γij, for j =,..., p. Then, using piecewise constant or linear basis functions, so that z = (z,..., z p ) is a vector of potential knot locations, a changepoint in f i occurs at z j if γ ij γ ij. The KPP will automatically borrow information on changepoint locations within- and across-subjects using a more flexible dependence structure than commonly-used Markov models. In addition, computation is straightforward using the proposed MCMC algorithm. Appendix Proof of Theorem Theorem follows in a straightforward manner from Lemma, so it suffices to prove Lemma. Letting k (n) = max{γ i } n i= and n h = n i= (γ i = h), we have Pr(γ,..., γ n ) = E [ k (n) { π h h= l<h } ] nh ( πl )

21 = = k (n) h= k (n) h= E { (πh) n h ( πh) n h l= n } l Be ( n h +, k (n) l=h+ n l + α ) Be(, α), where Be( ) is the beta function. Letting k (n+) = max{k (n), m}, a h = n h +, and b h = k (n) l=h+ n l + α, we have Pr(γ,..., γ n, γ n+ = m) = k (n+) h= Be ( a h + (m = h), b h + (m > h) ) Be(, α) Noting that Be(a+, b)/be(a, b) = a/(a+b) and Be(a, b+)/be(a, b) = b/(a+b) from properties of the beta function, Lemma is obtained as the ratio of the previous two expressions, Pr(γ n+ = m γ (n) ) = { m h= ( bh a h + b h )} am a m + b m, m =,...,.. Proof of Proposition Lemma can be calculated as the probability that individuals i, j and i, j choose the same restaurant, which is trivially calculated as ω jω j, plus the probability that these individuals choose different restaurants ( ω jω j ) multiplied by the probability of being assigned to the same numbered table at different restaurants, m= {( ) ( )} α = + α t<m + α = ( ) {( α ) } m + α m= + α + α, where the last identity is a consequence of the infinite geometric series identity. In addition, Pr(γ ij = γ i j) equals the probability that the jth individual in households i and i choose the same table in the same restaurant [ω jω j /( + α)] plus the probability they choose the same table in different restaurants [( ω jω j )/( + α)]. References

22 Aldous, D.J. (985). Exchangeability and related topics. In École d Été de Probabilitiés de Saint-Flour XII (Edited by P. L. Hennequin). Springer Lecture Notes in Mathematics, Vol. 7. Behseta, S., Kass, R.E. and Wallstrom, G.L. (5). Hierarchical models for assessing variability among functions. Biometrika, 9, 9-. Bigelow, J.L. and Dunson, D.B. (7). Bayesian adaptive regression splines for hierarchical data. Biometrics, 6, 7-7. Bhuiyan, M.I.H., Ahmad, M.O. and Swamy, M.N.S. (7). Spatially adaptive waveletbased method using the Cauchy prior for denoising SAR images. IEEE Transactions on Circuits and Systems for Video Technology, 7, Bush, C.A. and MacEachern, S.N. (996). A semiparametric Bayesian model for randomised block designs. Biometrika, 8, De Iorio, M, Müller, P., Rosner, G.L. and MacEachern, S.N. (). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99, 5-5. De la Cruz-Mesia, R., Quintana, F.A. and Müller, P. (7). Semiparametric Bayesian classification with longitudinal markers. Applied Statistics, 56, 9-7. Dunson, D.B. (8). Nonparametric Bayes local partition models for random effects. Biometrika, to appear. Dunson, D.B. and Park, J.-H. (8). Kernel stick-breaking processes. Biometrika, 95, 7-. Dunson, D.B., Pillai, N. and Park, J.-H. (7). Bayesian density regression. Journal of the Royal Statistical Society B, 69, 6-8.

23 Dunson, D.B., Xue, Y. and Carin, L. (8). The matrix stick-breaking process: Flexible Bayes meta analysis. Journal of the American Statistical Association,, 7-7. Holmes, C.C. and Held, L. (6). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis,, Ishwaran, H. and James, L.F. (). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 6-7. Ishwaran, H. and James, L.F. (). Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica,, -5. Jasra, A., Holmes, C.C. and Stephens, D.A. (5). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science,, Kaufman, C. and Sain, S. (7). Bayesian functional ANOVA modeling using Gaussian process prior distributions. Journal of Computational and Graphical Statistics, submitted. Kleinman, K.P. and Ibrahim, J.G. (998). A semiparametric Bayesian approach to the random effects model. Biometrics, 5, MacEachern, S.N. (999). Dependent nonparametric processes. In ASA Proceedings of the Section on Bayesian Statistical Science, Alexandria, VA: American Statistical Association. Morris, J.S. and Carroll, R.J. (6). Wavelet-based functional mixed models. Journal of the Royal Statistical Society B, 68, Müller, P. and Rosner, G.L. (997). A Bayesian population model with hierarchical mixture

24 priors applied to blood count data. Journal of the American Statistical Association, 9, Nguyen, X. and Gelfand, A.E. (8). The Dirichlet labeling process for functional data analysis. Technical Report, 8-7, Department of Statistical Science, Duke University. Papaspiliopoulos, O. (8). A note on posterior sampling from Dirichlet process mixture models. Working Paper, 8-, Centre for Research in Statistical Methodology, University Warwick, Coventry, UK. Papaspiliopoulos, O. and Roberts, G. (8). Retrospective MCMC for Dirichlet process hierarchical models. Biometrika 95, Petrone, S., Guindani, M. and Gelfand, A.E. (8). Hybrid Dirichlet processes for functional data. Journal of the Royal Statistical Society B, revision invited. Ray, S. and Mallick, B. (6). Functional clustering by Bayesian wavelet methods. Journal of the Royal Statistical Society B, 68, 5-. Sethuraman, J. (99). A constructive definition of Dirichlet priors. Statistica Sinica,, Tipping, M.E. (). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, -. Thompson, W. and Rosen, O. (8). A Bayesian model for sparse functional data. Biometrics, 6, 5-6. Walker, S.G. (7). Sampling the Dirichlet mixture model with slices. Communications in Statistics: Simulation and Computation, 6, 5-5. West, M. (987). On scale mixtures of normal distributions. Biometrika, 7,

25 Table. Posterior summaries of parameters in KPP analysis of hormone curve data. Posterior Summary Parameter Mean Median SD 95% CI α..8.6 [.8,.] β...7 [.8,.5] τ [.7,.7] ψ...9 [.8, 6.56] υ.6.6. [.,.] κ [.6, 78.6] φ [., 9.5] 5

26 α=, β=, ψ=5 α=, β=., ψ= ρ.5 ρ time time α=, β=., ψ= α=, β=., ψ= ρ.5 ρ time time Fig.. Samples of Pr(γ ij = γ ij ) under the kernel partition process with various choices of hyperparameters. We let z j = z j, for j =,..., 5 correspond to an equally-spaced grid of times between and, with j = 5 and K(x, z j ) = exp{ ψ(x z j ) }. 6

27 α=., β=, ψ=5 8 α=., β=., ψ= f i (t) f i (t) time (t) time (t) α=, β=., ψ= α=, β=., ψ= f i (t) f i (t) time (t) time (t) Fig.. Samples from the prior for f i under the functional data model () with a KPP prior on the distribution of the basis coefficients across subjects for different choices of hyperparameter values. 7

28 Fig.. Results for simulation example. The upper 8 plots are for subjects having observations distributed randomly in [,], while the lower 8 plots are for subjects having observations in [/,]. The solid lines are posterior mean curves, dashed lines are 95% pointwise credible intervals, and dotted lines are true curves. 8

29 Fig.. Log(PdG) data and KPP-based function estimates for 6 randomly selected women. The data points are marked with, the posterior means are solid lines, and 95% credible intervals are dashed lines. 9

30 Out of Sample Predictive Performance Predicted value True value of y it Fig. 5. Out-of-sample predictive performance for the KPP. The last 5 log(pdg) observations for 5 women randomly selected from those with or more observations were held out.

31 Out of Sample Predictive Performance Predicted value True value of y it Fig. 6. Out-of-sample predictive performance for the LPP (Dunson, 8). The last 5 log(pdg) observations for 5 women randomly selected from those with or more observations were held out.

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017 Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h