Approximate Marginal Posterior for Log Gaussian Cox Processes

Size: px

Start display at page:

Download "Approximate Marginal Posterior for Log Gaussian Cox Processes"

Claribel Cummings
6 years ago
Views:

1 Approximate Marginal Posterior for Log Gaussian Cox Processes Shinichiro Shirota and Alan. E. Gelfand arxiv: v1 [stat.co] 26 Jun 2016 June 12, 2018 Abstract The log Gaussian Cox process is a flexible class of Cox processes, whose intensity surface is stochastic, for incorporating complex spatial and time structure of point patterns. The straightforward inference based on Markov chain Monte Carlo is computationally heavy because the computational cost of inverse or Cholesky decomposition of high dimensional covariance matrices of Gaussian latent variables is cubic order of their dimension. Furthermore, since hyperparameters for Gaussian latent variables have high correlations with sampled Gaussian latent processes themselves, standard Markov chain Monte Carlo strategies are inefficient. In this paper, we propose an efficient and scalable computational strategy for spatial log Gaussian Cox processes. The proposed algorithm is based on pseudo-marginal Markov chain Monte Carlo approach. Based on this approach, we propose estimation of approximate marginal posterior for parameters and comprehensive model validation strategies. We provide details for all of the above along with some simulation investigation for univariate and multivariate settings and analysis of a point pattern of tree data exhibiting positive and negative interaction between different species. Keywords: pseudo-marginal Markov chain Monte Carlo approach; kernel mixture marginalization; multivariate Poisson log normal; log Gaussian Cox processes; importance sampling; Laplace approximation 1 Introduction There is increasing interest in analyzing spatial point process data. In the literature, the most widely adopted class of models are nonhomogeneous Poisson processes (NHPP) or, more generally log Gausian Cox processes (LGCP) (see Møller Department of Statistical Science, Duke University., US. ss571@stat.duke.edu. Department of Statistical Science, Duke University., US. alan@stat.duke.edu. 1

2 and Waagepetersen (2004) and references therein). The intensity surface of Cox processes is stochastic, so, given the intensity surface, Cox processes are Poisson processes. Among Cox processes, the LGCP is a flexible class of point processes for incorporating complex structure of spatial or time point patterns. The process was originally proposed by Møller et al. (1998) and extended to space-time case by Brix and Diggle (2001). As the name suggested, the intensity function of this process is driven by the exponential of Gaussian processes (GP). Basically, since LGCP is one type of latent Gaussian models, the sampling of GP is required. However, the likelihood of LGCP includes infinite dimensional stochastic integral over the study region, which analytically intractable. Hence, some approximation methods are required. One simple strategy is to take grid over the study region, and approximate this intractable integral with Riemann sum and plug this estimator into the likelihood (Møller et al. (1998) and Møller and Waagepetersen (2004)). Then, standard Markov chain Monte Carlo (MCMC) scheme is available even though the conditional sampling of high dimensional Gaussian latent variables is required. The convergence of posterior samples based on this approximated likelihood to exact posterior distribution is guaranteed by Waagepetersen (2004). Then, sampling of high dimensional GPs become a main computational task. Alternative approach is sigmoidal Gaussian Cox process (SGCP, Adams et al. (2009)). This approach utilize the thinning property for Cox processes, they avoid the grid approximation and obtain the exact inference by introducing latent thinned points in addition to observed points. Although their approach is potentially attractive in the meaning of exact inference, the approach require the larger dimensional GP outputs. Hence, it is computationally infeasible for large datasets as authors suggested (see Adams et al. (2009)). Hence, more or less, the sampling of high dimensional GPs is a fundamental problem as often observed in Gaussian latent process models. Calculating high dimensional inverse or Cholesky decomposition of n-dimensional covariance matrix require O(n 3 ) computational time and O(n 2 ) memory for restore. As for the literature of MCMC based Bayesian inference, Møller et al. (1998) implement Metropolis adjusted Langevin algorithm (MALA, Roberts and Rosenthal (1998)). This algorithm achieve higher asymptotic acceptance rate than random walk Metropolis-Hastings algorithm (RWMH, e.g., Robert and Casella (2004)) by introducing the transition density induced by the Langevin diffusion to the target distribution. Chakraborty et al. (2011) discuss and introduce various ad hoc approaches used by ecologists in the context of presence-only datasets. They implemented LGCP with Gaussian predictive processes approximation (GPP, Banerjee et al. (2008)). Diggle et al. (2013) survey models of space-time and multivariate LGCP and implement manifold MALA (MMALA, Girolami and Calderhead (2013)), which is manifold extension of MALA to incorporate geometrical structure of the target information 2

3 within Langevin dynamics. Although these algorithms are potentially efficient, these algorithms require the careful tuning of some parameters in the transition density. Incorporation of the density information also requires further computational time. More recently, Leininger and Gelfand (2016) implemented elliptical slice sampling proposed by Murray et al. (2010) for sampling high dimensional Gaussian latent variables of spatial LGCP. Although this algorithm might be relatively slow, the algorithm does not require the fine tuning and further computation of the target density information. As an alternative Bayesian inference stream, Integrated nested Laplace approximation (INLA, Rue et al. (2009)) was proposed, which is an highly efficient approximate Bayesian inference scheme for structured latent Gaussian process models. This approach take the different path from MCMC, i.e., calculate the marginal posterior distribution of parameters with implementing Laplace approximation (Tierney and Kadane (1986)). By approximating Gaussian random field (GRF) by the Gaussian Markov random field (GMRF, Rue and Held (2005)) structure for the precision matrix of Gaussian prior, they accomplish computationally efficient implementation of sampling of posterior marginal distribution of parameters and latent Gaussian variables. Although their approach is based on GMRF approximation, Lindgren et al. (2011) shows the connection of GMRF to GRF through the stochastic partial differential equation (SPDE). Hence, they can estimate the concerning parameters of GRF through the connection while utilizing the computational efficient structure of GMRF. Illian et al. (2012) implemented INLA for the inference of LGCP and demonstrates its applicability. More recently, Simpson et al. (2016) implement INLA with Lindgren et al. (2011) for the inference of LGCP and discuss its convergence properties. Although the INLA approach is successfully implemented for the latent GP models, the fundamental assumption is that the dimension of parameter is low (basically 2 to 5, but not exceeding 20, see Rue et al. (2016)). However, in practice, we face some situation where the estimation of larger dimensional parameters, e.g., relatively large dimensional covariate information is available, is our main concerns (e.g., Waagepetersen et al. (2016)). Recently, Bayesian inference of marginal posterior from the MCMC perspective was proposed, called pseudo-marginal MCMC approach, by Andrieu and Roberts (2009). The algorithm is just to input the unbiased estimator of marginal likelihood integrated over latent variables into acceptance ratio instead of likelihoods themselves. The surprising property of the method is that the convergence to exact marginal posterior distribution is guaranteed when we use the unbiased estimator of marginal likelihood integrated over latent variables. The efficiency of this algorithm is dependent on the variance of the estimator. Hence, the primal task for this algorithm is to construct the unbiased estimator while keeping its variance as small as possible. The straightforward construction of this estimator is the importance sampling. How- 3

4 ever, the direct implementation of PM-MCMC for LGCP has a big computational problem, i.e., implement high dimensional importance sampling to construct unbiased estimate. Although the accuracy of the inference is dependent on the resolution degree of approximation, the large grid approximation increase the variance of the unbiased estimator. In this paper, we propose comprehensive Bayesian inference scheme for LGCP with estimation of approximate marginal posterior by PM-MCMC approach. This computational scheme is composed of two steps. At the first step, we estimate the approximate marginal posterior of parameters by PM-MCMC. Since the direct implementation is difficult due to the high dimensionality of Gaussian latent variables, we take the grid approximation over the study region and convert the likelihood into the multivariate Poisson log normal (mpln, Aitchison and Ho (1989)). At the next step, we calculate the marginal posterior of Gaussian latent variable. Hence, our computational scheme is similar to the spirit of INLA. Based on these approximate posterior samples, we suggest two types of model validation strategies for LGCP. We also investigate our methodology through the simulation studies and real data applications. 2 Log Gaussian Cox Processes and Their Bayesian Inference Let S = {s 1,..., s n } be the observed point pattern and D be the study region, the likelihood of LGCP is defined as ( L(S θ) = exp D D ) λ(u θ, z(u))du λ(s θ, z(s)) (1) log λ(s θ, z(s)) = X(s)β + z(s), z GP(0, C ζ ( )) (2) where X(s) is a covariate vector, θ = (β, ζ) is a parameter vector where β is a coefficient vector and ζ is a hyperparameter vector for Gaussian covariance function, z(s) is a Gaussian process at location s. The likelihood has infinite dimensional stochastic integral inside exponential, which is analytically intractable. Hence, in practice, some approximation of this integral is required. The straightforward approach is to take finite grid approximation, i.e., K k=1 λ(u k θ, z(u k )) k (Møller et al. (1998) and Illian et al. (2012)). Since the convergence of posterior samples based on the approximated likelihood to exact posterior distribution of concerning parameters are guaranteed when K go to infinity (Waagepetersen (2004)), we can implement MCMC given the approximated likelihood. For this approximation, n + K GP outputs have to be sampled, so we need calculate inverse of n + K GP covariance within each MCMC 4 s S

5 iteration. Due to this computational cost, some variations of Gaussian predictive processes (GPP) approach proposed recently in geostatistics contexts, e.g., nearestneighbor GP (NNGP, Datta et al. (2015)) and multi-resolution GP (MRGP, Katzfuss (2016)), are recommended as MCMC based inference of LGCP. An alternative approach is sigmoidal Gaussian Cox processes (SGCP) by Adams et al. (2009). This approach specifies the intensity as λ(s) = λ /(1 + exp( z(s))), where λ is the maximum of the intensity surface over the study region. By introducing and sampling thinned events {s thin } as latent variables, this approach avoid the numerical integration inside the exponential. Although this approach might be applicable for some case, e.g., when the number of points is small or the smooth intensity surface is expected, we still need to sample high dimensional latent variables, i.e., GPs on observed and thinned locations. Hence, in below discussion, we assume MCMC based on the grid approximated likelihood strategy as the default Bayesian inference for LGCP. These MCMC based inference for LGCP requires the calculation of inverse or Cholesky decomposition of covariance matrices of high dimensional GPs within each MCMC iteration. Its computational cost is O(n 3 ) computational time and O(n 2 ) memory for store. In addition to the computational cost for inverse calculation of LGCP, the sampling inefficiency between Gaussian latent variables and parameters is observed (Filippone and Girolami (2014)). The hyperparameters of covariance function depend on the sampled Gaussian latent variables. When the sampling of GPs is inefficient, the sampling of hyperparameters is also inefficient. Furthermore, the estimation of the coefficients of spatial covariates has also high correlation with Gaussian latent variables. To facilitate the sampling of Gaussian latent variables, the sophisticated sampling schemes have been developed, e.g., Hamiltonian Monte Carlo (Neal (2010)), Metropolis adjusted Langevin algorithm (Roberts and Rosenthal (1998), Møller et al. (1998)) and their manifold extension (Girolami and Calderhead (2013)). However, the performance of these algorithms is dependent on the careful tuning of some parameters, or requires the calculation of computationally heavy pre-conditioning matrix (Girolami and Calderhead (2013)). In order to avoid detailed tuning of algorithms, Leininger and Gelfand (2016) implemented elliptical slice sampling (Murray and Adams (2010)) for the inference of LGCP. Hence, the main computational costs for the Bayesian inference of LGCP is (1) evaluation of inverse or Cholesky decomposition of covariance matrices for sampling high dimensional Gaussian latent variables within MCMC iteration and (2) inefficiency caused by the correlation between Gaussian latent processes and related parameters. In below subsections, we describe some computational schemes for latent Gaussian process models especially in the LGCP perspective and their pros and cons. 5

6 2.1 Metropolis Adjusted Langevin Algorithm Metropolis adjusted Langevin algorithm (MALA) is originally proposed by Roberts and Rosenthal (1998). This approach is a MH algorithm with transition density driven by Langevin diffusion (see, Roberts and Tweedie (1996)), i.e., z = z (i 1) + σ n v (i 1) + σ2 n 2 log{π n(z (i 1) )} (3) where the random variables v (i 1) are distributed as independent standard normal and σ 2 n is the step variance. In contrast to traditional random walk MH algorithms, Langevin algorithm utilizes the local information of the target density. Actually, since the Langevin algorithm utilize the structure of the target density, the higher acceptance rate is accomplished than random walk MH (Roberts and Rosenthal (1998)). Girolami and Calderhead (2013) proposed manifold MALA (MMALA), which incorporates the information matrix of the target density as a pre-conditioning matrix M, i.e., z = z (i 1) + σ n M 1/2 v (i 1) + σ2 n 2 M 1 log{π n (z (i 1) )} (4) where M is the expected information matrix. They also implemented the algorithm for various applications, logistic regression, stochastic volatility models, log Gaussian Cox processes and dynamic systems drive by non-linear differential equations. Although their approach is efficient, calculating of information matrices are computationally costly. Although MALA and its extension are efficient algorithms, they require careful tuning of parameters. 2.2 Integrated Nested Laplace Approximation Integrated nested Laplace approximation (INLA) proposed by Rue et al. (2009) is a highly efficient Bayesian approximate inference for structured latent Gaussian models. More recent review is Rue et al. (2016). Let y = {y 1,..., y n } be observed dataset, the goal of this algorithm is to estimate approximate marginal posterior of Gaussian latent variables, i.e., π(z y). In their approach, they implement Laplace approximation (Tierney and Kadane (1986) and Barndorff-Nielsen and Cox (1989)) for calculating both marginal posterior for θ and for z. The marginal posterior of z is evaluated as π(z y) = j π(z θ j, y) π(θ j y) θj (5) 6

7 The sum is over gridded values of θ j with area j. Calculation of π(θ y) is as follows π(θ y) π(z, y, θ) π(z θ, y) z=z (θ) where π(z θ, y) is the Gaussian approximation to the full conditional of z, and z (θ) is the mode of the full conditional for z given θ (Rue et al. (2009) and Martins et al. (2013)). When π(z θ, y) = π(z θ, y), π(θ y) = π(θ y) for arbitrary z. For calculation of π(z θ, y), they implemented Laplace approximation for Gaussian Markov random field (GMRF, Rue and Held (2005)) approximation for precision matrices of Gaussian processes to obtain fast computation. In the Gaussian case, zeros for pairs of conditionally independent values in the precision matrix means the conditional independence. Due to this property, the computational cost for spatial GMRF case is O(n 3/2 ) computational time with O(n log(n)) memory for store (see Rue and Held (2005)). Illian et al. (2012) and Simpson et al. (2016) implement INLA approach for LGCP and investigate its convergence properties by utilizing the theoretical results for the connection of GMRF and GRF by Lindgren et al. (2011). Although INLA based inference for LGCP is highly efficient, there are some unsolved problems. Firstly, INLA approach have to evaluate π(θ j y) for each gridded θ j and then integrate over π(θ y) to calculate π(z y). For example, if we take 3 integration points in each dimension, the cost would be 3 p to cover all combinations in p dimensions, which is 729 for p = 6 and for p = 10. Hence, INLA based inference is available only for low dimensional θ with coarse grids (see, Rue et al. (2009), Illian et al. (2012), and Simpson et al. (2016)). In practice, since the number of hyperparameters for covariance function of spatial Gaussian processes and coefficients of concerning covariates are not so large, e.g., less than 10, this approach still work. However, large number of point patterns potentially have more complex covariance specification, e.g., nonstationary, nonseparable space-time, multivariate spatial, and more spatial covariates researchers have concerns. Hence, the low dimensional assumption might be a fatal bottleneck for the INLA approach. Furthermore, although some sophisticated variations of Gaussian approximation have been investigated (Martins et al. (2013)), Gaussian approximation for π(z θ, y) might not be accurate enough. For LGCP, the likelihood is point process, approximated by Poisson process on each grids. When the number of counts on each grid is very small, Gaussian approximation might not be accurate enough. (6) 7

8 3 Approximate Marginal Posterior for Log Gaussian Cox Processes Both INLA and MALA approaches have some limitations especially for high dimensional Gaussian latent variables or large number of parameters, e.g., over 10 dimension. In this paper, we propose the different computational scheme based on pseudomarginal MCMC proposed by Andrieu and Roberts (2009). Since this scheme is specific for LGCP, we assume y = S, i.e., observed dataset is a point pattern, in below discussion. Our basic strategy is similar to the INLA approach, i.e., consider an efficient and accurate Bayesian inference for approximate marginal posterior distribution of θ, i.e., π(θ S). This approximate marginal posterior distribution is constructed by the different way from INLA and estimated in MCMC framework. After obtaining this approximate marginal posterior, we can calculate the marginal posterior of Gaussian latent processes as π(z S) = I+i 0 i=i 0 +1 π(z θ (i), S) π(θ (i) S) (7) where I is the number of approximate marginal posterior samples and i 0 is the end point of burn-in period. Given the preserved {θ (i) } I+i 0 i=i 0 +1, we could estimate π(z θ (i), S) at each θ (i). Since we can sample π(z θ (i), S) at fixed θ (i), calculation of inverse or Cholesky decomposition of covariance matrices can be parallelized with respect to different θ (i). Hence, the computationally heavy iteration of these calculation through MCMC is not required. Furthermore, since θ (i) is fixed, z converge fast to π(z θ (i), S). The posterior samples of z are obtained by elliptical slice sampling for LGCP (Leininger and Gelfand (2016)) or MALA (Møller et al. (1998) and Diggle et al. (2013)). This step can be implemented by parallel computation schemes, since we only need to sample z S, θ (i) at fixed θ (i). Hence, the question move to How can we accurately estimate π(θ S)?. We consider implementation of pseudo-marginal approach proposed by Andrieu and Roberts (2009) for LGCP. 3.1 Pseudo-Marginal for Exact MCMC Andrieu and Roberts (2009) propose the pseudo-marginal approach. This approach enable us to sample posterior marginal of parameters efficiently when latent variables exit. The key point is to construct an unbiased estimate of marginal likelihood ˆπ(S θ) by integrating out latent variables and put this likelihood into the acceptance ratio, 8

9 i.e., α = { 1, π(θ )ˆπ(S θ )q(θ θ ) π(θ)ˆπ(s θ)q(θ θ) } (8) If u < α, preserve θ and ˆπ(S θ ). Surprisingly, the convergence to π(θ S) is guaranteed as long as ˆπ(S θ) is an unbiased estimate of π(s θ). Andrieu and Roberts (2009) verify the uniform ergodicity of the algorithm. The efficiency is dependent on the variance of ˆπ(S θ). When ˆπ(S θ) is a noisy estimate, the samples with above acceptance ratio would be highly correlated. So, the primal task for pseudo-marginal approach is to construct an unbiased estimate ˆπ(S θ) with small variance. The straightforward approach is the importance sampling. Due to the theory of importance sampling (e.g., Robert and Casella (2004)), we can construct the smaller variance unbiased estimate than the direct Monte Carlo estimate. Filippone and Girolami (2014) implemented the pseudo-marginal approach for estimating hyperparameters of GPs. They utilize Laplace approximation (Tierney and Kadane (1986) and Barndorff-Nielsen and Cox (1989)) and expectation propagation (Minka (2001)) as the importance density to construct the unbiased estimate. For the LGCP, we need to take grid approximation over the study region. For the accurate implementation, the sufficiently fine grids are required for approximating infinite dimensional stochastic integral. However, the importance sampling for high dimensional case is not promising. Let B = (B 1,..., B M ) be M disjoint subregions over D and T (S) = (T 1 (S),..., T M (S)) and δ = (δ 1,..., δ M ) be counts and intensity on each subregion B m for m = 1,..., M. The basic strategy is (1) divide the study region into subregions B, (2) taking counts on each subregions as count summary statistics T and (3) construct likelihood for the vector of count summary statistics given θ. For the third step, we can utilize the multivariate Poisson log normal (mpln) kernel function (Aitchison and Ho (1989)). So, approximate for marginal posterior means Grid approximation of the study region. Then the meaning π(θ S) is π(θ S) = π(θ T (S)) (9) The direct implementation of pseudo-marginal approach require the high dimensional grid approximation over the study region D, i.e., integrating out large M dimensional Gaussian latent variables. The estimator is given by ˆπ(T (S) θ) = 1 N imp N imp j=1 π(t (S) δ j )π(δ j θ), δ j g(δ j T (S), θ) (10) g(δ j T (S), θ) where N imp is the number of samples from importance density. When M is large, 9

10 obtaining the low variance estimate is computationally tough because very large N imp is needed. The straightforward implementation requires large M because this implementation assumes homogeneous Poisson in each grid as in standard MCMC and in INLA approaches. Then, the first and second order moments are evaluated at and between representative points of grids. On the other hand, in our settings, we can avoid high dimensional integration by utilizing the first and second moment equations induced by general moment formula for Cox processes. The key point is that even if we keep M is relatively low dimension, we can calculate the exact first and second order moments from the general moments formula for Cox processes. These moments formula is based on the integration of intensity and its product with pair correlation function over the subregions. So, although we take grid approximation like straightforward implementation, the first and second order moments of T is induced by the exact moments of LGCP. These exact moments eliminate biases cased by the grid approximation (homogeneous Poisson on each grid) in INLA and MCMC based algorithms. 3.2 Kernel Mixture Marginalization We consider the kernel mixture marginalization for the density of summary statistics vector, i.e., π(t (S) θ) = M m=1 P(T m (S) δ m )π(δ θ)dδ (11) where π(δ θ) is the prior distribution of the intensity vector. This mixture representation is to incorporate correlation structure of counts into the intensity distribution. Given the intensity, the counts between different grids are independent. From the moment formula for general Cox processes, we can obtain the first and second moments of marginal counts summary vector given θ, i.e., α θ,m = E[T m (S) θ] and β θ,mn = Cov[T m (S), T n (S) θ] for m, n = 1,..., M, α θ,m = λ θ (u)du (12) B m β θ,mn = λ θ (u)du + λ θ (u)λ θ (v){g θ (u, v) 1}dudv, (13) B m B n B n B m λ θ (s) = E z [λ(s θ, z(s))] (14) where g θ (u, v) is pair correlation function of latent processes, which can be anisotropic and nonstationary expression. Since B m and B n are disjoint, B m B n λ θ (u)du = B m λ θ (u)du for m = n and 0 otherwise. λ θ (s) = E z [λ(s θ, z(s))] is the expected intensity with respect to z. 10

11 Although the marginal likelihood π(t θ) is not analytically available, the first and second order moments can be calculated from above formula for any latent processes assumption. In practice, these moments can be accurately calculated by grid approximation, i.e., ˆα θ,m = ˆβ θ,mn = N Bm b=1 N Bm Bn b=1 λ θ (u b ) Bm (15) λ θ (u b ) Bm B n + N Bm b=1 N Bn b =1 λ θ (u b )λ θ (v b ){g θ (u b, v b ) 1} Bm Bn where N Bm and Bm is the number of grids and area of an unit grid within B m, and both of them can be different among B m s. For simplification, we assume N Bm = N B and Bm = B for all m. Importantly, B is a M-dimensional disjoint grid vector over the study region to construct T, the above moment calculation is implemented by taking further grids within each B m. (16) That is, we take M-dimensional count summary statistics vector over the M-dimensional disjoint subregions at first stage, then we can calculate the first and second order moments of T θ on B through the above moments formula directly. Although the analytical expression of π(t (S) θ) is not available, the first and second moments of T (S) θ is available. We try to utilize these moments information for construct the unbiased estimator ˆπ(T (S) θ). In below discussion, we introduce mpln kernel, which is an appropriate kernel function for LGCP to construct the unbiased estimator. multivariate Poisson log normal kernel By using first and second moments of count summary statistics, we could introduce mpln kernel function defined as π(t (S) θ) = M m=1 P(T m (S) δ m )LN (δ µ θ, Σ θ )dδ (17) where µ θ and Σ θ are the mean vector and covariance matrix of log(δ). In this approach, the latent intensity parameter δ is introduced and the marginal correlation structure is incorporated into the log normal distribution of δ. The marginal mean and variance of T (S) θ (see, Aitchison and Ho (1989)) is α θ,m = exp ( µ θ,m + σ θ,mm 2 ), m = 1,..., M (18) β θ,mm = α θ,m + α 2 θ,m{exp(σ θ,mm ) 1}, m = 1,..., M (19) β θ,mn = α θ,m α θ,n {exp(σ θ,mn ) 1}, m n = 1,..., M (20) 11

12 where σ θ,mn is the (m, n) element of Σ θ. Since, through the moments formula, we can calculate the marginal mean and covariance, E[T m ( ) θ] and Cov[T m ( ), T n ( ) θ] for m, n = 1,..., M given θ, we also calculate µ θ,m and σ θ,mn for m, n = 1,..., M. Hence, µ θ and Σ θ are induced from α θ and β θ as µ m = log(α θ,m ) σ mm ( 2 σ mm = log 1 + β θ,mm 1 σ mn = log ( α 2 θ,m 1 + β θ,mn α θ,m α θ,n α θ,m ) ) (21) (22) (23) Since β θ,mn for n m can be positive and negative, we could incorporate both positive and negative correlation among counts. However, this specification expresses only overdispersion because marginal variance β mm have to be larger than α m to satisfy σ mm > 0. The total number of parameters in µ θ and Σ θ is 1 M(M + 3), 2 which is the same number as that in α θ and β θ. Hence, the matching between (α θ, β θ ) and (µ θ, Σ θ ) is one to one. We also note that λ θ introduced above is not the intensity including Gaussian latent variables z because the Gaussian latent variables are integrated over. This is analytically available as E[exp(z(s))] = exp(σ 2 /2) (Møller et al. (1998)). The pair correlation function for LGCP is g θ (u, v) = exp(c ζ (u, v)), where C ζ (u, v) is covariance function for GPs at between u and v, which can be anistropic and nonstationary. Since the expressions for α θ and β θ are exact, the induced moments µ θ and Σ θ are also exact mean and covariance for log δ. The main purpose above is to induce µ θ and Σ θ from α θ and β θ. At first, we need to calculate α θ and β θ which are dependent on θ, then transform these values into µ θ and Σ θ. 3.3 Algorithm The algorithm is composed of two main steps. We call the below algorithm approximate marginal posterior (AMP) approach. If marginal posterior distribution for z is not your interests, you can skip the second step. Estimate π(θ S) 1. let i = 1, set initial value θ (0), 2. Generate θ q(θ θ (i 1) ), calculate moments E[T (S) θ ] and Cov[T (S) θ ] and convert these moment vectors into µ θ and Σ θ. 3. Calculate Laplace approximation g(δ T (S), θ ) for π(t (S) δ)π(δ θ ). 12

13 4. Estimate ˆπ(T (S) θ ) as ˆπ(T (S) θ ) = 1 N imp π(t (S) δ j )π(δ j θ ) N imp j=1 g(δ j T (S), θ ) (24) where g(δ j T (S), θ ) is the Laplace approximation of π(t (S) δ)π(δ θ ) evaluated at δ j and δ j g(δ T (S), θ ) for j = 1,..., N imp. 5. Evaluate acceptance ratio α = min { 1, π(θ )ˆπ(T (S) θ )q(θ (i 1) θ } ) π(θ (i 1) )ˆπ(T (S) θ (i 1) )q(θ θ (i 1) ) (25) and preserve θ (i) = θ and ˆπ(T (S) θ (i) ) = ˆπ(T (S) θ ) if u < α where u U(0, 1), otherwise θ (i) = θ (i 1) and ˆπ(T (S) θ (i) ) = ˆπ(T (S) θ (i 1) ). Back to step 2 and i i + 1 Importantly, although our algorithm approximately estimate π(θ S) in the meaning of π(θ S) = π(θ T (S)), pseudo marginal MCMC enable us to estimate π(θ T (S)). Although the posterior variance is dependent on the dimension of T (S), the algorithm estimate the marginal posterior mode even with moderate dimensional M. This point is a difference between AMP approach and grid approximation based approach, e.g., INLA and MALA. These algorithms assume the homogeneous Poisson on each grid, so require the high dimensional (sufficiently fine) grids for accurate inference. On the other hand, our approach is calculate the exact moments accurately enough for each grid (α θ and β θ ), the marginal posterior modes of parameters would be included within approximate marginal posterior credible intervals. When the estimator is heavy-tailed, it is difficult to accept a change from a large ˆπ(T (S) θ), and the Markov chain would stop moving for a long time. The efficiency of algorithm is dependent on the variance of ˆπ(T (S) θ). We discuss in more details about the construction of ˆπ(T (S) θ) later. Estimate π(z θ, S) and π(z S) (optional step) 1. Given θ (i) for i = i 0 +1,..., I+i 0, where i 0 is the end point of burn-in period and I is the number of preserved approximate marginal posterior samples, estimate π(z θ (i), S). 2. Calculate π(z S) as π(z S) = I+i 0 i=i 0 +1 π(z θ (i), S) π(θ (i) S) (26) 13

14 There are three important points here. First, since we preserved {θ (i) } I+i 0 i=i 0 +1 through the first step, we can calculate π(z θ (i), S) separately for each θ (i). The computational bottleneck for the sampling of LGCP is iteration of Cholesky decomposition or inverse calculation of alarge dimensional covariance matrix, which depends on θ, within MCMC iteration. However, since we preserve approximate marginal posterior samples, {θ (i) } I+i 0 i=i 0 +1 through the first step, we just need to take only one time Cholesky decomposition of inverse calculation of huge covariance matrix for ech θ (i). These calculation can be parallelized without passing information. Second, π(z θ (i), S) does not necessarily need approximation for sampling GPs, e.g., nearest-neighbor GPs (Datta et al. (2015)) and multiresolution GPs (Katzfuss (2016)). Since the iteration of calculation of inverse or Cholesky decomposition is not required, we can implement them for relatively large covariance matrices without approximation. Given fixed θ (i), convergence of π(z θ (i), S) is dramatically fast, e.g., elliptical slice sampling (see, Murray et al. (2010) and Leininger and Gelfand (2016)), even without fine tunings for Hamiltonian Monte Carlo or MALA. Finally, AMP approach does not require the grid approximation over θ, this is a fatal bottleneck of extension of INLA approach to relatively larger dimensional θ, e.g., over 10 dimension. Although π(θ S) is still approximation of π(θ S) in the meaning of π(θ S) = π(θ T (S)), the marginal posterior mode is well estimated as shown in simulation studies. AMP approach is potentially advantageous for larger dimensional θ than the INLA approach. 3.4 Construction of Unbiased Estimator The efficiency and accuracy of our algorithm is dependent on the construction of an unbiased estimator. When the variance of the estimator is small, the PM-MCMC converge efficiently. Through the theory of importance sampling (Robert and Casella (2004)), the importance sampling is an unbiased estimator with smaller variance than Monte Carlo estimator. Hence, we assume the importance sampling as a default choice for constructing an unbiased estimator. One question is how large we should take the number of importance particle, N imp. Doucet et al. (2015) suggest that N imp should be taken so that the variance of estimator of log likelihood is 1. Alternative approach is expectation propagation (EP, Minka (2001)). Filippone and Girolami (2014) implemented EP in addition to Laplace approximation, they suggested that EP might be more robust to the choice of N imp. However, in this paper, we assume Laplace approximation as an importance density because the related discussion to INLA is available, but EP is also directly applicable. Although INLA approach requires the fine grids over the study regions, for some grids, there are small number of points. Since Laplace approximation for the small 14

15 counts deviate from Gaussian distribution, skewness corrected method is considered in INLA approach (Martins et al. (2013)). On the other hand, AMP approach can potentially contain relatively larger number of counts in each grid because M would be lower dimension than INLA approach. Posterior distribution of log δ given T (S) is closer to the Gaussian distribution, so Laplace approximation of the posterior distribution of log δ is a promising option especially when large number of points are observed. 3.5 Computational Costs and Tuning Parameters There are three main computational costs for implementing AMP algorithm. Firstly, first and second order moments of T (S) θ, i.e., α θ and β θ, have to be calculated for each proposed θ. Especially, calculation of β θ would be time consuming because the number of components is M(M + 1)/2. So, the straightforward computational cost for these moment calculation is O(N B M 2 ). On the other hand, these moments are deterministic funtions of parameters. For calculating these moments, modern distributed computational tools, e.g., graphical processing units (GPU), are available. Given θ, each first and second order moments are obtained independently without passing information. Hence, this computation would be reduced to O(N B ), and not to be a bottleneck. Secondly, we need to generate N imp samples from importance density, so decide N imp. Doucet et al. (2015) suggest N imp should be decided so that the variance of loglikelihood is 1 under the assumption that the distribution of additive noise for loglikelihood estimator is Gaussian with variance inversely proportional to the number of samples and independent of the parameter value at which it is evaluated. Since AMP algorithm does not require large M, N imp is not so large. When the M is relatively small and large number of points are observed, the Laplace approximation for the posterior distribution of log δ would be close to the Gaussian distribution. Then, the importance density is close to the posterior density of log δ, the N imp is not required to be large. Finally, inverse and Cholesky decomposition of M M matrix is necessary for generating samples from and evaluate importance density. Although our approach can keep M relatively low dimension, larger M provide more information about π(θ S). Hence, the main computational cost for AMP algorithm is O(M 3 ). In practice, we suggest that M should be kept from and N B is adjusted so that N B M is close to the necessary number of grids in the INLA approach or standard MCMC approach, e.g., MALA, elliptical slice sampling, and Hamiltonian Monte Carlo. 15

16 3.6 Some Extensions Nonstationary There are varieties of approachs have been developed for nonstationary spatial processes. One parametric family of nonstationary covariance function was proposed by Paciorek and Schervish (2006). Alternative direction is the deformation approach proposed by Sampson and Guttorp (1992) which transform the stationary random field into nonstationary by deforming the space. Some Bayesian extension have been also proposed by (Damian et al. (2001), Schmidt and O Hagan (2003), Schmidt et al. (2011) and Bornn et al. (2012)). Other approach is kernel convolution (Higdon (1998)). AMP approach is easily extended to nonstationary covariance case as long as the analytical expression of covariance function is available, then the pair correlation functin is also analytically available, which is exponential of covariance function. Moment formula itself is generally applicable even for anisotropic and nonstationary cases. Hence, as long as pair correlation function is analytically available, AMP approach is extended to the nonstationary case. When the analytical expression is not available, we approximate pair correlation by further grid approximation. For example, the expression of covariance function is convolution of kernel function in the kernel convolution approach. Then, by taking the grid approximation of the convolution, we can still implement AMP approach. Fortunately, the calculation of moments and pair correlation function can be parallelized without passing information. AMP approach is flexibly available for the nonstationary case. Multivariate Møller et al. (1998) suggests the multivariate extension of LGCP. Brix and Møller (2001) consider the common latent process and independent process specification for bivariate LGCP. Waagepetersen et al. (2016) propose factor like specification for common latent process which is a promising direction for higher dimensional LGCP. The estimation strategies are based on minimum contrast estimator with respect to pair correlation function for multivariate LGCP. Since the Bayesian computational cost for these processes is huge, we rarely find the preceding literature of Bayesian inference of multivariate LGCP except for the bivariate case. For the multivariate extension, specification of cross covariance function (see, Genton and Kleiber (2015)) for LGCP is required. Let L is the dimension of a point pattern, the simplest form of cross covariance function is a separable form, C ll (s 1, s 2 ) = ρ(s 1, s 2 )a ll, s 1, s 2 R 2, (27) for all l, l = 1,..., L, where ρ(s 1, s 2 ) is a valid, non-stationary or stationary correlation function and a ll = cov(z l, Z l ) is the nonspatial covariance between variables 16

17 l and l. An alternative choice is the linear model of coregionalization (LMC, see, Schmidt and Gelfand (2003) and Banerjee et al. (2014)). It consists of representing a multivariate random field as a linear combination of H < L independent univariate random field. The resulting cross-covariate functions takes the form H C ll (u, v) = ρ h (u, v)a lh a l h (28) h=1 where A is the L H matrix, whose (i, j) component is a ij. Then, we can define the l-th Gaussian components inside the intensity function at location s as z l (s) = Hh=1 a lh v h (s) for l = 1,..., L where v 1,..., v H is mean 0 and variance 1 Gaussian processes with spatial correlation ρ h ( ). The cross pair correlation function ) between l-th and l -th components is gθ ll (u, v) = exp ( H h=1 ρ h (u, v)a lh a l h. Furthermore, β ll θ,mn = Cov[T l m, T l n θ], i.e., covariance of counts of l-th and l -th components on B m and B n is βθ,mn ll = 1(l = l ) λ l θ(u)du + λ l θ(u)λ l θ (v){gθ ll (u, v) 1}dudv, (29) B m B n B m B n where λ l θ( ) and λ l θ ( ) are intensity of l-th and l -th components. Waagepetersen et al. (2016) consider the factor like specification which is a special case of LMC as a cross covariance specification for multivariate LGCP. Importantly, the number of parameters would be larger than 20 in their multivariate LGCP. For these approaches, the INLA approach is difficult to be implemented. The computational cost required for multivariate extension is O((M L) 3 ). Although our algorithm doesn t require the large M, it computational cost would be huge for when L is large, e.g., over 30. On the other hand, if we assume independent LGCP for each component, the computational costs still remain O(M 3 ). As we discussed, there are two main computational tasks (1) moment calculation and (2) Laplace approximation. When we consider the convolution based cross covariance (Ver Hoef and Barry (1998), Ver Hoef et al. (2004) and Majumdar and Gelfand (2007)), the pair correlation function is not analytically available. Then, the additional calculation of the grid approximation of convolutional expression of pair correlation function is required. However, this steps is mainly related to moment calculation part, this part can be parallelized without passing information. Hence, our approach is potentially available for these convolution based cross covariance case, too. Space and time Space and time LGCP processes, proposed by Brix and Diggle (2001), might have been attracting more interests than multivariate LGCP. Shirota and Gelfand (2015) 17

18 applied separable and nonseparable space and circular time LGCP for crime event datasets. Let M s and M t be the number of grids for space and time, the computational cost for nonseparable space-time covariance function is O((M s M t ) 3 ). Hence, we need to keep both M s and M t small for fast computation. More practical strategy is to consider the sparse covariance specification with respect to time. It would reduce the computational time. 4 Model Validation for Log Gaussian Cox Processes Our specification enable us to estimate π(θ T (S)). Given approximate marginal posterior samples θ T (S), we could implement model comparison. Since sampling posterior marginal z is an optional step in our algorithm, we propose two model comparison strategies. p-thinning cross validation The first one is to implement p-thinning cross validation approach proposed by Leininger and Gelfand (2016). Due to the conditional independence property of Cox processes, we can obtain the posterior predictive intensity surface for the test dataset from the training dataset. For this approach, we sample z from marginal posterior in the second step of algorithm. Let p denote the retention probability, i.e., we delete s i S with probability 1 p. This produces a training point pattern S train and test point pattern S test, which are independent, conditional on λ(s). In particular, S train has intensity λ train (s) = pλ(s). We set p = 0.5 and estimate λ train (s) for s S train. Then, we convert the posterior draws of λ test (s) using λ test (s) = 1 p p λtrain (s). Let {Q k } be a collection of subsets of D. After fitting the model to obtain λ test, the posterior predictive distribution of N(Q k ) is available. For the choice of {Q k }, Leininger and Gelfand (2016) suggest to draw random subsets of the same size uniformly over D, i.e., if the area of each Q k is q D, then q is the size of each Q k. Next, calculating the predictive residuals in each subset; they argue that making the subsets disjoint is time consuming and unnecessary. Based on the p-thinning cross validation, we consider two model performance criteria: (1) predictive interval coverage (PIC) and (2) ranked probability score (RPS, Gneiting and Raftery (2007)). assessment of model adequacy, RPS enables model comparison. PIC offers After the model is fitted to S train, the posterior predictive intensity function can supply posterior predictive point patterns and therefore samples from the posterior predictive distribution of N(Q k ). For the i-th posterior sample, i = 1 + i 0,..., I + i 0, 18

19 the associated predictive residual is defined as R pred l (Q k ) = N test (Q k ) N (i) (Q k ) where N test (Q k ) is the number of points of the test data in Q k. If the model is adequate, the empirical predictive interval coverage rate, i.e., the proportion of intervals which contain 0, is expected to be roughly the nominal level of coverage; below, we choose 90% nominal coverage. Empirical coverage much less than the nominal suggests model inadequacy; predictive intervals are too optimistic. Empirical coverage much above, for example 100%, is also undesirable. It suggests that the model is overfitting, introducing more uncertainty than needed. Gneiting and Raftery (2007) propose the rank probability score (RPS). This score is derived as a proper scoring rule and enables a criterion for assessing the precision of a predictive distribution. That is, we seek to compare a predictive distribution to an observed count. Intuitively, a good model will provide a predictive distribution that is very concentrated around the observed count. While the RPS has a challenging formal computational form, it is directly amenable to Monte Carlo integration. In particular, for a given B k, we calculate the RPS as I+i 0 RPS(F, N test (Q k )) = 1 N (i) (Q k ) N test (Q k ) I i=1+i 0 1 2I 2 I+i 0 i=1+i 0 I+i 0 N (i) (Q k ) N (i ) (Q k ) i =1+i 0 Summing over the collection of Q k gives a model comparison criterion. Smaller values of the sum are preferred. Posterior functional summary statistics The alternative approach is to compare the posterior functional summary statistics. This approach does not require the sampling of z S, θ, but require the simulation of point pattern S (i) for each θ (i). Although the direct comparison of counts on subregions of observed point pattern S obs and simulated point patterns S (i) is straightforward approach, this is not available because point patterns from LGCP is heavily dependent on the realization of Gaussian processes. Then, comparing functional summary statistics would be a more promising way. Leininger (2014) discuss about Bayesian alternative of functional summary statistics. 19

20 5 Simulation Studies In this section, we investigate two simulation examples (1) univariate LGCP and (2) three dimensional LGCP. The study region is defined as D = [0, 1] [0, 1]. 5.1 Example 1: univariate LGCP In this first example, we consider the univariate LGCP and investigate the influence of the choice of some tuning parameters. The model we assume is λ(s) = λ 0 exp(β 1 s x β 2 s y z(s)), z GP(0, C φ ) (30) where s = (s x, s y ), C ζ (s 1, s 2 ) = σ 2 exp( φ s 1 s 2 ) and ζ = (σ 2, φ). The true parameter values are (λ 0, β 1, β 2, σ 2 ) = (400, 3, 3, 1). We set the decay parameter at three different smoothness levels: (1) φ = 1 (smooth), (2) φ = 3 (moderate) and (3) φ = 5 (rough). We discard first i 0 = 1, 000 samples as the burn-in period and preserve subsequent I = 5, 000 samples as posterior samples Figure 1 is the plot of univariate LGCP for three different smoothness. The number of simulated points for each case is (1) n = 3474, (2) n = 3744 and (3) n = We set N imp = As one benchmark case, we implement elliptical slice sampling algorithm with grid approximated likelihood, i.e., ( ) K K L(S θ) = exp λ(u k θ, z(u k )) λ(u k θ, z(u k )) n k (31) k=1 k=1 where n k is the number of counts in k-th grid, K k=1 n k = n. As our prior choice, we assume λ 0 G(2, 0.01), σ 2 G(2, 1), β 1, β 2 N (0, 100) and flat prior with sufficiently large support for φ. We discard first i 0 = 20, 000 samples as the burnin period and preserve subsequent I = 10, 000 samples as posterior samples. the sampling of parameters, we implement adaptive MCMC for both cases (see e.g., Andrieu and Thoms (2008)). Table 1 3 demonstrates the estimation results. We consider different settings for (M, N B ): (a) (100, 36), (b) (100, 1) (c) (25, 144) and (d) (25, 4). For (a) and (c), (M, N B ) are tuned so that M N B = 60. For (b) and (d), (M, N B ) are tuned so that M N B = 10. The case (a) represents the case where M is relatively large and N B is also taken large enough for accurate moments evaluation. The case (b) corresponds to the case where M is relatively large and N B is taken small not enough for accurate moments evaluation. Likewise, (c) corresponds to relatively small grids and fine moments evaluation. For The case (d) represents the case where both approximation are coarse. The results suggests that true values are recovered even when M is low dimensional as long as N B is sufficiently large. 20

21 Table 1: Estimation results for φ = 1 True Mean Stdev 95% Int Inef Elliptical K = 2500 λ [91.46, 296.6] 555 β [3.113, 4.376] 598 β [2.483, 3.678] 598 σ [0.454, 1.947] 562 φ [0.486, 3.024] 653 σ 2 φ [0.891, 1.574] 402 AMP M = 100, N B = 36 λ [57.67, 559.8] 18.0 β [1.407, 4.459] 14.1 β [0.769, 3.878] 23.6 σ [0.300, 4.148] 20.9 φ [0.179, 3.089] 15.6 σ 2 φ [0.515, 1.441] 15.8 AMP M = 100, N B = 1 λ [62.32, 528.8] 11.7 β [1.627, 4.256] 11.6 β [0.712, 3.551] 13.8 σ [0.296, 4.752] 18.9 φ [0.098, 1.944] 9.8 σ 2 φ [0.348, 0.801] 12.1 AMP M = 25, N B = 144 λ [50.45, 554.5] 22.3 β [1.248, 5.015] 26.2 β [0.620, 4.344] 14.6 σ [0.330, 4.090] 14.0 φ [0.228, 4.532] 26.4 σ 2 φ [0.538, 2.879] 31.5 AMP M = 25, N B = 4 λ [52.01, 575.2] 19.8 β [1.200, 4.922] 20.8 β [0.560, 4.092] 17.6 σ [0.329, 4.628] 45.0 φ [0.160, 3.598] 38.8 σ 2 φ [0.439, 2.034]

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate