Efficient Bayesian Inference for Conditionally Autoregressive Models. Justin Angevaare. A Thesis presented to The University of Guelph

Size: px

Start display at page:

Download "Efficient Bayesian Inference for Conditionally Autoregressive Models. Justin Angevaare. A Thesis presented to The University of Guelph"

Jasper Jordan
5 years ago
Views:

1 Efficient Bayesian Inference for Conditionally Autoregressive Models by Justin Angevaare A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Mathematics and Statistics Guelph, Ontario, Canada c Justin Angevaare, April, 2014

2 ABSTRACT EFFICIENT BAYESIAN INFERENCE FOR CONDITIONALLY AUTOREGRESSIVE MODELS Justin Angevaare University of Guelph, 2014 Advisors: Dr. D. Gillis Dr. G. Darlington We compare the performance of Metropolis-Hastings (MH) and Hamiltonian Monte Carlo (HMC) methods for Bayesian inference, with specific application to conditionally autoregressive (CAR) models. A simulation study is performed which investigates the efficiency of MH and HMC in estimation of the spatial correlation strength parameter of the CAR model. For this, data are simulated at various resolutions and spatial correlation strengths. An application to the relative abundance of Lake Whitefish in Lake Huron is also presented. Many new HMC-based methods have been recently developed, some of which offer significant benefit in performing inference for CAR models.

3 iii Acknowledgments I have found statistics to be incredibly rewarding research area. Undoubtedly, I owe much of this to my wonderful thesis advisors, Dr. Dan Gillis and Dr. Gerarda Darlington, whom have been supportive during my many challenges and celebrative whenever things turned around. Without Dan s mentorship over the years, it is unlikely I would have gave consideration to performing research in statistics. I am glad that I did. Thank you. Communicating ideas or problems in statistics to people outside the discipline can be difficult. To those in my life without the background in statistics that have shown the interest and patience in understanding what I spend my time doing (and occasionally being frustrated by), thank you sincerely. This research has received support through the Mitacs Accelerate program, which covered research equipment costs and provided a valuable internship opportunity.

4 iv Table of Contents 1 Introduction Motivation Bayesian Inference Conditionally Autoregressive Models Metropolis-Hastings Hamiltonian Monte Carlo Goal and Objectives Simulation Study Methodology Design Analysis Metrics Results Trace plots Computational time Effective sample size Effective samples per second Discussion Application Methodology Data Description Analysis Results Trace plots Performance metrics Model predictions Discussion Conclusions 48 A Appendices 53 A.1 Tables A.1.1 Kruskal-Wallis tests A.1.2 Wilcoxon rank-sum Tests

5 v A.2 Code A.2.1 Simulation Study A Neighbourhood matrix generation A PyMC A Data organization A.2.2 Application A Data organization and visualization A PyMC A.2.3 Graphics A Level Plots A Line Graphs A Stacked Histograms A.2.4 Table Production A Metric summaries A Kruskal-Wallis tests A Wilcoxon rank-sum tests

6 1 Chapter 1 Introduction 1.1 Motivation When exploring and analyzing spatially labelled data, it is important to consider relationships which may occur due to spatial proximity. Many statistical methods are available to explore these relationships for point (spatially continuous) and areal (spatially discrete) data alike. When working with areal data, one of the most widely used methods is the conditionally autoregressive model. This model considers spatial random effects, which can be used within simple regression or more complicated hierarchical models. In the case of hierarchical models, the Bayesian framework and the computational methods available for Bayesian inference, are especially useful. When conditionally autoregressive models are used for high dimensional data, it becomes increasingly important to consider the efficiency of these computational methods. Two prominent computational methods in Bayesian inference are Metropolis-Hastings and Hamiltonian Monte Carlo. The relative efficiency of these methods is dependent on the shape of a model s parameter space. This thesis addresses the relative efficiency of Metropolis-Hastings and Hamiltonian Monte Carlo for conditionally autoregressive models.

7 2 1.2 Bayesian Inference Bayesian inference is a framework for assimilating prior assumptions about model parameters, the likelihood of observed data given specific values of those parameters, into a posterior (the probability of parameter values given observed data) (Neal, 1993). The shape of the posterior distribution is determined by how informative the observed data are versus how informative the chosen prior is. Calculation of the posterior distribution follows the rule of conditional probabilities, which states that for any two events A and B (Miller and Miller, 2004), P (A B)P (B) =P (B A)P (A), and therefore, P (B A)P (A) P (A B) = P (B) By substituting a parameter vector, θ, for event A, and observed data, D, for event B, it follows that P (θ D) = P (D θ)p (θ). P (D)

8 3 The prior distribution of θ is represented by P (θ), the likelihood of D for specific values of θ by P (D θ), and the posterior distribution of θ by P (θ D). P (D) ensures that P (θ D) is a true density (i.e. integrates to 1), and can be considered to be a normalization constant. P (D) can be found through an integration over the parameter space (Neal, 1993), such that P (θ D) = θ P (D θ)p (θ). P (D θ)p (θ)dθ However, the integration required to find P (D) is often difficult or impossible in many applications. Markov Chain Monte Carlo (MCMC) allows us to sample from and approximate our posterior distribution without performing this integration, allowing our posterior to be defined to proportionality as (Besag et al., 1995) P (θ D) P (D θ)p (θ). MCMC is a widely used method in performing Bayesian inference Cappé and Robert (2000).

9 1.3 Conditionally Autoregressive Models 4 Conditionally autoregressive (CAR) models, first defined by Besag (1974), and later generalized by Besag et al. (1991), describe the spatial relationship between regions or areal units. For a supplied neighbour kernel, the CAR prior describes spatial random effects based upon the strength of correlation between neighbours and the scale of regional variability. Often this neighbour kernel is binary, indicating whether regions are adjacent or not (for examples of use see Fuentes et al. (2008) or Yu et al. (2008)). Other definitions or weighting schemes for determining neighbour relationships are possible. For instance, Kyung and Ghosh (2009) describe a directional CAR model where neighbours in one direction may be weighted differently from neighbours in another. This approach makes sense in many applications where factors such as a prevailing wind patterns or other directional processes are expected to influence observations. The spatial random effects described by the CAR model are defined through a multivariate normal distribution, centred on zero, with covariance matrix (τ(d pw )) 1, where τ, ρ, D, and W are defined following Besag et al. (1991):

10 5 W, an r r symmetric matrix, defines the neighbour relationships/weights amongst r areal units. The diagonal elements of W, (w 1,1, w 2,2,... w r,r ), are necessarily all zeros, 0 w 1,2... w 1,r w 2, w 2,r W = w r,1 w r, For a binary neighbour weighting scheme, 1 if i and j are neighbours, w i,j = 0 otherwise. D is an r r diagonal matrix of neighbour counts, or total neighbour weights for the k th areal unit, such that d d D =, d r

11 6 where, r d k = w k,i. i=1 The parameter ρ describes the strength of spatial correlation, or spatial dependency. The range of ρ must be restricted to ( 1, 1) to ensure that the covariance matrix is positive definite. It is possible to further restrict ρ to only positive or negative values if we wish to place a stronger prior on the type of spatial correlation that may exist between areal units. For our purposes, an uninformative prior that allows for a positive or negative correlation amongst neighbouring areal units is selected, hence we allow ρ Uniform( 1, 1). The parameter τ serves as a scaling factor for the inverse of the covariance matrix, and can be thought of as the overall variation amongst regions. τ must be greater than 0 to ensure the covariance matrix is positive definite. A gamma prior is selected, which ensures that τ is positive, and is flexible in terms of shape and location, hence τ Gamma(α, β),

12 7 where α is the shape parameter, and β the rate parameter; hyperpriors for τ. The CAR model described in full is in the form of CAR(τ, ρ, W ) =MvNormal(µ = 0, Σ = (τ(d ρw )) 1 ). The standard method of parameterizing CAR models is with Metropolis- Hastings (MH) based algorithms. Typically the use of MH for these models will result in poor convergence and mixing properties (Haran et al., 2001). 1.4 Metropolis-Hastings Metropolis-Hastings (MH) sampling is an MCMC method available for Bayesian inference (Tierney, 1993). Samples of the joint posterior distribution exist as a position in an MCMC chain. These chains are Markovian in that a sample at location t in the chain only depends mathematically on the value of the previous sample at location t 1. The MH algorithm is first initialized, then iterated. The iteration scheme is as follows (Chib and Greenberg, 1995).

13 8 A new value (or vector), x t+1 for the t+1 position of the chain is proposed. This value is probablistically generated from a transition kernel, based on the current value of the chain, x t. The chosen transition kernel must be reversible, that is P (x t x t+1) =P (x t+1 x t ). It is also required that the chain that results from this transition kernel is aperiodic, meaning that movement through areas of the target density is not restricted by a multiple of an integer number of steps (Chib and Greenberg, 1995). The proposed value x t+1 is accepted with probability ( ) P (x t+1 = x t+1) = min 1, π(x )P (x t x t+1). π(x)p (x t+1 x t ) Since the transition kernel is reversible, this simplifies to ( ) P (x t+1 = x t+1) = min 1, π(x ), π(x)

14 9 where π(x) is some target density. If x t+1 fails to be accepted, the chain remains at the same values, that is, x t+1 =x t. In Bayesian inference, the values of the Markov Chain, x, would be values of model parameters, θ. The probability of these values, π(x) would correspond to the posterior distribution, P (θ D). The posterior need only be defined up to proportionality, i.e. the product of the likelihood and the prior. Normalization constants will cancel when calculating the probability that a proposed value is accepted: P (θ t+1 = θ t+1) = min 1, P (D θ t+1)p (θ t+1) P (D θ)p (θ)dθ P (D θ t )P (θ t ) P (D θ)p (θ)dθ ( = min 1, P ) (D θ t+1)p (θ t+1). P (D θ t )P (θ t )

15 10 The efficiency of this algorithm will be greatly dependent on the transition kernel used. Often a normal transition kernel is selected, and the variance of this kernel with respect to the parameter space determines efficiency. In modern uses of the MH algorithm, this variance is typically automatically tuned as samples are generated such that a desired rejection rate is achieved. That is, the transition kernel variance will be increased if the chain is remaining in a high acceptance region, and decreased if proposals are consistently rejected. With MH, obtaining an optimal rejection rate is the primary consideration in ensuring efficient exploration of the parameter space (Roberts et al., 1997; Chib and Greenberg, 1995). The optimal rejection rate will depend on the dimensionality of the parameter space (increasing with dimensionality), ranging from around 0.45 for a one-dimensional problem up to maximum of about 0.77 for higher dimensions(chib and Greenberg, 1995). 1.5 Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC), first described by Duane et al. (1987), is another probabilistic sampling method. Like MH, we can utilize HMC for Bayesian inference by sampling from a model s posterior distribution. HMC is known to perform comparatively well when the parameter space of a model is particularly difficult to explore - such as when model parameters are highly correlated (Brooks et al., 2011). The efficiency of HMC is based on its ability to generate distant, but high acceptance

16 11 proposals. These proposals are generated through a discretization of Hamiltonian dynamics, which requires gradient information on the model parameters. The cost of computing this gradient with respect to HMC s ability to quickly explore difficult parameter spaces is the principal consideration in its use. With HMC, a realization from the posterior distribution of d model parameters is analogous to the position vector of a particle in a d dimensional space. This particle exists in a Hamiltonian system, and as such its movements follow Hamiltonian dynamics. Describing a physical system in terms of Hamiltonian dynamics is an alternative to the Newtownian interpretation. Hamiltonian dynamics are preferred especially when analyzing or simulating complex systems. In general, Hamiltonian dynamics describe an object s state in terms of its energy, mass, and position. This state can be determined through the appropriate differentiation of the Hamiltonian. The Hamiltonian describes the total energy of the system (Neal, 1993). Energy in Hamiltonian systems is continuously converted through time, from potential, to kinetic, and back again. An object may exhibit convergent, periodic, or chaotic behaviour in this respect. Our description of Hamiltonian systems follows details presented by Hairier et al. (2006) and Greiner (2009).

17 12 The location of a particle in a Hamiltonian system with d dimensions is described by a vector q, of length d. The k th element of q represents the location of the particle in the k th dimension. The velocity of this particle in each of d dimensions is described by the vector q, also of length d. The function T (q, q) describes the kinetic energy of the particle, typically defined as T (q, q) = 1 2 qt M(q) q, where M(q) represents a symmetric, positive definite, square, (possibly) positiondependent mass matrix of size d d. The potential energy of the particle is described by the function U(q), which only depends on the particle s location. The Langrangian function, L(q, q), is the difference between these energy functions, such that L(q, q) =T (q, q) U(q). The Lagrangian follows a known relationship between differentials involving the velocity, time, and location, that t ( ) L(q, q) = q L(q, q). q

18 13 The momentum of a particle in a Hamiltonian system is represented by a vector p of length d. Momentum is defined in each of k dimensions with respect to the Langrangian and the velocity as L(q, q) p k =. (1.1) q k With this momentum, we can finally define the Hamiltonian function, H(q, q, p), which depends on a particle s position, velocity, and momentum as H(q, q, p) =p T q L(q, q). The Hamiltonian can be shown to be the total energy in a system, thus can also be defined as H(q, q, p) =T (q, q) + U(q). For the Hamiltonian to only depend on the current position and momentum, the position and the velocity must have one-to-one correspondence with the momentum, via equation 1.1, which must be continously differentiable. Velocity, q k, and change in momentum, ṗ k, in the k th dimension can then be described as: q k = H(p, q) p k,

19 14 and, ṗ k = H(p, q) q k. The state of a mass point following Hamiltonian dynamics through time can be approximated with second order accuracy through a leapfrog integration scheme. This integration method is used to approximate many systems described by differential equations. In the case of Hamiltonian dynamics, integration will find the state of the system at time intervals dictated by a step size, ɛ (Brooks et al., 2011). The step size will determine the initial resolution in which the dynamics are described - too large and the resolution may gloss over the features of interest; too small and the computation may be needlessly intensive (Hoffman and Gelman, 2011). Other discretization methods are possible (Neal, 1993), but the leapfrog integration method has proved to be the most practical and widely used for HMC. Description of the scheme in which a particle s momentum and position are updated follows Neal (1993): p k (t + ɛ/2) =p k (t) (ɛ/2) U(q(t)) q k, q k (t + ɛ) =q k (t) + (ɛ)p k (t + ɛ/2), p k (t + ɛ) =p k (t + ɛ/2) (ɛ/2) U(q(t + ɛ)) q k.

20 15 To sample from a posterior distribution using HMC, the position vector q now represents a vector of proposed parameters. We define our potential energy function U(q) with special reference to our likelihood and prior such that U(q) = log (P (q D)P (q)). Samples that are generated from this algorithm are then subject to a regular MH acceptance scheme. In relation to MH, HMC produces samples with lower autocorrelation, but does so at a computational cost (Hoffman and Gelman, 2011). 1.6 Goal and Objectives The goal of this study is to demonstrate and compare the merits of MH and HMC in performing Bayesian inference for CAR models. Specifically, we investigate the hypothesis that HMC is unequivocally superior in efficiency to MH in performing inference for CAR models. In order to achieve this goal, the following objectives must be met: Spatially correlated data must be simulated with a variety of resolutions and parameter values; Joint posterior densities of CAR model parameters from each simulated dataset must be sampled using MH and HMC methods;

21 16 Efficiency must be measured, and compared between MH and HMC; An application must be presented that demonstrates the abilities of MH and HMC in performing inference for a CAR model.

22 17 Chapter 2 Simulation Study 2.1 Methodology Design A simulation study was designed to investigate the relative performance of HMC and MH in inferring the value of the spatial correlation strength parameter of the CAR model, ρ. Spatially correlated data were simulated for one of nine levels of correlation: high (ρ = ±0.95), medium-high(ρ = ±0.75), medium (ρ = ±0.5), medium-low(ρ = ±0.25), or zero (ρ = 0). In this simulation, an areal unit represented a cell from a regular, finite lattice. The resolution of the simulated spatial data, in other words the lattice dimensions, was one of four levels: 5 5, 10 10, 15 15, or Each combination of spatial correlation strength and resolution was simulated ten times, yielding a total of 360 simulated datasets. Visualizations of three such datasets are shown in figure 2.1. The experiment follows a full factorial arrangement with three factors: MCMC method, spatial correlation strength, and resolution, with 10 replicates.

23 18 Figure 2.1: Three examples of simulated data for three different spatial correlation strengths (ρ = 0.95, 0.5, and 0.5), and three different lattice sizes (20 20, 15 15, and 10 10).

24 Analysis Each simulated dataset was fit to a simple CAR model in Python using PyMC 3.0 (Patil et al., 2010). These models consisted only of a CAR component, i.e. observations were assumed to be direct observations from CAR(τ, ρ). Posterior distributions of τ and ρ were sampled using MH and HMC methods. Both algorithms were initiated at maximum a posteriori (MAP) points, and iterated for steps each. Complete Python code for this simulation study is included in appendix A Metrics The purpose of this simulation was to study the relative efficiency of MH and HMC when performing Bayesian inference of CAR models. Here, efficiency has two main components: 1 - the speed with which samples can be generated from the joint posterior distribution, and 2 - the level of temporal autocorrelation amongst these samples (i.e. how many independent samples do we actually have for the purpose of estimating the joint posterior distribution). The first component is measured simply with computation time. The second component can be measured through an effective sample size (ESS) calculation. A higher ESS indicates a better exploration of a parameter space. ESS is defined as (Kass et al., 1998; Pakman and Paninski, 2013) ESS = n δ,

25 20 where n is the number of posterior samples generated by an MCMC method (n = in the present simulation study), and δ is the autocorrelation time such that δ =1 + 2 k ψ(k). with ψ(k), the autocorrelation at lag k for parameter θ i defiined as ψ(k) =corr[e(θ (t) i ), E(θ (t+k) i )]. Girolami and Calderhead (2011) and Betancourt (2012) use these same metrics in the comparison of a variety of MCMC methods. Pakman and Paninski (2013) used ESS and CPU runtime units in this situation, and Wang et al. (2013) and Hoffman and Gelman (2011) use ESS and the number of leapfrog steps when comparing HMC methods. Here, the number of leapfrog steps (i.e. gradient calculations) is a measure of computational cost in performing some form of HMC. ESS calculations were performed using R software for statistical computing (R Core Team, 2013), with the LaplacesDemon package (Statisticat, LLC., 2013). LaplacesDemon ESS calculation was interfaced with Python using rpy2 (Gautier, 2013). The time (in seconds) required to generate samples using MH and HMC was recorded directly within Python. Combined, ESS and time allow for the calculation of effective samples per second.

26 Results Trace plots The performance of an MCMC algorithm can be quickly assessed through the visual examination of trace plots. Typical trace plots are included in figure 2.2, which correspond to data simulated for a lattice with ρ = Through visual examination of trace plots from the simulation study, it appears that consecutive samples in trace plots from HMC are generally more distant than those in the trace plots from MH. Relatively distant consecutive samples in the trace plots of HMC suggest that there is less autocorrelation present in comparison to MH. In trace plots for both HMC and MH, convergence appears to be immediate. Immediate convergence suggests that the use of the maximum a posteriori point as an initial value for each algorithm has been effective in eliminating the need for a burn-in period Computational time Computational time was found to be largely related to the number of regions for which data were simulated. Figure 2.3 illustrates how computational time differs between HMC and MH as a function of lattice size. We see in this figure that HMC consistently requires more time for computation for a given lattice size in comparison to MH. We also see both methods require more time for computation as lattice size increases. The rate of this increase is much higher for HMC. For instance, for 5 5 lattices, MH and HMC require 3.05 and seconds for computation on average, respectively. In other words, HMC requires roughly 7 times longer than MH for

27 22 Figure 2.2: Trace plots for posterior samples of CAR model parameters as generated by MH (top right plots) and HMC (bottom right plots). Posterior densities corresponding to these trace plots are included on the left. The data were simulated for a lattice with ρ = 0.95.

28 23 computation at this lattice size. For lattices, this increases to and seconds for MH and HMC respectively, or 21 times longer is required for computation for HMC in comparison to MH. Kruskal-Wallis tests were performed to determined if there was evidence that computation time differed amongst simulations with different spatial correlation strength, for each method and lattice dimension. There was evidence (p-value 0.05) that differences occur in computation time amongst spatial correlation strengths for both methods at every lattice dimension except for This was followed by a Wilcoxon rank-sum post hoc analysis to determine if there were any patterns to when these differences in computation time with spatial correlation strength occurred. There was evidence (p-value 0.05 ( 9 2) ) that significant differences in computation time occurred amongst some, but not all spatial correlation strengths with MH and HMC - but without any obvious patterns. The Kruskal-Wallis test results are included in appendix A.1.1 in table A.1. The complete set of Wilcoxon rank-sum test results are included in appendix A.1.2, in tables A.2 and A.3. As there was evidence computation time differs amongst spatial correlation strengths, separate lines for each spatial correlation strength and method are presented in figure 2.4 to show the relationship of computation time and lattice dimension. For individual simulation runs, the relative computation time of HMC and MH is presented figure 2.5. This figure shows that HMC requires more computation time for all grid sizes, but also for all spatial correlation strengths. Stacked histograms allow for examination of whether relative computational time depends on the spatial correlation strength. Based on visual examinination of these figures, relative compu-

29 24 tation time appears to be consistent across all spatial correlation strengths. Median computation time is reported in table 2.1 for each lattice dimension. In table 2.2, median computation time is reported for each combination of lattice dimension and ρ. Computation time is bold in these tables when it is significantly lower for MH, as determined by one-sided Wilcoxon rank-sum tests (p-value 0.05). The results of these tests are found in appendix A.1.2 in tables A.8 and A Effective sample size The main feature of HMC is its ability to produce samples with limited autocorrelation. The degree of autocorrelation present in an MCMC chain can be measured with ESS. Figure 2.6 shows the relative ESS of the spatial correlation strength parameter, ρ, for each individual simulation run. Again, the use of stacked histograms allows us to see how differences in ESS occur with respect to lattice dimension and true spatial correlation strength. Values greater than 1 in this figure indicate that HMC has produced more effective samples than MH for a given simulation run. We see that this is nearly always the case. The difference in ESS is commonly more than 7500, which can be seen in tables 2.1 and 2.2. There are instances, however, in which HMC and MH produce comparable ESS. This appears to be more likely when the spatial correlation strength has been set at extreme values (i.e. ρ = 0.95 and ρ = 0.95). A difference which is close to zero is not due to MH producing more independent samples in these situations, but due to poorer performance of HMC. It seems this performance issue with HMC is reduced as lattice dimensionality increases. Kruskal-Wallis tests were performed to detect whether ESS differed signifi-

30 25 Figure 2.3: Median computation time for HMC and MH with respect to lattice dimension (d d) are shown. The 95th percentile range of computation time are indicated with whiskers.

31 26 Figure 2.4: Median computation time for HMC and MH with respect to lattice dimension (d d) and spatial correlation strength (ρ) is shown.

32 27 Figure 2.5: The above histograms depict the relative computation time required for HMC in comparison to MH for each simulation run, by lattice dimension.

33 28 cantly between different spatial correlation strengths within a method and lattice size. Significant differences in ESS were detected amongst spatial correlation strengths within each lattice size and method. Results from these tests are in appendix A.1 in table A.1.1. The Kruskal-Wallis tests were followed by a Wilcoxon rank-sum post hoc test to determine if there were any patterns for when significant differences in ESS occurred. For HMC, ESS for simulations with extreme spatial correlation strengths (ρ = ±0.95) were commonly significantly different than that for every other spatial correlation strength. For a specific spatial correlation strength, there were no significant differences detected in ESS between negative and positive values (e.g. between ρ = 0.5 and ρ = 0.5.). Complete results from the Wilcoxon rank-sum tests are found in appendix A.1.2 in tables A.4 and A.5. Median ESS is reported in table 2.1 for each lattice dimension. In table 2.2, median ESS is reported for each combination of lattice dimension and ρ. ESS is bold in these tables when it is significantly higher for HMC, as determined by one-sided Wilcoxon rank-sum tests (p-value 0.05). The results of these tests are found in appendix A.1.2 in tables A.8 and A Effective samples per second Combining the previously mentioned metrics, computation time and ESS, we arrive at effective samples per second. This metric speaks to the overall efficiency of the MCMC algorithms. Computation time and ESS mean very little when not put in the context of one another. Effective samples per second for both methods decreases as lattice dimension increases. This decrease occurs more rapidly for HMC

34 29 Figure 2.6: The above histograms depict the relative effective sample size for HMC in comparison to MH for each simulation run, by lattice dimension.

35 30 in comparison of MH, as shown in figure 2.8. For a lattice, HMC and MH generate very similar effective samples per second. We use stacked histograms to visualize the relative effective samples per second of HMC in comparison to MH for each individual simulation run, in figure 2.7. If relative effective samples display a pattern with respect to spatial correlation strength, that would suggest that it is situational. In general, HMC is able to generate more individual samples per unit time in comparison to MH. There are exceptions, however. Since HMC consistently requires more computation time, when HMC and MH produce comparable ESS, the effective samples per second is greater for MH than HMC. In our simulation study, this occurs when extreme values have been selected for the spatial correlation strength. Kruskal-Wallis tests were used to determined if significant differences occurred amongst effective samples per second due to spatial correlation strength, within each method and lattice size. Significant differences were detected by these tests for every method and lattice dimension. The results of these tests can be found in appendix A.1 in table A.1.1. These tests were followed by a Wilcoxon rank-sum post hoc analysis to determine if there were any patterns to when differences in effective samples per second occurred amongst spatial correlation strengths. Significant differences were found to occur amongst some, but not all spatial correlation strengths for both methods without any obvious patterns. The results of these tests can be found in appendix A.1.2 in tables A.6 and A.7. As there was evidence of differences amongst spatial correlation strengths for both methods, figure 2.9 presents the median effective samples per second for each spatial correlation strength and method as

36 31 a function of lattice dimension. Median effective samples per second is reported in table 2.1 for each lattice dimension. In table 2.2, median effective samples per second is reported for each combination of lattice dimension and ρ. Effective samples per second is bold in these tables when it was significantly higher for a specific method, as determined by twosided Wilcoxon rank-sum tests (p-value 0.05). The results of these tests are found in appendix A.1.2 in tables A.8 and A Discussion The purpose of this research has been to investigate the hypothesis that HMC is unequivocally superior in efficiency to MH in performing inference for CAR models. Here, efficiency has been defined as number of effective samples generated per second for each method - which has been measured for a variety of scenarios through simulation. If this hypothesis was true, we would predict that HMC would have more effective samples per second for each of these scenarios. This was found to be the case the majority of the time, but with some interesting exceptions. HMC dropped in performance when regions had very strong positive or negative correlation with their neighbours. The true values in these cases, 0.95 and -0.95, approach the boundaries of the uniform prior set on the spatial correlation strength, 1 and -1. It seems natural that the effective sample size of an MCMC chain for a parameter, when its true value is near a hard boundary, may be lower than when this is not the case. This is due to the fact that any value proposed beyond this boundary will have an acceptance probability

37 32 Figure 2.7: The above histograms depicts the difference in effective samples per second between HMC and MH methods for each simulation run, by grid size.

38 33 Figure 2.8: Median effective samples per second for HMC and MH with respect to lattice dimension (d d) are shown. The 95th percentile range of effective samples per second are indicated with whiskers.

39 34 Figure 2.9: Median effective samples per second for HMC and MH with respect to lattice dimension (d d) and spatial correlation strength (ρ) are shown.

40 35 of zero, and proposals in this region will be more common when near high density areas of the posterior distribution. Additionally, the gradient calculation involved in the generation of HMC proposals is not able to guide an MCMC chain away from the boundaries of a uniform prior. Pakman and Paninski (2013) have presented an interesting HMC-based MCMC algorithm to deal with truncated distributions, such as the one required for CAR models. In their algorithm, they account for boundaries by having particles, or posterior samples bounce off of them. When this occurs, an inversion of velocity occurs, which is shown to continue to satisfy the conditions of Hamiltonian dynamics. An approach such as this could mitigate the efficiency issues observed at boundaries in this simulation study. A second interesting result was that the efficiency advantages of HMC were diminishing as the number of regions for which data were simulated increased. For a lattice, the efficiency differences between HMC and MH were slight. If this trend continued, for higher spatial resolutions it may actually be advantageous to use MH over HMC. The high computational cost to HMC is related to required gradient calculation of the potential energy function. As the dimensionality of a model increases, so does the cost of computing this gradient. For MH, the cost of increased dimensionality is due to a higher rejection rate, and more complex posterior density calculations, something which HMC is also subject to. One option to lessen the burden of gradient calculations is to use a stochastic gradient approach with HMC. Chen et al. (2014) have implemented stochastic gradient HMC. In order to account for the noise generated from the use of stochastic gradients, they found that a friction term was necessary to maintain Hamiltonian dynamics. This friction term

41 36 itself is based on second-order Langevin dynamics. Chen et al. (2014) conclude that stochastic gradient HMC, with their simple friction term presents a promising avenue for scaling HMC for practical use with high dimensional Bayesian models. Beyond the two HMC variants briefly described here, many others have been developed that seek to improve the efficiency and usability of HMC. Figure 2.10 lists some of these developments. Excitingly, there are opportunities for some of these new HMC-based methods to borrow from one another. The popularity of HMC for performing Bayesian inference will continue to increase as these advances are made.

42 37 Stochastic Gradient (Chen et al., 2014) Exact (Pakman and Paninski, 2013) Advanced (Beskos et al., 2013) Hamiltonian (Duane et al., 1987)(Neal, 1993) Split (Lan and Shahbaba, 2012) Rasmussen (Fielding and Liong, 2011) Parallel Tempering (Fielding and Liong, 2011) NUTS (Hoffman and Gelman, 2011) Riemann Manifold (Girolami and Calderhead, 2011) Adaptive (Wang et al., 2013) Figure 2.10: developed. Some of the recent HMC-based MCMC methods which have been

43 38 d Time ESS ES/sec HMC MH HMC MH HMC MH (20, 25.7) 3.1 (2.9, 3.2) (7.4, 10000) (86.3, 817.7) 388 (0.4, 481) 76.6 (28.1, 282.3) (50.9, 62.1) 5.4 (5.3, 6.4) (1.9, 10000) (64, 862.6) (0, 196.3) 31.1 (11.7, 159.9) (232.1, 267.6) 14.8 (14.4, 17.5) (16.4, 10000) (47, 879.9) 39.9 (0.1, 42.9) 11.6 (3.2, 58.8) (1051.9, ) 55.2 (50.1, 151.7) (48.3, 10000) (27.6, 879.4) 8.3 (0, 9.5) 2.7 (0.5, 16.4) Table 2.1: Median values of the metrics considered in this simulation study are presented by lattice dimension (d) and method, over all spatial correlation strengths. The range of each metric is included in brackets. Bold values indicate significantly better performance of a method for a specific metric. Relatively lower computation time, higher ESS, and higher effective samples per second are preferred in performance.

44 d ρ Time ESS ES/sec HMC MH HMC MH HMC MH (20, 20.9) 2.9 (2.9, 2.9) 474 (112.7, ) (127.8, 456.8) 22.8 (5.5, 480.4) 89.3 (43.7, 156.3) (51.1, 58.2) 5.4 (5.3, 5.6) (1.9, ) (96, 581.7) 8.4 (0, 42) 70.3 (17.7, 106.7) (232.1, 239.4) 14.7 (14.4, 15.9) (16.4, 10000) (47, 630.1) 4.1 (0.1, 42.6) 20.5 (3.2, 42.8) (1051.9, ) 50.9 (50.1, 54) (48.3, 10000) (27.6, 632.9) 0.2 (0, 9.5) 5 (0.5, 12.4) (20.1, 23.2) 3.1 (2.9, 3.2) (89.4, 10000) (119.2, 446.6) (3.9, 481) 68.7 (39.2, 149.4) (51.1, 51.8) 5.4 (5.3, 5.6) (2149.9, 10000) 190 (99.6, 480) (41.5, 195.6) 35.3 (18.4, 88.3) (233.1, 245.3) 14.6 (14.4, 16.1) (10000, 10000) (166.5, 338.6) 42.2 (40.8, 42.9) 14.4 (11.3, 23.1) (1059.3, ) 51 (50.3, 64.7) (10000, 10000) (131.3, 427.7) 9.2 (8.7, 9.4) 5.2 (2.6, 8.3) (21.6, 23.7) 3.2 (3.1, 3.2) (5748, 10000) (102.3, 289.9) 441 (242.7, 461.9) 53.8 (32.6, 93.8) (51.1, 59.2) 5.4 (5.3, 5.5) (5875.7, 10000) (72, 462.3) (114.9, 195.8) 25 (13.3, 85.5) (235.5, 240.4) 14.7 (14.6, 14.9) (9077, 10000) (109, 220) 41.9 (38.2, 42.5) 11.1 (7.3, 15) (1069.2, ) 51.6 (50.5, 61.6) (10000, 10000) (114.4, 206.9) 9.2 (8.5, 9.4) 2.4 (2.2, 3.8) (20.9, 23.6) 3 (3, 3.1) (7376.1, 10000) (151.8, 369.4) (347.2, 479) 64.8 (48.9, 122.3) (51.1, 51.4) 5.4 (5.4, 5.5) (9545.6, 10000) (82.3, 213.1) (185.6, 195.6) 27.5 (15.1, 39.4) (236.1, 251.6) 14.7 (14.5, 14.9) (8461.2, 10000) (94.6, 175.2) 41 (35.3, 42.2) 9.7 (6.5, 11.7) (1157.8, ) 56.8 (52.9, 66.7) (9724.8, 10000) (103.3, 164.3) 8.3 (6.8, 8.6) 2 (1.8, 2.8) (21.1, 24.9) 3.1 (3, 3.2) (1904.5, 10000) (96.6, 317) (89.5, 473.6) 55.3 (30, 104.5) (51.2, 60.7) 5.4 (5.3, 5.5) (10000, 10000) (82.8, 148) (164.7, 195.3) 23.6 (15.1, 27.3) (234.8, 252.4) 14.7 (14.6, 14.9) (8820.1, 10000) (79.4, 138.5) 38.4 (36.8, 41.8) 8.1 (5.4, 9.3) (1184.2, ) 61 (55.4, 92.3) (8459.8, 10000) (66.8, 163.9) 7.1 (4.3, 7.8) 1.8 (1.1, 2.7) (21, 25.7) 3.1 (3, 3.2) (1113, 10000) (90.5, 702.9) (52.6, 476.1) 77.2 (29.7, 222.4) (51, 51.8) 5.4 (5.3, 5.5) (10000, 10000) (64, 180) (193.1, 196.2) 23.9 (11.7, 33.2) (235.7, 240.5) 14.7 (14.5, 15) (8833.4, 10000) 131 (98.3, 181.3) 41 (37, 42.4) 8.9 (6.6, 12.4) (1132.5, ) 56.6 (52.5, 151.7) (8701.5, 10000) (100.4, 158.7) 8.1 (4.3, 8.8) 2.2 (0.7, 2.6) (21.1, 25.7) 3.1 (3, 3.2) (3750.1, 10000) (86.3, 565) 465 (172.3, 472.2) (28.1, 183.7) (51, 62.1) 5.4 (5.3, 6.4) (10000, 10000) (85.9, 283) (161.2, 196) 30.3 (16.1, 49.9) (236.7, 266.5) 14.9 (14.6, 15.7) (9837.9, 10000) (93.4, 295.2) 39.7 (37.5, 42.2) 11.1 (6.3, 18.8) (1115.1, 1885) 57.3 (52.5, 106.9) (10000, 10000) (104.9, 278.5) 8.7 (5.3, 9) 2.6 (1.2, 5.3) (20.3, 25.7) 3 (2.9, 3.1) (660.9, 10000) (132.1, 432.1) (32.5, 476.6) 91.1 (44.4, 147.4) (50.9, 53.3) 5.4 (5.3, 5.4) (3354.4, 10000) (143.2, 505.4) (63, 196.3) 45 (26.5, 93.3) (239.9, 267.6) 15.2 (14.7, 17.5) (1207.4, 10000) (176.6, 539) 38.9 (4.9, 41.7) 18.7 (11.9, 35.3) (1091, ) 57.8 (51.9, 71.2) (10000, 10000) (155.5, 481.4) 8.4 (7.2, 9.2) 4.4 (2.7, 8.2) (20.3, 20.7) 3 (2.9, 3) (7.4, 3167) (210.6, 817.7) 57.7 (0.4, 155.5) (71.4, 282.3) (51.2, 51.7) 5.4 (5.3, 5.4) (147, ) (371.1, 862.6) 14.6 (2.9, 33.2) 97.6 (69.2, 159.9) (235.4, 260) 15 (14.5, 16.5) 1199 (316.8, 10000) (171.8, 879.9) 4.8 (1.3, 38.9) 36.9 (11.6, 58.8) (1081, ) 53.4 (51.5, 73.5) (575.1, 10000) (273.1, 879.4) 1.5 (0.5, 8.8) 11.8 (4.8, 16.4) 39 Table 2.2: Median values of the metrics considered in this simulation study are presented by lattice dimension (d), spatial correlation strength (ρ), and method. The range of each metric is included in brackets. Bold values indicate significantly better performance of a method for a specific metric.

45 40 Chapter 3 Application 3.1 Methodology Data Description A CAR model was used to detect the spatial structure of, and provide spatially smoothed estimates of, catch per unit effort (CPUE) for the lake whitefish (Coregonus clupeaformis) fishery of the North Channel of Lake Huron. CPUE is considered a measure of relative abundance. The abundance of lake whitefish may vary spatially according to the degree in which local environmental conditions are in line with a species habitat preferences. Local abundance may also be affected by commercial harvest intensity, and aggregation and dispersion behaviours of lake whitefish. It is likely that several more such spatial processes exist, combining to result in the spatial correlation of lake whitefish abundance. The raw CPUE data were calculated as the total harvest of lake whitefish in round kilograms divided by the total effort in terms of meters of gillnet for each actively fished 5 minute 5 minute grid in the Northern Channel of Lake Huron. Only gillnet harvest was considered for simplicity, as it accounted for the vast majority of

46 41 commercial harvest (> 98% of all harvest). Harvest and effort were totalled across 34 years of commercial fisheries data ( inclusive), for the calculation of CPUE. In total, 85 5 minute x 5 minute grid cells were considered for the CAR model. In other words, the harvest weight and effort from each gillnet harvest event (h) were aggregated over all years for each grid cell (j), and CPUE was calculated as Harvest h CPUE j = i hɛj, for i = 1979,..., 2012, and j= 1,..., 85. Effort h i hɛj Analysis A simple CAR model was assumed for the CPUE data, where y =β 0 + u,

47 42 where y is a vector of CPUE observations for grid cells 1,..., 85, β 0 is an intercept term, and u are the spatial random effects as described by the CAR model. The following priors were assumed for the model: β 0 Normal(µ = 0, σ 2 = 1), u MvNormal(µ = 0, Σ = (τ(d ρw )) 1 ), τ Gamma(α = 1, β = 4), ρ Uniform(A = 1, B = 1). The normal prior for β 0 allows for a positive or negative intercept. The use of a normal prior here is standard for Bayesian regression. The gamma prior on τ ensures positiveness, which is necessary for its use in the covariance matrix associated with the CAR model. The uniform prior on ρ is uninformative, and allows for positive or negative correlation amongst the regions. Similar to the simulation study, samples from the joint posterior distribution were generated, with MH and HMC sampling initiated at the MAP point. Time required for computation and the effective sample size were calculated for each MCMC method.

48 Results Trace plots Trace plots associated with the fitting of the Lake Huron lake whitefish CPUE data to a CAR model, using both HMC and MH can be found in figure 3.1. Substantial differences can be seen in the behaviour of these trace plots. MH seems to have performed especially poorly in the production of region estimates in comparison to HMC. HMC appears to produce samples for ρ and τ with very low autocorrelation. There seems to be slightly more autocorrelation in the samples generated by MH for these parameters, but their trace plots still show reasonable mixing Performance metrics MH required 6.2 seconds for iterations, whereas HMC required seconds. In these times, MH produced effective sample sizes of 109.9, 612.0, and 2.1 for ρ, τ and β 0, respectively. For these same parameters, HMC produced effective sample sizes of 538.9, , and respectively. In terms of effective samples per second, MH was found to have rates of 17.72, 98.70, and 0.36 for ρ, τ and β 0, in comparison to HMC which was found to have a rates of 4.23, 48.67, and 2.15, respectively. These results are also presented in table 3.1. These results differ from the relative efficiency that the simulation study has lead us to expect. In comparison, in the simulation study, the median effective samples per second for ρ, for a lattice, were and 31.1 for HMC and MH, respectively. The irregular spatial structure of the CPUE data, the differences in spatial correlation strengths, and/or

49 44 Figure 3.1: Comparison of trace plots and posterior densities resulting from MH (left) and HMC (right) sampling methods.

50 45 ESS ESS/sec MH HMC MH HMC ρ τ β Table 3.1: ESS and effective samples per second (ESS/sec) for each parameter of the CPUE CAR model from MH and HMC. the use of the CAR model in a predictive capacity in the application may explain for these differences Model predictions In figure 3.2 the observed CPUE data is compared side by side with the predicted data from the CAR model using MH and HMC. The relative CPUE as predicted with HMC in comparison to the observed CPUE is shown in figure 3.3. The largest differences occur where very high or very low values of CPUE effort had been observed for a region. 3.3 Discussion While fitting the CAR model for Lake Huron lake whitefish relative abundance, it was observed that the shape of an area has implications for the CAR model. Indeed, it is noted by Wall (2004) that the original use of the CAR model was for doubly infinite regular lattices, and when it is applied to finite, irregular lattices, the implied spatial correlations are not well understood. In general, Wall (2004) showed that the implied spatial correlation of CAR models for irregular lattices are unin-

51 46 Figure 3.2: Observed CPUE (left), beside CPUE as predicted by the CAR model after iterations from MH (center) and HMC (right).

52 47 Figure 3.3: A plot of the the observed CPUE relative to CPUE as predicted by the CAR model using HMC sampling. Brighter colours (e.g. yellow vs. red) indicate that the predicted CPUE is relatively higher in comparison to the observed CPUE. tuitive. This implies that when there is an emphasis on understanding the spatial structure of data, rather than other model coefficients, alternative methods should be used. As HMC is particularly effective in exploring correlated parameter spaces, it excels when used in a predictive capacity for CAR models. That is, when measures from spatially correlated regions are treated as model parameters, as CPUE is here, HMC is able to effectively explore that parameter space. The trace plots in figure 3.1 show poor mixing of the regional estimates for MH, and ideal mixing of these same estimates when using HMC. We can represent the performance of MH and HMC for generating regional estimates with a single parameter, β 0. Over iterations, MH had an effective sample size of only 2.1 for β 0, where HMC generated effective samples. The consequences of MH s poor mixing can be seen in the regional estimates in figure 3.2. In this figure it is clear that the regional estimates produced by MH have excessive noise in comparison to those produced by HMC. In order to use MH for this type of application, it would be required to run for many more iterations.

53 48 Chapter 4 Conclusions CAR models are a popular choice for spatially correlated areal data. When these data are high resolution, inference on the parameters of the CAR model becomes increasingly computationally intensive. In these situations, the relative efficiency of MCMC methods becomes of critical importance. Our simulation study compares two broad categories of MCMC methods, MH and HMC. No research has been previously conducted on the relative merits of these different MCMC methods for specific use with CAR models. Our simulation study suggested that HMC is generally the preferred MCMC method for these types of models. HMC was more efficient in the majority of simulation runs, but less so under extreme scenarios, and with a declining margin for increasing resolutions. However, two HMC-based algorithms recently described by Chen et al. (2014) and Pakman and Paninski (2013) present promising strategies for each of these concerns. The application that accompanied this simulation study found that MH had greater efficiency in performing inference for the CAR model parameters, τ and ρ, in comparison to HMC. However, in predicting regional CPUE, HMC outperformed MH. This raises further questions regarding the impact of fitting CAR models to irregular lattices on computational efficiency. In their standard forms, we cannot say that HMC is unequivocally superior

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview