CONDUCTING INFERENCE ON RIPLEY S K-FUNCTION OF SPATIAL POINT PROCESSES WITH APPLICATIONS

Size: px

Start display at page:

Download "CONDUCTING INFERENCE ON RIPLEY S K-FUNCTION OF SPATIAL POINT PROCESSES WITH APPLICATIONS"

Gary Smith
5 years ago
Views:

1 CONDUCTING INFERENCE ON RIPLEY S K-FUNCTION OF SPATIAL POINT PROCESSES WITH APPLICATIONS By MICHAEL ALLEN HYMAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

2 c 2013 Michael Allen Hyman 2

3 To my loving family and friends The first law of ecology is that everything is related to everything else. - Barry Commoner 3

4 ACKNOWLEDGMENTS First and foremost, I would like to acknowledge my advisor, Dr. Linda J. Young. Dr. Young, you were to person to convince me to achieve this degree four years ago and the person to push me every step of the way to see it finished. Throughout the past four years, you have devoted an enormous amount of your free time into helping and teaching me. You have also granted me many incredible opportunities and I thank you for every one of them. You have gone above and beyond the role of an advisor and I am forever grateful. I have learned so much about statistics and life and also have had a lot of fun working with you. You have taught me to stay calm when things don t work out and how to identify the problem, whether it is big or small. Thank you for everything you have done. I would also like to acknowledge all of my professors at the University of Florida. Specifically, I would like to thank Dr. George Casella and Dr. Nikolay Blitznuk for going out of their way to help students in IFAS statistics. I would also like to acknowledge Dr. Christina Staudhammer for introducing me to applied statistical research and Dr. Mihai Giurcanu for always finding the time to help me when I needed it. Nikolay, thank you for taking the initiative to start the R workshop this past year. Not only have I become a better programmer, but I would not have finished the simulations for this dissertation without learning to use the HPC. Christie, before becoming one of your students I had spent two years in classrooms learning the theoretical components of statistics. Working on the first applications inspired a love in the subject that before I had only appreciated and gave me the motivation to continue learning. Thank you for sticking by as my co-advisor despite the distance. Dr. Casella, you have inspired so many people with your passion for statistics and love of life. Despite being incredibly busy, you always found time to help students, no matter how small the problem. You are a true inspiration and I am so happy to have had the opportunity to know you and learn from you. 4

5 Finally, I would like to acknowledge my family and friends. To my Mom and Dad, thank you for convincing me (or making me) go to graduate school. You have shown your love and support in so many ways. Thank you for always being there when I needed you. To my sister Casey and brother-in-law Elliot, thank you for the endless support and encouragement. To my girlfriend, Whitney, thank you for your love and support over the past four years. I would not have been able to do this without you working by my side. Finally, I d like to acknowledge all of my friends in the SNRE and statistics department. Thank you Emily and Dan for always being there when a happy hour was needed. Nate and Kenny, thank you for being such good friends and making these past few years so much fun. I would not have been able to complete this degree without you guys walking the road ahead of me. Thank you for all your help. 5

6 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 BACKGROUND AND MOTIVATION Introduction Point Process Summary Inference on Point Processes - Confidence Intervals Inference on Point Processes - Hypothesis Testing Objectives CONFIDENCE INTERVALS FOR RIPLEY S K-FUNCTION Introduction Bootstrapping a Spatial Point Pattern Network Resampling to Construct Confidence Intervals of K (r ) Unbiasedness of Bootstrapped Estimator of K (r ) Simulation Study Results Discussion HYPOTHESIS TESTING FOR RIPLEY S K-FUNCTION Introduction Hahn s (2012) Studentized Permutation Test Proposed Test Statistic Proposed Test Statistic Simulation Study Results Discussion APPLICATION OF METHODS WITH JOSEPH W. JONES ECOLOGICAL RESEARCH CENTER DATA Introduction Joseph W. Jones Ecological Research Center Exploratory Data Analysis Estimation of Confidence Intervals for the K-Function Hypothesis Testing of the K-Function

7 4.5.1 Hypothesis Hypothesis Hypothesis Discussion FUTURE WORK Estimating Confidence Intervals for the K-function Appropriate Number of Networks to Resample Extension to Inhomogeneous Spatial Point Processes Bayesian Methods of Inference Hypothesis Testing for the K-function REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 The processes and their respective parameters used in the simulation study Number of each tree classification observed in each plot of the Joseph. W. Jones Ecological Research Center Estimated Deviation from stationarity for each pattern observed at each plot. The unit of measurement for each Plot is 1 meter x 1 meter P-values for tests of Hypothesis 1 using adult and juvenile trees in all plots P-values for tests of Hypothesis 2 using adult trees in Plot 1 (wiregrass understory) and Plot 2 ( old-field plot) P-values for tests of Hypothesis 2 using juvenile pine trees in Plot 1 (wiregrass understory) and Plot 2 ( old-field plot) P-values for tests of Hypothesis 3 using juvenile pine trees in Plot 1 (single tree harvesting) and Plot 3 (control - no harvesting)

9 Figure LIST OF FIGURES page 1-1 Sample patterns and their resulting K and L functions Example of tiling method Example of Loh and Stein s marked point method Example of toriodal wrapping Example of Loh and Stein s marked point method Dendrogram of networking method Realizations of point patterns for confidence interval simulation Percent coverages for Poisson patterns with intensity= Percent coverages for Poisson patterns with intensity= Percent coverages for Poisson patterns with intensity= Percent coverages for softcore patterns with intensity= Percent coverages for softcore patterns with intensity= Percent coverages for softcore patterns with intensity= Percent coverages for Matern clustered patterns with intensity= Percent coverages for Matern clustered patterns with intensity= Percent coverages for Matern clustered patterns with intensity= Percent coverages for Matern clustered patterns with intensity= Confidence interval widths for Poisson patterns with intensity= Confidence interval widths for Poisson patterns with intensity= Confidence interval widths for Poisson patterns with intensity= Confidence interval widths for softcore patterns with intensity= Confidence interval widths for softcore patterns with intensity= Confidence interval widths for softcore patterns with intensity= Confidence interval widths for Matern clustered patterns with intensity= Confidence interval widths for Matern clustered patterns with intensity=

10 2-23 Confidence interval widths for Matern clustered patterns with intensity= Confidence interval widths for Matern clustered patterns with intensity= s of tests for Poisson point patterns of varying intensities s of tests for softcore point patterns of varying intensities s of tests for Matern point patterns 1 of varying intensities s of tests for Matern point patterns 2 of varying intensities s of tests for hardcore point patterns of varying intensities s of tests for Poisson point patterns when patterns have different intensities Powers of tests comparing Matern 1 patterns and Poisson patterns of varying intensities Powers of tests comparing Matern 2 patterns and Poisson patterns of varying intensities Powers of tests comparing softcore patterns and Poisson patterns of varying intensities Powers of tests comparing hardcore patterns and Poisson patterns of varying intensities s of tests for Poisson point patterns using different numbers of quadrats s of tests for Matern 1 point patterns using different numbers of quadrats s of tests for Matern 2 point patterns using different numbers of quadrats s of tests for softcore point patterns using different numbers of quadrats s of tests for hardcore point patterns using different numbers of quadrats s of tests for Poisson point patterns with different intensities and different numbers of quadrats Powers of tests comparing Matern 1 patterns and Poisson patterns using different numbers of quadrats Powers of tests comparing Matern 2 patterns and Poisson patterns using different numbers of quadrats Powers of tests comparing softcore patterns and Poisson patterns using different numbers of quadrats Powers of tests comparing hardcore patterns and Poisson patterns using different numbers of quadrats

11 4-1 Locations of trees in three plots of the Joseph W. Jones Research Center Estimated K-functions from patterns of trees in three plots Cross-K functions of adult and juvenile pine trees Confidence intervals for adult trees, Plot Confidence intervals for juvenile pine trees, Plot Confidence intervals for adult trees, Plot Confidence intervals for juvenile pine trees, Plot Confidence intervals for adult trees, Plot Confidence intervals for juvenile pine trees, Plot

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CONDUCTING INFERENCE ON RIPLEY S K-FUNCTION OF SPATIAL POINT PROCESSES WITH APPLICATIONS Chair: Linda J. Young Major: Interdisciplinary Ecology By Michael Allen Hyman August 2013 In many sciences, spatially-referenced data are collected and analyzed. In some cases, these data represent the locations of a set of events recorded over an area. Collections of these data are referred to as point patterns and the underlying distributions determining these data are point processes. Specifically, when data are collected in 2-dimensional space, the underlying distributions for these data are referred to as spatial point processes. In these cases, analysis is typically focused on the number of points observed and the locations of these events relative to one another. Many summary functions have been developed to describe the interaction among points of spatial point patterns. Ripley s K-function is a commonly used function to describe the spatial interaction of an observed set of points at a range of distances. The statistical properties of this function and its estimators are unknown for most cases. Thus, empirical methods are typically used to to conduct inference on the K-function of an observed point pattern. In this work, we propose several new inferential methods to conduct inference on Ripley s K-function. A new method of setting confidence intervals for Ripley s K-function is proposed using a bootstrap technique for spatial point patterns. The proposed method accounts for the intensity and interaction among points in the pattern and adjusts the bootstrap sample accordingly. Confidence intervals are estimated using the quantiles from bootstrap estimates of the K-function. The variance of the proposed bootstrap 12

13 estimator more closely approximates the variance of the estimator of the K-function than for many current methods of bootstrapping spatial point patterns. A simulation study is conducted to compare this new method to current methods of interval estimation. The percent coverage of the resulting confidence intervals for the K-function and confidence interval widths are determined for the proposed method and current bootstrap methods using point processes with different intensities and interactions among points. A hypothesis test used to compare the K-function across multiple observed patterns is proposed. The purpose of the proposed test is to compare the K-function from single realizations of spatial point processes. Here, two test statistics are proposed using different methods to account for the heteroskedasticity of the K-function at larger distances. A permutation test is used to calculate p-values for the tests. A simulation study compares the proposed test to an existing permutation test using processes with varying intensities and interactions among points. The size of the proposed tests are better controlled when testing patterns with different intensities. The proposed methods for conducting inference on Ripley s K-function are applied to several point patterns recorded at the Joseph W. Jones Ecological Research Center. These patterns represent the locations of longleaf pine trees on plots with different understory composition and different harvesting schemes. The interaction of adult trees and juvenile pine trees is assessed for plots with different treatments and/or forest characteristics. Conclusions drawn from this analysis are helpful for management and conservation efforts of longleaf pine forests. 13

14 CHAPTER 1 BACKGROUND AND MOTIVATION 1.1 Introduction Spatially-referenced data are commonly observed in ecology as well as other disciplines. In general, spatial data presents challenges in analysis due to spatial correlation among the observations. The two basic types of spatial data are geostatistical and point processes. In geostatistics, geographically-referenced, quantitative random variables are observed at a set of locations. Inference is conducted on the distribution of this quantitative variable, adjusting for the spatial covariance structure. A point process is an underlying process giving rise to spatially-referenced events, and a spatial point pattern is a realization of that process. The distribution of all possible realizations is a point process distribution. For point processes, the events themselves are the observations, and interest is in the number of points located in an area and the position of the points relative to one another. The focus of this work is inferential procedures for spatial point processes. A spatial point pattern may be recorded as a marked point process, in which a categorical or quantitative mark is associated with each of the points. This mark might represent the species or diameter of a tree observed at a particular location. Marked point processes can help assess the interaction among several classifications of points. A point pattern is observed in d-dimensional space, where d is typically 1, 2, or 3 dimensions for most applications. The locations of cellular phone towers in a state is an example of a 2-dimensional spatial point pattern. Similarly, the locations of cell nuclei on a piece of brain tissue or the locations of galaxy clusters in the observable universe are examples of 2-dimensional and 3-dimensional point patterns, respectively, observed at vastly different scales. In ecology, point processes can represent the locations of some species of interest or the locations at which an event that affects the ecosystem occurs. Examples include the locations of nesting sites of endangered species or the locations 14

15 of a particular invasive plant species recorded over an area of space. A specific example is given in Chapter 4 where the marked point patterns contain the locations of trees in several plots of land in the Joseph W. Jones Ecological Research Center. The marks are an age classification, labeling each tree as either adult or juvenile. As with these data, spatially-referenced ecological data is recorded in d = 2 dimensions. For the remainder of this dissertation, a point process will represent the general process in some d-dimensional space, and a spatial point process will specifically refer to a point process in 2-dimensions. Recently, ecological modeling methods have been proposed that utilize point process distributions (Illian and Burslem, 2007; Illian et al., 2009; Warton and Shephard, 2010). However, many of the statistical properties of the distributions themselves are unknown, and conducting inference is challenging. Summary statistics have been developed to assess the quantities of interest for point processes (i.e., the mean number of points per unit area, the interaction among points, etc.). The statistical distributions of these statistics are generally unknown, except for the simplest cases. Therefore, empirical methods of conducting inference have been proposed to analyze observed point patterns. The purpose of this dissertation is to expand upon these methods and to propose new methods for conducting inference on spatial point processes. In Chapter 4, we apply these new and some existing methods to several recorded spatial point patterns to demonstrate how they can be used to inform management decisions in forestry. 1.2 Point Process Summary A spatial process is a set of events Z occurring at locations s in a set X, a subset of R d ; that is {Z (s) : s X R d }. For spatial point processes, X is a subset of R 2. In point process theory, the exact locations of the points {s 1, s 2,..., s n } are typically lost in favor of more convenient notation. Let N be a counting measure on X. For each Borel set A, N(A) represents the number of events on A. Thus N(A) = {0, 1, 2,...} for all 15

16 A X, where X represents the Borel σ-field of X R 2. If N(A) is known for every set A X, then this is equivalent to knowing all locations of events {s 1,...s n }. If the counting measure N(A) is locally finite, then N(A) < for all bounded sets A X (Cressie, 1991). Here we also make the assumption that point processes are simple. That is, the probability of observing more than one point at a given location is 0. This assumption often holds in ecological application and violations can make interpretation and some inferential methods difficult. In the above notation, X can be thought of as a surface over which the statistical distribution is found. This statistical distribution is defined by the counting measure N. For any observed, bounded region A of this surface, the number of points N(A) observed in A is a random variable with a probability assigned to each possible value (N(A) = 0, 1, 2,...). Notating a point process in this way allows us to assign probability distributions to a process observed on a bounded, closed region and to conduct basic inference on point patterns. Similar to quantitative distributions, point processes can be defined by their moments. The first-order moment of a point process is referred to as the first-order intensity of the process (often referred to as the intensity) and is analogous to the mean of a quantitative distribution. Formally, if ν d represents a Lebesgue measure of D R d and N(ds) represents a counting measure on the infinitesimally small Borel set ds D centered at point s, then E [N(ds)] λ(s) = lim ds 0 ν d (ds). (1 1) In the case where d = 2, ν d (A) represents the area of the Borel set A D, and if d = 3, ν d (A) represents the volume of the ball for A D. If the process is stationary, then λ(s) is constant for all locations s and the intensity (λ) of a point process is the expected number of points per unit area, λ = E [# of points per unit area]. (1 2) 16

17 If the process is heterogeneous, then λ(s) represents the intensity of the process at location s X. The intensity λ may be a function of a set of covariates whose values are recorded at each location s. In this case, the average intensity over an area A can be determined by λ A = 1 λ(s)ds (1 3) A A Under the assumption of stationarity, an unbiased estimator of the intensity of a process is ^λ = n A, (1 4) where n is the observed number of points in region A. Here we will let A refer to the area of a bounded region in 2-dimensional space. Stationary, isotropic point processes can be categorized as one of three classes: completely random, spatially clustered, or regular. A point process is a completely random pattern, and is said to exhibit complete spatial randomness or CSR if 1) the average number of points per unit area is constant over the entire area A and 2) the number of events in two non-overlapping Borel sets, A 1 and A 2, are independent of one another. In this case, the number of events in A is Poisson distributed (Schabenberger and Gotway, 2005). A completely spatially random (CSR) process is also known as a homogeneous Poisson point process. Under CSR, each event has a constant probability of occurring at any location in the region (Cressie, 1991; Illian et al., 2008). More formally, the counting measure defined on a set A is distributed as P (N (A) = n) = e λν(a) (λν(a)) n ; (1 5) n! where ν(a) represents the volume of A in d-dimensional space (the area if d = 2) and λ represents the mean number of points per unit area (constant at all locations). For spatial point processes, the first null hypothesis tested is generally that a pattern exhibits CSR. If a pattern exhibits CSR, points are dispersed randomly and a researcher can do little more to summarize or explain them. If this null hypothesis is rejected, a 17

18 researcher can conclude that there is heterogeneity or interaction among events and may choose to investigate the intensity function or interpoint dependence in the pattern (Schabenberger and Gotway, 2005). A binomial point process is closely related to a homogeneous Poisson process and results from a Poisson process conditional on n points being observed. In this case, if a pattern is a realization of a homogeneous Poisson process with n points observed on A, then the number of points in any subset A sub A is distributed as a binomial random variable with a mean equal to n A sub A. Although the homogeneous Poisson process is a common benchmark distribution to test all observed realizations against, it is rarely seen in practice. Points recorded in disjoint regions might indeed be independent of one another, yet the density at which the points occur may not be homogeneous. In other cases, a constant density might be observed across a sample area, but the events are not independent of one another. Interactions might exist among points making it more or less probable that another point is located nearby relative to CSR. General classifications for stationary patterns that depart from CSR are clustered and regular. In regular patterns, the probability of observing a point near an arbitrary point in the pattern is smaller, and thus the expected number of points within a given radius of an arbitrary point is also smaller, than it is under the assumption of CSR. An example of this might include the locations of nesting sites of a species that tend to avoid others due to competition. For clustered patterns, the probability of observing points that are nearby one another is greater than under CSR. Thus the expectation of the number of points within a given radius of a particular point is greater than under CSR. Trees may be clustered at early stages of forest development because parent trees drop seeds from which seedlings grow. However, trees compete for the same resources and, after some time, may have a regular pattern. The second-order intensity of a point process describes the interaction or relative position among points. If s i and s j are two points in d-dimensional space and ds i and ds j 18

19 the infinitesimally small balls centered at these points, then the second-order intensity of a point process is λ 2 (s i, s j ) = E [N(ds i )N(ds j )] lim ν d (ds i ) 0,ν d (ds j ) 0 ν d (ds i )ν d (ds j ). (1 6) Interpretation of the second order intensity is difficult and simpler measures to assess the dependency among points are desired. Thus, Ripley introduced the use of the K-function (also known as the reduced second moment measure) to assess the second-order characteristics of stationary, isotropic processes (Ripley, 1976). The K-function is a function of distance r: K (r ) = 2π r xλ λ 2 2 (x)dx. (1 7) 0 For a simple process, λk (r ) represents the number of points less than distance r from an arbitrary point in the process and thus, K (r ) = E [N (s 0, r ) point at s 0 ]. (1 8) λ The moments of point processes can be used to define some of the typical assumptions that are made when conducting inference on point processes. A point process is considered homogeneous if the first-order moment (the intensity) is constant over space. A process is considered stationary if the process is invariant to translations (Schabenberger and Gotway, 2005). A process is called isotropic if it is invariant to rotations around a point (the second-order moment depends only on the distance between two events) (Schabenberger and Gotway, 2005). Together, K (r ) and λ define the first and second moments of a stationary and isotropic process (Stoyan et al., 1995). However, just as the mean and covariance of two random variables do not provide a complete description of their bivariate distribution, the first and second order intensity measures give an incomplete description of a point process (Baddeley and Silverman, 1984). 19

20 The K-function has several properties that make it the most common function for assessing the second-order properties of an observed pattern. Because λk (r ) represents the average number of points within distance r from an arbitrary point in the pattern, it is easily interpretable. The second-order intensity for a process can be derived if its K-function is known (Schabenberger and Gotway, 2005). The K-function is invariant to the intensity of a process so the second-order characteristics of patterns can be compared despite differences in the numbers of points (Baddeley et al., 2000). The K-function can be used to observe the interaction among points at a range of distances (Schabenberger and Gotway, 2005). This can be useful for processes that exhibit one particular type of interaction at small distances and a different form of interaction at larger distances. Finally, the K-function is invariant to observations missing completely at random (Schabenberger and Gotway, 2005). The theoretical value for K (r ) under CSR is a function of r, K (r ) = πr 2, the area of the circle of radius r: K (r ) = E [# of points < r from an arbitrary point in the observed region] λ [ e λa (λa) n E = λ = λπr 2 λ = πr 2 n! ] (1 9) Several transformations of the K-function have also been proposed. The most common of these is the L-function, which is used to correct for the heteroscedasticity of the K-function (Besag, 1977): L(r ) = K (r ) π. (1 10) 20

21 Thus, in the case of a homogeneous Poisson process, L(r ) = r. Figure 1-1 shows realizations of three spatial point processes: a CSR pattern, a clustered pattern, and a regular pattern. The corresponding K and L functions are also shown for each of the patterns. To estimate K (r ), the term λ 2 ν d (A)K (r ) is typically estimated and divided by an estimator of λ 2 ν d (A). A naive estimator of λ 2 ν d (A)K (r ) is x A y x I { y x r} where I represents the indicator function and. represents Euclidean distance. However, this estimator is biased low. The bias is due to points lying outside the boundaries of region A that are not observed, but are still within distance r of a point in the pattern. That is, if point z is unobserved because it lies outside of A but x z r, then this point pair is not included in the summation. This can make substantial differences for values of r that are large relative to ν d (A). Multiple methods can be used to obtain more accurate or unbiased estimators of K (r ) despite being unable to observe points outside of the region s boundary. Most methods assign a weight w (x, y) to each pair of points (x, y). Ripley s isotropic edge correction (Ripley, 1988) is a common edge correction weight that is useful under the assumptions of stationarity and isotropy. Let A be the 2-dimension region in which a realization of a process is observed. Let x, y A be points observed in A. If C represents the circumference of the circle centered at point x and passing directly through point y, and C represents the length of C that lies inside A, then weight w (x, y) = 1 C /C. That is, w (x, y) is the reciprocal of the proportion of the circumference of the circle centered at x and passing through y that lies inside of the region A. This implies that, if a point fell directly on the straight boundary of an open set, any circle with an arbitrary radius r centered at this point would fall half inside the region. Therefore, any point falling within distance r of this point would be weighted by two, because it would be equally likely to have observed another point outside of the boundary. Calculating this weight for each 21

22 point pair observed in A, an estimator of K (r ) is ^K (r ) = A n 2 w (x, y)i ( x y < r ). (1 11) x A y x This estimator is slightly biased but is the most common because the bias is low and it is relatively easy to calculate, especially for rectangular or circular windows (Ohser, 1983). ^K (r ) is shown to be asyptotically normal as the number of points approaches infinity (Ripley, 1981). It is also consistent for K (r ) for ergodic point processes, as the size of the study area increases in R 2 (Nguyen and Zessin, 1979). Ripley s isotropic edge correction can easily be extended to patterns in d-dimensions where d > 2. In addition, other estimators and edge correction weights have been suggested that result in an unbiased estimate of K (r ) (Cressie, 1991; Ohser, 1983). Ripley s weights are easily calculated for a rectangular or circular window; however, other weights may be more appropriate for more complex windows (Ohser, 1983). 1.3 Inference on Point Processes - Confidence Intervals For spatial point patterns, inference must be conducted on the information provided by the points themselves. Because interpoint distances are some of the most informative and defining characteristics of a particular point process, the summary statistics used to describe a pattern, such as the K-function, tend to be functions of distance. Second-order analysis of point processes is a relatively unexplored area of statistics, so many of the original methods are the most commonly used techniques today. Exact distributions of the common estimators and the statistical properties of the processes are unknown, except under CSR or other simple processes. Thus, work on spatial point processes tends to be empirical in nature. Here we review the previous work in the area of interval estimation and hypothesis testing for point processes or for a specific parameter of a point process. Note that, although these works are related, they may have different objectives. For instance, the K-function can be identical for 22

23 processes having different second-order structures (Baddeley et al., 2000; Baddeley and Silverman, 1984). Thus testing the equivalence of K (r ) for multiple processes is not equivalent to testing whether the second-order intensity of the processes is equal. However, testing the equivalence of the K-function can result in valuable information about the second-order structures of the processes. The motivation for the following works range from confidence interval calculation and hypothesis testing under model specification, to testing process equivalence. Ripley s development of the K-function (Ripley, 1976) allowed researchers to interpret the spatial dependence inherent in a point pattern. Monte Carlo approaches are the foundation for the most widely applied inferential methods for K (r ) and other summary statistics for point processes(barnard, 1963; Besag and Diggle, 1977; Chiu, 2007; Hope, 1968; Koen, 1991). Most commonly, an observed statistic is compared to the distribution of the statistic under the assumption of CSR. Suppose an observed point pattern contains n points. To determine a simulation envelope for a parameter under the assumption of CSR, n points are simulated randomly in a finite region A, and the estimate of that parameter is calculated. This is replicated B times resulting in a simulation envelope. For example, to find an interval for the K-function under CSR, B homogeneous Poisson patterns are simulated and ^K i (r ) for i = 1,..., B calculated. Then the 100% simulation envelope is given by ^K l (r ) = min { ^K i (r )} and ^K u (r ) = max { ^K i (r )}. (1 12) i=1,...,b i=1,...,b To test whether the observed pattern is from a CSR process, the observed K-function is compared to this simulated envelope. Many methods of comparison can be used, and Cressie (Cressie, 1991) suggests a test statistic of the form TS = 0 { ( ) } 2 1/2 ^K (r ) π 1/2 r dr (1 13) 23

24 where this statistic is calculated for the observed pattern and for each of the simulated patterns from a CSR process. An observed test statistic that is greater than any calculated from CSR would imply deviation from CSR, either from clustering or regularization or a combination of clustering and regularization. Tests of this nature can be used in ecology to test for pattern transference (such as the observed locations of migratory animals over different regions) and space-time interactions (such as testing for contagion in the observed locations of a particular event) by specifying a null model exhibiting the absence of these interactions and comparing the observed and simulated K-functions from the null model (Besag and Diggle, 1977). Monte Carlo methods for testing a null hypothesis have several advantages. An approximation of the distribution of the test statistic is not necessary and thus the p-values are exact in that sense (Schabenberger and Gotway, 2005). They are also flexible in that the null hypothesis can easily be adapted to test complex point process distributions. However, many critical choices are left to the researcher, such as the number of simulations to perform and the distance to which the test statistic is to be evaluated. Diggle and others (Diggle, 1977, 1979; Ho and Chiu, 2006) explored these choices as well as additional test statistics. Monte Carlo methods are beneficial under the assumption of a specified null model. Methods of model fitting and parameter estimation exist for point process data. However, variation in realizations from a parametric point process model can be great, and the power and confidence of the respective hypothesis tests and confidence intervals are contingent on the accuracy of the fitted models (Loh and Stein, 2004). Model fitting typically involves the assumption of a specific type of process, and wrong model specification can result in poor performance of simulation envelopes for both confidence interval estimation and hypothesis testing. Often, confidence intervals for the K-function are desired for an observed point pattern without specification of a null process. Replicated patterns from a process are typically not available to estimate the 24

25 standard error of K (r ). As an alternative, bootstrap methods for dependent data have been applied to point patterns (Davison and Hinkley, 1997; Loh and Stein, 2004) to estimate confidence intervals for the K-function, as well as other parameters. The simplest way to obtain confidence intervals for the K-function of a point pattern observed over a square region, called the splitting method, is to divide the region into N congruent subregions and to calculate N separate estimates of ^K (r ) (Loh and Stein, 2004). Assuming that the N estimates are independent and approximately normally distributed, an estimate for the variance of ^K (r ) can be obtained and the 100(1 α)% confidence interval can be computed by Var{ ^ ^K (r ) ± t ^K i (r )} N 1,α/2 N (1 14) where Var{ ^ ^K i (r )} is the sample variance of ^K 1 (r ), ^K 2 (r ),..., K ^ N (r ), and ^K (r ) is the overall estimate of K (r ) (Loh and Stein, 2004). For patterns with sufficient numbers of points, this method of confidence interval calculation creates fairly accurate intervals. However, these intervals can be wide due to being calculated from a small number of samples (e.g., the quadrats). An obvious limitation is in the distance range that intervals can be determined. For a unit pattern divided into quadrats of area 0.25, the length of each quadrat edge is 0.5. Calculations of K (r ) for distance greater than 0.25 begin to highly weight the edge correction because a high proportion of any circle of that radius would fall outside of each quadrat. At these distances, the type of edge correction used to estimate K (r ) has a much larger influence on the estimates of the K-function than at shorter distances. Issues with small sample sizes, dependence among quadrats, and non-normal samples from quadrats, encountered when calculating confidence intervals by dividing a pattern into multiple independent samples, led to the development of resampling methods for point patterns. Hall (1985) extended block bootstrapping methods used to estimate the statistical characteristics of time series data to spatial Boolean models 25

26 in two dimensions (Hall, 1985). Davison and Hinkley (1997) gave a more general explanation of this method, referring to it as tile resampling (Davison and Hinkley, 1997) and referred to here as tiling. In this method, the goal is to create new patterns that maintain the spatial dependence of the observed pattern. If the observed region A R 2 is partitioned into N disjoint tiles, A 1, A 2,..., A N, the statistic of interest can be defined as T = t(a 1, A 2,..., A N ). Then a resampled pattern is created by taking random samples of the N disjoint tiles, A, 1 A,..., 2 A N, and a bootstrapped estimate of the statistic of interest is calculated from this newly created pattern T = t(a 1, A 2,..., A N ). A common variation is to use moving, overlapping tiles by setting A j = U j + A j where U j is a random vector (Davison and Hinkley, 1997). Politis and Romano (1992) suggest using toroidal wrapping before resampling such that tiles are allowed to fall outside of the boundary and, in this case, are wrapped around to the opposite side of the window (Politis and Romano, 1992). This can be accomplished by creating a new region which is a 3 x 3 grid of the observed region (assuming that the observed region is rectangular). The tiles then contain identical points as if toroidal wrapping is used. This helps avoid bias created by undersampling points near the boundaries of the observed region (Davison and Hinkley, 1997). By performing B resamples and calculating ^K (r ),..., ^K 1 2 (r ),..., K^ B (r ) with the ordered estimates of the statistic of interest from bootstrapping, a 100(1 α)% confidence interval can be created for K (r ) using the formula: [ 2 ^K (r ) ^K (B+1)(1 α/2)(r ), 2 ^K (r ) ^K ] (B+1)(α/2)(r ) (1 15) (Davison and Hinkley, 1997). This creates an equal-tailed confidence interval for K (r ). It is possible that other methods of confidence interval estimation might be more appropriate. However, interest here is in the method of bootstrapping from the pattern. Figure 1-2 shows an example of the creation of an artificial realization from the process using the tile method. For further information on the statistical properties using this 26

27 resampling method, see Hall (1985) for a 2-dimensional spatial case and Künsch (1989) for a 1-dimensional time series analysis (Hall, 1985; Künsch, 1989). The primary problem with creating bootstrapped patterns using tiling is that when tiles are rearranged to produce the new pattern, points that are not originally positioned together can be placed in close proximity. Under the assumption of CSR, the placement of points within the bootstrap samples is random and the resulting patterns should still follow the same distribution. However, if the process has obvious spatial dependence, construction of a new pattern might violate the interpoint dependence structure in the process. If the spatial dependence between points is relatively short-ranged, and the blocks are large enough to adequately capture the spatial dependence between points, consistent results can be obtained (Davison and Hinkley, 1997). However, if the spatial dependence is longer ranged, this method fails at maintaining the original distribution of the observed pattern. Lahiri (1993) shows that putting independent resampled blocks together destroys the long-range dependence of the original observations (Lahiri, 1993). A simple example is that of a hardcore inhibited (regular) process. In this case, the probability of points falling within a certain radius of interaction r 0 of one another is 0. Thus for distances less than r 0, the resulting K-function is K (r ) = 0 for r < r 0. However, when a block bootstrapping approach is applied and the blocks taken from the original pattern are replaced to create a new pattern, the placement of these blocks might violate the hardcore assumption of the process (i.e. points might fall within r 0 of one another in the bootstrapped patterns) (Loh and Stein, 2004). Politis and Romano (1994) develop a similar method of resampling data based on using subsets, called the subsetting method. They extend this to cases of 1-dimensional time-series and multi-dimensional random fields of dependent data (Politis and Romano, 1994). When applied to point pattern data, subsets of the original pattern are used to estimate the value of interest. For this method, suppose a pattern is observed over region A with area A. Then N square subregions A i, i = 1,...N, 27

28 are formed with tiles of size A /N, allowing for overlapping between tiles and toroidal wrapping around A. Let xij and xik, j, k = 1,..., n i represent two distinct points in subregion A i. Then the estimate of K (r ) can be calculated using ^K (r ) = A ni ni N n i i=1 j i w Ai (xij, xik)i ( xij xij < r ), (1 16) where w Ai (xij, xik) is the new weight for point pair xij and xik calculated over subregion A i and I (.) represents the indicator function (Loh and Stein, 2004). Multiple bootstrap resamples are conducted and confidence intervals are created as in Equation Considering only pairs of points in the same subregion to estimate K (r ) reduces the problem of producing new point pairs that violate process assumptions encountered in the method previously described. If wrapping is used, some new point pairs will still exist in the resampled points; however, far fewer than using tiling. Fewer point pairs are used in each region, which is accounted for in the estimation of K (r ) by larger edge correction weights. The larger weights are a result of the smaller subregions being used to estimate the weights involved in Equation This means that the estimates of K (r ) are more highly influenced by the edge corrections used and limitations exist as to the largest value of r for which K (r ) can accurately be estimated (Loh and Stein, 2004). Though several bootstrapping methods have been suggested to resample point processes, asymptotic results of the bootstrap techniques have rarely been provided. One exception is a method developed by Braun and Kulperger (1998), which they refer to as the marked point method, and referred to here as marking or the marked method (Braun and Kulperger, 1998). The results they provide are for point processes in one-dimension, and it is not known whether the theoretical results apply to processes with dimensions greater than one. Because the authors use one-dimension and show their results using the second-order intensity instead of the K-function, we use their notation and definitions. 28

29 The first two moments of the process, p 1 (τ ), and p 2 (τ 1, τ 2 ), are defined as P(X ((τ, τ + h]) > 0) p 1 (τ ) = lim h 0 h (1 17) and P(X ((τ 1, τ 1 + h 1 ]) > 0, X ((τ 2, τ 2 + h 2 ]) > 0) p 2 (τ 1, τ 2 ) = lim, (1 18) h 1 0,h 2 0 h 1 h 2 respectively. Similar to the estimators of the first and second-order intensities, the first two moments are estimated from an observed pattern on the interval (0, T ] using the following equations (Brillinger, 1975): ^p 1 = 1 X ((0, T ]) (1 19) T ^p 2 (τ ) = 1 ht {# of points in (x i + τ, x i + τ + h)} x i (1 20) where τ = τ 2 τ 1 and h is a window, or bin width parameter. Under several assumptions regarding higher-order moments of the process defined by Brillinger (Brillinger, 1975, 1978), these estimators are approximately asymptotic normal as T increases to infinity. However, the asymptotic variance for ^p 1 is difficult to estimate, and the asymptotic variance of ^p 2 is a poor approximation to the true variance of the estimator, even for large intervals (large values of T ) (Braun and Kulperger, 1998). Thus, a new block bootstrap approach provides more accurate confidence intervals (Braun and Kulperger, 1998). Braun and Kulperger first describe an approach similar to the tile method described in Davison and Hinkley (1997). Let X represent a point process of which a pattern X is observed on an interval (0, T ]. Let A represent the set of points (locations) 29

30 t on (0, T ], such that x(t) = 1. Then the point pattern x can be bootstrapped using the following method: 1. Take b to be some positive integer. For each integer i = 1,..., b, generate uniform variates U 1, U 2,..., U b on the interval (0, T T /b]. 2. For j = 1, 2,...b, set each of the following: a) A j = (U j, U j + T /b] A b) A j = A j U j + (j 1)T b c) X j (.) = A j.. 3. Set X = b j=1 X j. Essentially, b randomly selected blocks of length T /b are taken from the observed pattern. The points occurring in the selected blocks are repositioned to create an artificial realization of X. The authors provide some asymptotic results for the distribution of the estimator of the first moment from equation 1 17 (Braun and Kulperger, 1998). As with the Hall and Davison and Hinkley methods, the Braun and Kulperger bootstrap method fails to capture the second-order characteristics of the process. In the same paper, the authors suggest another bootstrap method for estimating the second-order properties of a point process, which they label the marked point process method. Again using their notation for the 1-dimensional case, the second-order properties of the point process, ρ(τ ) (analogous to K (r )) are estimated for the 1-dimension region (0, T ]. For each observed point x (0, T ], a mark is set equal to the number of points in the interval (x + τ, x + τ + h] for a fixed value h. The estimate of the second order intensity ρ(τ ) is given by the equation: ^ρ(τ ) = 1 ht The following theorem is offered by the authors. x i {# of points in (x i + τ, x i + τ + h]}. (1 21) 30

31 THEOREM: Suppose the point process X has finite and integrable fourth moment densities in the sense of Brillinger (1975). For a given h, conditional on the observed point process X on [0,T], ( ht ^ρ 2 (τ ) E (ρ (τ ) X )) 2 N ( ) 0, σ 2 2 (1 22) as T and σ 2 = ρ 2 (τ ) + O(h). For small h, the limiting variance σ 2 2 is approximately the limiting variance ^ρ 2 (τ ) (Braun and Kulperger, 1998). Thus, Braun and Kulperger (1998) show that, for small values of h, the bootstrap estimates of the second-order moment give approximately the correct distribution asymptotically (Braun and Kulperger, 1998). It is still unclear how close of an approximation the bootstrap estimators are to the true second-order moment. In a simulation study using a Matern clustered process, use of the marked bootstrap approach led to more accurate nominal level confidence intervals than their blocked bootstrap approach. Results improved for smaller values of h. It is also unknown whether these results hold in dimensions greater than 1. Loh and Stein (2004) provide an adaptation of the Braun and Kulperger (1998) marked point method for 2-dimensional point patterns, referred to here as the marked point method or marking. For each point x in an observed pattern X in region A, a mark m x (r ) is assigned for distance r. The mark equals the sum of all weights w A (x, y) for points y within distance r of point x. Thus, m x (r ) = w A (x, y)i { y x < r} (1 23) y :y x and is the total contribution by all points within distance r of point x to the estimate of λ 2 A K (r ), where λ is the intensity of the process and A is the area of the observed region. Once marks are applied, points are resampled using one of the various resampling strategies. For the example shown in Figure 1-3, information regarding the spatial location of points y and z are recorded in the mark given to x. Here, the 31

Point Pattern Analysis

Point Pattern Analysis Nearest Neighbor Statistics Luc Anselin http://spatial.uchicago.edu principle G function F function J function Principle Terminology events and points event: observed location of