Overview of Statistical Analysis of Spatial Data

Overview of Statistical Analysis of Spatial Data Geog 2C Introduction to Spatial Data Analysis Phaedon C. Kyriakidis www.geog.ucsb.edu/ phaedon Department of Geography University of California Santa Barbara Santa Barbara, CA 936-6 phaedon@geog.ucsb.edu Spring Quarter 9 Outline Preliminaries Types of Spatial Data Why Spatial Statistics? Points to Remember Ph. Kyriakidis (UCSB) Geog 2C Spring 9 2 /

Introduction & Objectives Preliminaries Spatial data Geo-referenced attribute measurements; each measurement is associated with a location (point) or an entity (region or object) in geographical (or other) space attribute measurement scale can be continuous or discrete, e.g., chemical concentration, soil types, disease occurrences sample locations can have a regular or irregular spatial arrangement, i.e., data locations on a raster (regular lattice) or scattered in space; domain informed by a measurement is called the sample unit or support, e.g., points, pixels, polygons spatial data often have an additional temporal component; dynamic attribute evolution in space and time, spatiotemporal support Objectives of this handout to provide a brief overview of types of spatial data to highlight the role of spatial statistics in analyzing data of each type Ph. Kyriakidis (UCSB) Geog 2C Spring 9 3 / Preliminaries Stages in Spatial Data Analysis Exploratory analysis explore spatial data using cartographic (or other visual) representations statistical analysis for detecting possible sub-populations, outliers, trends, relationships with neighboring values or other spatial variables Modeling or confirmatory analysis establish parametric or non-parametric model(s) characterizing attribute spatial distribution estimate model parameters from data; evaluate their statistical significance; predict attribute values at other locations and/or future time instants Notes any processing of spatial data, e.g., filtering or interpolation, affects any inference made from them boundaries between above stages not always clear-cut Ph. Kyriakidis (UCSB) Geog 2C Spring 9 4 /

Types of Spatial Data Attributes Varying Continuously in Space Characteristics also known (unfortunately) as geostatistical data, e.g., temperature, rainfall, elevation, population density measurements of nominal scale, e.g., land cover types, or interval/ratio scale, e.g., sea floor depth often, sparse samples are available only at fixed set of locations 39 38.5 Bay Area rain gauge precipitation mm/day 14 12 38.5 8 6 36.5 4 1981 82 NDJ average 36 123.5 123 122.5 122 121.5 121 2 Ph. Kyriakidis (UCSB) Geog 2C Spring 9 5 / Area or Lattice Data Characteristics Types of Spatial Data attributes take values only at fixed set of areas or zones, e.g., administrative districts, pixels of satellite images typically, all possible locations have been sampled; no attribute values between sampling units (unless there are missing values) 36.5 36 35 34.5 34 33.5 From 1979 to 1984 SIDS Cases in North Carolina 84 83 82 81 8 79 78 77 76 Distinction between spatially continuous and area (lattice) data not always clear-cut, particularly when the latter are derived via aggregation from the former Ph. Kyriakidis (UCSB) Geog 2C Spring 9 6 /

Types of Spatial Data Point Pattern Data Characteristics series of point locations with recorded events, e.g., locations of trees, disease or crime incidents point locations correspond to all possible events (mapped point pattern), or to a subset (sampled point pattern) attribute values also possible at same locations, e.g., tree diameter, magnitude of earthquakes (marked point pattern) Lansing Woods tree locations Bay Area earthquake magnitudes.8.6 maple 38.5 38 5.5 5 4.5.5 4.4.2 hickory 36.5 3.5 3. 1962 1981 36 123.5 123 122.5 122 121.5 121 1.5 1 2.5..2.4.6.8 Ph. Kyriakidis (UCSB) Geog 2C Spring 9 7 / Types of Spatial Data Spatial Interaction or Network Data Characteristics attributes relate to pairs of points or areas: flows from origins to destinations, e.g., patients flow from residences to hospitals less tangible flows, e.g., information, could be defined Analysis objectives modeling of flow patterns = finding relationships between observed flows and explanatory variables, e.g., number of trips from origins to destinations as function of income classical analysis methods focus on patterns of aggregate interaction, rather than individuals themselves; more recent focus is placed on understanding individual preferences and choice modeling spatial location/allocation problems, and more generally spatial optimization problems, typically involve network data Methods for analyzing spatial interaction data are not covered in this course Ph. Kyriakidis (UCSB) Geog 2C Spring 9 8 /

Why Spatial Statistics? Univariate Statistics and Spatial Pattern? Two 1D attribute profiles with the same histogram: 3 1D population 3 1D population 2 2 1 1 value value 1 1 2 2 3 6 7 8 9 x 3 6 7 8 9 x Shortcomings of univariate statistics Univariate statistics, e.g., average, variance, histogram, do not suffice to describe spatial pattern; the spatial arrangement of attribute values matters, too Spatial auto-correlation an aspect of spatial pattern Attribute values measured at nearby supports tend to be more similar than those measured at distant supports; Tobler s 1st law(?) of Geography Ph. Kyriakidis (UCSB) Geog 2C Spring 9 9 / Why Spatial Statistics? Role of Spatial Statistics in Spatial Data Analysis Spatially continuous data model attribute spatial variation over study area from sampled point values predict attribute values at non-sampled locations (accounting for covariates) Area (lattice) data detect and model spatial patterns or trends in area values; no prediction at non-sampled locations, unless smoothing of existing values or imputation of missing values is required use covariates or relationships with adjacent attribute values for inference, e.g., disease rates in light of socioeconomic variables Point patterns detect clustering or regularity, as opposed to complete randomness, of event locations in space and/or time if clustering is detected, investigate possible relations between clusters and nearby sources or pertinent covariates Ph. Kyriakidis (UCSB) Geog 2C Spring 9 /

Why Spatial Statistics? Spatial Versus Non-Spatial Statistics Classical statistics samples assumed realizations of independent and identically distributed random variables (iid) most hypothesis testing procedures call for samples from iid random variables problems with inference and hypothesis testing in a spatial setting Spatial statistics multivariate statistics in a spatial/temporal context: each observation is viewed as a realization from a different random variable, but such random variables are auto-correlated in space and/or time each sample is not an independent piece of information, because precisely it is redundant with other samples (due to the corresponding random variables being auto-correlated) auto- and cross-correlation (in space and/or time) is explicitly accounted for to establish confidence intervals for hypothesis testing One can always choose to analyze spatial data with non-spatial statistics; problems arise when confidence intervals need to be reported... Ph. Kyriakidis (UCSB) Geog 2C Spring 9 11 / Why Spatial Statistics? Software for Statistical Analysis of Spatial Data GIS-based ESRI s Spatial Analyst, Geostatistical Analyst... opt for close or loose coupling with specialized external packages when specific functionalities are missing from a GIS Statistical packages extremely versatile in modeling; recent improvements in visualization R and SpaceStat/GeoDa most popular in Geography Image processing packages mature technology, lots of new developments IDL and Matlab most popular in Remote Sensing and Electrical Engineering Access to source code written in a straight-forward programming language is critical for research development in an academic environment... Ph. Kyriakidis (UCSB) Geog 2C Spring 9 12 /

Some Issues Specific to Spatial Data Analysis A first look differences from times series analysis: 1. irregular sampling 2. lack of clear indexing; no notion of past-present-future 3. auto- and cross-correlation in multiple directions multi-source data associated with different spatial/temporal resolutions data often reported as aggregates over arbitrarily defined zones/areas; statistics of aggregates are not the same as those of individuals: 1. Modifiable Area Unit Problem (MAUP) 2. Ecological Fallacy or Inference Problem (EIP) edge/boundary effects: samples near the edges of a study region have fewer neighbors than samples in the interior; near-edge samples might bear the effects of different spatial processes spatial process models typically distinguish between first- and second-order effects, i.e., between environmental controls and interactions (distinction between the two not always clear-cut) Ph. Kyriakidis (UCSB) Geog 2C Spring 9 13 / Modifiable Area-Unit Problem: Aggregation Effect Two spatial variables and their univariate/bivariate statistics Spatial Variable #1 87 95 72 44 24 Spatial Variable #2 72 75 85 29 58 9 55 55 38 88 34 41 26 35 38 24 14 56 34 8 18 6 49 46 84 23 21 46 22 42 45 14 19 36 48 23 8 29 8 7 6 ρ 12 =.83 49 44 51 67 17 38 47 52 52 22 48 55 25 33 32 59 54 m = 43.14 s =.17 58 46 38 35 55 m = 42.92 s = 18.32 6 7 8 9 91. 54.5 Aggregation Scheme #1 34. 73.5 57. 44. 9 35..5 61. 31. 13. 55. 33.5 27.5 32. 53.5 29.5 18.5 8 7 6 ρ 12 =.9 59. 27. 42.5 52. 35.. 32.5 56.5 m = 43.14 s = 16.79 49. 42. 45. m = 42.92 s = 12.65 6 7 8 9 Statistics and relationships between spatial attributes depend on aggregation extent Ph. Kyriakidis (UCSB) Geog 2C Spring 9 14 /

Modifiable Area-Unit Problem: Zonation Effect Upscaling spatial variables using two different aggregation schemes 91. 54.5 Aggregation Scheme #1 34. 73.5 57. 44. 9 35..5 61. 31. 13. 55. 33.5 27.5 32. 53.5 29.5 18.5 8 7 6 ρ 12 =.9 59. 27. 42.5 52. 35.. 32.5 56.5 m = 43.14 s = 16.79 49. 42. 45. m = 42.92 s = 12.65 6 7 8 9 Aggregation Scheme #2 63.5 75. 63.5.5 66. 29. 61. 67.5 67..5 71. 26.5 9 8 27.5 43. 31.5 34.5 23. 21.. 41. 35. 32.5 26.5 21.5 7 6 ρ 12 =.94 52. 34.5 42. 49.5 38. 45.5 48. 43.5 49. 45. 28.5 51.5 m = 43.14 s = 15.23 m = 42.92 s = 15.59 6 7 8 9 For a given aggregation extent, statistics and relationships between spatial attributes depend on which individual values are aggregated and how Ph. Kyriakidis (UCSB) Geog 2C Spring 9 15 / Ecological Inference Problem I Downscaling spatial variables 91. 54.5 Observed variables 34. 73.5 57. 44. 9 35..5 61. 31. 13. 55. 33.5 27.5 32. 53.5 29.5 18.5 8 7 6 ρ 12 =.9 59. 27. 42.5 52. 35.. 32.5 56.5 m = 43.14 s = 16.79 49. 42. 45. m = 42.92 s = 12.65 6 7 8 9 Spatial Variable #1 87 95 72 44 24 Spatial Variable #2 72 75 85 29 58 9 55 55 38 88 34 41 26 35 38 24 14 56 34 8 18 6 49 46 84 23 21 46 22 42 45 14 19 36 48 23 8 29 8 7 6 ρ 12 =.83 49 44 51 67 17 38 47 52 52 22 48 55 25 33 32 59 54 m = 43.14 s =.17 58 46 38 35 55 m = 42.92 s = 18.32 6 7 8 9 Statistics and relationships between spatial variables at a finer spatial resolution are different than those derived at the original coarse resolution Ph. Kyriakidis (UCSB) Geog 2C Spring 9 16 /

Ecological Inference Problem II Under-determined inverse problem 91. 54.5 Observed variables 34. 73.5 57. 44. 9 35..5 61. 31. 13. 55. 33.5 27.5 32. 53.5 29.5 18.5 8 7 6 ρ 12 =.9 59. 27. 42.5 52. 35.. 32.5 56.5 m = 43.14 s = 16.79 49. 42. 45. m = 42.92 s = 12.65 6 7 8 9 Spatial Variable #1 Spatial Variable #2 95 87 72 24 44 72 75 85 29 58 9 55 38 55 34 88 41 35 26 24 38 56 14 34 18 8 6 49 46 84 23 21 46 22 42 45 14 19 36 48 23 8 29 8 7 6 ρ 12 =.21 44 49 67 51 17 38 47 52 52 22 48 25 55 32 33 54 59 58 46 38 35 55 m = 43.14 s =.17 m = 42.92 s = 18.32 6 7 8 9 Multiple combinations of fine spatial resolution attribute values can lead to the same aggregate values at a coarser resolution (equi-finality) Ph. Kyriakidis (UCSB) Geog 2C Spring 9 17 / First- Versus Second-Order Effects 3 1D population 2 1 value 1 2 3 6 7 8 9 x First-order effects Spatial pattern explained by environmental (or extrinsic) factors, e.g., attribute value y(x) is high at location x due to another attribute value y (x) at the same location x, or another attribute value y (x ) at a nearby location x Second-order effects Spatial pattern explained by interaction (or intrinsic) factors, e.g., attribute value y(x) is low at location x due to another (same-attribute) value y(x ) at a nearby location x, provided both locations x and x lie in the same environment Ph. Kyriakidis (UCSB) Geog 2C Spring 9 18 /

Points to Remember Recap I Spatial data set of geo-referenced measurements with attribute values and coordinates (topology & context also important) data types: 1. spatial point patterns events 2. data continuously varying in space fields 3. area or lattice data objects 4. spatial interaction data flows Spatial data analysis objectives exploratory analysis: looking for patterns/relationships confirmatory analysis: establishing spatial process models from spatial patterns + model parameter estimation Ph. Kyriakidis (UCSB) Geog 2C Spring 9 19 / Recap II Spatial statistics Points to Remember statistical framework for analysis and modeling of spatial data: accounts for spatial auto-correlation and scale effects; allows assessing uncertainty in spatial analysis results multivariate statistics tailored to the analysis of spatial data Issues to be aware of any spatial analysis result is tied to a particular observation scale, i.e., to the particular sample support(s); the Modifiable Area Unit Problem (MAUP) and the Ecological Inference Problem (EIP) are consequences of this spatial process models typically distinguish between: first-order effects or environmental controls second-order effects or interactions (spatial auto-correlation) this dichotomy does not apply to actual data, only to data generating models... Ph. Kyriakidis (UCSB) Geog 2C Spring 9 /