OPEN GEODA WORKSHOP / CRASH COURSE FACILITATED BY M. KOLAK
WHAT IS GEODA? Software program that serves as an introduction to spatial data analysis Free Open Source Source code is available under GNU license As of final version, runs on Windows, Mac OS, and Linux Can open shapefiles or tables
WHAT IS GEODA? Developed by Dr. Luc Anselin team Spatial econometrics Epidemiology applications Supported by the National Science Foundation and the Center for Spatially Integrated Social Science Flagship of the GeoDa Center in Arizona State University geodacenter.asu.edu/projects/opengeoda
PART I Open a file in GeoDa Make different Chloropleth Maps Open a Table in GeoDa Link between table and maps Navigate, sort, select, and query data in the Table Create a new variable Calculate raw rate for new variable Save as a new shapefile
OPENING A FILE IN GEODA Open GeoDa on Desktop File/Open Shapefile Open SIDS.shp Many ways to change the map you see in view: Right click on display and change the Category Got to Map/ in the Navigation Menu
CHLOROPLETH MAPS A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map. - WikiPedia Quantile Map Create a quantile map for NWBIR74 and SID74 (using defaults)
CHLOROPLETH MAPS Percentile Map: Create a percentile map for NWBIR74 and SID74 (using defaults)
WORKING WITH DATA IN TABLE Navigation, selection, and sorting: (live demos) Linking between data table and map Moving selection to top Queries: Selection Dialog to select something specific Can add as a variable Could assign value as 1 if query is true, for example Can move selection to top
WORKING WITH DATA IN TABLE Creating a Variable Add Variable (Right-Click) Name your new variable SIDR74 to record a raw rate for SID occurrence in 1974 specified population
WORKING WITH DATA IN TABLE Raw Rate The raw rate is the same as the rate or the percentage. It consists of an event (numerator) and base (denominator) variable. Event and Base variables For rates, the Event field refers to the numerator, the Base field to the denominator. The Event field can be thought of as a count field since it refers to variables such as counts, dollar values, or indices. In the Base field, the reference universe for the Event variable is chosen (it cannot contain any zero values). For instance, in the St. Louis homicide dataset, an Event variable is HC7984 (homicide count, 1979-84) while a Base variable is PO7984 (population total, 1979-84).
WORKING WITH DATA IN TABLE Creating a Variable Assign the SID74Rate variable to equal the Raw Rate in the Variable Calculation tool
WORKING WITH DATA IN TABLE Rescale by 100,000 births
WORKING WITH DATA IN TABLE Confirm changes in Table
WORKING WITH DATA IN TABLE
WORKING WITH DATA IN TABLE Save as a New Shapefile (with new name), -- under File
PRACTICE Create a SID Raw Rate variable for 1979/ Save changes as a new shapefile. Try out other map options using Category options.
PART II Intro to Exploratory Data Analysis Make a Histogram and Box Plot from Data Investigate Outliers Make a Rate Map (Raw and Excess) Make an EB Smoothed Map Make a Spatial Weight file for your data
EDA BASICS - HISTOGRAM Create a Histogram for a Variable Click histogram icon in Navigation toolbar Select Variable (ie. Calculated SIDS rate) Right-Click on histogram to adjust display Change the number of intervals in histogram Link histogram to map by Clicking interested areas
EDA BASICS BOX PLOT Create a Box Plot for a Variable Click box plot icon in Navigation toolbar Select Variable (ie. Calculated SIDS rate) Right-Click on box plot to adjust display Hinge can be adjusted to 1.5 or 3 Create a map from Box Plot data
EDA BASICS BOX PLOT Depicts non-spatial distribution of a variable Represents cumulative distribution of variable, sorted by value Value in parantheses on upper right corner = # of observations Shows median, first, third quartile of distribution (50%, 25%, 75%) and an outlier Outliers: lie more than a given multiple of the interquartile range (difference in value between 75% and 25% observation) Standard Multiples used are 1.5 and 3 times the interquartile range
EDA BASICS Explore the data further by clicking on interesting areas, outliers, etc. Change the hinge and explore again.
BASIC RATE MAPPING Raw Rate Map Keep your Box Plot, Hinge 1.5 Map Open Create a new, themeless map Right-click your map, and select Rates/Raw Rate Choose SID74 as event variable, and BIR74 as base Right-click map and select Save Rate to write as a new variable (default as R_RAWRATE) Drag and drop column next to previously calculated rate (Should be off by our multiplying factor)
BASIC RATE MAPPING Raw Rate Map
BASIC RATE MAPPING Excess Rate Map Standardized mortality rate (SMR) commonly used notion to compare observed rate to a standard In GeoDa, Excess Ratio is the ratio of the observed rate to the average rate computed for all data This average is NOT the average of the all rates Calculated as ratio of total sum of all events over sum of all populations at risk
BASIC RATE MAPPING Excess Rate Map Right-Click Map, click on Rates/ Excess Rates Choose appropriate event and base variables Right-click on Map again to Save Rates, and add to table
BASIC RATE MAPPING Excess Rate Map Areas with less risk are blue (<1.00) Areas with more risk are red (>1.00) Legend Categories are hard-coded To do analysis or visualization, you must use add the rates to the table (done in previous slide) Drag and drop column to appropriate place in table
PRACTICE Create Histogram, Box Plots, and Rate Maps for the Ohio lung cancer sample data
RATE SMOOTHING Rate Smoothing techniques: To correct for the inherent variance instability of rates Empirical Bayes Smoothing (according to L. Anselin): Computing weighted average between raw rate for each county and state average, with weights proportional to the underlying population at risk IE. Small counties, with small populations at risk, will tend to have rates adjusted considerably, whereas large counties will barely change
RATE SMOOTHING Empirical Bayes (EB) Smoothed Rates Right-click map, Select Rates/ Empirical Bayes Choose your event and base variables Use a 1.5-hinge box plot Can use a Percentil Map if Appropriate Use Box Plot if <100 observations Right-Click to Save Rates and add to table Compare EB-smoothed map with previous rate maps How are outliers affected?
RATE SMOOTHING
RATE SMOOTHING Spatial Weight Smoothing Does proximity to neighbors affect the results? In GeoDa, neighbors are defined as a spatial weights file Create a simple spatial weights file for 8 nearest neighbors for each county: Go to the menu: Tools/ Weights/ Create Choose FIPSNO for the ID variable Each county (or tract or block) will have a unique ID no. Leave the defaults for the Distance Weights Section Click on the k-nearest Neighbors radio button, and adjust for 8 neighbors Save as a.gwt file in your folder
RATE SMOOTHING Spatial Weight Smoothing Load spatial weight file you just created Go to the menu: Tools/ Weights/ Open Spatial Weights will now be loaded for next maps Create a new map with spatial rate smoothing Right-click and choose Rates / Spatial Rates Use the same Base and Event variables Use the Box Plot with 1.5 Hinge Compare to previous box plot maps!
RATE SMOOTHING Spatial Weight Smoothing Spatially smoothed maps emphasize broad regional patterns. What happened to the outliers?
SPATIAL WEIGHTS Contiguity-Based Spatial Weights Definition of a neighbor is based on sharing a common boundary. Connectivity Histogram (according to L. Anselin) Histogram reflects connectivity distribution in data set Detects strange features in the distribution which could affect spatial autocorrelation and spatial regression specifications Beware of 1) islands, or unconnected observations, and 2) bimodal distribution of locations
SPATIAL WEIGHTS Rook-Based Contiguity Go to Tools/ Weights /Create create a Rook-Based Weights File use the Key variable Go to Tools/ Weights/ Connectivity Histogram to see results
SPATIAL WEIGHTS Queen-Based Contiguity Go to Tools/ Weights /Create create a Queen-Based Weights File use the Key variable Go to Tools/ Weights/ Connectivity Histogram to see results
SPATIAL WEIGHTS How are neighboring units determined? Queen criterion determines neighboring units as those that have any point in common, including both common boundaries and common corners Number of neighbors for any given unit will be equal to or greater that the rook criterion
SPATIAL WEIGHTS
SPATIAL WEIGHTS Higher Order Contiguity Two definitions of higher order contiguity: Pure: does not include locations that were also contiguous of a lower order Cumulative: includes all lower order neighbors
SPATIAL LAG CONSTRUCTION Spatially Lagged Variables Load a weighted file Open Table, Right-Click and select Variable Calculation Choose Spatial Lag construction Can Add Variable with new name (W_INC) Spatial Weights file will already be loaded Choose Variable to be spatially lagged (HH_INC) New Variable is calculated and added to Table For contiguity weights file, spatially lagged variable is the simple average of the values for the neighboring units
SPATIAL LAG CONSTRUCTION Value for one value is the average of values of weighted variable in neighboring units.
SPATIAL AUTOCORRELATION Moran Scatter Plot Plot with variable of interest on x-axis, and spatial lag on y-axis Use the Scatter Plot icon to manually create a Moran Scatter Plot: W_INC in left side, HH_INC on the right side Slope of regression line is the Moran s I Statistics for HH_INC using a rook contiguity weights definition
SPATIAL AUTOCORRELATION Global Spatial Autocorrelation We will work with the univariate case and Moran scatter plot. Scottish Lip Cancer Data: Map/ Raw Rate Cancer as Event, and Pop as Base variable Set map to the Box Type with Hinge 1.5 Save Rates (R_RAWRATE is the default) Create a weights file with 5 nearest neighbors (try k)
SPATIAL AUTOCORRELATION Moran I Plot and Statistic Go to the Menu, and select Space/ Univariate Moran I Select R_RAWRATE as variable Select your weights file Notice x and y axis set up accordingly Spatial lag variable constructed for y-axis R_RAWRATE on x-axis has been standardized to correspond to standard deviations (beyond 2SD as outlier) Centered on Mean with axes drawn in 4 quadrants
SPATIAL AUTOCORRELATION Moran I Plot and Statistic 4 quadrants correspond to different types of spatial autocorrelation: High-high and low-low for positive autocorrelation Low-high and high-low for negative spatial autocorrelation Value listed at the top is the Moran s I Statistic You can exclude selected as an option Intermediate calculations can be saved to data table Right-click on graph and select Save Results
SPATIAL AUTOCORRELATION Inference Inference for Moran I is based on random permutation procedure (calculates statistic many times to generate reference distribution) Obtained statistic compared to reference distribution for a pseudo significance level computation Right-click plot, Select Randomization > 999 permutations Click on Run to assess sensitivity of results Most significant p-level depends directly on # of permutations