In matrix algebra notation, a linear model is written as

Similar documents
Tracey Farrigan Research Geographer USDA-Economic Research Service

This report details analyses and methodologies used to examine and visualize the spatial and nonspatial

Community Health Needs Assessment through Spatial Regression Modeling

Final Exam - Solutions

ESRI 2008 Health GIS Conference

Everything is related to everything else, but near things are more related than distant things.

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Modeling Spatial Relationships using Regression Analysis

Finding Hot Spots in ArcGIS Online: Minimizing the Subjectivity of Visual Analysis. Nicholas M. Giner Esri Parrish S.

Exploratory Spatial Data Analysis (ESDA)

Modeling Spatial Relationships Using Regression Analysis

Spatial Pattern Analysis: Mapping Trends and Clusters. Lauren M. Scott, PhD Lauren Rosenshein Bennett, MS

Final Project: An Income and Education Study of Washington D.C.

CRP 272 Introduction To Regression Analysis

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Spatial Pattern Analysis: Mapping Trends and Clusters

Visualization Based Approach for Exploration of Health Data and Risk Factors

1Department of Demography and Organization Studies, University of Texas at San Antonio, One UTSA Circle, San Antonio, TX

DIFFERENT INFLUENCES OF SOCIOECONOMIC FACTORS ON THE HUNTING AND FISHING LICENSE SALES IN COOK COUNTY, IL

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Medical GIS: New Uses of Mapping Technology in Public Health. Peter Hayward, PhD Department of Geography SUNY College at Oneonta

Multidimensional Poverty in Colombia: Identifying Regional Disparities using GIS and Population Census Data (2005)

Dr Arulsivanathan Naidoo Statistics South Africa 18 October 2017

Where Do Overweight Women In Ghana Live? Answers From Exploratory Spatial Data Analysis

Measuring community health outcomes: New approaches for public health services research

Neighborhood social characteristics and chronic disease outcomes: does the geographic scale of neighborhood matter? Malia Jones

Spatial Pattern Analysis: Mapping Trends and Clusters

Review of Multiple Regression

Using Spatial Statistics Social Service Applications Public Safety and Public Health

In Class Review Exercises Vartanian: SW 540

Cluster Analysis using SaTScan

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Modeling Spatial Relationships Using Regression Analysis. Lauren M. Scott, PhD Lauren Rosenshein Bennett, MS

GIS in Locating and Explaining Conflict Hotspots in Nepal

y response variable x 1, x 2,, x k -- a set of explanatory variables

Migration Clusters in Brazil: an Analysis of Areas of Origin and Destination Ernesto Friedrich Amaral

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

ECON 497 Midterm Spring

The Church Demographic Specialists

Hennepin GIS. Tree Planting Priority Areas - Analysis Methodology. GIS Services April 2018 GOAL:

Geographical Information Systems Institute. Center for Geographic Analysis, Harvard University. GeoDa: Exploratory Spatial Data Analysis

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Variance Decomposition and Goodness of Fit

Working with Census 2000 Data from MassGIS

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Exploratory Spatial Data Analysis Using GeoDA: : An Introduction

SPACE Workshop NSF NCGIA CSISS UCGIS SDSU. Aldstadt, Getis, Jankowski, Rey, Weeks SDSU F. Goodchild, M. Goodchild, Janelle, Rebich UCSB

MATH 1150 Chapter 2 Notation and Terminology

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

CRP 608 Winter 10 Class presentation February 04, Senior Research Associate Kirwan Institute for the Study of Race and Ethnicity

Random Coefficient Model (a.k.a. multilevel model) (Adapted from UCLA Statistical Computing Seminars)

Chapter 4. Regression Models. Learning Objectives

Nature of Spatial Data. Outline. Spatial Is Special

GIS Spatial Statistics for Public Opinion Survey Response Rates

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Dynamics in Social Networks and Causality

NEW YORK DEPARTMENT OF SANITATION. Spatial Analysis of Complaints

Defining Statistically Significant Spatial Clusters of a Target Population using a Patient-Centered Approach within a GIS

Sampling, Frequency Distributions, and Graphs (12.1)

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

The Geography of Social Change

Transit Service Gap Technical Documentation

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

STATISTICS Relationships between variables: Correlation

Introduction to Spatial Statistics and Modeling for Regional Analysis

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

OPEN GEODA WORKSHOP / CRASH COURSE FACILITATED BY M. KOLAK

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Regression Analysis Primer DEO PowerPoint, Bureau of Labor Market Statistics

Inclusion of Non-Street Addresses in Cancer Cluster Analysis

Spatial Disparities in the Distribution of Parks and Green Spaces in the United States

Agro Ecological Malaria Linkages in Uganda, A Spatial Probit Model:

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

REGRESSION ANALYSIS BY EXAMPLE

Keywords: Air Quality, Environmental Justice, Vehicle Emissions, Public Health, Monitoring Network

Finding Hot Spots in ArcGIS Online: Minimizing the Subjectivity of Visual Analysis. Nicholas M. Giner Esri Parrish S.

Objectives Define spatial statistics Introduce you to some of the core spatial statistics tools available in ArcGIS 9.3 Present a variety of example a

MAKING PLANNING LOCAL

Demographic Data in ArcGIS. Harry J. Moore IV

Chapter 19 Sir Migo Mendoza

DEVELOPING DECISION SUPPORT TOOLS FOR THE IMPLEMENTATION OF BICYCLE AND PEDESTRIAN SAFETY STRATEGIES

Math 140 Introductory Statistics

This document contains 3 sets of practice problems.

Simple Linear Regression: One Qualitative IV

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Exploratory Spatial Data Analysis and GeoDa

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

GROWING APART: THE CHANGING FIRM-SIZE WAGE PREMIUM AND ITS INEQUALITY CONSEQUENCES ONLINE APPENDIX

A Joint Tour-Based Model of Vehicle Type Choice and Tour Length

Stat 101 Exam 1 Important Formulas and Concepts 1

Using AMOEBA to Create a Spatial Weights Matrix and Identify Spatial Clusters, and a Comparison to Other Clustering Algorithms

Transcription:

DM3 Calculation of health disparity Indices Using Data Mining and the SAS Bridge to ESRI Mussie Tesfamicael, University of Louisville, Louisville, KY Abstract Socioeconomic indices are strongly believed to be associated with the risk of disease. However, no consensus exists in the US regarding which area-based measure should be used to measure or monitor socio-economic inequalities in health. The purpose of this paper is to determine which area-based socioeconomic measures would be most appropriate for US public health surveillance to investigate in relationship to the incidence of disease. Geographic information systems (GIS) manage, analyze, and disseminate spatial data. Arc Map is used to display results of the analysis in a variety of formats, such as maps, reports and graphs. The SAS Bridge to ESRI is used to transfer the spatial information directly into SAS datasets. The specific example here is to examine the relationship between the rate of cancer and the various indices of social economic (SES) conditions for the study area consisting of Kentucky, Tennessee, North Carolina, Virginia and West Virginia. Linear models and cluster analysis in SAS are used in this problem to investigate the spatial data from Arc Map and to optimize the definition of a socioeconomic health index. INTRODUCTION The study data consist of 522 counties in five different states. We will use three different methods to predict the rate of cancer based on the different socioeconomic indices. The SAS Bridge to ESRI will transform the spatial data to SAS datasets so that inferential statistics in SAS can investigate how the different independent variables would predict the rate of cancer. Although the application of linear models and cluster analysis is widely used in investigating data, the models have not been used regularly with Arc Map. To enhance the use of SAS with Arc Map, SAS has developed the Bridge to ESRI. In this paper we will demonstrate how the SAS Bridge to ESRI is used to transfer the spatial information directly into SAS datasets. The primary task of this process is to locate the predictors of cancer rates from different categorical variables. The use of linear models enables us to perform analysis of variance when we have a continuous, dependent variable with independent classification variables, quantitative variables, or both. Besides the usual estimators and test statistics produced for a regression, a fit analysis can produce many diagnostic statistics. Collinearity diagnostics measure the strength of the linear relationship among explanatory variables and how this collinearity affects the stability of the estimates. Influence diagnostics measure how each individual observation contributes to determining the parameter estimates and the fitted values. Y = Χβ + ε In matrix algebra notation, a linear model is written as where y is the n 1 vector of responses, X is the n p design matrix, β is the p 1 vector of unknown parameters, and ε is the n 1 vector of unknown errors. Factor analysis selects which variables in the data set are explanatory variables. The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is, to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method. With the SAS Bridge to ESRI, we can export data to a SAS data set and use SAS to perform any analysis that is needed. The SAS Bridge to ESRI adds the analytic intelligence of SAS to the easy-to-use mapping capabilities of Arc GIS. The result is a geographic information system unmatched in the ability to inform, persuade, and motivate. Also, we can join SAS data to Arc Map layers to uncover new relationships in the existing data to find answers and to solve problems [1]. BACKGROUND Cervical cancer is the number one killer of women in many developing countries. More than 39 women die in the United States each year from this disease. A woman who doesn't have screening on a regular basis significantly increases her chances of developing cervical cancer. Only 11% of women report that they do not have regular cervical cancer screenings [5]. Three different methods will be used to investigate the rate of cervical cancer in white females to demonstrate the use of the SAS Bridge. The data were obtained from www.cancer.gov. Data concerning geographical details for the study states of Kentucky, Tennessee, Virginia, West Virginia and North Carolina were obtained from www.census.gov. The main objective is to predict the rate of cancer based on the level of the predictor variables (Table 1) 1

Table 1. Predictor Variables for Cervical Cancer Table Name Variable Description Name Low Education LOWED Percentage of persons aged >=25 with less than High school education High Education HIGHED Percentage of persons aged >=25 with at least 4 years of college. High Occupation HIGHOCC Percentage of persons employed in predominantly working class Low Occupation LOWOCC Low Occupation Low Income LOWINC Percentage of households with an income <$15,. High Income HIGHINC Percentage of population with an income >$15,. Poverty POVERTY Below federally defined line: for example income below $12,647 for a family of four Crowding CROWD Household with more than one person per room No Vehicle NOCAR Percentage of no car ownership Available High House Value HIGHVAL Homes worth>=$3, First, we download the Geographic files stgeo.uf3 from the www.census.gov and then download the files for the variable of interest of the study: Educational attainment (P37), Occupation (P5), Income (P52), Poverty (P87), Poverty Ratio (P88), Tenure by Person by Room (H22), Tenure by Vehicles (H44), Value of Housing (H84), are downloaded for each of the five states. For each county of the five states, Cervical Cancer data are downloaded as well. The next step was to define each of the variables into quintiles (top 2%, next 2%, middle 2%, next lowest 2% and bottom 2%) in which the top is given a value of 1 and the bottom is given a value of 5, where a 1 represents the best one fifth of cases and a 5 the worst By default, SAS usually starts calculating percentiles at the low end, which is the reverse of the natural order of the quintiles. If a variable is reversed, that is, a high value doesn t equate to a good situation (for instance, a high percent of Poverty is not a good situation, the 1 st percentile should be coded as 5, not 1. General Linear Model The first method we used in predicting the rate of cancer was to assign a level based on percentile for each predictor variable. We then classified people in public health databases by the socioeconomic characteristics of their residential neighborhood (here: Rate of Cancer by County.). The Index for the cancer rate was calculated from all the variables and level was assigned to the index based on the percentile level. The index level1 was used to predict the rate of cancer in white females across the states of study: Kentucky, Virginia, West Virginia, and North Carolina. The following SAS procedure was used for this task. proc GLM data=sasuser.sepindexlevel1; class indexlevel1; model RATEWFLEVEL_NUM = indexlevel1 /solution; output out=indexlevel1 p=_pred ; run; data sasuser.predratewfind1; set indexlevel1; PredValue1=round(_pred); Run; Sas Output R-Square Coeff Var P-value.7529 45.6143 <.1 Param SE Pr > t Intercept 3.48.132 <.1 indlevel1 1-1.6.18 <.1 indlevel1 2 -.6.191.18 indlevel1 3 -.18.19.3532 indlevel1 4 -.44.195.232 indlevel1 5... 2

Result 1 R 2 for indexlevel1 was.7529 and P<.1. This tells us that something is wrong. The problem is Multicollinearity. The variables that produced the socioeconomic indexlevel1 were highly correlated, but this was not taken into consideration in calculating Indexlevel1. Even though the overall P (.1) value is very low, most of the individual P values are high. This would suggest that the model doesn t fit well, even though none of the predictor variables has a statistically significant impact on predicting the rate of cancer. Figure 1, The scatter plot of Low-income levels versus Poverty can be explained such that as the poverty increases, the number of people with a low-income level increases as well. In the same way, as the number of people with high education increases, the number of people with low-income level decreases. The other scatter plots can be explained in a similar way. 3

Figure 2. Distribution of Cervical Cancer, Higher Education, Poverty and Ownership of Cars Distribution of Cervical Cancer Distribution of Higher Education Distribution of Poverty Distribution of Ownership of These maps show that the eastern part of Kentucky and the northern part of Tennessee have very high rates of cancer and very low rates of higher education. In contrast, Virginia has a very high rate of education and lower rates of cancer. Mapping the distribution of cervical cancer rates in the general white female population compared to cervical cancer rates for white females in poverty showed that eastern Kentucky and northern Tennessee have high rates of poverty, which is not the case in the eastern portion of Virginia. 4

II Factor Analysis Since the GLM method of linear models didn t give a good fit of the data, a factor analysis was used. The factor analysis gave two different factors. The variables related to Economic resources are grouped together as Factor 1 and those that are related with Employment and Education are grouped in Factor 2. The SAS code gave the following result for the standardized scoring coefficients SAS out put POVERTYlev_num Factor1.39 Factor2 -.12257 LowINCleve_num.2676 -.484 NOCARlevel_num.32712 -.24129 HighINClev_num.9468.17643 HIGHVALlev_num.13587 -.77 HighEdleve_num.512.29817 HighOcclev_num -.22325.38763 LowEdlevel_num.5249.22684 CROWDlevel_num.6763 -.17714 LowOccleve_num -.1535 -.266 Now we construct an equation for index level2 by choosing the highest absolute value for each of the predictor variables: Factor1=POVERTYlev_num*.39+LowINCleve_num*.2676+NOCARlevel_num*.32712+ HIGHVALlev_num*.13587 Factor2=HighINClev_num*.17643+ HighEdleve_num*.29817 + HighOcclev_num*.38763 + LowEdlevel_num *.22684-CROWDlevel_num.17714-LowOccleve_num*.266 The standardized scoring coefficient gave a coefficient of.59762 for factor1 and factor2. Based on these results, the indexlevel2 of the rate of cancer is calculated. Factor analysis gave the following result: SAS out put R-Square Coeff Var P-value.11648 44.727 <.1 Param SE Pr > t Interc 3.59.131 <.1 indlevel2 1-1.41.185 <.1 indlevel2 2 -.69.185.2 indlevel2 3 -.52.185.52 indlevel2 4 -.33.186.791 indlevel2 5... Result 2 R 2 for indexlevel2 was.11648 and P<.1. The overall P value is significant, and only one of the indexlevels is marginally significant (.791). The factor analysis, then, improves upon the first model. When cluster analysis is performed, the five index levels are clustered into three different classes. Clusters 2 3 4 Total 1 44 49 12 15 2 22 74 8 14 3 18 65 22 15 4 16 57 31 14 5 4 69 31 14 Total 14 314 14 522 Table 2: Predicted Cervical Cancer classes by Indexlevel2 where only three classes are observed. 5

III Interaction effect The previous two methods didn t give good results; as a consequence we are forced to seek another method. So we introduced an interaction effect on two groups where one can be called Low social Class variables and High social class variables. This gave a better result still. The following SAS code was used proc glm data=sasuser.interaction; class HIGHVALLEV LOWEDLEVEL HIGHEDLEVE LOWOCCLEVE HIGHOCCLEV LOWINCLEVE HIGHINCLEV POVERTYLEV CROWDLEVEL NOCARLEVEL; model RATEWFlevel_num= HIGHVALlev*HIGHEDleve*HIGHOCClev*HIGHINClev LOWOCCleve*LOWINCleve*POVERTYlev*CROWDlevel*NOCARlevel*LOWEDlevel /solution; output out=sasuser.method3data p=_pred; run; data sasuser.allmethod; set sasuser.method3data; predvalue3=round(_pred); run; Result 3 The prediction table obtained by using the interaction effect gave five clusters as desired. As long as many interaction effects are included, the model is going to fit the data. But one must carefully consider the case, as more interaction is included that the model might over-fit the data. Frequency 1 2 3 4 5 Total Col Pct 1 14 1 15 1.91 2 12 2 14 99.3 1.82 3 1 12 2 15.97 92.73 1.94 4 5 97 2 14 4.55 94.17 1.96 5 4 1 14 3.88 98.4 Total 14 13 11 13 12 522 Table 3. Prediction Table of the cancer rate based on the Interaction Method.The interaction method of predicting the rate of cancer provides almost a perfect clustering of the counties with very few misclassifications. The local Moran is related to the interaction model since it controls the other variables (spatial regression). The local Moran test (Anselin 1995) detects the local spatial autocorrelation for the General linear model and Factor analysis. It can be used to identify local clusters (regions where adjacent areas have similar values) or spatial outliers (areas distinct from their neighbors). Local Moran can be used to investigate local spatial clusters and as a diagnostic for outliers with respect to the measure of global association (local instability). The Local Moran value for each observation gives an indication of the extent of significant spatial clustering of similar values around that observation. The Local Moran statistics are used to identify regions that differ significantly from those expected under the null hypothesis [3]. I ^ = m w m ^ i, t i, t i, j j, t. The Local Moran statistic I i,t will be positive when values at neighboring locations are j similar, and negative if they are dissimilar. STIS (Space Time Information System) evaluates the significance of Local Moran statistic values with Monte Carlo randomizations, using conditional randomization. m i,t is the z-score standardized dataset being tested for region i at time t. m j,t is the z-score standardized dataset for region j at time t. w ij is a spatial weight set denoting the strength of connection between areas i and j. GeoDa was used to investigate the spatial autocorrelation of the predictor variables. The resulting map shows the significant locations by type of association, and the significance map shows the locations in different shades of green, 6

depending on the degree of significance. The map consists of visualization explanation and exploration of interesting patterns in geographic data. Figure 3. Moran scatter plot matrix and serial correlation for indexlevel1 and IndexLevel2 Figure 3 Accesses relationship between the variable value for unit of origin (x-axis) against the average of the values of its neighbors (y-axis). Figure 4. Lisa Cluster Map for p<.1 7

High-High can be interpreted, as "I'm high and my neighbors are high. High-Low can be interpreted, as "I'm a high outlier among low neighbors", Low-Low can be interpreted "I'm low and my neighbors are low", and Low-High can be interpreted as "I'm a low outlier among high neighbors. The map contains information only on those locations that have a significant Local Moran statistic. While every region in the dataset will be represented in the Moran Scatter plot, only those with Local Moran statistic p-values below.5 are be colored red or blue on the example map above. Regions with non-significant Local Moran statistics are colored gray. [4] Figure 5. Cluster map for Index1 CONCLUSION No consensus exists in the US regarding which area-based measure should be used to measure or monitor socioeconomic inequalities in health. The populations of the study states are highly affected by cancer. We were trying to find out if the risk of cancer is related with socio economic status. As we investigated, the General Linear Model and Factor analysis didn t produce a satisfactory prediction. So we continued our search method. The interaction method did well when compared to the other two methods, as it classified the rate of cancer into five different levels based on the socio economic indices. We conclude the interaction method gives a good prediction with a small improvement as a prediction of the rate of cancer. But one must carefully consider the case as more interaction is included and the model might over fit the data. REFERENCES [1] ESRI and the ESRI globe logo are trademarks of Environmental Systems Research Institute, Inc. [2]Copyright Stat Soft, I nc, 1984-23 STATISTICA is a trademark of Stat Soft, Inc. [3] Reference: Anselin, L. Local indicators of spatial association-lisa, 1995. Geographical Analysis, 27:93-115. [4].http://www.terraseer.com/products/stis/help/Statistics/LM/Results/Interpreting_univariate_Local_Moran_statistics.h tm [5] National Cervical Cancer Coalition, http://www.nccc-online.org/. 8

CONTACT INFORMATION Mussie Tesfamicael Department of Mathematics University of Louisville Louisville, KY 4292 Work Phone: 52-852-712, 52-298-824 Fax: 52-852-7132 Email: matesf1@louisville.edu 9