Big-Geo-Data EHR Infrastructure Development for On-Demand Analytics Sohayla Pruitt, MA Senior Geospatial Scientist Duke Medicine DUHS DHTS EIM HIRS Page 1
Institute of Medicine, World Health Organization, and others recognize that clinical care may only contribute 10% to the health of a population. Duke Medicine has spent MILLIONS on the electronic capture of clinical data. How do we make a health system more aware of the other determinants of health? Page 2
GeoMedicine Page 3
Typical Geospatial Workflow Requirements Personnel Software Data Hardware GIS Methods Requires specialized software and highly trained personnel. Data comes from a variety of sources (both free and costly), and it is often obtained in a variety of both nonspatial and spatial formats. Several specialized methods used to adequately prepare the data for use in geospatial visualization and analysis. Often involves ad hoc analyses that all too often get funded as a small piece of a larger research project. Page 4
Typical Geospatial Workflow Disadvantages Work is Not Easily Shared MANY Possible Data Sources Data Can Be Expensive Work is Not Easily Scalable Data Can Easily Become Stagnant Significant Bias Introduced Data Requires Significant Preparation Not Enough Time Spent on Analysis Too Much Time Spent on Data Prep Page 5
The Changing Paradigm: On-Demand Geospatial Analytics We have become an on-demand society With easy access to online mapping applications and GPS enabled smartphones and cars, we have become reliant on geospatial information. Successful applications deliver immediate answers to user questions, without the user ever having to manually collect, process, or analyze the data. Page 6
Duke Medicine s EDW Geospatial Strategic Vision Develop an enterprise geospatial infrastructure within Duke s Enterprise Data Warehouse, where automated methodologies download, update, and process geospatial data layers and then link them to each patient s geocoded address. Ensure the end user has access to thousands of the most up-to-date geospatial information. Develop sophisticated geospatial visualization and analytics tools that assist in transforming data to information on-demand and project agnostic. Spur geospatial health-care research efforts and help eliminate some of the bias from the analysis. Page 7
Duke Medicine s EDW Geospatial Infrastructure Development Page 8
Automated Address Standardization and Geocoding Page 9
Automated Address Standardization and Geocoding Status: Completed August 2012 Methodology: Used SAS Data Management Studio, the USPS knowledge pack, and the Tom-Tom Rooftop +6 Geocoding Data Pack to deploy an automated process that runs nightly on patient address records. Results: ~5.9 million address records evaluated for USPS verification / standardization and Rooftop or Street-level of Geocoding accuracy. ~90% of all patients seen in the past 10 years have a current address that has been USPS verified and standardized. ~83% of all patients seen in the past 10 years have a current address that has been geocoded to the Rooftop or Street-level of Geocoding accuracy. The 7% of addresses that were standardized but not geocoded, were caused by the address being a non-physical address (i.e. P.O. Boxes, Military Addresses, BEFORE etc.). AFTER Page 10
On-Demand Geospatial Visualization Automated Address Standardization and Geocoding Page 11
On-Demand Geospatial Visualization Status: Completed November 2012 Results: Filter data in EDW to create patient cohorts and visualize them on a map. Different map types available depending on PHI authorization: Dot Distribution or Thematic Maps using several geographic boundaries (i.e. Counties, ZIP Codes, Census Tracts, Block Groups). Page 12
On-Demand Geospatial Visualization Automated Address Standardization and Geocoding Automated Geospatial Data Collection, Transformation, Loading Page 13
Automated Geospatial Data Collection, Transformation, and Loading Status: In Progress / Ongoing Results: Acquired following data sets to date: ESRI Infrastructure Data, resulting in ~30 feature types (i.e. interstates, roads, parks, etc.). MapInfo Business Points, resulting in ~7000 business feature types at multiple SIC code grouping levels (i.e. eating and drinking establishments vs. restaurants vs. fast food locations). Census 2010 Summary File 1 Demographic Data at Block Group Level, resulting in ~6000 statistical measures expressed as raw counts, percentages, medians, or averages. American Community Survey 5-year Estimate Demographic and Socio-economic Data at Block Group Level (2006-2011), resulting in ~6000 statistical measures expressed as raw counts, percentages medians, or averages. Developed automated routines to prepare data for geospatial analysis (i.e. geo-location, clipping to NC, spatial re-projection, filtering, spatial joins, etc.) and load in a geodatabase within the EDW. Page 14
On-Demand Geospatial Visualization Automated Address Association with Geospatial Data Features Automated Address Standardization and Geocoding Automated Geospatial Data Collection, Transformation, Loading Page 15
Automated Address Association with Geospatial Data Features Status: In Progress / Ongoing Results: Developed automated routines to calculate relationship from each patient s address in the EDW to the nearest geospatial feature. Resulted in ~30 new variables characterizing each patient s distance to infrastructure related variables (i.e. distance to interstates, distance to roads, distance to parks, etc.). Resulted in ~7000 new variables characterizing each patient s distance to business variables at varying degrees of categorization (i.e. distance to eating and drinking establishments, distance to restaurants, distance to fast food locations). Resulted in ~12,000 demographic and socioeconomic block group value variables expressed as raw counts, percentages medians, or averages. Page 16
On-Demand Geospatial Visualization Automated Address Association with Geospatial Data Features Automated Address Standardization and Geocoding Automated Geospatial Data Collection, Transformation, Loading On-Demand Patient Geospatial Variable Filtering/Export Page 17
On-Demand Patient Geospatial Variable Filtering/Export Status: In Progress / Ongoing Results: Each patient s socioeconomic and demographic block group level variables are available for on-demand filtering, visualization, and export. The new geospatial data elements can be exported and used in advanced statistical models. The distance to infrastructure features and business establishments have not yet been made available on demand, but is in progress. New Geo Data Element Filters Page 18
On-Demand Patient Geospatial Variable Filtering/Export New Geo Data Elements for Export Page 19
On-Demand Geospatial Visualization Automated Address Association with Geospatial Data Features On-Demand Patient Geospatial Variable Visualization Automated Address Standardization and Geocoding Automated Geospatial Data Collection, Transformation, Loading On-Demand Patient Geospatial Variable Filtering/Export Page 20
On-Demand Patient Geospatial Variable Visualization Status: In Progress / Ongoing Results: Each cohort s socioeconomic and demographic block group level variables are available for on-demand charting. Common block group socioeconomic and demographic status variables can also be mapped thematically and visualized alongside a cohort. The distance to infrastructure features and business establishments have not yet been made available on demand, but is in progress. Page 21
On-Demand Patient Geospatial Variable Visualization Results Continued: We are exploring the use of BI dashboards to provide on-demand visualizations that help answer who? and where? Page 22
On-Demand Patient Geospatial Variable Visualization Results Continued: Working with SAS partners to explore the use of SAS Visual Analytics to provide on-demand analytic functionality. Will allow us to move toward providing a more complete on-demand environment that will decrease the need for researchers to extract the data from the EDW onto their own machines in order to statistically analyze. Page 23
On-Demand Geospatial Visualization Automated Address Association with Geospatial Data Features On-Demand Patient Geospatial Variable Visualization Automated Address Standardization and Geocoding Automated Geospatial Data Collection, Transformation, Loading On-Demand Patient Geospatial Variable Filtering/Export On-Demand Geospatial Predictive Analytics Page 24
On-Demand Geospatial Predictive Analytics Innovation: Developed first ever proof of concept of a mhealth technology that is capable of learning a user s behavior throughout time/space, the socio-geographic factors that influence that behavior, and delivering real-time intervention, just in time and just in place. Overview: Supported the Community Health and Resource Mapping (CHARM) team in building the big-geo-data infrastructure on top of mhealth data collected on smokers. Demonstrated how 5000+ geospatial variables could be considered within a logistic regression model to: (a) Identify the geospatial characteristics in common and statistically significant to the locations where participants reported smoking within their mhealth app, and (b) identify other areas with similar geospatial characteristics that have a high statistical probability or likelihood that they might engage in smoking in the future, based on their past behavior. OUTPUT: Generated Probability Hotspot Maps (PHMs)*, where values range from 0-1 and specify the likelihood that participants will engage in smoking behavior in a given location. INPUT: Modeled x,y mhealth smoking logs of 17 smokers across NC against 5000 geospatial data variables * PHMs are different than density hotspot maps, as they do not only summarize the density where behavior occurred in the past, but they identify NEW areas where the behavior is likely Page to 25 occur in the future.
Summary of Benefits The geospatial data that is being integrated within the EHR will be very useful as it will contribute a valuable set of data elements that is not collected upon interaction with a patient (i.e. educational attainment, income, primary mode of transportation, distance from primary care clinics, distance from fitness facilities, etc.) This integration, will allow our community on-demand access to geospatial visualization and analytics without having to be expert geospatial modelers who know where to acquire the geospatial data, how to process it, and how to interact with advanced geospatial software to get the information and analysis they need. This approach to research and management can transform how our organization examines the geographic and environmental determinants in the Population Medicine equation. Page 26