How to make R, PostGIS and QGis cooperate for statistical modelling duties: a case study on hedonic regressions

Similar documents
How to make R, PostGIS and QGis cooperate for statistical modelling duties: a case study on hedonic regressions

How to make R, PostGIS and QGis cooperate for statistical modelling duties

Among various open-source GIS programs, QGIS can be the best suitable option which can be used across partners for reasons outlined below.

Karsten Vennemann, Seattle. QGIS Workshop CUGOS Spring Fling 2015

Development of a server to manage a customised local version of OpenStreetMap in Ireland

Are You Maximizing The Value Of All Your Data?

QGIS FLO-2D Integration

Valdosta State University Strategic Research & Analysis

Leveraging Your Geo-spatial Data Investments with Quantum GIS: an Open Source Geographic Information System

A Distributed GIS Architecture for Research in Baalbek Based on CISAR

Welcome! Power BI User Group (PUG) Copenhagen

Write a report (6-7 pages, double space) on some examples of Internet Applications. You can choose only ONE of the following application areas:

GIS-based Smart Campus System using 3D Modeling

Transactions on Information and Communications Technologies vol 18, 1998 WIT Press, ISSN

OSGIS Platform. Storing and distributing PostGIS, Deegree, UMN Map Server Desktop visualization JUMP, QGIS, Thuban, udig, gvsig

GIS Software. Evolution of GIS Software

GIS = Geographic Information Systems;

Practical teaching of GIS at University of Liège

GIS Geographical Information Systems. GIS Management

GeoPostcodes. Trinidad & Tobago

Salisbury University: Eric Flint, John O Brien, & Alex Nohe

GEOGRAPHICAL INFORMATION SYSTEMS. GIS Foundation Capacity Building Course. Introduction

GeoPostcodes. Luxembourg

Newcastle City Council - Migration to QGIS and Open Source GIS

Overview. Everywhere. Over everything.

One platform for desktop, web and mobile

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

GeoPostcodes. Grecia

GIS for ChEs Introduction to Geographic Information Systems

GeoPostcodes. Litauen

ENV208/ENV508 Applied GIS. Week 1: What is GIS?

How do I do that in Quantum GIS: illustrating classic GIS tasks Edited by: Arthur J. Lembo, Jr.; Salisbury University

TOWARDS THE DEVELOPMENT OF A MONITORING SYSTEM FOR PLANNING POLICY Residential Land Uses Case study of Brisbane, Melbourne, Chicago and London

A Review: Geographic Information Systems & ArcGIS Basics

GeoPostcodes. Denmark

a system for input, storage, manipulation, and output of geographic information. GIS combines software with hardware,

Why GIS & Why Internet GIS?

Geometric Algorithms in GIS

Understanding Geographic Information System GIS

Geographical Information Systems Energy Database Report. WP1 T1.3- Deliverable 1.9

Census Mapping with ArcGIS

PC ARC/INFO and Data Automation Kit GIS Tools for Your PC

Existing Open Source Tools and Possibilities for Cadastre Systems

CES Seminar 2016: Geospatial information services based on official statistics Key issues from the session II papers

STATE GEOGRAPHIC INFORMATION DATABASE

A BASE SYSTEM FOR MICRO TRAFFIC SIMULATION USING THE GEOGRAPHICAL INFORMATION DATABASE

Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan

Toward a coupling between GIS and agent simulation: USM, an OrbisGIS extension to model urban evolution at a large scale

[N492] Using GIS in Noise exposure analysis

MapOSMatic: city maps for the masses

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning

Joint MISTRAL/CESI lunch workshop 15 th November 2017

UTAH S STATEWIDE GEOGRAPHIC INFORMATION DATABASE

EXPLANATION OF G.I.S. PROJECT ALAMEIN FOR WEB PUBLISHING

AUTOMATED METERED WATER CONSUMPTION ANALYSIS

Geometric Algorithms in GIS

INTRODUCTION TO GEOGRAPHIC INFORMATION SYSTEM By Reshma H. Patil

Digital Tax Maps Westport Island Project Summary

Learning Computer-Assisted Map Analysis

Multiscale Spatio-Temporal Data Aggregation and Mapping for Urban Data Exploration

The E. I. E. L Project: An Experience of a GIS Development

gvsig: Open Source Solutions in spatial technologies

DEVELOPMENT OF TRAFFIC ACCIDENT ANALYSIS SYSTEM USING GIS

PostGIS Cookbook. open source I community experience distilled

Development of Univ. of San Agustin Geographic Information System (USAGIS)

CENSUS MAPPING WITH GIS IN NAMIBIA. BY Mrs. Ottilie Mwazi Central Bureau of Statistics Tel: October 2007

YYT-C3002 Application Programming in Engineering GIS I. Anas Altartouri Otaniemi

Open Source Technologies and Remotely Sensed Data in Protecting Elephants. Rosemary Alles Dr. Justine Blanford Penn State World Campus July 2015

INTRODUCTION TO GIS. Dr. Ori Gudes

UNIT 4: USING ArcGIS. Instructor: Emmanuel K. Appiah-Adjei (PhD) Department of Geological Engineering KNUST, Kumasi

Introduction to ArcGIS Server - Creating and Using GIS Services. Mark Ho Instructor Washington, DC

IndiFrag v2.1: An Object-based Fragmentation Analysis Software Tool

GIS Visualization: A Library s Pursuit Towards Creative and Innovative Research

Introduction-Overview. Why use a GIS? What can a GIS do? Spatial (coordinate) data model Relational (tabular) data model

ArcGIS 9 ArcGIS StreetMap Tutorial

FIRE DEPARMENT SANTA CLARA COUNTY

OPEN SOURCE TECHNOLOGIES IN GEOGRAPHIC INFORMATION SYSTEMS

NATO Headquarters The Situation Center GIS experience.

Merging statistics and geospatial information

The Application of 3D Web GIS In Land Administration - 3D Building Model System

Assessment and management of at risk populations using a novel GIS based UK population database tool

Geographical Databases: PostGIS. Introduction. Creating a new database. References

Introducing GIS analysis

An online data and consulting resource of THE UNIVERSITY OF TOLEDO THE JACK FORD URBAN AFFAIRS CENTER

Demographic Data in ArcGIS. Harry J. Moore IV

GIS ADMINISTRATOR / WEB DEVELOPER EVANSVILLE-VANDERBURGH COUNTY AREA PLAN COMMISSION

Transcription:

202 RESEARCH CONFERENCES How to make R, PostGIS and QGis cooperate for statistical modelling duties: a case study on hedonic regressions Author Olivier Bonin, Université Paris Est - IFSTTAR - LVMT, France KEYWORDS : return on experience, statistical modelling, visualisation, R, PostGis, QGis Hedonic regressions used to model housing prices are a good example of statistical modelling that is demanding towards software: housing databases are generally large (typically several hundred thousands of records), and many spatial characteristics must be computed for each housing location. An hedonic model for housing prices [3, 8] is generally written as : p i = Σ α j x ij + Σ β j y ij +Σ γ j z ij + e i where x ij denotes structural characteristics of housing i (e.g. size, number of rooms, presence of an elevator, etc.), y ij the neighbourhood characteristics (e.g. presence of rapid transits in the vicinity), z ij the market position characteristics (e.g. distance to employment, distance to the city centre), p i the housing prices (or more generally Box-Cox transforms of these prices), and e i a Gaussian error term. The structural characteristics are found in housing price databases. The location of each housing consists in a postal addresses or in geographical coordinates. The spatial characteristics and are computed with the help of spatial queries as well as with network analyses. Thus it is necessary to be able to load road networks and public transportation networks and perform classical network analyses such as shortest path computation. It is also necessary to manage several layers of spatial information such as the location of employment centres and of amenities.

RESEARCH CONFERENCES 203 To select pieces of software to perform our modelling duties, we have to keep in mind that we want to navigate within two distinct worlds: the world of statistics and the world of geomatics. On one hand, a statistical software is mandatory to perform model estimation. R [5] is probably the best candidate for this task, even compared to commercial systems, given that our modelling might include spatial regression techniques or multi-level modelling [1]. On the other hand, a geographical information system (or a spatially-aware database system) is useful to perform spatial analysis (though R can perform part of these tasks thanks to some of its libraries). Here, the number of mature open source geographical information systems is relatively high, and several projects have reached the state to be able to cope with the hundred thousands of points associated to the location of dwellings. And globally, as a large amount of data is involved, it may prove useful to store and access it through a database management system. PostGIS [6] is a very good and easy to use spatially-aware RDBMS. Moreover, the spatial index in PostGIS will be helpful to perform spatial queries on hundred thousands of points. While R and PostGIS are obvious choices, the choice of the candidate geographical information system is very open. QGis [4] has been selected in the present study because it has the ability to connect to both PostGIS and R natively or with the help of readily available plugins, and has decent mapping capabilities (Figure1). FIGURE 1 Rapid transit network and housing locations in the Ile-de-France region (map drawn with QGis, data from the BIEN database)

204 RESEARCH CONFERENCES We wanted our system to be able to run on Windows, GNU/Linux and Mac OS X, which is the case for the three softwares. However, in our setup with a PostGIS server on GNU/Linux and R and QGis on Mac OS X, we have discovered that some of the required libraries required to make software communicate could be a bit tricky to install and configure on Mac OS X. We describe in this communication how we managed to make R, PostGIS and QGis work together to assist us in our econometric modelling of housing prices. We have evaluated different combinations of these three pieces of software that we have used extensively during the course of this research, to finally converge to the system that we are currently using. We were concerned mainly with performance (as large amounts of data are involved), spatial query capabilities, easiness to compute new spatial indicators and to take them into account in the statistical modelling, and finally spatial visualisation of statistical results and map production. As it is often the case with open source software, several possibilities exist to connect the pieces of software two by two. We tried some of them and give an insight on the results of our experiments. We chose PostGIS as a central repository for all our tabular and spatial data, and R as statistical system. To our knowledge, R can be connected to PostGIS either by RODBC [7] or by Rdbi and RdbiPgSQL. Rdbi (hosted on BioConductor [2] and thus accessible through this repository) proved to be much faster than RODBC in our case, so that we have settled on it. Moreover, having ODBC work on Mac OS X requires to compile the PostGIS driver and to configure the connexion by hand. To connect QGis to PostGIS was performed with QGis native driver. We have also connected QGis directly to R for testing purposes. This connexion can be established by the manager plugin (and probably other plugins). It did work, but did not proved to be very useful: because of the size of the housing price database, loading data from files in QGis exhibited horrible performance compared to PostGIS. The performance of the final system is quite good. Data can be exchanged between R and PostGIS in both directions with decent times, and from PostGIS to QGis for visualisation purposes. We almost never had to exchange data from Qgis to PostGIS, as all spatial queries were directly performed in PostGIS by SQL queries.

RESEARCH CONFERENCES 205 Our return on experience on these three pieces of open source software deals with statistical analysis, and more precisely the modelling of the influence of accessibility on housing prices. These kinds of models can require somewhat sophisticated statistical methods when spatial auto-correlation of residuals must be controlled and taken care of. Of utmost importance to us was the possibility to visualise our results on maps, so that we focus here on the visualisation duties. The typical modelling process is to import data from PostGIS into R, estimate models, add the model error term into the data frame used for modelling and then update the PostGIS database with the new values. It is clear from Figure 1 that mapping the error term at the point level leads to hard to read maps, because of point density and because several dwellings share the same coordinates as the result of the (documented) geocoding process of dwellings in the BIEN database. Thus we choose to work on aggregate error terms, at the IRIS level or at the town level. Figure 2 is an example of such a map at the IRIS level on Ile-de-France. FIGURE 2 Residuals of one of our hedonic models on housings aggregated at the IRIS level in Ile-de-France (map done with QGis, data from the BIEN database)

206 RESEARCH CONFERENCES However, with our progressive discovering of R spatial handling facilities, we use more and more R and PostGIS without QGis. Indeed, we are able to load shapefiles, perform analyses and thematic maps with the help of the sp and the maptools libraries. The shapefiles describing public transit networks, employment centres or administrative boundaries are small enough to be efficiently handled by R. Thus, we obtain quicker maps of error terms, probably cruder than the ones we can elaborate with QGis, but sufficient for analysis and even for publication. FIGURE 3 Residuals of one of our hedonic models on housings aggregated at the IRIS level (map done in R with the spplot() method, data from the BIEN database) The most striking conclusion of our experiment is that, despite the strong requirements of our research in spatial analysis, the geographical information system appeared to be the most dispensable piece of software. QGis showed irritating limitations when working on PostGIS tables (by comparison to working on shapefiles), and its visualisation module worked decently but showed real ergonomics problem when tuning cartographic representation. Of course there might have been better choices than QGis, but our experiments

RESEARCH CONFERENCES 207 with other geographical information systems were not better. We were able to progressively put aside the geographical information system in our setup because of the excellent spatial processing capabilities of PostGIS, and to the development of a large number of spatial functions inside R, including the ability to load shapefiles and compute spatial queries. Our conclusion on working on the interface between statistics and geomatics is that statistical systems such as R are bridging the gap between both worlds. In our opinion, geographical information systems have to work a great deal towards statistical analysis to regain the interest of researchers performing quantitative analyses in social sciences. [1] BONIN, O. Urban location and multi-level hedonic models for the lle-de-france region, ISA International Conference on Housing, Glasgow, 2009 [2] BIOCONDUCTOR. Open source software for bioinformatics, retrieved September 6, 2012 from http://bioconductor.org/ [3] KAIN, J. F., AND QUIGLEY, J. M. Measuring the value of house quality. Journal of the American Statistical Association 65(330): 532-548, 1970 [4] QUANTUM GIS PROJECT. Welcome to the Quantum GIS Project, retrieved September 6, 2012 from http://www.qgis.org [5] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, 2009 [6] REFRACTION RESEARCH. What is Postgis, Postgis : Home, retrieved September 6, 2012 from http://postgis.refractions.net/ [7] RIPLEY B., AND LAPSLEY, M. RODBC: ODBC Database Access, CRAN - Package RODBC, retrieved September 6, 2012 from http://cran.r-project.org/web/ packages/rodbc/index.html [8] ROSEN, S. Hedonic prices and implicit markets: Product differentiation in pure competition. Journal of Political Economy 82(1): 34-55, 1974