Finding Outliers in Models of Spatial Data

Similar documents
Data Mining Techniques. Lecture 2: Regression

Multiple Regression Part I STAT315, 19-20/3/2014

On the Harrison and Rubinfeld Data

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

Introduction to PyTorch

The Model Building Process Part I: Checking Model Assumptions Best Practice

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

KAAF- GE_Notes GIS APPLICATIONS LECTURE 3

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Luc Anselin Spatial Analysis Laboratory Dept. Agricultural and Consumer Economics University of Illinois, Urbana-Champaign

Experimental Design and Data Analysis for Biologists

CRP 608 Winter 10 Class presentation February 04, Senior Research Associate Kirwan Institute for the Study of Race and Ethnicity

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

Mapping and Analysis for Spatial Social Science

Concepts and Applications of Kriging. Eric Krause

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Descriptive Data Summarization

Concepts and Applications of Kriging. Eric Krause Konstantin Krivoruchko

A Robust Approach of Regression-Based Statistical Matching for Continuous Data

Assumptions in Regression Modeling

ENGRG Introduction to GIS

Design of Experiments

Transit Service Gap Technical Documentation

Massachusetts Institute of Technology Department of Urban Studies and Planning

Statistics Toolbox 6. Apply statistical algorithms and probability models

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Geometric Algorithms in GIS

Workbook Exercises for Statistical Problem Solving in Geography

Outline. 15. Descriptive Summary, Design, and Inference. Descriptive summaries. Data mining. The centroid

Sampling Populations limited in the scope enumerate

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

Lecture 3: Exploratory Spatial Data Analysis (ESDA) Prof. Eduardo A. Haddad

USING DOWNSCALED POPULATION IN LOCAL DATA GENERATION

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Mapcube and Mapview. Two Web-based Spatial Data Visualization and Mining Systems. C.T. Lu, Y. Kou, H. Wang Dept. of Computer Science Virginia Tech

Lecture 5. Symbolization and Classification MAP DESIGN: PART I. A picture is worth a thousand words

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Model Selection in Functional Networks via Genetic Algorithms

In matrix algebra notation, a linear model is written as

METHODS FOR STATISTICS

Machine Learning, Fall 2009: Midterm

ECLT 5810 Data Preprocessing. Prof. Wai Lam

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

A Space-Time Model for Computer Assisted Mass Appraisal

Forward Selection and Estimation in High Dimensional Single Index Models

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007

Regression with Numerical Optimization. Logistic

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Bivariate data analysis

Analytics for an Online Retailer: Demand Forecasting and Price Optimization

CSC 411: Lecture 02: Linear Regression

Physics 509: Bootstrap and Robust Parameter Estimation

Robust control charts for time series data

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Part Possible Score Base 5 5 MC Total 50

Modeling Spatial Relationships Using Regression Analysis

BAYESIAN MODEL FOR SPATIAL DEPENDANCE AND PREDICTION OF TUBERCULOSIS

HW1 Roshena MacPherson Feb 1, 2017

LECTURE NOTE #3 PROF. ALAN YUILLE

Multiple Linear Regression for the Supervisor Data

Stat 5101 Lecture Notes

Hydraulic Processes Analysis System (HyPAS)

SPATIAL ANALYSIS. Transformation. Cartogram Central. 14 & 15. Query, Measurement, Transformation, Descriptive Summary, Design, and Inference

The Attractive Side of Corpus Christi: A Study of the City s Downtown Economic Growth

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Modeling Spatial Relationships Using Regression Analysis. Lauren M. Scott, PhD Lauren Rosenshein Bennett, MS

Regression Clustering

Introduction. Chapter 1

MULTIPLE REGRESSION METHODS

The Normal Distribution. Chapter 6

Comparing Color and Leader Line Approaches for Highlighting in Geovisualization

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Regression Analysis. BUS 735: Business Decision Making and Research

Geovisualization. Luc Anselin. Copyright 2016 by Luc Anselin, All Rights Reserved

Chapter 2: Tools for Exploring Univariate Data

Geovisualization of Attribute Uncertainty

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

MAT Mathematics in Today's World

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Tackling Statistical Uncertainty in Method Validation

Application of eigenvector-based spatial filtering approach to. a multinomial logit model for land use data

Overview of Statistical Analysis of Spatial Data

Feature Transformation

The Road to Data in Baltimore

Topic 2: New Directions in Distributed Geovisualization. Alan M. MacEachren

1Department of Demography and Organization Studies, University of Texas at San Antonio, One UTSA Circle, San Antonio, TX

Neural Network Approach to Estimating Conditional Quantile Polynomial Distributed Lag (QPDL) Model with an Application to Rubber Price Returns

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

Representation of Geographic Data

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Vocabulary: Samples and Populations

Simulating Uniform- and Triangular- Based Double Power Method Distributions

Regression Analysis By Example

Transcription:

Finding Outliers in Models of Spatial Data David W. Scott Dept of Statistics, MS-138 Rice University Houston, TX 77005-1892 scottdw@stat.rice.edu www.stat.rice.edu/ scottdw J. Blair Christian Dept of Statistics, MS-138 Rice University Houston, TX 77005-1892 blairc@stat.rice.edu www.stat.rice.edu/ blairc Abstract Statistical models fit to data often require extensive and challenging re-estimation before achieving final form. For example, outliers can adversely affect fits. In other cases involving spatial data, a cluster may exist for which the model is incorrect, also adversely affecting the fit to the good data. In both cases, estimate residuals must be checked and rechecked until the data are cleaned and the appropriate model found. In this article, we demonstrate an algorithm that fits models to the largest subset of the data that is appropriate. Specifically, if a hypothesized linear regression model fits ninety percent of the data, our algorithm can not only find an excellent fit as if only that good data were presented, but will also highlight the ten percent of the bad data that is not fit. Our work in digital government has focused on mapping data. Thus we illustrate how models fit to census track data work, and how the data in the bad set can be viewed spatially through ArcView or other tools. This approach greatly simplifies the task of modeling spatial data, and makes us of advanced map visualization tools to understand the nature of subsets of the data for which the model is not appropriate. 1 Introduction The study of spatial data has an important role in the mission of governmental entities. The topic has drawn the interest of statisticians (Cressie, 1991) and geographers (MacEachren, 1995), among others. The main presentation tool is the choropleth map, which displays small geographical units by means of color shading (Tobler, 1978; Tufte, 1990). Two outstanding collections of such maps may be found in Pickle et al. (1996) and Brewer and Suchan (2001). Color choice is critical to successful presentation, and Brewer (1999) has described a theory of successful color mapping. Choropleth maps are inherently discontinuous. Smoothing methodologies such as kriging and nonparametric regression can aid visual understanding (Scott, 1992, 2000), but at the cost of some statistical bias. Whittaker and Scott (1994, 1999) have argued that the needs of many decision makers are aided by such a big picture, without insisting on complete fidelity to the raw data. One of the many challenges in federal mapping is a better understanding of how many variables are related spatially to variables of interest. One classical dataset much studied is the Boston housing data (Harrison et al, 1978); the goal is to predict median housing prices, but much of the

data does not fit a single multivariate model. Robust methods are often applied to such data, but are iterative and sometimes difficult to interpret. We have introduced an alternative multivariate regression fitting algorithm, called L2E (Scott, 2001). In place of a maximum likelihood or a least-squares criterion, L2E attempts to estimate parameters so the shape of the residuals is as close to Normal as possible. Why does this alternative criterion help? If the data contain a cluster of bad data fitted by least-squares, the residuals will not be Normal and may be skewed, for example. Regression diagnostic plots must be carefully screened and interpreted in order to identify and then correct these practical problems. Making the residuals look Normal will avoid such difficult diagnostic steps. Instead, 90% of the residuals will look Normal (with a mean of 0), and the 10% of residuals corresponding to the bad data will be clearly shown. In the case of spatial data, another diagnostic is available. Namely, the values of the residuals may be plotted spatially in order to better understand the nature of the bad data. Of course, the bad data may in fact be the most interesting data. A simple estimation example is presented. 2 Fitting Multivariate Regressions by L2E Given a variable of interest, y, and a number of covariates, (x 1,x 2,...,x p ), collected over a number of spatial units (census tracks, for example), we seek to model y as a simple function of the predictors, {x k }: p ŷ(x 1,x 2,...,x p ) = β 0 + β j x j. Given estimates of the β j s, a residual for the i-th case would be computed according to the formula: j=1 p ɛ i = y i β 0 β ij x ij. j=1 If there was a cluster of bad data, then the residuals would not look Normal and be centered at 0. With the L2E, least-squares is replaced by another criterion, which can be optimized using standard software, for example, nlmin in the Splus package. The criterion to minimize is simply 1 2 πσ ɛ 2 n n φ(ɛ i 0,σ ɛ ), i=1 where σ 2 ɛ is the variance of the residuals and φ(x µ,σ2 ) is the density of a Normal random variable. The new exciting extension which we will illustrate below was introduced by Scott and Szewcyzk (2003). Specifically, instead of modeling the residuals as Normal, N(0,σ 2 ɛ ), we add an additional weight parameter, w, and use the residual model w N(0,σ 2 ɛ ). What this very unusual (and unexpected) model is really doing can be made clearer by the following observation. If we must deal with a bad data cluster, then the residual plot will have a separate component for that cluster. In other words, we believe the residual density is actually

a mixture of two components, one centered at 0 (the good data), and a second elsewhere (and of unknown shape). The magic of the L2E algorithm, as described in Scott and Szewczyk (2001), is that we can use L2E to estimate only the major mixture component. The criterion given above is modified only slightly to: w 2 2 2w n φ(ɛ i 0,σ ɛ ). πσ ɛ n The optimization is over the parameters w and σ ɛ, as well as the parameters {β 0,β 1,...,β p }, which are used implicitly to recompute the estimated residuals, {ɛ i }. The estimated value of the weight, w, indicates the amount of the data that the L2E algorithm believes is being fitted by the multivariate regression model. Thus, when all works well, the investigator simultaneously obtains a model fit that is very good as though only the good data were fitted, as well as an explicit estimate of the fraction of bad data. By sorting on the magnitude of the fitted residuals, the suspect data can clearly be identified. The output of such models can be used in our previous digital government work fitting smooth slices of regression functions (Whittaker and Scott, 1994; Scott and Whittaker, 1996; Whittaker and Scott, 1999; Scott and Wojciechowski, 2001; Christian and Scott, 2002). 3 Application to Boston Housing Data The Boston housing data were originally modeled to determine if levels of air pollution appeared to be correlated with housing prices. Data were assembled for the 506 census tracts in greater Boston in the early 1970 s. The 13 predictor variables included information on per capita crime rates, proportion of residential land zoned for large lots, proportion of non-retail business acres, average age and number of rooms per dwelling, full-value property-tax rates, and pupil-teacher ratios, in addition to the nitric oxides concentration (the pollution variable). Dummy variables were added to account for spatial correlation, for example, adjacency to the Charles River, distances to five Boston employment centers, and accessibility to radial highways. We begin by examining the residuals of a least-squares fit to these data; see Figure 1a. We can easily see that the shape is far from the assumed shape N(0,σɛ 2 ). The residuals seem skewed to the right, with a number of outliers on the right (and a couple on the left, too). We next fit these data by the L2E method, including the weight parameter, w. The fitted value of w was 84.5%. Thus from the beginning we have an indication that the assumed multivariate regression model will not fit all of these data. In fact, 15.5% of the data are not well-fit by the model. (However, 84.5% are well-fit, which is not bad in many situations.) When we examine the residuals of the L2E fit to these data in Figure 1b, we can easily see that the majority of the data are well-fitted by the assumed shape 2 N(0,σɛ 2 ). The residuals seem skewed to the right, with a number of outliers on the right (and a couple on the left, too). Note that the Normal curve depicted in this figure has area 84.5%, rather than a full 100%, since we are fitting the partial mixture model. Now that we have the standardized residuals from the L2E fit, we can examine the locations of the census tracts that are different from the vast majority of the data. We expect most Normal residuals to fall in the range, ( 3,3). Here, of course, only 84.5% fall in that range. i=1

LS residuals robust residuals with wt=0.845 density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4-2 -1 0 1 2 3 residual -1 0 1 2 3 4 residual Figure 1: Kernel smooth of residual density (solid line), with assumed Normal fit (dotted line). The least-squares case is shown on the left, and the L2E case on the right. In Figure 2, we show a choropleth map of the L2E residuals for all 506 census tracts. The dark blue residuals (corresponding to predictions that are much too low) are indeed clustered in the western region. Armed with this knowledge, the investigator could attempt to add an additional indicator variable so that the model would fit a larger fraction of the data. The dark red residuals (corresponding to predictions that are too high) cannot be seen very well in Figure 2, since they are all located in the downtown region. In Figure 3, we zoom in on that region. We see that these are very special tracts, and probably should be analyzed separately from the rest of the data. 4 Discussion and Future Directions The digital government work of our research team has focused on a number of statistical tools and techniques which can handle many of challenging real situations. In this paper, we have examined the all-too-common problem of statistical models which fit a large fraction of the data, but not all of the data. The common practice of iteratively trying to figure out which data points are the problematic ones is very difficult and, in our experience, often leads to unnecessary failure. This is not too surprising as the number of possible remedies with 13 predictor variables is almost unlimited, and both a novice and experienced investigator may not be able to find the correct combination of fixes. In contrast, our new fitting technique goes straight to the heart of the issue, finding the best for the largest fraction of good data that can be identified. Diagnostics are much easier with this approach. The generality of our approach will become clearer, as regression is the fundamental model for many if not most data analyses. Acknowledgments: Research was supported in part by the National Science Foundation grants NSF EIA-9983459 (digital government) and DMS 99-71797 (non-parametric methodology). The

Figure 2: Color-coded standardized residuals by census tract. authors would like to thank Linda Pickle and Sue Bell for their collaboration, as well as our DG collaborators Drs. Carr, MacEachren, and Brewer. References Brewer, Cynthia A. (1999), Color Use Guidelines for Data Representation, Proceedings of the Section on Statistical Graphics, American Statistical Association, Baltimore, pp. 55-60. Brewer, Cynthia A., and Trudy A. Suchan (2001), Mapping Census 2000: The Geography of U.S. Diversity, CENSR/01-1, U.S. Government Printing Office, Washington DC. Cressie, Noel A. C. (1991), Statistics for spatial data, Wiley-Interscience, New York. Harrison, D. and Rubinfeld, D.L. (1978), Hedonic Housing Prices and the Demand for Clean Air, Journal of Environmental Economics and Management, 5, 81-102. MacEachren, Alan M. (1995), How maps work, Guilford Publications. Pickle, L. W., Mungiole, M., Jones, G. K., and White, A. A. (1996), Atlas of United States Mortality, HHS-CDC, Hyattsville, MD. Scott, D.W. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley, New York.

Figure 3: Color-coded standardized residuals by census tract. Scott, D.W. (2000), Multidimensional Smoothing and Visualization, In Smoothing and Regression. Approaches, Computation and Application, M. G. Schimek, Ed., John Wiley, New York, pp. 451-470 (with 5 color plates). Scott, D.W. (2001), Parametric Statistical Modeling by Minimum Integrated Square Error, Technometrics, 43, pp. 274 285. Scott, D.W. and Whittaker, G. (1996), Multivariate Applications of the ASH in Regression, Communications in Statistics, 25, pp. 2521-2530. Tobler, W. R. (1978), Data Structures for Cartographic Analysis and Display, Proceedings of the Computer Science and Statistics 11th Annual Symposium on the Interface, pp. 134-139. Tufte, Edward R. (1990), Envisioning information, Graphics Press, Cheshire, CT. Whittaker, G. and Scott, D.W. (1994), Spatial Estimation and Presentation of Regression Surfaces in Several Variables Via the Averaged Shifted Histogram, Computing Science and Statistics, 26, pp. 8-17. Whittaker, G. and Scott, D.W. (1999), Nonparametric Regression for Analysis of Complex Surveys and Geographic Visualization, Sankhya, Series B, special issue on small-area sampling, P. Lahiri and M. Ghost, Eds., 61, pp. 202-227. Scott, D.W. and Szewczyk, W.F. (2003), The Stochastic Mode Tree and Clustering, J. Comp. Graph. Stat., to appear.