A General Framework for Conflation

Similar documents
Lecture 1: Geospatial Data Models

Geospatial Semantics. Yingjie Hu. Geospatial Semantics

A Case Study for Semantic Translation of the Water Framework Directive and a Topographic Database

CyberGIS: What Still Needs to Be Done? Michael F. Goodchild University of California Santa Barbara

Statistical Perspectives on Geographic Information Science. Michael F. Goodchild University of California Santa Barbara

Mappings For Cognitive Semantic Interoperability

GEO-INFORMATION (LAKE DATA) SERVICE BASED ON ONTOLOGY

Cognitive Engineering for Geographic Information Science

Geographic Information Systems (GIS) in Environmental Studies ENVS Winter 2003 Session III

From Research Objects to Research Networks: Combining Spatial and Semantic Search

Modeling Discrete Processes Over Multiple Levels Of Detail Using Partial Function Application

Taxonomies of Building Objects towards Topographic and Thematic Geo-Ontologies

Geographic Information Systems and Science: Today and Tomorrow. Michael F. Goodchild University of California Santa Barbara

Geospatial Semantics for Topographic Data

Convex Hull-Based Metric Refinements for Topological Spatial Relations

A FRAMEWORK FOR ROAD CHANGE DETECTION AND MAP UPDATING

Spatial Webs. Michael F. Goodchild University of California Santa Barbara

Spatial Intelligence. Angela Schwering

An Ontology-based Framework for Modeling Movement on a Smart Campus

Twenty Years of Progress: GIScience in Michael F. Goodchild University of California Santa Barbara

Geographers Perspectives on the World

SVY2001: Lecture 15: Introduction to GIS and Attribute Data

Quality and Coverage of Data Sources

Semantic Granularity in Ontology-Driven Geographic Information Systems

Deriving place graphs from spatial databases

MONITORING LOGISTICS OPERATIONS OF THE FOREST BASED SUPPLY CHAIN IN (NEAR) REAL-TIME UTILIZING HARMONIZATION APPROACHES: a Case Study from Austria

Towards Geographic Information Observatories

Fundamentals of Geographic Information System PROF. DR. YUJI MURAYAMA RONALD C. ESTOQUE JUNE 28, 2010

Improving Spatial Data Interoperability

Modeling and Managing the Semantics of Geospatial Data and Services

Alexander Klippel and Chris Weaver. GeoVISTA Center, Department of Geography The Pennsylvania State University, PA, USA

Animating Maps: Visual Analytics meets Geoweb 2.0

Quality Assessment and Uncertainty Handling in Uncertainty-Based Spatial Data Mining Framework

Outline. Geographic Information Analysis & Spatial Data. Spatial Analysis is a Key Term. Lecture #1

8/28/2011. Contents. Lecture 1: Introduction to GIS. Dr. Bo Wu Learning Outcomes. Map A Geographic Language.

Modeling Temporal Uncertainty of Events: a. Descriptive Perspective

Spatial Data Science. Soumya K Ghosh

Reproducible AGILE

Spatial Relations for Semantic Similarity Measurement

Semantic Evolution of Geospatial Web Services: Use Cases and Experiments in the Geospatial Semantic Web

Key Words: geospatial ontologies, formal concept analysis, semantic integration, multi-scale, multi-context.

The Nature and Value of Geographic Information. Michael F. Goodchild University of California Santa Barbara

Cognitive modeling with conceptual spaces

Towards Process-Based Ontology for Representing Dynamic Geospatial Phenomena

Automated Assessment and Improvement of OpenStreetMap Data

Geography 38/42:376 GIS II. Topic 1: Spatial Data Representation and an Introduction to Geodatabases. The Nature of Geographic Data

Accuracy and Uncertainty

The Relevance of Spatial Relation Terms and Geographical Feature Types

DATA SOURCES AND INPUT IN GIS. By Prof. A. Balasubramanian Centre for Advanced Studies in Earth Science, University of Mysore, Mysore

Multiple Representations of Geospatial Data: A Cartographic Search for the Holy Grail

Intelligent GIS: Automatic generation of qualitative spatial information

A SEMANTIC SCHEMA FOR GEONAMES. Vincenzo Maltese and Feroz Farazi

A Spatial Analytical Methods-first Approach to Teaching Core Concepts

Representation of Geographic Data

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

Harmonizing Level of Detail in OpenStreetMap Based Maps

Reducing Consumer Uncertainty

Cell-based Model For GIS Generalization

Space-adjusting Technologies and the Social Ecologies of Place

CLASSIFICATION AND REPRESENTATION OF CHANGE IN SPATIAL DATABASE FOR INCREMENTAL DATA TRANSFER

12 Review and Outlook

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics

Three-Dimensional Visualization of Activity-Travel Patterns

An Introduction to Geographic Information System

Geographic Data Science - Lecture II

The Building Blocks of the City: Points, Lines and Polygons

INTRODUCTION TO GIS. Dr. Ori Gudes

High Speed / Commuter Rail Suitability Analysis For Central And Southern Arizona

GIS Data Structure: Raster vs. Vector RS & GIS XXIII

Chapter 5. GIS The Global Information System

Spatial Validation Academic References

RESEARCH ON SPATIAL DATA MINING BASED ON UNCERTAINTY IN GOVERNMENT GIS

of a landscape to support biodiversity and ecosystem processes and provide ecosystem services in face of various disturbances.

GEOGRAPHICAL NAMES ONTOLOGY

Module 2, Investigation 1: Briefing Where do we choose to live and why?

Zero Hours of System Training

Linking local multimedia models in a spatially-distributed system

Projections & GIS Data Collection: An Overview

You are Building Your Organization s Geographic Knowledge

[Figure 1 about here]

1. Origins of Geography

Lecture 5. GIS Data Capture & Editing. Tomislav Sapic GIS Technologist Faculty of Natural Resources Management Lakehead University

presents challenges related to utility infrastructure planning. Many of these challenges

Popular Mechanics, 1954

GIS and Spatial Statistics: One World View or Two? Michael F. Goodchild University of California Santa Barbara

TRAITS to put you on the map

Using C-OWL for the Alignment and Merging of Medical Ontologies

FUNDAMENTALS OF GEOINFORMATICS PART-II (CLASS: FYBSc SEM- II)

Comparing Flickr tags to a geomorphometric classification. Christian Gschwend and Ross S. Purves

Basics of GIS reviewed

Basics of GIS. by Basudeb Bhatta. Computer Aided Design Centre Department of Computer Science and Engineering Jadavpur University

Introduction to GIS. Phil Guertin School of Natural Resources and the Environment GeoSpatial Technologies

About places and/or important events Landmarks Maps How the land is, hills or flat or mountain range Connected to maps World Different countries

Global Grids and the Future of Geospatial Computing. Kevin M. Sahr Department of Computer Science Southern Oregon University

30 YEARS SINCE Z_GIS FOUNDING: PROVIDING SOLUTIONS FOR SOCIETY AND BUSINESS

Bentley Map Advancing GIS for the World s Infrastructure

Analysis of European Topographic Maps for Monitoring Settlement Development

CPSC 695. Future of GIS. Marina L. Gavrilova

OBJECT BASED IMAGE ANALYSIS FOR URBAN MAPPING AND CITY PLANNING IN BELGIUM. P. Lemenkova

Visualizing Uncertainty In Environmental Work-flows And Sensor Streams

Transcription:

A General Framework for Conflation Benjamin Adams, Linna Li, Martin Raubal, Michael F. Goodchild University of California, Santa Barbara, CA, USA Email: badams@cs.ucsb.edu, linna@geog.ucsb.edu, raubal@geog.ucsb.edu, good@geog.ucsb.edu 1. Introduction To date GIS-based conflation research has primarily been concerned with specific algorithms and tools for performing conflation on specific types of datasets. Most of these techniques focus on matching the geometry of geospatial features represented as points, polylines, and polygons. More recently the matching of the semantics of geospatial features has been identified as a key component of the conflation problem. In addition, conflation was conceived in terms of traditional GIS, but with the advent of neogeography, and in particular web-based maps with community-generated content, the representation of geospatial features on the earth has become increasingly heterogeneous in terms of geometry, semantics, data provenance, and error (Goodchild, 2007). In light of these complexities we present a general framework for conflation that aims to handle this diversity of representation. To further motivate the need for a general framework we propose that the conflation operation is an important part of answering the question Where am I? The Where am I? question is a perennial favorite of spatial cognition, navigation, and location services researchers. For our purposes we take a very specific interpretation of this question. Namely, the answer to this question is the identification of one s location as a point on the Earth. Any location on the Earth can be represented by a set of properties. We can identify our location if and only if this set of properties used to represent our location is unique. In the case that we have the latitude/longitude property or a unique identifier of a discrete object at our location, such as an address, we can establish uniqueness of our location. In the absence of these singleton sets we can only answer this question by assembling a set of properties that are unique to our location. As these properties will often come from separate information sources a conflation operation is required. Given that conflation merges representations based on similarity thresholds and that errors exist in all geographic data sets, there exists a degree of uncertainty as to the uniqueness of our location identified from conflated data. We make no claims as to sources of the representations being conflated and in fact as to whether they are internal mental representations or external representations stored in an information system. For an example of the Where am I? question take the following natural language description of a person s location that may have been sourced from a number of different places (see figure 1 for a corresponding map): I am in Southern California standing on a two-lane road facing north. Directly behind me is a parking lot about 100 meters square and further back are a number of buildings. Directly in front of me I see a small airport and behind that a town and a mountain range rising above it. A highway intersects the town. I know the ocean is south of me.

Here we have a number of geospatial feature types (e.g., buildings, airport, mountains), topological and spatial relationships (e.g., intersects, south of), attributes (e.g., twolane), and geometry (100 m 2 ), which can all be used to uniquely identify the location of the questioner. Although conflation research has focused on the geometric representation of the features, this scenario illustrates that for unstructured and heterogeneous data both geometry and semantics must be considered. Such unstructured geographic data is ubiquitous. For example, many neogeographic data sources, such as Geonames 1, define the semantics of feature type terms using natural language descriptions. Furthermore, an analogy can be drawn between the distortion in a person s (or ontology s) representation of space and location (including both geometry and semantics) and the geometric distortion in GIS data resulting from different measurement techniques. Figure 1. Map of location in scenario. 2. Geospatial Feature Every geospatial feature on Earth can be represented by a 4-tuple G,F,A,T where: G: Geometry F : Feature type A: Set of attributes 1 http://www.geonames.org/

T : Set of topological relationships to other features Arguably, F, A, and T could all be subsumed under the label semantics (and perhaps even G) (Kuhn 2003) but we find it helpful to make this separation because 1) feature types are especially salient in our conceptualizations of geospatial features (Frank 2003) and 2) topology is unique from other attributes in that it encodes relationships with other features. One might be tempted to also include a unique identifier for each feature along the lines of Ordnance Survey s Topographic Identifiers (TOIDs) 2, but epistemologically this kind of assignment is suspect. For example, is Lake Tahoe the same geospatial entity that it was before European settlers named it as such? This question also points to the fact that all geospatial features are dynamic on some time scale. Thus, we can designate different representations for the same feature at different points or intervals in time. For example, the geometric representation of the Aral Sea today looks very different from what it did in 1960. 3 Every location can be generalized as an area containing some collection of the above tuples. A characteristic of the conflation operation is that it is invariant based on the extent of the area examined. In other words our proposed framework is designed to be universally applicable regardless of whether one is conflating feature sets with large or small footprints on the Earth. 3. Conflation is Similarity + Error Estimation Given two collections of geospatial features, the process of conflating them generates a new collection of features along with estimations of the error propagated. The key component of any conflation algorithm is measurement of the similarity between features (Saalfeld 1988). We define feature similarity as a linear weighted sum of semantic and geometric similarities: ω ij = αs ij + βd ij (1) where S is semantic distance and D is geometric distance (distance and similarity are inversely related), and i and j are feature indices in the source datasets. The weights α and β provide a way to describe a context for conflation. We make the claim that in most cases geometry trumps semantics, because regardless of semantic similarity if two features are far apart in space they should never be conflated. Although semantic and geometric similarity measures are incommensurable, when they are both normalized to a [0..1] scale this will tend to result in β >> α. However, there may be exceptions, especially when conflating highly unstructured data such as found in the Where am I? scenario described in the introduction. Expanding the semantic part of equation 1 to differentiate feature type, attributes, and topology we get: ω ij = α f Sf ij + α a Sa ij + α t St ij + βd ij (2) 2 http://www.ordnancesurvey.co.uk/oswebsite/freefun/geofacts/geo1201.html 3 In fact, it might be represented as three different features today: the North Aral Sea and western and eastern South Aral Sea.

Note, in most cases the distance measures of the semantic components will be strongly correlated, which should be considered when assigning the α weights. 3.1 Geometric and Semantic Distortion Uncertainty in GIS data may come from many sources. The real world is infinitely complex with lots of information; therefore generalization or simplification introduces uncertainty in the creation of a GIS database. Different people may adopt different conceptualization schemes when they create datasets about the same features. In addition, uncertainty also depends on the measurement method and the scale of measurement. Finally, uncertainty may be propagated in data processing and analysis after data are captured. Therefore, one of the key research questions in conflation is how to identify the same features in spite of inherent uncertainty in various geospatial databases. Analogously, when answering the question Where am I? humans conflate information presented on a map with incomplete and imperfect information in their internal representations along with implicit knowledge of how well they trust their own information to generate a better understanding of their location. Semantic similarity measurement is an active research topic in GIScience, and several techniques have been developed (Schwering 2008). Every dataset either commits to an implicit ontology that exists in the head(s) of the schema designer(s) or an explicit ontology defined in some degree of formalization. Just as two representations of a feature will be geometrically distorted, they will also be semantically distorted by the ontology. In some cases there is no ground-truth to identify which of the features is more distorted from reality; there is simply a difference in definition. An example of this is the Río de la Plata in South America, which may or may not be classified as a river. However, there are cases where something close to ground-truth can be identified. For example, if the land-use of an area is represented as farmland in one database, but a raster image of the same area with the same timestamp shows concrete everywhere then it is a fair assessment to say that the first representation is highly semantically distorted from reality. In general, two representations will be semantically distorted from each other, quantified in terms of the results of semantic similarity measurement. 3.2 Temporal Effects on Similarity Measurement The role that time plays in conflation is a largely unexamined problem. Even under perfect measurement conditions, all of the components, G,F,A,T, of a representation will change over time. The rate of change, which we designate the temporal malleability, of a geospatial representation varies considerably depending on what is being represented. However, the similarity threshold required to identify two representations as referring to the same feature will always be indirectly related to the magnitude of their temporal difference (i.e., representations of the same feature will become less similar the further they are spread over time). Historical GIS is one application where temporal malleability is of particular importance. Furthermore, there is the issue that features can come into existence, cease to exist, join together, or separate into multiple features at different moments in time (Hornsby and Egenhofer 2000). 4. Conclusion Today geographic data is not only found in GIS databases but also on the Web in a variety of formats. A framework for conflation must be sufficiently general to be applicable to these heterogeneous data. Here we presented a framework of conflation

that defines geospatial features as 4-tuples and the process of conflation as a linear combination of geometric and semantic similarity measures with corresponding error estimation. This lays the groundwork for developing models that incorporate both geometric and semantic distortion for matching heterogeneous data. We also examined issues related to temporality in conflation. An important next step is the formalization of semantic distortion and how it relates to existing models of geometric distortion. In addition, more work needs to be done developing adequate models for error estimation and recording for both geometric and semantic matching. Acknowledgements This work is supported by NGA-NURI grant HM1582-10-1-0007. References Frank A, 2003, Ontology for spatio-temporal Databases. In: M. Koubarakis and e. al. (Ed.), Spatiotemporal Databases: The Chorochronos Approach. Lecture Notes in Computer Science 2520, pp. 9-77, Springer, Berlin. Goodchild M, 2007, Citizens as sensors: the world of volunteered geography. GeoJournal, 69: 211-221. Hornsby K and Egenhofer M, 2000, Identity-based change: a foundation for spatio-temporal knowledge representation. International Journal of Geographical Information Science, 14(3): 207-224. Kuhn W, 2003, Semantic Reference Systems. International Journal of Geographical Information Science, 17(5): 405-409. Saalfeld, A. 1988, Conflation Automated map compilation. International Journal of Geographical Information Systems, 2(3), 217-228. Schwering A, 2008 Approaches to Semantic Similarity Measurement for Geo-Spatial Data: A Survey. Transactions in GIS, 12(1): 5-29.