Mapping geospatial events based on extracted spatial information from web documents

Size: px

Start display at page:

Download "Mapping geospatial events based on extracted spatial information from web documents"

Arron Barrett
5 years ago
Views:

University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Mapping geospatial events based on extracted spatial information from web

edu/etd/1068 Recommended Citation Rock, Nathaniel Robert. "Mapping geospatial events based on extracted spatial information from web documents.

1 University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Mapping geospatial events based on extracted spatial information from web documents Nathaniel Robert Rock University of Iowa Copyright 2011 Nathaniel Robert Rock This thesis is available at Iowa Research Online: Recommended Citation Rock, Nathaniel Robert. "Mapping geospatial events based on extracted spatial information from web documents." MA (Master of Arts) thesis, University of Iowa, Follow this and additional works at: Part of the Geography Commons

2 MAPPING GEOSPATIAL EVENTS BASED ON EXTRACTED SPATIAL INFORMATION FROM WEB DOCUMENTS by Nathaniel Robert Rock A thesis submitted in partial fulfillment of the requirements for the Master of Arts degree in Geography in the Graduate College of The University of Iowa May 2011 Thesis Supervisor: Associate Professor Kathleen Stewart

3 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL MASTER'S THESIS This is to certify that the Master's thesis of Nathaniel Robert Rock has been approved by the Examining Committee for the thesis requirement for the Master of Arts degree in Geography at the May 2011 graduation. Thesis Committee: Kathleen Stewart, Thesis Supervisor David Bennett Marc Armstrong

4 To Rocky and Nora ii

5 And this I believe: that the free, exploring mind of the individual human is the most valuable thing in the world. John Steinbeck East of Eden iii

6 ACKNOWLEDGEMENTS I would like to thank the individuals who helped guide me through the process of this thesis. In particular, I express my sincere gratitude to Dr. Kathleen Stewart for all of the guidance, expertise, and patience she provided me throughout the research process. I would also like to thank Dr. Marc Armstrong and Dr. David Bennett who served on my thesis committee for their time and help. Finally I would like to thank my friends and family for all of their love and support throughout my graduate school journey. iv

7 ABSTRACT Web documents such as news articles, social feeds, and blogs provide an abundant and readily available data source of spatial information relating to dynamic events such as wildfires, storms, and chemical spills. Research in the fields of geographic information retrieval and natural language processing use methods to extract place-names from web documents that can be used to geocode these events. However much of the spatial information in these articles are difficult to use because of the inherent vagueness of natural language. This thesis aims to develop methods to handle the vaguness of representing natural language descriptions of events by integrating precise spatial information (landmarks and geographic coordinates) with imprecise spatial information to provide a map-based visualization of the likely spatial extent and location of web document events. v

8 TABLE OF CONTENTS LIST OF TABLES... VIII LIST OF FIGURES... IX CHAPTER 1. INTRODUCTION Motivation and Background for Research Scope of Thesis Thesis Structure RELATED WORK IN GEOGRAPHIC INFORMATION RETRIEVAL Text-based Retrieval and Extraction of Spatial Information Natural Language Processing Geographic Information Retrieval GIR Methods for Representing Vagueness Extracting and Mapping Events from Text CREATING EVENT ZONES USING TEXT-BASED INFORMATION Introduction Defining spatial anchors Point Generation for Spatial Anchor Polygons Weighting Point Density Estimation MAPPING VAGUE SPATIAL EVENT REFERENCES USING ANNOTATED INFORMATION Annotation of Vague Spatial References Mapping Vague Spatial References Mapping Directional Modifiers Mapping Distance Modifiers Mapping Topological Relations Containment Extended Connection Partial Overlap...52 vi

9 5. APPLICATION OF EVENT ZONE METHODS FOR A WEB DOCUMENT Creating an Event Zone for a Tornado News Article Annotation of the Tornado Article Weighting and Event Zone Creation Creating an Event Zone for a Forest Fire Article Event Zone Validation RESULTS AND FUTURE WORK Results Implications for GIR and GIScience Future Work...74 REFERENCES...78 vii

10 LIST OF TABLES Table 1. Features included in the ADL Feature Type Thesauras. The number of asterisks in front of each feature denotes the weight given viii

11 LIST OF FIGURES Figure 1. The query interface for the GeoNames feature search Query results from the search on Sears Tower shown in Figure A news article describing a tornado that occurred in Iowa City, IA annotated in GATE Spatial anchors (e.g., Menard s, Iowa Memorial Union, and St. Patrick s Cathedral) represented as point features in ArcMap The Regular Point Generation tool within the Hawth s Tools toolbox (a) Spatial anchors extracted from a news article represented in a GIS. (b) Output after RPG (a) A spatial anchor extracted from a news article represented in a GIS. (b) Output after RPG Event zone resulting from PDE of point layers in Figures 4, 6 (b), and 7 (b) Map representation of a directional modifier (a) Polygon representations of spatial anchors in a GIS. (b) Map representation of a distance modifier Map representation of distance and directional modifiers attributed to source and target toponyms Select by Location is used to isolate the target toponym within the source toponym Map representation of an extended connection modifier News article obtained from the New York Times website describing a tornado that occurred in College Park, Md ( gst/abstract.html?res=f60b14ff385e0c758edda00894d ) Spatial anchors and modifiers annotated in GATE...57 ix

12 16. Extracted point locations for spatial anchors used in the computation of an event zone Map representations of spatial anchors Event zone depicting the tornado described in the news article News article obtained from the CBS News website describing a forest fire that occurred in California Event zone depicting the forest fire in the news article NOAA tornado track overlaid with the computed event zone x

13 1 CHAPTER 1 INTRODUCTION 1.1 Motivation and Background for Research Web documents such as news articles, social feeds, and blogs provide an abundant and readily available source of spatial information related to dynamic geographic events such as wildfires, storms, and riots. Research based in the fields of geographic information retrieval (GIR) and natural language processing (NLP) use methods to extract place-names and other spatial references from web documents that can be used to display event locations in a GIS. Vagueness common in spatial descriptions, such as north of the city or near the border, is a problem that must be addressed in geographic information systems in order to successfully represent text-based events. In this thesis, I investigate vagueness associated with natural language descriptions of events and present a set of methods for representing such events even when only vague spatial information is available. The automatic extraction and representation of spatial information from text has been a topic of interest to researchers in the field of GIR. For example, studies have focused on the retrieval and extraction of geospatial information from web documents (Buyukkokten et al., 1999), as well as modeling vague spatial knowledge of places using precise and imprecise regions (Jones, 2008) and creating tools that identify an article s geographic focus using a digital map (Teitler et al., 2008). These works, however, focus on modeling static places given vague spatial information, whereas modeling events that

14 2 change their location over space, and are associated with spatial trajectories extracted from web documents, has not been given as much attention. In this study, an event refers to a dynamic happening that is heterogeneous and has periodic or episodic occurrences (Galton 2008). The GIScience community has defined events as being different than processes. Events occur at a point in time, or over an interval, and show a change that has occurred between one or more objects, although an event does not experience a change with itself over duration (Galton 2008). This research builds on the work of geographic information retrieval, a subfield of GIScience, where natural language descriptions of events, such as a storm passing through an area, are often vague and lack the information needed to locate them in geographic space using precise coordinates. To represent these natural language occurrences, an information system such as GIS must be able to handle the inherent spatiotemporal vagueness expressed in these descriptions. Additionally, the design of information systems that supports events, such as a mobile device application that provides an event s current location, also needs to be able to handle information relating to directional, distance, or topological relationships between places described in documents. 1.2 Scope of Thesis The aim of this thesis is to add to the growing body of geographic information retrieval research and contribute a set of methods for representing event extents, i.e., the event zone. An event zone refers to a surface representing an event s spatial extent as described in a web document that has been extracted from web documents in a GIS.

15 3 Methods are introduced that go beyond the current ability of locating clearly described events in text and also to provide a means to capture vague events as event zones. The United States is the study area the events being modeled in this work have occurred in the United States. Digital gazetteers (ADL Gazetteer, USGS, GNIS, GeoNames) and base data (ESRI base data for the United States) are used to geocode features within the U.S. One aim of this research is to incorporate the methods presented in this thesis into the growing body of GIR work that supports geographic information systems in integrating vague spatial information with existing spatial data. Geographic information systems will be enhanced if they can integrate vague spatial queries (Montello, 2003) as well as the ability to model events using boundaries designated by a text description. This work can be used to model events in the absence of recorded data from monitoring stations or sensor networks to present a possible extent from reported accounts. For example, events in web documents could be eyewitness accounts in news articles or blogs. This work can also be used to further incorporate vague natural language into geographic information systems that allows for more flexibility when identifying the extent of an event. The techniques involve extracting the spatial information from a web document that describes where an event has occurred and modeling this in a GIS. Text engineering software is used to annotate the spatial information found in the web document and includes attributes for any information relating to directional, distance, or topological relationships between places. Once all spatial information is annotated, these text strings are queried against a geographic names information system and imported into a GIS where they are used to create an event extent, referred to as an event zone.

16 4 In this work, spatial information that is unique or is described with locational information that can be used for geocoding with the extraction of an address or geographic coordinates, as a spatial anchor. Spatial anchors provide the foundation by which additional annotated information is used in the creation of the event zone. Techniques are also presented that allow for the incorporation of vague spatial information into a GIS, as well as directional, distance, and topological relationships that help to identify the spatial extent of the event s occurrence as described in the web document. 1.3 Thesis Structure The rest of this thesis is structured as follows. The next chapter discusses related research on visualizing events in geographic information systems, research on the vagueness of natural language descriptions of geographic space in the field of GIR, and techniques used in geographic information retrieval. The following chapter introduces the method used to annotate web documents using the text engineering software GATE and the guidelines put forth by the creators of the markup language SpatialML. GATE is used as an environment to annotate web documents using the SpatialML guidelines. Methods are introduced for the annotation of spatial anchors and additional spatial information. This extent represents what the web document describes as the space of the event. Spatial information that is unique, such as landmarks, stadiums, or stadiums, or that includes an identifier that can be linked to a single location (geographic coordinate or an address) forms the foundation of this analysis, since this information can be used to establish specific locations where the event was witnessed. To identify the spatial extent

17 5 of the extracted information, GeoNames ( a searchable database of standardized geographic names, is used to identify the geographic coordinates of an extracted text string, as well as to find information relating to the feature type it belongs to or other names it may commonly go by. Once all of the spatial references have been geocoded weighting and point density estimation are used to produce a surface representing the geographic area of the event as described in the web document. Chapter Four explains how vague spatial references are annotated and represented in the formation of an event s spatial extent as well. The vague spatial information incorporated into the methods of this thesis relate to directional, distance, and topological relationships between features. Automatic methods of extraction are unable to identify all instances of vague spatial references, so manual annotation is done to obtain this information. Annotation is done within GATE using guidelines established by the markup language SpatialML, which is equipped to handle vague information (Anderson et al., 2008). Vague references are unable to be geocoded into a GIS directly, so methods for the representation of these references are presented that make use of various selection techniques. Chapter Five highlights a case study in which the guidelines and techniques established in the previous chapters are used to create event zones using the spatial information extracted from a web document. The validation of the results is done using spatial data obtained from a different source responsible for monitoring the same event. The event extent that is created using these methods is compared to the obtained layer in order to confirm how closely the solutions provided in this study correspond to actual event locations. The final chapter highlights what has been accomplished in this work

18 6 and looks ahead to show how this work can be extended to incorporate more information that can be used in the refinement of the event s extent. The efforts of this thesis are attributed to the work put forth by the GIR and NLP communities. The retrieval, extraction, and representation of text descriptions constitutes a new way spatial data can be created and used, and this study contributes to new ways to achieve this goal.

19 7 CHAPTER 2 RELATED WORK IN GEOGRAPHIC INFORMATION RETRIEVAL 2.1 Text-based Retrieval and Extraction of Spatial Information The volume of information available to users in tangible formats, such as books, newspapers, microfilm, and digital formats, such as databases, web documents, is only accessible given a method for retrieving the information that is needed. Of the various methods of retrieving information, the World Wide Web (WWW) has become valuable in that global knowledge and media can be accessed instantly (Jones et al., 2008). Work in the retrieval of information from the Web has been done in library science, computer science, and database systems and deals with the representation, storage, organization, and access to information items (Baeza-Yates and Ribeiro-Neto, 1999). The field of study known as information retrieval (IR) was developed to help solve the problems of selecting appropriate text and representing the results using queries (Belkin, 1993). The broadness of IR facilitated its growth for specific purposes, notably the retrieval of spatial information from the Web. The retrieval of spatial information from text, known as geographic information retrieval, is one of the most quickly growing areas in IR research (Overell and Ruger, 2004). The categorization of information by place is often used to narrow a query. To test this, Sanderson and Kohler (2004) analyzed a 2001 Excite query log to determine how geographic queries were constructed and in what cases they were used. From a random sample of 25,000 queries they found that 18.6% of them contained a geographic term and 14.8% contained a place name. This implies that geographic information is

20 8 commonly used to formulate web retrievals. However place names, such as toponyms listed in gazetteers, do not make up the majority of the geographic terms found in queries, meaning that natural language descriptions of geographic space are commonly found in retrieval queries and web documents. 2.2 Natural Language Processing The success of web-based geographic information retrieval is dependent on its ability to search through web documents constructed using natural language to retrieve results. Natural Language Processing (NLP) is an area of study that uses methods of linguistic analysis to transform text characters into words that are given semantic meaning to better represent text (Voorhees, 1999). NLP techniques are used in a geographic context for the detection and resolution of toponyms (Stokes et al., 2008). The retrieval of toponyms involves identifying each instance of a toponym in a document using a source of geographic information, such as a gazetteer database (Hill, 2006). This technique is straightforward and successful as long as the gazetteer database is updated with current administrative names, or names used previously to describe the same location (Goodchild and Hill, 2008). However the task of toponym resolution is more complicated and requires an NLP system capable of disambiguating spatial references. Toponym resolution refers to a specialized NLP task for extracting spatial information known as word sense disambiguation (WSD) that determines which sense of an ambiguous word is invoked in a given context using syntactic and semantic cues in the text (Stokes et al., 2008). Methods for the toponym resolution problem have been studied extensively by Leidner (2007). Leidner (2007) proposed two minimality heuristics of

21 9 which the one referent per discourse heuristic assumes that a toponym mentioned in a text refers to the same location throughout the discourse, e.g., Springfield, IL would be identified and all other instances of Springfield would refer to the one in Illinois, while the spatial minimality heuristic assumes that for all toponyms mentioned in a text, the smallest region that is able to bound the whole set is the one that gives their interpretation (Leidner, 2007). A number of text processing software applications are available to carry out NLP. The implementation of NLP tasks can be accomplished with the use of software designed for text processing. A commonly used text engineering program that can be used for NLP tasks is General Architecture for Text Engineering (Bontcheva et al., 2002). GATE is equipped for annotation, machine learning, gazetteer matching, a variety of NLP tasks including tokenizing, sentence splitting, part-of-speech tagging, and chunking, as well as a graphical user interface (GUI) that allows first time users to interactively process text (Bontcheva et al., 2002). Other text engineering tools similar to GATE include MontyLingua (Carillo et al., 2009), Unstructured Information Management Architecture (UIMA) (used by IBM to develop the Watson computer (Ferrucci and Lally, 2004)), and Weka (Hu and Ge, 2007). There is also a large community of researchers developing open-source NLP toolkits that can be integrated into text processing software or models that include OpenNLP ( Natural Language Toolkit (NLTK) ( and LingPipe ( While these software tools are effective in executing a number of NLP tasks, there are specific techniques used to suppport the retrieval of geographic information.

22 Geographic Information Retrieval Research in geographic information retrieval addresses a number of issues that hinder the ability of normal IR techniques to retrieve geographic terms. GIR is used to detect geographic references, disambiguate place names, rank and index spatial information, and work with vague geographic terms (Jones and Purves, 2008). The detection of place names in text has been used in several applications that include the retrieval of web documents (Teitler et al., 2008), extracting spatial information from photograph captions (Rattenbury et al., 2007) and social media websites (Sankaranarayanan et al., 2009), and retrieving geographically-based information for mobile users (Mountain and MacFarlane, 2006). Place name detection is often accomplished by matching parsed terms from a document to gazetteer entries to determine if the text string refers to a geographic location (Jones and Purves, 2008). Gazetteers are important to verify the geographic references found in the retrieval process and to provide geographic coordinates that can be used for representation purposes. The overarching goal of gazetteers is to establish standardized place names, which in the United States was established by the creation of the Board of Geographic Names in 1890 (Goodchild and Hill, 2008). The recent trend towards digital gazetteers involve databases of geographic information that in most cases contain toponyms, geographic coordinates (or footprints), and feature types (Keßler et al., 2009). Examples of large geographic names databases and digital gazetteers for U.S. toponyms include the USGS Geographic Names Information System (GNIS) ( the NGA GEOnet Names Server (GNS) ( the Getty Thesauras of Geographic Names (TGN) ( and

23 11 the Alexandria Digital Library (ADL) Gazetteer ( Though digital gazetteers contain large collections of geographic information, there is no standardization of record formatting, which makes sharing information among them difficult (Hill, 2010). One source that addresses this problem is GeoNames ( a geographical database of over 10 million toponyms that incorporates many of the gazetteers listed above and users can add new entries or edit poorly transferred entries using a wiki-style interface ( Gazetteers are helpful in the detection of toponyms, however the ambiguous use of toponyms in natural language requires further processing. The disambiguation of place names involves identifying which place name, if there are other geographic locations with same name, is the unique one to which the document refers (Jones and Purves, 2008). One method of place name disambiguation uses a heuristic to calculate the likely toponym using the centroids of the places extracted from the document (Buscaldi and Rosso, 2008, Rauch et al., 2003). Buscaldi and Rosso (2008) compare this method to a knowledge-based method that uses GeoWordNet to identify context for each place extracted in the text and found the latter method was better at determining the context of a place name over the map-based method. Rauch et al. (2003) use a similar map-based method, but include a heuristic that uses a place s population as part of the disambiguation in process: a document will more likely use a a highly populated area in a description. Garbin and Mani (2005) use a statistical classifier to organize toponyms into classes based on the classes of other toponyms in the document and terms found in the same sentence. When GIR tasks are carried out on a

24 12 corpus of documents the most relevant documents should be used, which requires methods of indexing and ranking documents. The indexing of a corpus of web documents for GIR is commonly done using multi-dimensional indexes that are capable of handling spatial data; these include grid indexes, quad-trees, R-trees, and kd-trees (Martins et al., 2005). Martins et al. (2005) explain that R-trees are a popular spatial indexing method that can be used for simple indexing of documents by dividing feature types into nested hierarchies. Li et al. (2006) demonstrate a different method of indexing for implicit (ancestors of explicit locations mentioned in text) locations in which a focus-index is used to index explicit locations according to a hierarchy, and a grid-index that divides the Earth into 1000 x 2000 cells and is used for indexing implicit locations to which the explicit locations relate. The ranking of documents in a corpus or items in a digital library ensures that the closest results match the query of the GIR task. Examples of methods for ranking documents include Yu and Cai s (2007) relevance ranking that uses geographic as well as thematic relevance scores and the van Kreveld et al. (2005) method of multidimensional scattering that uses multiple score criteria. The ranking of geographic objects in a digital library is done using geographic objects and metadata by way of topological relationships held between the query selection box and geographic objects, such as overlaps and contains, as well as probability spatial ranking using logistic regression with metadata entries and query terms (Larson and Frontiera, 2004). After each of these GIR tasks is complete the final step is to represent results in a user-friendly way. A recent application of these GIR tasks for news acquisition is NewsStand (Teitler et al., 2008). NewsStand is system that automatically associates news articles

25 13 with the geographic references mentioned within them and clusters these articles based on their textual and geographic content (Teitler et al., 2008). The collection of news articles is done using Really Simple Syndication (RSS) feeds to extract the metadata from each article instead of dealing with problems relating to encoding and formatting of the web article. Gazetteer referencing, toponym disambiguation, and tagging an article based on its focus is done prior to clustering to ensure the geographic references and focus of each article can be used to place it in the proper cluster (Teitler et al., 2008). Clusters are also determined by how recent the article was published. Document clusters are displayed on a map using markers that are thematically displayed to indicate the type of story (general, business, and sports) (Teitler et al., 2008). While NewsStand introduces a powerful method for viewing the geographic space in which news stories occur, its method of display is hindered by only placing markers at locations where the article was published, not where the event actually took place. 2.4 GIR Methods for Representing Vagueness The communication of vernacular spatial language to a computer is particularly challenging because humans use spatial language that is qualitative and vague to allow the meaning to vary with the context of use, while a computer represents spatial concepts as quantitative and precise (Cai et al., 2003). In order to bridge this communication gap of spatial language there is a body of research dedicated to incorporating vague spatial language used in text into GIR and GIS applications. Vague spatial regions, e.g., downtown, Midwest, and vague spatial relations, e.g., near, far, outside of, have indeterminate boundaries when modeled in a GIS (Burrough and Frank, 1996). To

26 14 approximate a vague region, Montello et al. (2003) constructed maps of the vague region downtown Santa Barbara based on boundary lines drawn by human subjects at different confidence levels. While human subject collaboration can help clarify specific cases of vague spatial language, computational methods are needed to handle any case including those with vague spatial references. One example of a computational solution for modeling vague spatial language involves the use of fuzzy or probabilistic methods. Alani et al. (2001) use a Dynamic Spatial Approximation Method (DSAM) that employs Voronoi diagrams as well as ontological information to determine which location is described in the text, compared to ones with the same name around it, using fuzzy approximation. Duckham and Worboys (2001) use decision trees and landmarks in an environment to determine a qualitative vector for the nearness of 22 places on the Keele University campus, UK. Hall and Jones (2008) use field crisping methods that incorporate contour lines and evaluation using F- scores to determine locations that are near or not near a reference location. To georeference location descriptions and calculate the uncertainty present afterwards, Guo et al. (2008) uses a probabilistic approach that calculates a contribution of locations within a reference object to a target object to create an uncertainty field. Other methods of representing vague spatial language involves the use of place names present in a document (Purves et al., 2007, Jones and Purves, 2008,). The representation of vague spatial regions has been a topic of interest for the SPIRIT community. Spatially-Aware Information Retrieval on the Internet (SPIRIT) is a research project designed to create a search engine capable of retrieving and representing places or regions referred to in documents based on a user s query (

27 15 spirit.org). The SPIRIT project relates to the semantic geospatial web proposed by Egenhofer (2002), incorporates a number of GIR disciplines and has resulted in many publications. The SPIRIT interface takes a user s natural language query and displays the results using georeferencing, disambiguation, and gazetteer lookup techniques (Purves et al., 2007). To represent vague spatial queries in SPIRIT, grounded geographic locations extracted from web documents relating to the original query are used as the inputs in kernel density estimation to generate a surface that bounds the vague region on a map (Jones et al., 2008). The SPIRIT project has been influential in the field of GIR (Tobin et al., 2010), but does not extend to modeling geospatial events in text. 2.5 Extracting and Mapping Events from Text The modeling of events in GIScience has been studied extensively (Peuquet and Duan, 1995, Worboys, 2005, Cole and Hornsby, 2005, Galton and Augusto, 2002, Worboys and Hornsby, 2004), however the extraction and representation of events using GIR techniques has been given less attention. While place names and geographic references are spatial in nature, events occur in space and time and therefore require additional GIR methods to incorporate temporal information. Strötgen et al. (2010) explain that temporal information retrieval (TIR) and GIR are two fields that rarely incorporate each other s techniques, however temporal information in a document can be used to distinguish the temporal context for geographic terms and visa versa. An example of an application of event information extraction using GIR techniques is work done by Rattenbury et al. (2007) that uses the spatial temporal references in photograph captions on website Flickr to tag events. Techniques for the extraction and representation

28 16 of events using web documents combine TIR and GIR techniques. However some research suggests that using gazetteers specific to events will help in matching a user s query. Event gazetteers are of interest in the information science community to categorize the passage of events in space and time. Allen (2004) developed an event gazetteer due to the lack of information available in spatial gazetteers relating to events and temporal information associated with them. His event gazetteer organizes Civil War events into a database that includes information relating to the event s name, location, time, and part-of relations that indicate if an event is linked to other events or a larger event, e.g., a battle in a war (Allen, 2004). The interface for the gazetteer allows users to search based on any of these criteria. To build on this work, Shaw (2008) developed an event gazetteer capable of extracted event information from the Library of Congress, Wikipedia, and document corpuses. While the use of an event gazetteer would allow simpler referencing of events, there is not a viable event gazetteer that is equipped to reference every event. Other methods are used to reference geospatial events in space and time. The extraction and representation of events from web documents is a topic that lacks a large body of literature. The GIR techniques summarized earlier can be used to establish the location of events in space based on the spatial information available in the document. In the next chapter the determination of an event s spatial extent is explained using information obtained from a web document. The techniques used in this study to establish an event s spatial extent combines research done in the GIR and GIScience communities.

29 17 CHAPTER 3 CREATING EVENT ZONES USING TEXT-BASED INFORMATION 3.1 Introduction Web documents provide a readily available source of an event s spatial extent. Event descriptions often include some degree of spatial information that explains locations where an event has occurred. These descriptions may include starting and ending locations of events, as well as any intermediate locations that events may have affected on their path. In order to link the information about events in web documents with mapped information useful in a GIS, steps must be taken to identify relevant spatial information and translate this into locations on a map. It is rare, however, to find a web document, such as a news article or blog that describes an event s spatial extent with exact addresses or coordinates. In this work we refer to the geographic extent of an event that is depicted using natural language in a web document as an event zone. The event zone is modeled as a surface delimited by a set of georeferenced locations found in a document. To represent an event zone on a digital map, a number of steps must be undertaken to exploit all available spatiotemporal information from a document relating to the events. When creating an event zone, all relevant spatial information must be extracted from the text. This includes identifiable spatial references as well as vague and imprecise spatial references. One method for extraction is for the user to manually parse a web document and identify all relevant terms that can be used for further processing. When working with a single web document this can be effective since most news articles, blogs,

30 18 or web postings are normally no longer than a few pages. Efforts in GIR and NLP have resulted in a set of methods that allow for extraction of spatial information using automated language processing of documents. For this work it is assumed that text engineering software such as the open source software GATE ( is available for processing the documents. The focus of this work is on the expected output of such a system, i.e., the spatiotemporal terms extracted as a result of processing, and the methods applied for understanding and representing event locations. In particular, while these techniques provide methods for automated extraction, there is little support for the extraction of vague spatial locations that are necessary to represent an event s spatial extent in a GIS. The spatial information that forms the event zone is taken from toponyms extracted from a document. Toponyms include features at various geographic scales. A toponym can include local places such as a park or business, but they also may be defined at coarser levels of detail, such as cities, states, regions, or continents (Leidner, 2007). If a web document includes toponyms at varying scales it is more challenging to identify where the event has occurred. In this work, toponyms extracted from a web document that have a well-defined spatial location, i.e., they can be geocoded in a GIS, are referred to as spatial anchors. These include street addresses, landmarks, or any location that can be identified in a GIS using a gazetteer. A gazetteer provides a large collection of agreed upon place names and their associated spatial extents (Alani, et al., 2001). There are numerous digital gazetteers that have been created by government agencies (USGS GNIS, NGA GEOnet) and research groups (ADL Gazetteer) for public use, yet these gazetteers offer little or no support for

31 19 many of the vague or imprecise references to geographic space present in natural language (Hill, 2006). To overcome this restriction, spatial anchors are used in this work to provide guidance as to the locations where events may have occurred or are occurring. A set of spatial anchors provides the basis for a methodology developed for this research to represent the event zone. The process of selecting spatial anchors and a weighting scheme for anchors supports their use for determining how events that is only vaguely referenced in text, may be represented in a GIS. The principal steps include: identifying spatial anchors for bounding event zones representing text information with existing data types in a GIS generating sample points within polygons weighting sample points based on amount of detail creating an event zone using point density estimation. Many events in text, for example, natural disasters (blizzards or hurricanes) and human-induced disasters (riots or protest marches), are described using vague descriptors of areas such as: The storm left damage across three counties in Iowa. Since events may not be described in specific detail, a set of methods are introduced for using spatial anchors in addition to vague geographic references in order to portray the events described in web documents in a GIS. For cases where no spatial anchors exist, vague spatial references extracted from a document are used to compute the event zone. 3.2 Defining spatial anchors Spatial anchors are important for delineating an event zone because they provide identifiable locations that can be geocoded in a GIS. When used in natural language,

32 20 spatial anchors often take the form of salient landmarks or places. Spatial anchors also provide information relating to where the event started, ended, or other locations at which the event may have been witnessed or impacted. To extract spatial anchors from an article, the first step is to use text-processing software capable of parsing and annotation. GATE is a powerful text processing and language engineering tool that can process a document and extract geographic information when it is set up accordingly ( GATE s primary motivation is for text processing and language engineering, but also includes components for more complex tasks such as parsing, morphology, tagging, information retrieval and information extraction (Bontcheva, 2002). While other text processing programs can perform similar tasks, GATE offers a graphical user-interface (GUI) and freedom to create knowledge rules that have contributed to its popularity. GATE s GUI allows users to work with documents individually or in related groups, called a corpus, before processing. From here the user can designate which applications will be used to process the text found in documents. These applications perform a number of common natural language processing tasks, e.g., tokenizing, sentence splitting, and chunking. For this work we use an application package created by OpenNLP to do the initial text processing. The OpenNLP package designed for GATE is an open-source collection of NLP tools designed to process text and annotate based mainly on locations, time, people, and organizations ( /projects.html). Our method uses the OpenNLP package for initial text processing, then further annotation is done using standards set by the markup language known as SpatialML (Mani et al., 2008, Anderson et al., 2008).

33 21 SpatialML provides a framework for identifying spatial information within a web document. This markup language is used to annotate documents with identifiers relating to geographic space. Spatial anchors, as well as vague spatial information, are annotated using this framework so they can be more easily handled in further processing. Geographic coordinates found in a web document can be annotated along with the location to which they refer. After processing using OpenNLP, any spatial anchors not annotated are processed manually until a rule-base has been established. This rule-base will contain rules for deconstructing a sentence and identifying spatial anchors using a digital gazetteer within GATE as well as relating geographic coordinates or other location-based information. In time this step is anticipated to be automated once the gazetteer contains a complete list of spatial anchors across the United States. The diversity of natural language means that spatial anchors are not always described using the same name. GATE allows users to annotate text manually and establish new classifications that can be used to further organize spatial information found in web documents. Annotations are made within GATE by highlighting a word in the text and creating a new annotation category for it. Once this annotation category is created, we are able to annotate as many strings in the document as that category. SpatialML provides guidelines for annotating spatial information found in text, so new categories are based on the guidelines given by SpatialML. In the case of spatial anchors, a category is created for spatial anchors and any text strings found in a document are annotated accordingly. For spatial anchors that include an address or geographic coordinates the annotation involves referencing the address or coordinates as attributes of the location to

34 22 which this information refers. Spatial anchors that do not include an address or geographic coordinates and have not been annotated with OpenNLP must be annotated manually. Instead of creating a new category for these spatial anchors they are added to the gazetteer used in the initial processing so future processing is equipped with a larger information base. One common type of spatial anchor found in web documents is manmade structures such as buildings. Some of these structures will have a unique name that distinguishes them from all other structures such as a landmark, a city s post office, or a cultural attraction. However other structures may have an identifiable name but may be found elsewhere in the area (e.g., fast food restaurants, gas stations, and grocery stores). To incorporate a spatial anchor that is found frequently in an area (e.g., McDonald s, Home Depot, or Walmart), the sentence from which spatial anchor was extracted must be annotated properly to include any other information that may indicate to which structure the sentence refers. To bridge the gap between GATE s text processing results and a GIS, each text string is geocoded in a GIS to a point location. In the future, we anticipate that digital gazetteers will provide true spatial extents for geographic features, rather than coordinates for a point that lies inside an area (e.g., at its centroid). Until this time, a method is proposed for geocoding text-strings extracted from web documents using the coordinates provided by digital gazetteers to base layers available for use in a GIS. The geographic database GeoNames is used in this thesis to provide geographic coordinates of features across the world, but more specifically for the United States ( GeoNames combines toponym, feature type, and

35 23 geographic coordinate information from a variety of digital gazetteers, with two of the largest being the USGS Geographic Names Information System (GNIS) and the NGA GEOnet Names Server (GNS) (Roberts et al., 2010). GeoNames also includes attributes for different names and spellings of features that may differ than the official name, which allows for greater success in geographic information extraction from articles that describe features that have changed names over time. For example, the Willis Tower in Chicago was previously named the Sears Tower and GeoNames is equipped to handle both names for the same location. Figure 1 shows a GeoNames query of Sears Tower to determine the coordinates of this spatial anchor. Once the query was processed, Figure 2 provides the feature type, geographic coordinates, and toponym retrieved from the query in Figure 1, as well as other names this particular spatial anchor goes by (Willis Tower). The results of feature searches can be exported to tables and brought into ArcMap. From here each feature s coordinates found in the table geocoded as points by using the Add XY Data tool. An example of spatial anchors obtained from a news article that describes a tornado that had occurred in Iowa City, IA (Figure 3) were given point locations with the Add XY Data tool (Figure 4) ( ontentmain;contentbody). Once these references have been geocoded to point data they can be merged with the other geographic references extracted from the text for further processing.

36 24 Figure 1. The query interface for the GeoNames feature search. Figure 2. Query results from the search on Sears Tower shown in Figure 1.

37 25 When representing spatial anchors there are features that are better represented as points, such as a building, a landmark structure, and an address, while other features are better represented as polygons, such as a city, township, and state. In a GIS, representing features as a particular data type will partly depend on the scale at which the user will be reading the map. Yet when creating an event zone the data type used for representing extracted geographic references is based on the granularity used in describing the location of an event s occurrence in a web document. Granularity refers to the amount of detail needed for modeling a task (Hornsby and Egenhofer, 2002). Research on granularity as it relates to geographic events has been studied extensively, which makes it important to understand at what level an event is being described at when representing it on a map (Worboys and Hornsby, 2004). To spatial anchors that define an event s occurrence at different granularities, it is imperative that the user understands how each level is represented to better convey what spatial information the text provides (Howald, 2010). Some examples of how this has been accomplished include annotating articles based on part-whole and causal relationships (Mulkar-Mehta, et al., 2011), modeling multiple granularities (may show single or composite events) with a geospatial lifeline (Hornsby and Egenhofer, 2002), and implementing ontologies (Fonseca, et al., 2002).

38 26 Figure 3. A news article describing a tornado that occurred in Iowa City, IA annotated in GATE. The representation of spatial anchors as polygons in a GIS requires more intervention than is needed to represent spatial anchors as points. Base data of the United States provided by ESRI that consists of layers relating to cities, states, landmarks, roadways, railroads, airports, and water is used for selection and export to a new shapefile (USA Base Map layer). This base data is collected from government agencies (U.S. Census, National Atlas) and commercial firms (ESRI, Tom Tom, Michael Bauer Research) and integrated into an organized collection of layers ( While there may be multiple layers for a single feature type, such as a point and polygon layer representing cities, each layer is linked so it is displayed at a scale appropriate for detail. Each text string is queried against the ESRI base data using a selection of spatial anchors a polygon layer is generated because this information does not provide the same level of

39 27 detail as point-based spatial anchors (Figure 4). These geographic references include features at varying granularities that do not give identifiable locations where an event occurred, but instead give insight into the area impacted by the event. In the next section an explanation of vague geographic descriptions is given, as well as how they are handled to generate an event zone. Figure 4. Spatial anchors (e.g., Menard s, Iowa Memorial Union, and St. Patrick s Cathedral) represented as point features in ArcMap.

40 Point Generation for Spatial Anchor Polygons After representing spatial anchors as either points or polygons in a GIS, the next step in creating an event zone is to generate sample points within the polygons created from the parsing process. Points are generated within polygons in order to include them in point density estimation. While the location of points generated within spatial anchor polygons are not as important as the location of geocoded points, their location within the boundaries of a known feature make them important to include in the computation of an event zone. Regular sample points are generated using the Regular Point Generation (RPG) tool found in Hawth s Tools toolbox available for ArcGIS 9.3 ( This tool allows for the generation of regularly spaced sample points within a polygon (Figure 5). Regular point generation is favored over random sampling to obtain better coverage across a polygon and avoid clusters that may affect the computation of the event zone. The resulting output may generate points outside of the polygon, so extraneous points are clipped from the original extent. The parameters for setting the regular point generation must remain consistent for each polygon layer representing the spatial anchors extracted from the document to retain integrity when processing the output result. The input for this process is the polygon layer where the sampling will take place. The spacing used in this analysis is found by taking the default output raster cell value of a feature and multiplying it by 10, which typically results in a suitable number of points distributed throughout a polygon. The output cell size of a polygon is determined by taking the area of the shortest height or width of a polygon, in the output s spatial reference, and dividing it by 250 ( the number

29 of cells created) (http://help.arcgis.com). A grid alignment with a 1:1 aspect ratio also contributes to equal coverage across input polygons.

41 29 of cells created) ( A grid alignment with a 1:1 aspect ratio also contributes to equal coverage across input polygons. The polygons representing spatial anchors from the Iowa tornado article (Figure 6 (a) and Figure 7 (a)) are used as the inputs in the Regular Point Generation tool to create output point layers (Figure 6 (b) and Figure 7 (b)) for use in creating an event zone for the Iowa tornado article described in section 3.2 ( page2.shtml?tag=contentmain;contentbody). Figure 5. The Regular Point Generation tool within the Hawth s Tools toolbox.

42 30 Figure 6. (a) Spatial anchors extracted from a news article represented in a GIS. (b) Output after RPG. 3.4 Weighting The spatial anchors geocoded as points are weighted based on their place in the hierarchy of feature types established by the ADL Feature Type Thesaurus in order to distinguish the relative detail of each spatial anchor included in the point density estimation (PDE) calculation. Weights are assigned to spatial anchors by adding a value field within the attribute table of the spatial anchor s and populating it with weights. When a value field is included in point density estimation the value at a point determines how many times that point is counted in the calculation of the event zone. In this work, the points representing spatial anchors at a finer level of detail, i.e., buildings, towns, ponds, are counted more in the PDE calculation (weighted higher) than coarse spatial

43 31 anchors, i.e., counties, cities, states. Features are given weights before point density estimation in order to create an event zone that displays point concentrations that relate to the level of detail of the spatial information found in the web document. In the context of the Iowa City tornado article ( ontentmain;contentbody), references to specific buildings and locations that the tornado destroyed are more informative than the county or state in which the tornado occurred. Spatial anchors are always weighted based on the level of detail given in the web document. Figure 7. (a) A spatial anchor extracted from a news article represented in a GIS. (b) Output after RPG.

44 32 The feature type of an annotated text-string is weighted according to a hierarchy of feature types created by the community who developed the ADL Gazetteer. The weighting is based on a bottom-up approach in which the coarse spatial features at the broadest level of the hierarchy are weighted less and features that are at a higher resolution are weighted more. At the top of the hierarchy are the broadest identifications of geographic features, which are: Physiographic features Regions Manmade features Hydrographic features Administrative areas Within each of these categories there are first, second, third order types that are used to determine the level of detail in question. The feature types that belong to each of the features at the broadest level of the hierarchy, as derived by the Alexandria Digital Library (ADL) Feature Type Thesaurus (FTT), are shown in the tables below (Table 1). The asterisk before each of the feature types indicates its order in the hierarchy (first order, second order, and third order). Third order types receive the highest weights in this hierarchy, while first order types receive the least weight. Weights are generated inside the attribute table of each spatial anchor point layer with values of 2 (first order), 4 (second order), and 8 (third order). Third order features are given significantly higher weight in this scheme because they are the most detailed locations within a web document. Weights are not given to vague spatial references (valued as 1) used in the calculation of an event zone, which are explained in greater

45 33 detail in the next chapter, because they cannot be referenced against the ADL FTT. Instead, vague spatial references are incorporated in the event zone calculation as points contributing to density without weight. Once all points have been weighted the layers are merged into a single weighted point layer that is used for point density estimation. 3.5 Point Density Estimation After parsing a document and obtaining a weighted point layer for the spatial anchors and vague geographic references, the event zone is created using Point Density Estimation. Point Density Estimation (PDE), a tool within the ArcGIS Spatial Analyst extension, is used to create the event zone because it creates a surface that is based on the weights and concentration of points in an area (Jones et al., 2008). The event zone diverges from the common modeling practice of identifying a dynamic event s occurrence as a single point found somewhere along its path, in favor of a continuous surface of the event s likely occurrence over a geographic area. The event zone is used as a representation of what is stated in the web document. Point density estimation (PDE) is a method of determining the co-occurrence of points within a neighborhood of each point. This technique interpolates a surface based on a set of points by using the density of points within a search radius, as well as the weighted value of points to determine the value of each output raster cell q 2 C q, r i r where: is the density at a location q, q r q C, is a circular search area centered on q with a radius of r, and are the values of points contained within the search area (Jones i

46 34 and Purves 2008). The size of the search radius determines the resolution of resulting surface. In this thesis the search radius is determined by taking the diameter of the spatial anchor with the lowest area, because the results of were considered reasonable after experimenting with different kernel sizes. The values of points are determined by the features place in the ADL Feature Type Thesaurus hierarchy. Specific features, such as buildings and landmark sites, are given higher values than coarse features, such as a county or a state. Point density estimation is used to generate event zones based on the density of sampling points as well as their values in relation to each other. The resulting output shows a graduated surface in which darker shading equates with the certainty of an event s occurrence in an area. The resulting event zone (Figure 8) after point density estimation, shown below, was used with the merged point layer that included the spatial anchors represented as a point layer (Figure 4) and two polygon layers (Figures 6 and 7). The event zone shown in this figure highlights the detailed spatial anchors in the Iowa City tornado article (darker shades), as well as the vague spatial information and coarse references with lighter colors (lighter shades). The values for each class refer to the calculated density of weighted points around each output raster cell. For example, a neighborhood of 10 pixels containing points weighted as 8 would be calculated by counting each point 8 times to obtain the value for that neighborhood. Five classes were chosen in the visualization of the event zone because this best displayed the spatial references relating to the event from the text.

47 35 Table 1. Features included in the ADL Feature Type Thesauras. The number of asterisks in front of each feature denotes the weight given. *agricultural sites ***library **island *cemeteries ***medical facility **lakes *climatic region ***mobile home **mobile home parks *continents ***offshore platforms **paleontological sites *counties ***parks **parishes *disposal sites ***recreational facilities **parking sites *fault zones ***religious facilities **performance sites *launch facilities ***seaplane bases **residential sites *mine sites ***sports facilities **storage structures *monuments **aqueducts **telecommunication features *mountains **archaelogical sites **towers *oil fields **bridges **townships *parks **boroughs **university campus *reference locations **camps **windmills *research areas **canals **volcanoes *states **capitol buildings ***amusement parks *transportation features **caves ***buildings ***airport **cities ***educational facilities ***amusement parks **commercial sites ***historical sites ***buildings **dam sites ***medical facilities ***educational facilities **ecological research sites ***religious facilities ***heliports **forest ***parks ***historical sites **harbors ***recreational facilities ***housing areas **industrial sites ***sports facilities An event provides a visual description of the spatial references found in a web document. The amount of detail described in a document can be represented with spatial anchors, which means the more spatial anchors will give a more detailed event zone. However if a document only includes a few spatial anchors and a majority of vague spatial references, this will result in an event zone that is not very useful for

48 36 interpretation. This method is extended to incorporate text strings depicting directions, distance, and topological relations between places to accomplish this. Figure 8. Event zone resulting from PDE of point layers in Figures 4, 6 (b), and 7 (b).

49 37 CHAPTER 4 MAPPING VAGUE SPATIAL EVENT REFERENCES USING ANNOTATED INFORMATION 4.1 Annotation of Vague Spatial References The annotation of spatial anchors in text can be straightforward given toponyms that are referenced with indivisible addresses or geographic coordinates. However not all spatial information is described using natural language so the annotation of event descriptions must accommodate vague spatial information. SpatialML provides guidelines for the annotation of spatial information in text, and in this study these guidelines are used to incorporate the visualization of ambiguous information on a map, so that the event zone portrays the information given in the web document. In this section the annotation of words is discussed that modify a user s interpretation of spatial information, including directions, distances, and topological relations. SpatialML provides guidelines for annotation of spatial information using gazetteers and modifiers found in the same sentence. In this thesis all spatial information extracted from a document is referenced against the GeoNames database ( to determine whether each text string can be displayed on a map and, if it can, determine if the text string describes a spatial anchor or not. This is established in GATE by organizing the extracted text strings into the broadest categories of spatial anchor and spatial information once they have been queried against the GeoNames database. The spatial anchors will be annotated along with the geographic

50 38 coordinates given by the result of the GeoNames query, while the other vague spatial information will be annotated without coordinates. Any spatial reference that cannot be given a point location must be processed in a way where it retains the same context expressed in the web document, while providing useful information about the spatial extent of the event. Using the guidelines provided by SpatialML, a text string is annotated with information from GeoNames query results that can be used later to represent the text on a map. The typology established by the SpatialML guidelines is taken from classifications created by the ADL Thesaurus, which makes weighting SpatialML annotations using the ADL Feature Type Thesaurus hierarchy simpler ( /index_tt.htm). Feature types provide a basic classification of spatial information, however SpatialML allows for the annotation of more complex spatial information. Descriptions of an event s occurrence, especially an ongoing event, will often use vague natural language to detail its spatial extent. In order to annotate this information for inclusion in the event zone, any words that help clarify where the vague spatial information that has been extracted is located should also be annotated. In this thesis, these spatial attributes of text stings are referred to as modifiers, which are also supported by the SpatialML guidelines. The most commonly used modifiers include directional information, distance measurement, and topological relations between locations. In order to properly represent these modifiers on a map the annotation needs to be done carefully to ensure that the relationship between the modifier and the spatial information it is modifying remain linked.

51 Mapping Vague Spatial References Spatial anchors, as discussed in section 3.2, provide a stable grounding for visualizing geospatial events described in text documents, however when all spatial information found in the document is used, the resulting event zone will better portray the footprint of the described text event. Spatial references extracted from a document that are not categorized as a spatial anchor are considered vague, which means that interpretations of where the boundaries lies may not always be agreed upon (Montello, 2003). Therefore a collection of base layers can be used to narrow down the query of the region to which the text strings are referring. Vague spatial references are represented as polygons since they do not refer to locations that can be represented in their entirety from a single set of geographic coordinates. To ensure that layers used for text string representation overlay properly, a collection of layers is used that is both created and distributed by the same source. The vague spatial information extracted from the web document is queried against a collection of layers to find the best map representation for each text string. Geospatial data holdings for the United States are numerous, e.g., USGS, NGA, Tiger/Line, and state and local government geospatial datasets. However for this work, we use a collection of base map layers distributed by ESRI. These base layers, known as USA Base Map, include states, counties, cities, water bodies, roads, railways, and airports. What makes this collection of layers useful, in comparison to other data collections, is that it is not stored in separate files for download. The layers are linked together so they become visible at different scales, and geometries are updated. The updating of layers reduces the

52 40 chance false overlapping or misplaced features when representing spatial anchors, modifiers, and vague spatial references The map representation of spatial information extracted from text begins by organizing the text strings into categories. The feature types established by the ADL Gazetteer, as described in section 3.4, are used to form categories when annotation occurs in GATE. These are the same feature types used for weighting sampling points for point density estimation, which makes map representation and weighting more organized (Table 1). These categories are also used when structuring a query to select a feature in a GIS. The process of representing text strings with the base layers involves creating an SQL query using the Select by Attributes tool in ArcMap. The SELECT statement will always use the all operator *, the FROM statement will have the polygon base layer that matches the category of the text string s annotation, and the WHERE statement will be the text string. Once a query is executed, the feature that matches the text string will be selected and ready for export to a new shapefile. Toponyms with administered boundaries have the highest likelihood of returning successful results because these areas are easily represented on a map. In cases where directional, distance, or topological relationships are described in text, the selection of sample points generated within polygons is used to represent these relationships. However natural language descriptions of space will include modifiers to toponyms that can be used to narrow the vague reference s extent. The following methods describe how to represent these modifiers when generating an event zone.

53 Mapping Directional Modifiers Directional attributes included in a text description of an event s extent are useful when narrowing down an event zone. We annotate directional information using the SpatialML attribute known as a mod, which is referred to as a modifier in this thesis. A mod can be used for cardinal directions and relations that include bottom, top, and border. The following is an example showing how to annotate a sentence that indicates a restriction to an area using SpatialML guidelines: The infection spread to [downtown Chicago.] <PLACE state= IL country= US type= STATE mod= N form= NAM >northern Illinois</PLACE>. The directional information is incorporated into the annotation of the toponym (Illinois) using the mod type within the SpatialML guidelines (Anderson et al., 2008). The annotation of mods using GATE is accomplished by highlighting the direction that modifies a toponym and inputting the text string into the annotation list as an attribute of direction to the toponym. At present cases of bottom and top are not included although future work could map these directions to south and north respectively. As discussed in section 4.2, a toponym modifier that is used often in natural language is direction. Cardinal directions used to modify a toponym can be used to isolate a portion of an area for further analysis. Consider the following example that describes a car accident in Atlanta, GA. When comparing the sentences Two robbery suspects caused a major accident in Atlanta. and Two robbery suspects caused a major accident in southwest Atlanta (taken from the article), the directional information in the latter example could be effectively used when creating an event zone for this event (

54 42 money-safe-and-ak-47s). The representation of directional information requires extra data manipulation before the generated layer can be used in point density estimation. The incorporation of directional information into a map representation first requires the creation of a polygon for the toponym it modifies. After a polygon is generated from the selection (Figure 9 (a)), the next step is to identify the centroid of the polygon using the Feature to Point tool in ArcMap. The Feature to Point tool uses a polygon input to generate a point with the same attribute information at the centroid of the polygon (Figure 9 (b)) ( Once the centroid point has been created the points are generated within the polygon (Figure 9 (c)). However before any weighting of points occurs the directional information is used to isolate sample point in the direction relative to the location of the centroid. Once these points have been selected and a new shapefile has been generated these points can be weighted appropriately and used for point density estimation (Figure 9 (d)) Mapping Distance Modifiers If the exact location where an event occurred is not given, it is common to refer to distances from a known location. Annotation of distance measurements in GATE is done by treating the distance mentioned in the text as a modifier to the toponym to which it relates. As with directional modifiers, any distance modifiers are highlighted and input as attributes to the toponym they are modifying within the annotation list. There is also the possibility of using distance modifiers to describe a location that is between two known locations, such as The fire was set in an area 30 miles north of Bloomington, Il

43 and 15 miles west of Pontiac, Il. In order to annotate this case the two locations and distance modifier must be annotated and linked. Figure 9. Map representation of a directional modifier.

55 43 and 15 miles west of Pontiac, Il. In order to annotate this case the two locations and distance modifier must be annotated and linked. Figure 9. Map representation of a directional modifier. We identify the location from which the distance is measured as the source toponym, while the location to which the distance is being measured is the target toponym. The source and target toponyms must be linked in their annotation in order for the distance modifier to be represented in the GIS. To annotate a distance modifier case the source toponym and distance measurement are annotated first so the distance can be recorded as an attribute of the source toponym. Once this is done the target toponym can

56 44 be annotated with the source toponym as an attribute to link both of these locations together. If the target toponym is not a spatial anchor, there is the possibility that the distance modifier description may be able to identify a common area that is held between the area the distance encompasses (in all directions of the source toponym) and the target toponym. This is determined by the particular topological relationship that is described in the text. If a common area is identified and annotated properly the size of the polygon created from representing a distance modifier will be narrowed down to a smaller area, which equates to more confidence in the resulting event zone. Vague natural language descriptions of geographic space may include distance measurements as a way of estimating an event s location from a certain vantage point. These distance measurements can span small distances (feet, yards, or meters) or larger distances (miles and kilometers). Distance modifiers are represented in a GIS using the Buffer tool. The Buffer tool generates a buffered polygon that is a specified distance around the input polygon ( To include a distance modifier with an extracted toponym, the polygon for the toponym must be already generated. From here the Buffer tool is opened with the polygon as the input and the buffer distance set to what the distance modifier states. The exported polygon will be used as the input for sample point generation in further analysis. The description represented in Figure 10 (a) suggests the event occurred in a location 25 miles away from Cape Girardeau, indicating that the event did not occur that same distance around the entire area ( While the result of the buffering process does introduce additional points into the

57 45 calculation, these erroneous points will likely not be part of the denser collection of points in which the event zone will lie (Figure 10 (b)). In some cases there may be directional and distance modifiers that can be combined to better isolate sample points in the final event zone calculation. For example in an article describing an earthquake, one of the extracted text strings that describe a location where the earthquake originated states it was centered 20 miles east of Willits ( ). The modifiers for this toponym can be parsed into a directional modifier (east) and a distance modifier (20 miles) (Figure 11). To properly represent these modifiers on a map the distance modifier is handled first. The Buffer tool is used to generate a polygon with the distance specified in the text all around the original polygon (Figure 11 (a)). With the buffered polygon generated, the next step is to use the Feature to Point tool to generate a centroid point for the buffered polygon (Figure 11 (b)). Once a centroid point has been obtained the directional modifiers can be incorporated into the visualization. Sample points are generated inside the buffered polygon and using the centroid and directional modifier from the text (Figure 11 (c)), the points in the direction of the description are selected and written to a new shapefile for weighting and further processing (Figure 11 (d)). The addition of a direction to a distance description can help reduce the number of sample points used in the point density estimation, and better isolate the area of the event s occurrence.

58 46 Figure 10. (a) Polygon representations of spatial anchors in a GIS. (b) Map representation of a distance modifier. 4.3 Mapping Topological Relations SpatialML also supports a modifier for topological relations used in natural language, known as LINK. The LINK types include relations for containment, connection, overlap, and nearness. Topological relations found in event descriptions may provide additional information that can be used when generating the event neighborhood, but they can also be vague. LINK s are used to express topological relations, which can

59 47 be used later for making selections based on location while processing expressions in a GIS. The annotation of topological relations using SpatialML guidelines in GATE involves identifying a source toponym and target toponym, as well as the text string that identifies the particular LINK being expressed. Each LINK type has its own sentence structure or words that indicate it belongs to a certain topological relation. When presented in natural language containment can either be explicit, e.g., The town of Crestwood is in Cook county, or implicit, e.g., Crestwood, Illinois. For the topological relation of nearness, words such as near, close, and next to are used to connect the source toponym to the target toponym during annotation. Ambiguity propagated by natural language can be created by the connections made between locations. These connections describe a topological relationships held over geographic space, and have been extensively studied in GIScience (Egenhofer, 1991, 1994, Winter, 1994). Topological relationships found in natural language will often describe notions of containment, connection, and equality. The annotation of these relationships is well documented in the SpatialML guidelines, but representing these on a map involves the use of different techniques.

48 Figure 11. Map representation of distance and directional modifiers attributed to source and target toponyms. 4.3.

60 48 Figure 11. Map representation of distance and directional modifiers attributed to source and target toponyms Containment The topological relationship of containment relates to one location being within another location. Once containment has been properly annotated with a target toponym being inside a source toponym, each of these toponyms is used to identify the appropriate map feature. Map representations of containment can be easily accomplished if the place name is not commonly found in the United States. This is done using an selection with a SQL query structured with the target toponym in the WHERE statement and the layer

61 49 relating to the appropriate annotation type of the text string in the FROM statement. The selected polygon should match the target toponym and can be exported to a new shapefile for further processing. However if the target toponym is a place name that can be found at other locations across the United States, further data mining must be done to identify the proper location. If multiple entries have been found after the selection is made, the source toponym can be used to filter other entries if the source toponym is found in another field of the target toponym s attribute table. For example, a query for the city of Washington will retrieve multiple cities across the United States. To identify which Washington entry is correct, the state (source toponym) extracted from the article is used to identify which Washington is located in that state. If the source toponym cannot be found in another field the Select by Location tool must be used to identify the correct feature. The Select by Location tool is a selection tool that allows for the selection of features based on their relative location to other features ( To use the Select by Location tool for this process, the source toponym must be selected and exported to a new shapefile using the Select by Attributes tool. Once added to the map, this new layer will be used as the input layer for the Select by Location operation. With the target toponyms still selected, the Select by Location is set up to only select the one target toponym that is within the source toponym. In a news article describing a wild fire in California ( ), an affected area is described as being in a town, so Select by Location can be used to determine which feature the article is describing (Figure 12). Once this selection

50 is made a new shapefile can be created for the target toponym and used to create an event zone. Figure 12. Select by Location is used to isolate the target toponym within the source toponym. 4.3.

62 50 is made a new shapefile can be created for the target toponym and used to create an event zone. Figure 12. Select by Location is used to isolate the target toponym within the source toponym Extended Connection An extended connection between two locations denotes a physical or administrative meeting of two features in space, e.g., borders and coastlines. When representing this topological relationship on a map, each of the connected toponyms that were extracted is used for the visualization (Figure 13(a)). Once each of these toponyms has been transformed to a feature they are used as inputs for regular point generation

51 (Figure 13 (b)). From here the sample points that lie closest to the connection established in the text description are selected and exported to a new shapefile (Figure 13 (c)).

63 51 (Figure 13 (b)). From here the sample points that lie closest to the connection established in the text description are selected and exported to a new shapefile (Figure 13 (c)). This selection is done to ensure only the sample point in the area of the described connection is used for point density estimation, rather than the other sample points that have been generated throughout the features. Figure 13. Map representation of an extended connection modifier.

64 Partial Overlap There may be instances in text descriptions of geographic events where a location is described as being in a place that is found in two or more larger areas. For example a forest fire may be taking place in a forested area that is described as being in a county and a national park that spans two counties, so both of these places would need to be used to determine the area. Each of these places will need to be selected and exported to new layers in a GIS to begin the process. Next these two polygons have sampling points generated within them using the Regular Point Generation tool. Once the point layers for these extracted locations have been created, there will be a noticeable overlap in point density where the boundaries of these areas overlap in geographic space. The next step is to select the points from each layer that both hold in common. The Select by Location tool is used to select points in each layer, so the selection of points from the second layer will be added to the first selection. The selection is structured with one point layer as the input and the selection is based on points within the polygon of the other layer. This process is repeated for the other point layer until the densest collection of points from both layers is isolated and can be exported to a new layer for computation of the event zone. The ambiguity produced by natural language is the basis for event zone computations. While these methods for computing event zones may not be appropriate for all cases of ambiguous spatial information, they do cover the most common examples found in web documents that describe an event s extent. To ensure that extracted spatial information can be represented correctly in a GIS, it is important to use a model that can employ each of these methods when necessary.

65 53 CHAPTER 5 APPLICATION OF EVENT ZONE METHODS FOR A WEB DOCUMENT As presented in the previous chapters, the development of event zones using extracted text-based information is a multi-step process. This chapter applies these techniques to two news articles obtained from the web, and shows the steps that are necessary in order to apply these techniques to web documents. News articles on topics reported at a local level to an international level are common on the web, and provided by news outlets worldwide. 5.1 Creating an Event Zone for a Tornado News Article The first news article used in this chapter was obtained from the New York Times website describing a tornado to demonstrate how an event zone is created from start to finish. The article that is used to demonstrate the techniques describes a tornado that occurred in College Park, Maryland in September The article describes the tornado s course of destruction with witness descriptions (Figure 14) ( ). This article was chosen because it demonstrates examples that cover a variety of topics presented in previous chapters. Spatial information and modifiers are annotated and represented in a GIS, as well as the exclusions of spatial information not associated with the extent of the event of interest.

66 54 Maryland Campus Reels From Tornado That Killed Sisters - NYTimes.com COLLEGE PARK, Md., Sept. 25 The University of Maryland began to recover today from a tornado that ripped across campus on Monday evening, killing two students, damaging several buildings and tearing up hundreds of trees. The students, Colleen P. Marlatt, 23, and Erin P. Marlatt, 20, sisters from nearby Clarksville, had stopped after class to visit their father, F. Patrick Marlatt, at the university's Fire and Rescue Institute, where he is deputy director. Clifford H. Turen, a family friend and doctor who lives near the Marlatts, said Mr. Marlatt had warned his daughters to head home as the skies grew menacing. ''He put them in the car and said, 'Why don't you get home before the storm comes?' '' Dr. Turen said, ''and he walked back into his office, and -- boom.'' The tornado struck, tossing the women's Mercury Sable a quarter of a mile and flattening the institute's building. Their father and several other workers were trapped, and some were slightly hurt. A volunteer firefighter, Clarence Kretizer, 78, of Bowie, Md., who responded to the emergency call, collapsed and later died. The tornado tore through the northwestern edge of this sprawling, 34,000-student campus about 5:30 p.m. Along with the institute, a dining hall, a campus athletic complex and a day care center were damaged. The tornado displaced about 700 students from an apartment complex. Classes were to resume on Wednesday, university officials said. About 25 students suffered cuts and bruises; another 25 people in surrounding Prince George's County had minor injuries. The authorities said the tornado heavily damaged about 40 houses in Howard County and ripped the roof off a Home Depot store there. Gov. Parris N. Glendening declared a state of emergency in Prince George's and Howard Counties. The tornado tore a roof off a building at Laurel High School, damaged houses and knocked out power to 20,000 houses. In College Park, students today surveyed the destruction, including a parking lot with scores of mangled cars, and recalled the horror of the day before. Alison Bazala, 29, a doctoral student who plays the cello, was on the third floor of the Clarice Smith Performing Arts Center on Monday rehearsing for a concert to celebrate the building's grand opening. ''Somebody said, 'Hey, look, there's a tornado,' and I thought I'd see something in the distance,'' Ms. Bazala said. ''When I looked out the window I saw this huge black cloud coming right toward us,'' about 50 feet from the window. Rebecca Burdette, a freshman from Germantown, Md., said she had heard tornado warnings on Monday but told roommates to ignore them. ''I just told them, 'Don't worry about it, we never have tornadoes in Maryland,' '' Ms. Burdette said. ''I would never tell anybody that again.'' Robert Chartuk, a spokesman for the National Weather Service, called such a powerful tornado -- with winds estimated at 150 to 200 miles per hour -- extremely rare in Maryland and nearby. Figure 14. News article obtained from the New York Times website describing a tornado that occurred in College Park, Md ( 14FF385E0C758EDDA00894D ).

67 Annotation of the Tornado Article The extraction and annotation of spatial information found in this news article is carried out in GATE. Text processing begins using the OpenNLP set of applications. The OpenNLP tools include a sentence splitter, tokenizer, part of speech (POS) tagger, chunker, and a name finder ( /index.php?title=main_page). All of these applications work together to deconstruct the news article and extract spatial anchors and vague geographic references found in the article that can be used in the annotation process. After the OpenNLP application identifies toponyms found in the article, further annotation is accomplished using the guidelines of SpatialML to find the remaining spatial references. The OpenNLP application provides a useful first step in extracting spatial information automatically because it identifies specific places and times mentioned in the article. However these NLP tools are not fully equipped to identify all toponyms and references to geographic locations found in natural language. For this reason we use guidelines established by the SpatialML community to annotate spatial anchors, vague references, and modifiers that the OpenNLP application is unable to identify (Anderson et al., 2008). At this time SpatialML remains an annotation scheme that is used to manually annotate text. To capture all of the spatial information presented in this article, the text strings have been manually annotated using GATE s Annotation Editor. The user-interface is designed to annotate text strings that are highlighted by the user, and then additional information, such as modifiers, source or target toponym identification, and place type, is input using the editor s options (Figure 15). When the tornado article is processed using the OpenNLP application, the spatial information is not complete.

68 56 Spatial information in the document is annotated as a location, spatial anchor, or modifier. The spatial anchor type is used to annotate any spatial anchors found in the text (e.g., Howard County and Clarksville ) found in the article. Any text strings annotated as either of these two types are updated with information relating to their place type, status as a source or target toponym, and an identification of the modifier to which it is linked. Text strings annotated as modifiers are updated with information relating to the modifier s type as well as the source and target toponyms they modify in the text. After annotation, twelve spatial anchors, two vague references, and two modifiers were extracted from the article. Below these annotated text strings are shown as they appear in GATE, where each shade refers to the text string as a spatial anchor or modifier (Figure 15). The spatial anchors found in the article, Laurel High School and Clarence Smith Performing Arts Center, were two specific locations where the tornado was witnessed. The other spatial anchors annotated were regions used to describe where the tornado s damage was witnessed during and after the event took place. The modifiers found in this document relate to directional, northwest of campus, and distance, about 50 feet from the Clarence Smith Performing Arts Center. With the annotation complete we then query the annotated text strings against the GeoNames database to obtain geographic coordinates that can be used in the representation process. The GeoNames database was used to identify the standardized names for each annotated text string. The results that are returned for each query include the feature type, county, state, and geographic coordinates. The resulting entries are exported to a table that can then be added to a GIS for further analysis (Figure 16). These geographic coordinates are used to model the text strings as geographic extents, however as point

69 57 locations they do not provide sufficient information to understand the spatial coverage of an area described in a web text. For example, the geographic coordinates given for Howard County would only give one location within Howard County, while the tornado may not have occurred at this location. It is for this reason we model the tornado s extent using the information extracted from the article rather than placing a point on a map representing the affected geographic area. Figure 15. Spatial anchors and modifiers annotated in GATE. The polygon layer chosen matches the feature type established in the point s attribute table. Once a selection is made the polygons associated with the text strings are exported to separate layers. For example, a Select by Location query for Prince

58 George s County in Maryland would select a feature within the Counties layer that contains the point location established by the GeoNames database.

70 58 George s County in Maryland would select a feature within the Counties layer that contains the point location established by the GeoNames database. This is repeated for each point until there are polygon representations for spatial anchors taken from the text. One of the data issues encountered was finding a polygon representation for Clarksville. This municipality was not included in the base layers used in this analysis. Figure 16. Extracted point locations for spatial anchors used in the computation of an event zone. When this happens the feature must be represented as a point, but is still given the appropriate weight based on the feature types place in the ADL Feature Type Thesaurus hierarchy. Another challenge encountered is the existence of different representations of a text-string on the map. When the annotated text-strings were queried against the

71 59 GeoNames the text-string Germantown returned four entries for cities within Maryland. Since the news article only makes reference to a single Germantown, the other three polygons must be removed to isolate the event s extent as described in the article. The representation of all spatial information extracted from this news article may show a tornado path that covers a much larger area than is naturally possible (including descriptions of different events). To filter the spatial information that makes references to locations not associated with the tornado s extent, e.g., the sisters hometown of Clarksville, and the witness s hometown of Germantown, the relationships between the map representations must be examined. The exclusion of information not relating to the event s extent is done using the topological relationship of containment and extended connection. Once all spatial information has been georeferenced, features from the article that does not touch the boundary or are not contained within a feature are excluded from further analysis and are not included in the creation of the event zone. The assessment of topological relationships is accomplished using the Select by Location tool to identify which layers are not contained by or touch the boundary of another feature. These layers are removed and the remaining layers are used for further analysis. Once the spatial anchors, vague references, and modifiers have been represented appropriately (Figure 17) points can be generated within polygon features.

60 Figure 17. Map representations of spatial anchors. 5.1.2 Weighting and Event Zone Creation The next step is to generate sample points within the polygons and weight all point layers.

72 60 Figure 17. Map representations of spatial anchors Weighting and Event Zone Creation The next step is to generate sample points within the polygons and weight all point layers. The four spatial anchors that were used as inputs for the point sampling were representations for College Park, the University of Maryland campus, Prince George s County, and Howard County. Once point layers were generated from each of the original polygon layers the annotated modifiers were represented using the techniques described in Chapter 4. The directional modifier northwestern for the University of Maryland is represented using the centroid location as a basis for excluding the other

73 61 points in the layer, which leaves only points in the northwestern part of the layer. The distance modifier 50 feet from the Clarice Smith Performing Arts Center is represented using the Buffer tool. The point layer for the Clarice Smith Performing Arts Center is used as the input and the buffer distance is input as 50 feet. Sample points are then generated inside this buffer area and used in further analysis. With all point layers created they are each weighted using the ADL FTT hierarchy described in Chapter 3. Two spatial anchors extracted from the article, Laurel High School and Clarice Smith Performing Arts Center are given the greatest weight (10). College Park, Clarksville, and the University of Maryland are weighted lower (3) while Prince George s County and Howard County are given the lowest weights (2) since they are at the coarsest level of granularity. After all point layers have been weighted they are merged into one weighted point layer and are ready to undergo point density estimation. The resulting event zone shows the area where the tornado likely occurred as described in the news article. The extent of event zone shows the full range of locations where the tornado may have occurred as extracted from the article (Figure 20). The range of values computed for the event zone is divided into five classes to best display the spatial anchors and vague spatial references. The concentration of high valued cells in the center of the event zone displays the most detailed locational information relating to the tornado event, which indicates features that were directly affected by the tornado and locations of witness sightings.

74 Figure 18. Event zone depicting the tornado described in the news article. 62

75 Creating an Event Zone for a Forest Fire Article The next example presented is a news article depicting a forest fire that occurred in the San Bernardino National Forest in California. This article was obtained from the CBS news website and presents clear examples of modifiers and spatial anchors ( (Figure 19). This text information is represented in a GIS and transformed into an event zone. The extraction of spatial information from this article resulted in 7 spatial anchors and 3 modifiers. One of the modifiers ( 15 miles northeast of San Bernardino ) was excluded from the computation of the event zone because the target toponym to which the modifier described ( Lytle Creek ) was georeferenced with the GeoNames database and coordinates were identified. Each of the spatial anchors was represented as points in a GIS using GeoNames coordinates. The spatial anchors that were categorized as feature types best represented as polygons, i.e., California, Las Vegas, Los Angeles, San Bernardino National Forest, and Lytle Creek, were converted to a polygon data type using the Select by Attributes tool and the base data layers provided by ESRI. Lytle Creek was the only spatial anchor that was not found during the selection process, which resulted in the representation of this spatial anchor as a point. The Las Vegas polygon was removed from the calculation of the event zone because it was not contained and its boundaries did not touch any other represented features. The modifiers used in the calculation of this event zone included an area 60 miles east of Los Angeles and Southern California. The distance and directional modifier ( 60 miles east of Los Angeles ) was represented using the Buffer tool to generate a 60 mile buffer around the city of Los Angeles, and then sample points were

76 64 generated within this buffered area. To represent the directional component of this description a centroid was generated within the polygon representing the city of Los Angeles and the sample points east of the centroid were selected and exported for use in the computation of the event zone. The directional modifier Southern in the extracted reference Southern California was represented using the same method described previously. Once the sample points south of California s centroid point were exported to a new layer, this new layer would be used in the computation of the event zone. Southern California Forest Fire Destroys 3 Homes Fire In Southern California's San Bernardino National Forest Torches 3 Homes, Threatens Others (AP) A fire driven by winds of 40 mph destroyed three homes and threatened dozens of others in a rugged warren of mountains and canyons northeast of San Bernardino on Saturday. As a huge wall of flames chewed through thick timber and brush in the Lytle Creek area 15 miles northeast of San Bernardino, residents of some 50 homes in its path fled, taking horses and pets with them. No injuries were reported. "We do have structures lost, three homes in Swarthout Canyon," said Norma Bailey, a fire information officer with the U.S. Forest Service. The area is not far from Interstate 15, a major route connecting Las Vegas with Southern California. About 50 homes are located in the canyon, and Bailey said evacuation centers had been opened for people, large animals like horses and smaller animals like dogs and cats. People were being housed at Eisenhower High School in nearby Rialto, while horses were being boarded at the Glen Helen Regional Park rodeo grounds in Devore and smaller animals were being taken to a local animal shelter. The area is about 60 miles east of Los Angeles. The blaze, named the sheep Fire, broke out about 2 p.m. in the northwest corner of Lytle Creek, a small community surrounded by the San Bernardino National Forest. Fueled by thick timber and brush, and pushed over hills and canyons by the wind, it quickly burned across 1,500 acres. It was only 5 percent contained Saturday night, and its cause was under investigation. In addition to the homes, campgrounds and an RV park in the area were also evacuated. The interstate, from which huge plumes of smoke and walls of flames were visible, remained open. The blaze was being fought by more than 500 firefighters from the U.S. Forest Service and the San Bernardino County Fire Department. Bailey said they used air tankers and water-dropping helicopters until darkness grounded the aircraft. Figure 19. News article obtained from the CBS News website describing a forest fire that occurred in California.

77 65 The weighting of these georeferenced information resulted in 2 third order spatial anchors ( Eisenhower Senior High School and Glen Helen Regional Park ), 3 second order spatial anchors ( Los Angeles, Lytle Creek, and San Bernardino National Forest ), and 2 modifiers that referenced vague locations ( Southern California and 60 miles east of Los Angeles ). Once all point layers were weighted they were merged together to form a single weighted point layer that was used in the calculation of the event zone. Point density estimation was used to calculate the event zone shown below (Figure 20). The event zone generated for the forest fire article is displayed using five classes, which proved to be a better display method for the interpretation of point densities than using multiple classes that draw the users to areas in the event zone where detailed information (affected buildings and homes) was not provided. Unlike the event zone computed for the tornado article, the event zone computed from the forest fire article has two distinct areas of high point density values. This is attributed to the frequency of using the spatial anchors Los Angeles and San Bernardino in the news article to describe the spatial extent of the forest fire. The locations of the 2 third order spatial anchors are found within the high density cluster to the right of the event zone that is in the vicinity of the San Bernardino National Forest. The event zone provides a vague representation of an event s extent based on the description in a web document. To determine if the event zone corresponds to the spatial extent of the actual event, a comparison must be made between the event zone and known data depicting the event s true extent.

78 Figure 20. Event zone depicting the forest fire in the news article. 66

79 Event Zone Validation To examine the results of the methods applied to the news article to model and represent the spatial extent of the tornado event, a known track of the tornado obtained from the National Oceanic and Atmospheric Administration s (NOAA) data repository website is used to determine the degree of match between them. ( /shapepage.htm). NOAA provides spatial data for weather events and forecasting across the United States that is freely available. The dataset used for validation is a set of tornado tracks that have been recorded between 1950 and The attributes for these tornado tracks include the location of the tornado s occurrence, the magnitude (Fujita scale), and the resulting impact measured by property destruction and deaths. The tornado track obtained from NOAA shows a 22 mile long path across the state of Maryland with 100% of the event zone containing the track. Of the full 22 mile track, about 14% of it matched a high concentration of points in the event zone. The NOAA tornado track was overlaid with the event zone, and shows how the tornado track falls fully within the extent of the event zone (Figure 21). This shows that the area of points representing the most specific spatial information extracted from the news article corresponds to the actual trajectory of the tornado. The full extent of the event zone represents the area of impact described in the article. The areas of higher density can be used to indicate highlights of the article and areas that may have received significant damage. While it is not always going to be the case that text descriptions match exactly with measured occurrences of natural phenomena, it is expected that there should be a close correspondence and from this analysis, the methods developed for this research

80 68 appear to employ and represent locations appropriately to capture events. The circumstances in which the event zone would prove to be inadequate in representing the spatial extent of the text event include the relationship of the event zone and the dataset used for comparison (tornado track) is completely disjoint, or if the dataset used for comparison does not overlap with any of the densely clustered areas within the event zone (areas of most detail extracted from the document). The creation of an event zone as a representation of an event depicted in a web document is useful to summarize a document s contents in geographic space as well as a method to display the location of significant locations described in the article. The annotation and representation of this information can be used to help refine query methods and integrate vague spatial information into GIS applications.

81 Figure 21. NOAA tornado track overlaid with the computed event zone. 69

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n. University of Groningen Geographically constrained information retrieval Andogah, Geoffrey IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from