A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia

Similar documents
CLRG Biocreative V

GIS Visualization: A Library s Pursuit Towards Creative and Innovative Research

Information Extraction from Text

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Toponym Disambiguation using Ontology-based Semantic Similarity

Topic Models and Applications to Short Documents

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Spatial Information Retrieval

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

The Geo Web: Enabling GIS on the Internet IT4GIS Keith T. Weber, GISP GIS Director ISU GIS Training and Research Center.

Test and Evaluation of an Electronic Database Selection Expert System

EXTRACTION AND VISUALIZATION OF GEOGRAPHICAL NAMES IN TEXT

An empirical study of the effects of NLP components on Geographic IR performance

Conditional Random Fields

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Improving Geographical Data Finder Using Tokenize Approach from GIS Map API

Alexander Klippel and Chris Weaver. GeoVISTA Center, Department of Geography The Pennsylvania State University, PA, USA

A geo-temporal information extraction service for processing descriptive metadata in digital libraries

St. Kitts and Nevis Heritage and Culture

Rainfall data analysis and storm prediction system

Evaluating Physical, Chemical, and Biological Impacts from the Savannah Harbor Expansion Project Cooperative Agreement Number W912HZ

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

Machine Learning for natural language processing

Variation of geospatial thinking in answering geography questions based on topographic maps

Lecture 18 April 26, 2012

Predicting New Search-Query Cluster Volume

Intelligent Systems (AI-2)

Text Mining. March 3, March 3, / 49

Intelligent Systems (AI-2)

Tuning as Linear Regression

Mining coreference relations between formulas and text using Wikipedia

Extraction of Spatio-Temporal data about Historical events from text documents

Introduction to ArcGIS Server Development

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

The GapVis project and automated textual geoparsing

Anomaly Detection for the CERN Large Hadron Collider injection magnets

From Research Objects to Research Networks: Combining Spatial and Semantic Search

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Spatial Role Labeling CS365 Course Project

GOVERNMENT GIS BUILDING BASED ON THE THEORY OF INFORMATION ARCHITECTURE

An integrated Framework for Retrieving and Analyzing Geographic Information in Web Pages

Internet Engineering Jacek Mazurkiewicz, PhD

Predictive analysis on Multivariate, Time Series datasets using Shapelets

DM-Group Meeting. Subhodip Biswas 10/16/2014

How a Media Organization Tackles the. Challenge Opportunity. Digital Gazetteer Workshop December 8, 2006

Learning Features from Co-occurrences: A Theoretical Analysis

Resolutions from the Tenth United Nations Conference on the Standardization of Geographical Names, 2012, New York*

Visualizing Energy Usage and Consumption of the World

Toponym Disambiguation by Arborescent Relationships

Text Analytics (Text Mining)

Click Prediction and Preference Ranking of RSS Feeds

Aspect Term Extraction with History Attention and Selective Transformation 1

Analysis of Evolutionary Trends in Astronomical Literature using a Knowledge-Discovery System: Tétralogie

file://q:\report1\greenatlasfinalreportindex.html

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

Exploring Class Discussions from a Massive Open Online Course (MOOC) on Cartography

Visualizing Uncertainty: How to Use the Fuzzy Data of 550 Medieval Texts?

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Road & Railway Network Density Dataset at 1 km over the Belt and Road and Surround Region

INDOT Office of Traffic Safety

Spatial Decision Tree: A Novel Approach to Land-Cover Classification

2013 AND 2025 THE FUTURE OF GIS

Machine learning for pervasive systems Classification in high-dimensional spaces

DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON

DP Project Development Pvt. Ltd.

Maschinelle Sprachverarbeitung

UC Berkeley International Conference on GIScience Short Paper Proceedings

NR402 GIS Applications in Natural Resources

Maschinelle Sprachverarbeitung

A Bayesian Model of Diachronic Meaning Change

A Web-based Geo-resolution Annotation and Evaluation Tool

A Prototype of a Web Mapping System Architecture for the Arctic Region

Deep Learning for NLP Part 2

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

A Hierarchical, Multi-modal Approach for Placing Videos on the Map using Millions of Flickr Photographs

DISTRIBUTIONAL SEMANTICS

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

FAO GAEZ Data Portal

HMM Expanded to Multiple Interleaved Chains as a Model for Word Sense Disambiguation

Developing Geo-temporal Context from Implicit Sources with Geovisual Analytics

WEB-BASED SPATIAL DECISION SUPPORT: TECHNICAL FOUNDATIONS AND APPLICATIONS

Your web browser (Safari 7) is out of date. For more security, comfort and the best experience on this site: Update your browser Ignore

TEMPLATE FOR CMaP PROJECT

Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)

geographic patterns and processes are captured and represented using computer technologies

Enhancing the Curation of Botanical Data Using Text Analysis Tools

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Spatio-Textual Indexing for Geographical Search on the Web

Discovery and Access of Geospatial Resources using the Geoportal Extension. Marten Hogeweg Geoportal Extension Product Manager

A Hidden Markov Model for Alphabet-Soup Word Recognition

Crime Analyst Extension. Christine Charles

Introduction to Spatial Big Data Analytics. Zhe Jiang Office: SEC 3435

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Transcription:

A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia Daryl Woodward, Jeremy Witmer, and Jugal Kalita University of Colorado, Colorado Springs Computer Science Department 1420 Austin Bluffs Pkwy Colorado Springs, CO 80918 daryl.woodward@gmail.com, jeremy@geograkos.net, kalita@eas.uccs.edu Abstract We target in this paper the challenge of extracting geospatial data from the article text of the English Wikipedia. We present the results of a Hidden Markov Model (HMM) based approach to identify location-related named entities in the our corpus of Wikipedia articles, which are primarily about battles and wars due to their high geospatial content. The HMM NER process drives a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name (often referred to as grounding). We compare our results to a previously developed data structure and algorithm for disambiguating place names that can have multiple coordinates. We demonstrate an overall f-measure of 79.63% identifying and geocoding place names. Finally, we compare the results of the HMM-driven process to earlier work using a Support Vector Machine. I. Introduction The amount of user-generated, unstructured content on the Internet increases significantly every day. Consequently, the need for techniques to automatically extract information from unstructured text becomes increasingly important. A significant number of queries on the Internet target geospatial data, making this a good area of study. Therefore, this paper focuses on approaches to extracting geospatial information from unstructured text. While Named Entity Recognition (NER) has seen much progress in recent years, NER for location names is only the first step in extracting geospatial information. The extracted place names must then be geocoded to at least a latitude and longitude coordinate pair to allow visualization, geospatial search, and information retrieval of text based on locations. We refer to this process as geocoding and disambiguation. It is also often referred to as grounding. Our work in this paper has a very specific focus: to maximize the efficiency of the initial geospatial NER that feeds our geocoding and disambiguation process. Increasing the accuracy of the raw NE data will improve our final results. Specifically, a more refined list of geospatial named entities (place names) from unstructured text documents such as Wikipedia articles, will aid in geocoding ambiguous names to the correct geospatial coordinates. Our research in this area is motivated by a number of factors. The goal of the ongoing development of the Geografikos package is a software system that creates a database of Wikipedia articles in which each article has an associated structured set of geospatial entities extracted from the article. This database will allow geospatial querying, information retrieval, and geovisualization to be applied to the Wikipedia articles. Further, we wish to opensource the Geografikos software on completion. Section 2 of the paper discusses relevant background research and information on the LingPipe library. Section 3 focuses on our process for choosing training data. Section 4 summarizes our method for generating the most accurate results. Section 5 quickly recaps the geocoding and disambiguation process. Section 6 discusses our results and compares them to the SVM used for the same process used by Witmer and Kalita [1]. Finally, we conclude with a discussion of possibilities for future research. II. Background Research Named entity recognition refers to the extraction of words and strings of text within documents that represent discrete concepts, such as names and locations. The term Named Entity Recognition describes the operations in

natural language processing that seek to extract the names of persons, organizations, locations, other proper nouns, and numeric terms such as percentages and dollar amounts. The term Named Entity was defined by the MUC-6, sponsored by DARPA in 1996 [2]. The NE recognition task was further defined, and expanded to language independence by the Conference on Natural Language Learning shared task for the 2002 and 2003 conferences. Numerous approaches have been tried since MUC- 6 to increase performance in NER, including Hidden Markov Models, Conditional Random Fields, Maximum Entropy models, Neural Networks, and Support Vector Machines (SVM). Dakka and Cucerzan demonstrated an SVM that achieves an f-measure of 0.954 for LOC entities in Wikipedia articles, and an f-measure of 0.884 across all NE classes [3]. Although research into text classification and NER has found that SVMs provide good performance on NER tasks, HMMs can produce similar results with minimal training. Hidden Markov Models (HMMs) have also shown excellent results. Klein et al. demonstrated that a characterlevel HMM can identify both English and German named entities with an f-measure of 0.899 and 0.735 for LOC entities in testing data, respectively [4]. In [5], Zhou and Su evaluated a HMM and HMM-based chunk tagger on the MUC-6 and MUC-7 English NE tasks, achieving f- measures of 0.966 and 0.941, respectively. To compare to the SVM-based approach used by Witmer in [1], we chose to use the HMM implemented by the LingPipe library for NER which if participated in the CoNLL 2002 NER Task would have been tied for fourth place with an f-measure of 0.766 1. For simplicity, Lingpipe is a fully developed Java package that easily integrated into our existing code. LingPipe identifies itself as a suite of Java libraries for the linguistic analysis of human language, providing tools for information extraction and data mining 2. III. Training and Test Corpus Generation A. Training We downloaded a number of previously tagged data sets to provide the training material for the HMM, along with other resources that required hand-tagging. We narrowed our training to include: CoNLL 2003 shared task dataset on multi-language NE tagging 3 containing tagged named entities for PER, LOC, and ORG in English. 1 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 2 http://alias-i.com/lingpipe/ 3 http://www.cnts.ua.ac.be/conll2003/ner/ CoNLL 2004 shared task dataset on Semantic Role Labeling 4 tagged for English LOC NEs. Hand-tagged articles from the English Wikipedia, downloaded June 18, 2008. The CoNLL datasets were chosen due to their high quality and because they had been previously tagged for all NEs. We combined the CoNLL data with articles from the English Wikipedia in which all LOC NEs were handtagged. The articles focused on battles and wars with high frequencies of geospatial entities. An HMM was then generated for various combinations of the listed corpora and short simulations conducted for each of the models. The inclusion of the hand tagged data proved to have the greatest effect on the results. The average difference in f-measure between training corpora that did and did not include the hand tagged data was about 1.7%, where including the data was beneficial. Our final corpus was comprised of the CoNLL 2003 and 2004 datasets and the hand-tagged data from Wikipedia. All named entity tags were preserved from the CoNLL datasets but only locations were tagged in the Wikipedia articles. To put into perspective the degree of accuracy required in this NER task, simple statistics were gathered from the nine hand tagged articles used for training. Only 2191 of 61,708 total words were part of a location string (3.55%). Due to the sparse occurrences of these geospatial entities, the addition of these location-weighted and subject-similar articles significantly improved results. Based on our analysis of the CoNLL 2003 dataset, 7,893 of 32,588 NEs of 169,032 total words in 947 articles were locations. The CoNLL 2004 data did not contain article separators, but 3,347 of 16,308 NEs of 176,920 total words were locations. Only about half of the locations in the hand tagged articles did not appear in the CoNLL datasets, thus only 1,168 new locations were truly added with the inclusion of the nine hand tagged articles. B. Testing For primary testing, 21 Wikipedia articles (171,232 total words) were selected from the list of 90 articles processed by the SVM in [1]. These specific articles were chosen because they were used as primary examples in Witmer s previous work. These articles were preprocessed and were determined to have a variety of lengths and location frequencies but suitable for statistical analysis. Particularly, they did not have too few locations where incorporating them into the statistics without some sort of standardization would have an imbalanced impact on the results. The test articles shared the topic of historical battles and wars with the training articles. The articles used 4 http://www.lsi.upc.es/ srlconll/st04/st04.html

for training only made up about 15% of the final training set. Currently, this set of Wikipedia articles is the only corpus chosen for testing. In the future, we may expand our corpus to include news articles or other such informational online resources that are also expected to contain geospatial content. IV. Method LingPipe offers various formats for results along with different Named Entity Recognizers which vary in accuracy and efficiency. We chose the CharLMRescoringChunker as it is described to be the most accurate, but also the slowest chunker 5. This was best suited for the Geografikos package since the geospatial information associated with each article is only processed once for each article. The NER process is also significantly faster than the geocoding and disambiguation process, so NER speed was not an issue. Table I is an example of LingPipe s Confidence Named Entity Chunking which returns a list of the most confident results, including the string, where it can be found, what type of chunk it may be, and the probability that the string is correctly typed. The four types of chunks it can be trained to identify are PER, ORG, LOC, and MISC while text not identified as one of these chunks is labeled O. Based on a manual review of sentences such as these, we predicted that setting a threshold of 1.1 for the confidence would provide the best balance between false positives and correct identification. This threshold was set as a parameter within the Geografikos package as it processes results returned by the HMM. Tests were conducted with thresholds 1.0 to 1.5 with 0.1 increments. A direct correlation emerged between the threshold value and the precision in final results. An inverse correlation emerged between the precision and recall. The highest f-measure was achieved with a threshold of 1.1, which coincided with our initial prediction. V. Geospatial Named Entity Resolution The HMM extracted a set of candidate geospatial NEs from the article text. For each candidate string, the second objective was to decide whether it was a geospatial NE, and to determine the correct (latitude, longitude), or (φ, λ) coordinate pair for the place name in context of the article. To resolve the candidate NE, a lookup was made using Google Geocoder 6. If the entity reference resolved to a 5 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 6 http://code.google.com/apis/maps/documentation/geocoding/index. html single geospatial location, no further action was required. Otherwise, the context of the place name in the article, a data structure and a rule-driven algorithm were used to decide the correct spatial location for the place name. Our disambiguation task is close to that of word sense disambiguation, as defined by Cucerzan [6], only that we consider the geospatial context and domain instead of the lexical context and domain. We refer to this as geospatial entity resolution. It has also been referred to as grounding a place name [7]. Sehgal et al. demonstrate good results for geospatial entity resolution using both spatial (coordinate) and nonspatial (lexical) features of the geospatial entities [8]. Zong et al. demonstrated a rule-based method for place name assignment, achieving a precision of 88.6% on disambiguating place names in the United States, from the Digital Library for Earth System Education (DLESE) metadata [9]. While related to the work done by Martins et al. in [10], using an HMM for NER, and resolving through a geocoder for geospatial coordinates as a second step, we drew on the work in this area done by Witmer in [1], using a novel location tree data structure and algorithm to disambiguate and geocode the place names. A. Google Geocoder We used Google Geocoder as the gazetteer and geocoder for simplicity, as much research has already been done in this area. [11] and [12] provided an excellent overview of existing technology in this area. Google Geocoder provides a simple REST-based interface that can be queried over HTTP, returning data in a variety of formats. This system architecture allows developers to manager their user-client interfaces however they wish while only their client-server interaction must stay consistent. This also enables the option of adding layers to a process, allowing many lower-level layers to share caches to reduce server interaction. For each geospatial NE string submitted as a query, Google Geocoder returns 0 or more placemarks as a result. VI. Results In this section, we compare the overall results from the HMM-driven NER and disambiguation with the SVMdriven NER and disambiguation presented in [13] and [1]. Table II compares the final results of the HMM with that of the SVM. These results are based on the processing of 21 articles from Wikipedia. The Resolved Results in Table II specifically show the performance of our package in correctly identifying location strings and geocoding the locations. The NER Results show the accuracy of

Table I. LingPipe output for sentence: Lee at first anticipated that he would fight Burnside northwest of Fredericksburg and that it might be necessary to drop back behind the North Anna River. Rank Conf Span Type Phrase 0 2.0000 (67, 81) LOC Fredericksburg 1 1.4661 (137, 154) LOC North Anna River. 2 1.2874 (143, 154) LOC Anna River. 3 1.0745 (137, 142) LOC North 4 1.0560 (148, 154) LOC River. 5 1.0134 (137, 147) ORG North Anna 6 1.0074 (137, 147) LOC North Anna 7 1.0053 (0, 3) PER Lee Table II. Resolved Geospatial NE Results Precision Recall F-Measure HMM NER Results 0.489 0.615 0.523 SVM NER Results 0.510 0.998 0.675 HMM Resolved Results 0.878 0.734 0.796 SVM Resolved Results 0.869 0.789 0.820 the two NERs before any further processing. A correctly identified string identified by the NER process is one that exactly matched a hand tagged ground truth named entity. Geocoding success is taking the string and correctly resolving a single location in the context of the document. Although the NER results look significantly lower when using the HMM, it should be noted that the median for the collected data was 0.646. A handful of articles with foreign names, such as the article on the Korean War, brought down the average results with f-measures around only 10%. This is most likely due to the fact that our training data contained a limited number of foreign names, and the HMM had trouble recognizing these strings as LOC named entities. Figure 1 shows a more detailed breakdown of the precision, recall, and f-measure for a subset of the articles processed. The results in this chart show the same trend shown by Table III. The Geografikos package was generally able to extract locations with a higher performance from articles that contain a higher percentage of nonambiguous locations. The lowest scoring articles, portrayed as the leftmost three articles in Figure 1 all reference engagements in the American Civil War. Table III, from [14], shows that North and Central America have a much larger percentage of ambiguous place names that other parts of the world. The highest scoring article (rightmost three articles in Figure 1) focus on engagements on other continents. Table III. Places With Multiple Names and Names Applied to More Than One Place in the Getty Thesaurus of Geographic Names Continent % places with multiple names % names with multiple places North & Central 11.5 57.1 America Oceania 6.9 29.2 South America 11.6 25.0 Asia 32.7 20.3 Africa 27.0 18.2 Europe 18.2 16.6 A. Analysis The SVM used in [1] performed independent of article length and the frequency of place names. This independence did not carry over into the HMM. Although the two methods share the same disambiguation process, the initial phase of named entity extraction is the main influence on final results and the focus of this paper. The SVM was tuned to produce a very high recall by extracting a large number of potential NEs. Ultimately, all of these names were fed into Google Geocoder which identified actual places. The geocoding process counteracted the large amount of extra extraction from the initial phase and protected the precision by only accepting potential NEs that successfully geocoded. However, this generated a significantly higher amount of traffic to the geocoder than the HMM-based NER process. The HMM focused on decreasing this number of geocoder queries while maintaining overall performance. Table IV shows the decrease in the number of candidate location NEs extracted by the HMM over the SVM for some of the articles in the test corpus. It also shows the number of these NEs that successfully geocoded and were disambiguated. The HMM identified significantly less potential NEs in the initial phase, resulting in the generally lower recall but

Figure 1. Lowest 3 and Highest 3 Scoring Articles of HMM higher precision in the final geocoding and disambiguation process. The results for a selection of articles is shown in Table IV. Although the HMM often extracted less than half as many potential NEs as the SVM, the final results came out similar. The HMM demonstrates better performance than the SVM in the longer articles, and worse performance on shorter articles. The f-measure for some of these articles are pictured in Figure 1 in which the three lowest and three highest scoring articles of the HMM s results are shown side by side. For the HMM, the lowest scoring articles were the articles with the fewest potential NEs in the text, and the highest had the most potential NEs. Overall, the HMM-driven process showed an f-measure 2.4% lower than the SVM-driven process on the same testing corpus of Wikipedia articles. However, various overall improvements were demonstrated by the HMMdriven process that balance out these results. The time required to generate the model for the HMM was under four minutes while it took about 36 hours to train the SVM on a similar system. The vast decrease in training time for the HMM allows much greater flexibility in changing and expanding the training corpus to adjust the model for greater performance. Second, the HMM-driven process reduced the number of candidate NEs by over 50% in most cases, reducing the time spent on the geocoding and disambiguation phase. For both the SVM and HMM driven approaches, the most processing time is spent in the geocoding and disambiguation phase, so the streamlining of the NER phase multiplies the decrease in time spent on processing each article. VII. Conclusions and Future Work By continuing to enhance the efficiency of the Geografikos package, we both increase the value of the output results, and we can make it more suitable for heavy public use over the Internet. It also supports our ultimate goal of making this code open source. We envision a number of uses for this package in the search and visualization of Wikipedia articles. With the geospatial-specific information, searches for Wikipedia articles could be filtered by geographic area, through a search in two steps. The first would be a standard free text search on Wikipedia for articles about the topic. That list could then be further filtered to those articles with appropriate locations. Reversing this paradigm, the location data provided by the Geografikos package could also allow location-centric search. If a user wanted information on a particular region in the world, they may be able to select that location on a map interface and articles that reference it could be displayed with some excerpts of the text concerning the region. Furthermore, this database of locations could enable the visualization of Wikipedia articles through a geospatial interface. For instance, consider an interface that would allow a user to select a Wikipedia article, and then present the user with a map of all the locations from the article. Each location on the map would be clickable, and provide the sentences or paragraph around that NE from the text. Imagine putting World War II into this interface, and being presented with a map of Europe, Africa, and the Pacific theater, with all the locations from the article marked and clickable. This kind of visualization would

Table IV. Hand Tagged Articles - Potential Location NEs HMM Extracted SVM Extracted Article Potential NEs / Potential NEs / HMM Precision SVM Precision HMM Recall SVM Recall Grounded NEs Grounded NEs Chancellorsville 119/47 566/75 0.7015 0.8621 0.4947 0.7895 Gettysburg 327/115 1209/117 0.7718 0.7267 0.6319 0.6429 Korean War 625/328 2167/331 0.9371 0.6910 0.8700 0.8780 War of 1812 752/384 2173/408 0.9165 0.8518 0.7370 0.7831 World War 2 668/464 1641/448 0.9915 0.9124 0.8609 0.8312 be an excellent teaching tool, and possibly reveal implicit information and relationships that are not apparent from the text of the article. Applied to other corpora of information, this kind of information could also be very useful in finding geospatial trends and relationships. For instance, consider a database of textual disease outbreak reports or world news articles. The Geografikos package could extract all the locations, allowing graphical presentation on a map, allowing trends to be found much more easily. With additional work, the geospatial data extracted by the Geografikos package could be combined with temporal information to allow geographic and temporal refinement. While many of these visualization tools already exist, they are driven from structured databases, and not from free text document sets. Our contribution in this paper is demonstrating improvements to the process originally laid out in [1]. This process extracts location names from a given text and grounds them to a single, disambiguated geospatial entity. Through the improvements by applying an HMM to our process the flexibility and speed of the NER phase of the overall process is increased. With the data structure and algorithm for resolution of ambiguous geospatial NEs based on article context, we open up possibilities for increased capability in geospatial information retrieval provided by associating a structured list of geospatial entities with a free text corpus. Credits The work reported in this paper is partially supported by the NSF Research Experience for Undergraduates Grant ARRA::CNS 0851783. References [1] J. Witmer and J. Kalita, Extracting geospatial entities from wikipedia, IEEE International Conference on Semantic Computing, pp. 450 457, 2009. [4] D. Klein, J. Smarr, H. Nguyen, and C. D. Manning, Named entity recognition with character-level models, in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. Morristown, NJ, USA: Association for Computational Linguistics, 2003, pp. 180 183. [5] G. Zhou and J. Su, Named entity recognition using an HMM-based chunk tagger, in Proc. 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), 2002. [6] S. Cucerzan, Large-scale named entity disambiguation based on Wikipedia data, EMNLP, 2007. [7] J. Leidner, G. Sinclair, and B. Webber, Grounding spatial named entities for information extraction and question answering, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references, 2003, pp. 31 38. [8] V. Sehgal, L. Getoor, and P. Viechnicki, Entity resolution in geospatial data integration, in ACM int. sym. on Advances in GIS. ACM, 2006, pp. 83 90. [9] W. Zong, D. Wu, A. Sun, E. Lim, and D. Goh, On assigning place names to geography related web pages, in ACM/IEEE-CS joint conf. on Digital libraries. ACM, 2005, pp. 354 362. [10] B. Martins, H. Manguinhas, and J. Borbinha, Extracting and Exploring the Geo-Temporal Semantics of Textual Resources, in IEEE ICSC, 2008, pp. 1 9. [11] Ø. Vestavik, Geographic Information Retrieval: An Overview, 2003. [12] T. D Roza and G. Bilchev, An Overview of Location- Based Services, BT Technology Journal, vol. 21, no. 1, pp. 20 27, 2003. [13] J. Witmer and J. Kalita, Mining Wikipedia Article Clusters for Geospatial Entities and Relationships, Papers from the AAAI Spring Symposium: Technical Report SS-09-08, 2009. [14] D. Smith and G. Crane, Disambiguating geographic names in a historical digital library, Lecture Notes in CS, pp. 127 136, 2001. [2] R. Grishman and B. Sundheim, Message understanding conference-6: A brief history, in ICCL. ACL, 1996, pp. 466 471. [3] W. Dakka and S. Cucerzan, Augmenting wikipedia with named entity tags, IJCNLP, 2008.