A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia

A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia Daryl Woodward, Jeremy Witmer, and Jugal Kalita University of Colorado, Colorado Springs Computer Science Department 1420 Austin Bluffs Pkwy Colorado Springs, CO 80918 daryl.woodward@gmail.com, jeremy@geograkos.net, kalita@eas.uccs.edu Abstract We target in this paper the challenge of extracting geospatial data from the article text of the English Wikipedia. We present the results of a Hidden Markov Model (HMM) based approach to identify location-related named entities in the our corpus of Wikipedia articles, which are primarily about battles and wars due to their high geospatial content. The HMM NER process drives a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name (often referred to as grounding). We compare our results to a previously developed data structure and algorithm for disambiguating place names that can have multiple coordinates. We demonstrate an overall f-measure of 79.63% identifying and geocoding place names. Finally, we compare the results of the HMM-driven process to earlier work using a Support Vector Machine. I. Introduction The amount of user-generated, unstructured content on the Internet increases significantly every day. Consequently, the need for techniques to automatically extract information from unstructured text becomes increasingly important. A significant number of queries on the Internet target geospatial data, making this a good area of study. Therefore, this paper focuses on approaches to extracting geospatial information from unstructured text. While Named Entity Recognition (NER) has seen much progress in recent years, NER for location names is only the first step in extracting geospatial information. The extracted place names must then be geocoded to at least a latitude and longitude coordinate pair to allow visualization, geospatial search, and information retrieval of text based on locations. We refer to this process as geocoding and disambiguation. It is also often referred to as grounding. Our work in this paper has a very specific focus: to maximize the efficiency of the initial geospatial NER that feeds our geocoding and disambiguation process. Increasing the accuracy of the raw NE data will improve our final results. Specifically, a more refined list of geospatial named entities (place names) from unstructured text documents such as Wikipedia articles, will aid in geocoding ambiguous names to the correct geospatial coordinates. Our research in this area is motivated by a number of factors. The goal of the ongoing development of the Geografikos package is a software system that creates a database of Wikipedia articles in which each article has an associated structured set of geospatial entities extracted from the article. This database will allow geospatial querying, information retrieval, and geovisualization to be applied to the Wikipedia articles. Further, we wish to opensource the Geografikos software on completion. Section 2 of the paper discusses relevant background research and information on the LingPipe library. Section 3 focuses on our process for choosing training data. Section 4 summarizes our method for generating the most accurate results. Section 5 quickly recaps the geocoding and disambiguation process. Section 6 discusses our results and compares them to the SVM used for the same process used by Witmer and Kalita [1]. Finally, we conclude with a discussion of possibilities for future research. II. Background Research Named entity recognition refers to the extraction of words and strings of text within documents that represent discrete concepts, such as names and locations. The term Named Entity Recognition describes the operations in

natural language processing that seek to extract the names of persons, organizations, locations, other proper nouns, and numeric terms such as percentages and dollar amounts. The term Named Entity was defined by the MUC-6, sponsored by DARPA in 1996 [2]. The NE recognition task was further defined, and expanded to language independence by the Conference on Natural Language Learning shared task for the 2002 and 2003 conferences. Numerous approaches have been tried since MUC- 6 to increase performance in NER, including Hidden Markov Models, Conditional Random Fields, Maximum Entropy models, Neural Networks, and Support Vector Machines (SVM). Dakka and Cucerzan demonstrated an SVM that achieves an f-measure of 0.954 for LOC entities in Wikipedia articles, and an f-measure of 0.884 across all NE classes [3]. Although research into text classification and NER has found that SVMs provide good performance on NER tasks, HMMs can produce similar results with minimal training. Hidden Markov Models (HMMs) have also shown excellent results. Klein et al. demonstrated that a characterlevel HMM can identify both English and German named entities with an f-measure of 0.899 and 0.735 for LOC entities in testing data, respectively [4]. In [5], Zhou and Su evaluated a HMM and HMM-based chunk tagger on the MUC-6 and MUC-7 English NE tasks, achieving f- measures of 0.966 and 0.941, respectively. To compare to the SVM-based approach used by Witmer in [1], we chose to use the HMM implemented by the LingPipe library for NER which if participated in the CoNLL 2002 NER Task would have been tied for fourth place with an f-measure of 0.766 1. For simplicity, Lingpipe is a fully developed Java package that easily integrated into our existing code. LingPipe identifies itself as a suite of Java libraries for the linguistic analysis of human language, providing tools for information extraction and data mining 2. III. Training and Test Corpus Generation A. Training We downloaded a number of previously tagged data sets to provide the training material for the HMM, along with other resources that required hand-tagging. We narrowed our training to include: CoNLL 2003 shared task dataset on multi-language NE tagging 3 containing tagged named entities for PER, LOC, and ORG in English. 1 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 2 http://alias-i.com/lingpipe/ 3 http://www.cnts.ua.ac.be/conll2003/ner/ CoNLL 2004 shared task dataset on Semantic Role Labeling 4 tagged for English LOC NEs. Hand-tagged articles from the English Wikipedia, downloaded June 18, 2008. The CoNLL datasets were chosen due to their high quality and because they had been previously tagged for all NEs. We combined the CoNLL data with articles from the English Wikipedia in which all LOC NEs were handtagged. The articles focused on battles and wars with high frequencies of geospatial entities. An HMM was then generated for various combinations of the listed corpora and short simulations conducted for each of the models. The inclusion of the hand tagged data proved to have the greatest effect on the results. The average difference in f-measure between training corpora that did and did not include the hand tagged data was about 1.7%, where including the data was beneficial. Our final corpus was comprised of the CoNLL 2003 and 2004 datasets and the hand-tagged data from Wikipedia. All named entity tags were preserved from the CoNLL datasets but only locations were tagged in the Wikipedia articles. To put into perspective the degree of accuracy required in this NER task, simple statistics were gathered from the nine hand tagged articles used for training. Only 2191 of 61,708 total words were part of a location string (3.55%). Due to the sparse occurrences of these geospatial entities, the addition of these location-weighted and subject-similar articles significantly improved results. Based on our analysis of the CoNLL 2003 dataset, 7,893 of 32,588 NEs of 169,032 total words in 947 articles were locations. The CoNLL 2004 data did not contain article separators, but 3,347 of 16,308 NEs of 176,920 total words were locations. Only about half of the locations in the hand tagged articles did not appear in the CoNLL datasets, thus only 1,168 new locations were truly added with the inclusion of the nine hand tagged articles. B. Testing For primary testing, 21 Wikipedia articles (171,232 total words) were selected from the list of 90 articles processed by the SVM in [1]. These specific articles were chosen because they were used as primary examples in Witmer s previous work. These articles were preprocessed and were determined to have a variety of lengths and location frequencies but suitable for statistical analysis. Particularly, they did not have too few locations where incorporating them into the statistics without some sort of standardization would have an imbalanced impact on the results. The test articles shared the topic of historical battles and wars with the training articles. The articles used 4 http://www.lsi.upc.es/ srlconll/st04/st04.html

for training only made up about 15% of the final training set. Currently, this set of Wikipedia articles is the only corpus chosen for testing. In the future, we may expand our corpus to include news articles or other such informational online resources that are also expected to contain geospatial content. IV. Method LingPipe offers various formats for results along with different Named Entity Recognizers which vary in accuracy and efficiency. We chose the CharLMRescoringChunker as it is described to be the most accurate, but also the slowest chunker 5. This was best suited for the Geografikos package since the geospatial information associated with each article is only processed once for each article. The NER process is also significantly faster than the geocoding and disambiguation process, so NER speed was not an issue. Table I is an example of LingPipe s Confidence Named Entity Chunking which returns a list of the most confident results, including the string, where it can be found, what type of chunk it may be, and the probability that the string is correctly typed. The four types of chunks it can be trained to identify are PER, ORG, LOC, and MISC while text not identified as one of these chunks is labeled O. Based on a manual review of sentences such as these, we predicted that setting a threshold of 1.1 for the confidence would provide the best balance between false positives and correct identification. This threshold was set as a parameter within the Geografikos package as it processes results returned by the HMM. Tests were conducted with thresholds 1.0 to 1.5 with 0.1 increments. A direct correlation emerged between the threshold value and the precision in final results. An inverse correlation emerged between the precision and recall. The highest f-measure was achieved with a threshold of 1.1, which coincided with our initial prediction. V. Geospatial Named Entity Resolution The HMM extracted a set of candidate geospatial NEs from the article text. For each candidate string, the second objective was to decide whether it was a geospatial NE, and to determine the correct (latitude, longitude), or (φ, λ) coordinate pair for the place name in context of the article. To resolve the candidate NE, a lookup was made using Google Geocoder 6. If the entity reference resolved to a 5 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 6 http://code.google.com/apis/maps/documentation/geocoding/index. html single geospatial location, no further action was required. Otherwise, the context of the place name in the article, a data structure and a rule-driven algorithm were used to decide the correct spatial location for the place name. Our disambiguation task is close to that of word sense disambiguation, as defined by Cucerzan [6], only that we consider the geospatial context and domain instead of the lexical context and domain. We refer to this as geospatial entity resolution. It has also been referred to as grounding a place name [7]. Sehgal et al. demonstrate good results for geospatial entity resolution using both spatial (coordinate) and nonspatial (lexical) features of the geospatial entities [8]. Zong et al. demonstrated a rule-based method for place name assignment, achieving a precision of 88.6% on disambiguating place names in the United States, from the Digital Library for Earth System Education (DLESE) metadata [9]. While related to the work done by Martins et al. in [10], using an HMM for NER, and resolving through a geocoder for geospatial coordinates as a second step, we drew on the work in this area done by Witmer in [1], using a novel location tree data structure and algorithm to disambiguate and geocode the place names. A. Google Geocoder We used Google Geocoder as the gazetteer and geocoder for simplicity, as much research has already been done in this area. [11] and [12] provided an excellent overview of existing technology in this area. Google Geocoder provides a simple REST-based interface that can be queried over HTTP, returning data in a variety of formats. This system architecture allows developers to manager their user-client interfaces however they wish while only their client-server interaction must stay consistent. This also enables the option of adding layers to a process, allowing many lower-level layers to share caches to reduce server interaction. For each geospatial NE string submitted as a query, Google Geocoder returns 0 or more placemarks as a result. VI. Results In this section, we compare the overall results from the HMM-driven NER and disambiguation with the SVMdriven NER and disambiguation presented in [13] and [1]. Table II compares the final results of the HMM with that of the SVM. These results are based on the processing of 21 articles from Wikipedia. The Resolved Results in Table II specifically show the performance of our package in correctly identifying location strings and geocoding the locations. The NER Results show the accuracy of

Table I. LingPipe output for sentence: Lee at first anticipated that he would fight Burnside northwest of Fredericksburg and that it might be necessary to drop back behind the North Anna River. Rank Conf Span Type Phrase 0 2.0000 (67, 81) LOC Fredericksburg 1 1.4661 (137, 154) LOC North Anna River. 2 1.2874 (143, 154) LOC Anna River. 3 1.0745 (137, 142) LOC North 4 1.0560 (148, 154) LOC River. 5 1.0134 (137, 147) ORG North Anna 6 1.0074 (137, 147) LOC North Anna 7 1.0053 (0, 3) PER Lee Table II. Resolved Geospatial NE Results Precision Recall F-Measure HMM NER Results 0.489 0.615 0.523 SVM NER Results 0.510 0.998 0.675 HMM Resolved Results 0.878 0.734 0.796 SVM Resolved Results 0.869 0.789 0.820 the two NERs before any further processing. A correctly identified string identified by the NER process is one that exactly matched a hand tagged ground truth named entity. Geocoding success is taking the string and correctly resolving a single location in the context of the document. Although the NER results look significantly lower when using the HMM, it should be noted that the median for the collected data was 0.646. A handful of articles with foreign names, such as the article on the Korean War, brought down the average results with f-measures around only 10%. This is most likely due to the fact that our training data contained a limited number of foreign names, and the HMM had trouble recognizing these strings as LOC named entities. Figure 1 shows a more detailed breakdown of the precision, recall, and f-measure for a subset of the articles processed. The results in this chart show the same trend shown by Table III. The Geografikos package was generally able to extract locations with a higher performance from articles that contain a higher percentage of nonambiguous locations. The lowest scoring articles, portrayed as the leftmost three articles in Figure 1 all reference engagements in the American Civil War. Table III, from [14], shows that North and Central America have a much larger percentage of ambiguous place names that other parts of the world. The highest scoring article (rightmost three articles in Figure 1) focus on engagements on other continents. Table III. Places With Multiple Names and Names Applied to More Than One Place in the Getty Thesaurus of Geographic Names Continent % places with multiple names % names with multiple places North & Central 11.5 57.1 America Oceania 6.9 29.2 South America 11.6 25.0 Asia 32.7 20.3 Africa 27.0 18.2 Europe 18.2 16.6 A. Analysis The SVM used in [1] performed independent of article length and the frequency of place names. This independence did not carry over into the HMM. Although the two methods share the same disambiguation process, the initial phase of named entity extraction is the main influence on final results and the focus of this paper. The SVM was tuned to produce a very high recall by extracting a large number of potential NEs. Ultimately, all of these names were fed into Google Geocoder which identified actual places. The geocoding process counteracted the large amount of extra extraction from the initial phase and protected the precision by only accepting potential NEs that successfully geocoded. However, this generated a significantly higher amount of traffic to the geocoder than the HMM-based NER process. The HMM focused on decreasing this number of geocoder queries while maintaining overall performance. Table IV shows the decrease in the number of candidate location NEs extracted by the HMM over the SVM for some of the articles in the test corpus. It also shows the number of these NEs that successfully geocoded and were disambiguated. The HMM identified significantly less potential NEs in the initial phase, resulting in the generally lower recall but

Figure 1. Lowest 3 and Highest 3 Scoring Articles of HMM higher precision in the final geocoding and disambiguation process. The results for a selection of articles is shown in Table IV. Although the HMM often extracted less than half as many potential NEs as the SVM, the final results came out similar. The HMM demonstrates better performance than the SVM in the longer articles, and worse performance on shorter articles. The f-measure for some of these articles are pictured in Figure 1 in which the three lowest and three highest scoring articles of the HMM s results are shown side by side. For the HMM, the lowest scoring articles were the articles with the fewest potential NEs in the text, and the highest had the most potential NEs. Overall, the HMM-driven process showed an f-measure 2.4% lower than the SVM-driven process on the same testing corpus of Wikipedia articles. However, various overall improvements were demonstrated by the HMMdriven process that balance out these results. The time required to generate the model for the HMM was under four minutes while it took about 36 hours to train the SVM on a similar system. The vast decrease in training time for the HMM allows much greater flexibility in changing and expanding the training corpus to adjust the model for greater performance. Second, the HMM-driven process reduced the number of candidate NEs by over 50% in most cases, reducing the time spent on the geocoding and disambiguation phase. For both the SVM and HMM driven approaches, the most processing time is spent in the geocoding and disambiguation phase, so the streamlining of the NER phase multiplies the decrease in time spent on processing each article. VII. Conclusions and Future Work By continuing to enhance the efficiency of the Geografikos package, we both increase the value of the output results, and we can make it more suitable for heavy public use over the Internet. It also supports our ultimate goal of making this code open source. We envision a number of uses for this package in the search and visualization of Wikipedia articles. With the geospatial-specific information, searches for Wikipedia articles could be filtered by geographic area, through a search in two steps. The first would be a standard free text search on Wikipedia for articles about the topic. That list could then be further filtered to those articles with appropriate locations. Reversing this paradigm, the location data provided by the Geografikos package could also allow location-centric search. If a user wanted information on a particular region in the world, they may be able to select that location on a map interface and articles that reference it could be displayed with some excerpts of the text concerning the region. Furthermore, this database of locations could enable the visualization of Wikipedia articles through a geospatial interface. For instance, consider an interface that would allow a user to select a Wikipedia article, and then present the user with a map of all the locations from the article. Each location on the map would be clickable, and provide the sentences or paragraph around that NE from the text. Imagine putting World War II into this interface, and being presented with a map of Europe, Africa, and the Pacific theater, with all the locations from the article marked and clickable. This kind of visualization would

Table IV. Hand Tagged Articles - Potential Location NEs HMM Extracted SVM Extracted Article Potential NEs / Potential NEs / HMM Precision SVM Precision HMM Recall SVM Recall Grounded NEs Grounded NEs Chancellorsville 119/47 566/75 0.7015 0.8621 0.4947 0.7895 Gettysburg 327/115 1209/117 0.7718 0.7267 0.6319 0.6429 Korean War 625/328 2167/331 0.9371 0.6910 0.8700 0.8780 War of 1812 752/384 2173/408 0.9165 0.8518 0.7370 0.7831 World War 2 668/464 1641/448 0.9915 0.9124 0.8609 0.8312 be an excellent teaching tool, and possibly reveal implicit information and relationships that are not apparent from the text of the article. Applied to other corpora of information, this kind of information could also be very useful in finding geospatial trends and relationships. For instance, consider a database of textual disease outbreak reports or world news articles. The Geografikos package could extract all the locations, allowing graphical presentation on a map, allowing trends to be found much more easily. With additional work, the geospatial data extracted by the Geografikos package could be combined with temporal information to allow geographic and temporal refinement. While many of these visualization tools already exist, they are driven from structured databases, and not from free text document sets. Our contribution in this paper is demonstrating improvements to the process originally laid out in [1]. This process extracts location names from a given text and grounds them to a single, disambiguated geospatial entity. Through the improvements by applying an HMM to our process the flexibility and speed of the NER phase of the overall process is increased. With the data structure and algorithm for resolution of ambiguous geospatial NEs based on article context, we open up possibilities for increased capability in geospatial information retrieval provided by associating a structured list of geospatial entities with a free text corpus. Credits The work reported in this paper is partially supported by the NSF Research Experience for Undergraduates Grant ARRA::CNS 0851783. References [1] J. Witmer and J. Kalita, Extracting geospatial entities from wikipedia, IEEE International Conference on Semantic Computing, pp. 450 457, 2009. [4] D. Klein, J. Smarr, H. Nguyen, and C. D. Manning, Named entity recognition with character-level models, in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. Morristown, NJ, USA: Association for Computational Linguistics, 2003, pp. 180 183. [5] G. Zhou and J. Su, Named entity recognition using an HMM-based chunk tagger, in Proc. 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), 2002. [6] S. Cucerzan, Large-scale named entity disambiguation based on Wikipedia data, EMNLP, 2007. [7] J. Leidner, G. Sinclair, and B. Webber, Grounding spatial named entities for information extraction and question answering, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references, 2003, pp. 31 38. [8] V. Sehgal, L. Getoor, and P. Viechnicki, Entity resolution in geospatial data integration, in ACM int. sym. on Advances in GIS. ACM, 2006, pp. 83 90. [9] W. Zong, D. Wu, A. Sun, E. Lim, and D. Goh, On assigning place names to geography related web pages, in ACM/IEEE-CS joint conf. on Digital libraries. ACM, 2005, pp. 354 362. [10] B. Martins, H. Manguinhas, and J. Borbinha, Extracting and Exploring the Geo-Temporal Semantics of Textual Resources, in IEEE ICSC, 2008, pp. 1 9. [11] Ø. Vestavik, Geographic Information Retrieval: An Overview, 2003. [12] T. D Roza and G. Bilchev, An Overview of Location- Based Services, BT Technology Journal, vol. 21, no. 1, pp. 20 27, 2003. [13] J. Witmer and J. Kalita, Mining Wikipedia Article Clusters for Geospatial Entities and Relationships, Papers from the AAAI Spring Symposium: Technical Report SS-09-08, 2009. [14] D. Smith and G. Crane, Disambiguating geographic names in a historical digital library, Lecture Notes in CS, pp. 127 136, 2001. [2] R. Grishman and B. Sundheim, Message understanding conference-6: A brief history, in ICCL. ACL, 1996, pp. 466 471. [3] W. Dakka and S. Cucerzan, Augmenting wikipedia with named entity tags, IJCNLP, 2008.