Inferring Location Names for Geographic Information Retrieval

Inferring Location Names for Geographic Information Retrieval Johannes Leveling and Sven Hartrumpf Intelligent Information and Communication Systems (IICS) University of Hagen (FernUniversität in Hagen), 58084 Hagen, Germany firstname.lastname@fernuni-hagen.de Abstract. For the participation of GIRSA at the GeoCLEF 2007 task, two innovative features were introduced to the geographic information retrieval (GIR) system: identification and normalization of location indicators, i.e. text segments from which a geographic scope can be inferred, and the application of techniques from question answering. In an extension of a previously performed experiment, the latter approach was combined with an approach using semantic networks for geographic retrieval. When using the topic title and description, the best performance was achieved by the combination of approaches (0.196 mean average precision, MAP); adding location names from the narrative part increased MAP to 0.258. Results indicate that 1) employing normalized location indicators improves MAP significantly and increases the number of relevant documents found; 2) additional location names from the narrative increase MAP and recall, and 3) the semantic network approach has a high initial precision and even adds some relevant documents which were previously not found. For the bilingual experiments, English queries were translated into German by the Promt machine translation web service. Performance for these experiments is generally lower. The baseline experiment (0.114 MAP) is clearly outperformed, achieving the best performance for a setup using title, description, and narrative (0.209 MAP). 1 Introduction In geographic information retrieval (GIR) on textual information, named entity recognition and classification play an important role to identify location names. GIR is concerned with facilitating geographically-aware retrieval of information, which typically results from identifying location names in the text and classifying them into geographic and non-geographic names. The main goal of this paper is to investigate if GIR benefits from an approach which is not solely based on identifying proper nouns corresponding to location names. To this end, the system GIRSA (Geographic Information Retrieval by Semantic Annotation) 1 was developed. GIRSA introduces the notion of location 1 The research described is part of the IRSAW project (Intelligent Information Retrieval on the Basis of a Semantically Annotated Web; LIS 4 554975(2) Hagen, BIB 48 HGfu 02-01), which is funded by the DFG (Deutsche Forschungsgemeinschaft). C. Peters et al. (Eds.): CLEF 2007, LNCS 5152, pp. 773 780, 2008. c Springer-Verlag Berlin Heidelberg 2008

774 J. Leveling and S. Hartrumpf Class Table 1. Definition of location indicator classes Definition; Example location adjective adjective derived from a location name; irisch / Irish for Irland / Ireland demonym name for inhabitants originating from a location; Franzose / Frenchman for Frankreich / France location code code for a location, including ISO code, postal and zip code; HU21 asthefipsregioncodefor Tolna County, Hungary location abbreviation abbreviation or acronym for a location; franz. for französisch / French (mapped to Frankreich / France ) name variant orthographic variant, exonym, or historic name; Cologne for Köln language language name in a text; Portuguese for Portuguese speaking countries (mapped to Portugal, Angola, Cape Verde, East Timor, Mozambique, Brazil ) meta-information document language, place of publication, place of birth for the author; such attributes can be explicitly given by Dublin core elements or similar means or can be inferred from the document unique entity entity associated with a geographic location, including headquarters of an organization, persons, and buildings; Boeing for Seattle, Washington ; Eiffel Tower for Paris location name name of a location, including full name and short form; Republik Korea / Republic of Korea for Südkorea / South Korea indicators and the application of question answering (QA) techniques to GIR. The system is evaluated on documents and topics for GeoCLEF 2007, the GIR task at CLEF 2007. 2 Location Indicators Location indicators are text segments from which the geographic scope of a document can be inferred. Important location indicators classes are shown in Table 1. 2 Typically, location indicators are not part of gazetteers, e.g. the morphological and lexical knowledge for adjectives is missing completely. Distinct classes of location indicators contribute differently in assigning a geographic scope to a document; their importance depends on their usage and frequency in the corpus (e.g. adjectives are generally frequent) and the correctness of identifying them because new ambiguities may be introduced (e.g. the ISO 3166-1 code for Tuvalu (TV) is also the abbreviation for television). For identification and normalization of location indicators, tokens are mapped to base forms and looked up in a knowledge base. The knowledge base contains pairs of a location indicator and a normalized location name. This knowledge base was created by collecting raw material from web sources and dictionaries 2 German examples are double-quoted, while English examples are single-quoted.

Inferring Location Names for Geographic Information Retrieval 775 (including Wikipedia and an official list of state names 3 ), which was then transformed into a machine-readable form, manually extended, and checked. Location indicators are normalized to location names on different levels of linguistic analysis in GIRSA. Normalization consists of several stages. First, Morphological variations are identified and inflectional endings are removed, reducing location indicators to their base form. In addition, multi-word names are recognized and represented as a single term ( Roten Meer(e)s / Red Sea s Rote Meer / Red Sea ). In the next step, location indicators are normalized, e.g. abbreviations and acronyms are expanded and then mapped to a synset representative, e.g. equivalent location names containing diacritical marks or their equivalent non-accented characters are represented by an element of the name synset (e.g., Québec Quebec ). Finally, prefixes indicating compass directions are separated from the name, which allows to retrieve documents with more specific location names if a more general one was used in the query. Thus, a search for Deutschland / Germany will also return documents containing the phrase Norddeutschland / Northern Germany (exception: Südafrika / South Africa ). We performed first experiments with semantic representation matching for GIR at GeoCLEF 2005 [1]. GIR-InSicht is derived from the deep QA system InSicht [2] and matches reduced semantic networks (SNs) of the topic description (or topic title) to the SNs of sentences from the document collection.this process is quite strict and proceeds sentence by sentence. 4 Before matching starts, the query SN is allowed to be split in parts at specific semantic relations, e.g. at a loc relation (location of a situation or object) of the MultiNet formalism (multilayered extended semantic networks; [3]), to increase recall while not losing too much precision. For GeoCLEF 2007, query decomposition was implemented, i.e. a query can be decomposed into two queries. First, a geographic subquery about the geographic part of the original query is derived and answered by the QA system InSicht. These geographic answers are integrated into the original query on the SN level (thereby avoiding the complicated or problematic integration on the surface level) yielding one or more revised queries. For example, the query Whiskey production on the Scottish Islands (57-GC) leads to the geographic subquery Name Scottish islands. GIR-InSicht also decomposes the alternative query SNs derived by inferential query expansion. In the above example, this results in the subquery Name islands in Scotland. InSicht answers the subqueries on the SNs of the GeoCLEF document collection and the German Wikipedia. For the above subqueries, it correctly delivered islands like Iona and Islay, which in turn lead to revised query SNs which can be paraphrased as Whiskey production 3 http://www.auswaertiges-amt.de/diplo/de/infoservice/terminologie/ Staatennamen.pdf 4 But documents can also be found if the information is distributed across several sentences because a coreference resolver processed the SN representation for all documents.

776 J. Leveling and S. Hartrumpf on Iona and Whiskey production on Islay. Note that the revised queries are processed only as alternatives to the original query. Another decomposition strategy produces questions aiming at meronymy knowledge based on the geographic type of a location, e.g. for a country C in the original query a subquery like Name cities in C is generated, whose results are integrated into the original query SN yielding several revised queries. This strategy led to interesting questions like Which country/region/city is located in the Himalaya? (GC-69). In total, both decomposition strategies led to 80 different subqueries for the 25 topics. After the title and description of a topic have been processed independently, GIR-InSicht combines the results. If a document occurs in the title results and the description results, the highest score was taken for the combination. The semantic matching approach is completely independent of the main approach in GIRSA. Some of the functionality of the main approach is also realized in the matching approach, e.g. some of the location indicator classes described above are also exploited in GIR-InSicht (adjectives; demonyms for regions and countries). These location indicators are not normalized, but the query SN is extended by many alternative SNs that are in part derived by symbolic inference rules using the semantic knowledge about location indicators. In contrast, the main approach exploits this information on the level of terms. There has been little research on the role of normalization of location names, inferring locations from textual clues, and applying QA to GIR. Nagel [4] describes the manual construction of a place name ontology containing 17,000 geographic entities as a prerequisite for analyzing German sentences. He states that in German, toponyms have a simple inflectional morphology, but a complex (idiosyncratic) derivational morphology. Buscaldi, Rosso et al. [5] investigate the semi-automatic creation of a geographic ontology, using resources like Wikipedia, WordNet, and gazetteers. Li, Wang et al. [6] introduce the concept of implicit locations, i.e. locations which are not explicitly mentioned in a text. The only case explored are locations that are closely related to other locations. Our own previous work on GIR includes experiments with documents and queries represented as SNs [1], and experiments dealing with linguistic phenomena, such as identifying metonymic location names to increase precision in GIR [7]. Metonymy recognition was not included in GIRSA because we focused on investigating means to increase recall. 3 Experimental Setup GIRSA is evaluated on the data from GeoCLEF 2007, containing 25 topics with a title, a short description, and a narrative part. As for previous GIR experiments on GeoCLEF data [1], documents were indexed with a database management system supporting standard relevance ranking (tf-idf IR model). Documents are preprocessed as follows to produce different indexes: 1. S: As in traditional IR, all words in the document text (including location names) are stemmed, using a snowball stemmer for German.

Inferring Location Names for Geographic Information Retrieval 777 Table 2. Frequencies of selected location indicator classes Class # Documents # Locations # Unique locations demonym 23379 39508 354 location abbreviation 33697 63223 248 location adjective 211013 751475 2100 location name 274218 2168988 16840 all 284058 3023194 17935 Table 3. Results for different retrieval experiments on German GeoCLEF 2007 data Run ID Parameters Results query language index fields rel ret MAP P@5 P@10 P@20 FUHtd1de DE S TD 597 0.119 0.280 0.256 0.194 FUHtd2de DE SL TD 707 0.191 0.288 0.264 0.254 FUHtd3de DE SLD TD 677 0.190 0.272 0.276 0.260 FUHtdn4de DE SL TDN 722 0.236 0.328 0.288 0.272 FUHtdn5de DE SLD TDN 717 0.258 0.336 0.328 0.288 FUHtd6de DE SLD/O TD 680 0.196 0.280 0.280 0.260 GIR-InSicht DE O TD 52 0.067 0.104 0.096 0.080 FUHtd1en EN S TD 490 0.114 0.216 0.188 0.162 FUHtd2en EN SL TD 588 0.146 0.272 0.220 0.196 FUHtd3en EN SLD TD 580 0.145 0.224 0.180 0.156 FUHtdn4en EN SL TDN 622 0.209 0.352 0.284 0.246 FUHtdn5en EN SLD TDN 619 0.188 0.272 0.256 0.208 2. SL: Location indicators are identified and normalized to a base form of a location name. 3. SLD: In addition, document words are decompounded. German decompounding follows the frequency-based approach described in [8]. 4. O: Documents and queries are represented as SNs and GIR is seen as a form of QA. Typical location indicator classes were selected for normalization in documents and queries. Their frequencies are shown in Table 2. Queries and documents are processed in the same way. The title and short description were used for creating a query. GeoCLEF topics contain a narrative part describing documents which are to be assessed as relevant. Instead of employing a large gazetteer containing location names as a knowledge base for query expansion, additional location names were extracted from the narrative part of the topic. For the bilingual (English-German) experiments, the queries were translated using the Promt web service for machine translation. 5 Query processing then follows the setup for monolingual German experiments. 5 http://www.e-promt.com/

778 J. Leveling and S. Hartrumpf Values of three parameters were changed in the experiments, namely the query language (German: DE; English: EN), the index type (stemming only: S; identification of locations, not stemmed: SL; decomposition of German compounds: SLD; based on SNs: O; hybrid: SLD/O), and the query fields used (combinations of title T, description D, and locations from narrative N). Parameters and results for the GIR experiments are shown in Table 3. The table shows relevant and retrieved documents (rel ret), MAP and precision at five, ten, and twenty documents. In total, 904 documents were assessed as relevant for the 25 topics. For the run FUHtd6de, results from GIR-InSicht were merged with results from the experiment FUHtd3de in a straightforward way, using the maximum score. (Run IDs indicate which parameters and topic language were used.) 4 Results and Discussion Identifying and indexing normalized location indicators, decompounding, and adding location names from the narrative part improves performance significantly (paired Student s t-test, P=0.0008), i.e. another 120 relevant documents are found and MAP is increased from 0.119 (FUHtd1de) to 0.258 for FUHtdn5de. Decompounding German nouns seems to have different effects on precision and recall (FUHtd2de vs. FUHtd3de and FUHtdn4de vs. FUHtdn5de). More relevant documents are retrieved without decompounding, but initial precision is higher with decompounding. The topic Deaths caused by avalanches occurring in Europe, but not in the Alps (55-GC) contains a negation in the topic title and description. However, adding location names from the narrative part of the topic ( Scotland, Norway, Iceland ) did not notably improve precision for this topic (0.005 MAP in FUHtd3de vs. 0.013 MAP in FUHtdn5de). A small analysis of results found by GIR-InSicht in comparison with the main GIR system revealed that GIR-InSicht retrieved documents for ten topics and returned relevant documents for seven topics. This approach contributes three additional relevant documents to the combination (FUHtd6de). For the topic Crime near St. Andrews (52-GC), zero relevant documents were retrieved in all experiments. Several topics had a high negative difference to the median average precision, i.e. their performance was lower. These topics include Schäden durch sauren Regen in Nordeuropa ( Damage from acid rain in northern Europe, 54-GC), Beratungen der Andengemeinschaft ( Meetings of the Andean Community of Nations, 59-GC), and Todesfälle im Himalaya ( Death on the Himalaya, 69-GC). The following causes for the comparatively low performance were identified: The German decompounding was problematic with respect to location indicators, i.e. location indicator normalization was not applied to the constituents of German compounds (e.g. Andengemeinschaft is correctly split into Anden / Andes and Gemeinschaft / community, but Anden is not identified as a location name for topic 59-GC).

Inferring Location Names for Geographic Information Retrieval 779 Several terms were incorrectly stemmed, although they were base forms or proper nouns (e.g. Regen / rain reg and Anden / Andes and for topics 54-GC and 59-GC, respectively). Decompounding led in some cases to terms with a very high frequency, causing a thematic shift in the retrieved documents (e.g. Todesfälle / cases of death was split into Tod / death and Fall / case for topics 55-GC and 69-GC). In several cases, a focused query expansion might have improved performance, i.e. Scandinavia may have been a good term for query expansion in topic 54-GC, but GIRSA s main approach did not use query expansion for GeoCLEF 2007. Results for the bilingual (English-German) experiments are generally lower. As for German, all other experiments outperform the baseline (0.114 MAP). The best performance is achieved by an experiment using topic title, description, and location names from the narrative (0.209 MAP). In comparison with results for the monolingual German experiments, the performance drop lies between 4.2% (first experiment) and 27.1% (fifth experiment). The narrative part of a topic contains a detailed description about which documents are to be assessed as relevant (and which not), including additional location names. Extracting location names from the narrative (instead of looking up additional location names in large gazetteers) and adding them to the query notably improves performance. This result is seemingly in contrast to some results from GeoCLEF 2006, where it was found that additional query terms (from gazetteers) degrade performance. A possible explanation is that in this experiment, only a few location names were added (3.16 location names on average for 15 of the 25 topics with a maximum of 13 additional location names). When using a gazetteer, one has to decide which terms are the most useful ones in query expansion. If this decision is based on the importance of a location, a semantic shift in the results may occur, which degrades performance. In contrast, selecting terms from the narrative part increases the chance to expand a query with relevant terms only. 5 Conclusion and Outlook In GIRSA, location indicators were introduced as text segments from which location names can be inferred. Results of the GIR experiments show that MAP is higher when using location indicators instead of geographic proper nouns to represent the geographic scope of a document. This broader approach to identify the geographic scope of a document benefits system performance because proper nouns or location names do not alone imply the geographic scope of a document. The hybrid approach for GIR proved successful, and even a few additional relevant documents were found in the combined run. As GIR-InSicht originates from a deep (read: semantic) QA approach, it returns documents with a high initial precision, which may prove useful in combination with a geographic blind feedback strategy. GIR-InSicht performs worse than the IR baseline, because

780 J. Leveling and S. Hartrumpf only 102 documents were retrieved for 10 of the 25 topics. However, more than half (56 documents) turned out to be relevant. Several improvements are planned for GIRSA. These include using estimates for the importance (weight) of different location indicators, possibly depending on the context (e.g. Danish coast Denmark, but German shepherd Germany ), and augmenting the location name identification with a part-ofspeech tagger and a named entity recognizer. Furthermore, the QA methods provide a useful mapping from natural language questions to gazetteer entry points. For example, the expression Scottish Islands is typically not a name of a gazetteer entry, while the geographic subquery answers Iona and Islay typically are. In the future, a tighter coupling between the QA and IR components is planned, exploiting these subquery answers in the IR methods of GIRSA. (Note that this reverses the standard order of processing known from QA: In GIRSA, QA methods are employed to improve performance before subsequent IR phases.) Finally, we plan to investigate the combination of means to increase precision (e.g. recognizing metonymic location names) with means to increase recall (e.g. recognizing and normalizing location indicators). References 1. Leveling, J., Hartrumpf, S., Veiel, D.: Using semantic networks for geographic information retrieval. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 977 986. Springer, Heidelberg (2006) 2. Hartrumpf, S., Leveling, J.: Interpretation and normalization of temporal expressions for question answering. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 432 439. Springer, Heidelberg (2007) 3. Helbig, H.: Knowledge Representation and the Semantics of Natural Language. Springer, Berlin (2006) 4. Nagel, S.: An ontology of German place names. Corela Cognition, Représentation, Langage Le traitement lexicographique des noms propres (2005) 5. Buscaldi, D., Rosso, P., Garcia, P.P.: Inferring geographical ontologies from multiple resources for geographical information retrieval. In: Proceedings of GIR 2006, Seattle, USA, pp. 52 55 (2006) 6. Li, Z., Wang, C., Xie, X., Wang, X., Ma, W.Y.: Indexing implicit locations for geographical information retrieval. In: Proceedings GIR 2006, Seattle, USA, pp. 68 70 (2006) 7. Leveling, J., Hartrumpf, S.: On metonymy recognition for GIR. In: Proceedings of GIR 2006, Seattle, USA, pp. 9 13 (2006) 8. Chen, A.: Cross-language retrieval experiments at CLEF 2002. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 28 48. Springer, Heidelberg (2003)