Geographic Informa0on Retrieval: Are we making progress? Ross Purves, University of Zurich

Outline Where I m coming from: defini0ons, experiences and requirements for GIR Brief lis>ng of one (of many possible) sets of challenges for GIR Progress and opportuni>es with focus on why Geographic IR SomeDimension IR Interspersed with a personal selec>on of relevant, but perhaps less well known, papers

Star0ng points

Ray Larson: seminal work in GIR an applied research area that combines aspects of DBMS research, User Interface Research, GIS research, and Informa0on Retrieval research,... concerned with indexing, searching, retrieving and browsing of geo- referenced informa0on sources, and the design of systems to accomplish these tasks effec0vely and efficiently." Larson et al. (1996)

Refining the defini0on GIR is therefore concerned with improving the quality of geographically specific informa0on retrieval with a focus on access to unstructured documents such as those found on the Web". (Jones and Purves, 2008)

My perspec0ves on spa0al search Work da>ng back to 2002 with Chris Jones, Mark Sanderson, Alistair Edwardes, Paul Clough, Curdin Derungs and others Two European projects and related research SPIRIT Enabling spa0al search on internet documents Tripod Indexing images based on loca0ons and associated geographies Co- chair (with Chris) of Workshop on Geographic Informa0on Retrieval (8 edi>ons so far) Working with linguists on language and space

SPIRIT Spa0ally Aware Informa0on Retrieval on the Internet Handled queries of the form <theme> <spatial relationship> <location> One of several early examples of complete systems (e.g. Jones et al., 2002; Chen et al., 2006; Lieberman et al., 2007) Not based on Local Directory data (c.f. early examples of Local Search)

Basic conceptual model of a SPIRIT- like system Oben forgocen!!

Basic precondi0ons for GIR GIR (in my view) becomes interes>ng when a few precondi>ons are met: Informa0on needs complex, varied and oben underspecified Large collec0ons of unstructured documents (may or may not be thema>cally related) Simple binary (DBMS type) retrieval not effec0ve Mul>ple geographic granulari0es

and so to challenges for GIR based on an editorial in IJGIS (Jones and Purves, 2008)

The challenges Detec0ng geographical references in the form of place names and associated spa0al natural language qualifiers within text documents and in users queries; disambigua0ng place names to determine which par>cular instance of a name is intended; geometric interpreta0on of the meaning of vague place names, such as the Midlands and of vague spa0al language such as near ; indexing documents with respect to their geographic context as well as their non- spa0al thema0c content; ranking the relevance of documents with respect to geography as well as theme; developing effec0ve user interfaces that help users to find what they want; and developing methods to evaluate the success of GIR.

Detec0ng geographical references Basic task underlying GIR iden0fying candidate spa0al referents in text Underpinned by NER methods but oben simple gazeweer lookup is key Queries (and social media) have very different proper0es to text documents Referents typically treated as part of a bag of words/ points model Usually predicated on specific placenames (Santa Barbara) (as opposed to types (a beach)) referents Language modelling approaches (c.f. Vanessa s talk) overcome some of these problems but bring others

Wolf, S.J, Henrich, A. and Blank, D. 2014 Characteriza>on of Toponym Usages in Texts. 8 th Workshop on Geographic Informa>on Retrieval, Dallas, Texas.

and disambigua0ng place names Very large propor0ons of candidate referents are ambiguous Humans deal with this very well Very simple methods (default sense typically) achieve very high precision, especially at city level granulari>es Oben assume random toponym distribu0on and focus on coarse granulari0es Sources such as Wikipedia (for co- occurrence) may increase unevenness of coverage (c.f. Mark Graham) GazeWeer proper0es oben unques0oningly accepted

Moncla, L., Renteria- Agualimpia, W., Nogueras- Iso, J. & Gaio, M. (2014) Geocoding for texts with fine- grain toponyms: an experiment on a geoparsed hiking descripgons corpus. In Proceedings of the ACM SIGSPATIAL GIS 2014, Dallas, Texas.

Vague place names and spa0al language Recognises importance of vague spa0al language and incompleteness of gazeweers Many studies have demonstrated possibili>es of delinea0ng (and more rarely iden0fying) vague place names (through co- occurrence and georeferenced spa>al media) Vagueness and its implica0ons vary with granularity and user need (oben ignored) Reasoning in search typically discards vagueness (other than as a distance ranking measure) Many official, crisply bounded geometries are also used vaguely in natural language

Davies, C., Holt, I., Green, J., Harding, J. and Diamond, L. (2009) User needs and implica>ons for modelling vague named places. SpaGal CogniGon & ComputaGon 9 (3), 174-94.

Indexing Indexes fundamental to efficient search of both text and space In GIR early experiments showed that simple approaches (e.g. separate thema>c and spa>al indexes) were adequate Recent work uses more complex ideas to combine dimensions but advantages s>ll unclear Index efficiency is possibly of less interest to most par>cipants here but effec0veness is also key: 1. How should we represent documents for indexing? 2. Should indexes be space (e.g. Vanessa s language models) or object primary (e.g. POI data)? 3. How should query vs. document footprints be represented?

Chen, L., Cong, G., Jensen, C. S., & Wu, D. (2013). Spa>al keyword query processing: an experimental evalua>on. Proceedings of the VLDB Endowment, 6(3), 217-228.

Ranking (and relevance) Relevance is o_en reduced to a binary or ordinal quality in IR/ GIR No>ons of relevance are fundamental to similarity measures used in ranking Most approaches use rela>vely simple models note that measures must also be tractable on large document collec0ons First (simple) experiments suggest spa0al diversity also important Relevance literature is, in my view, confused/ confusing very licle is directly related to text documents (oben focuses on POI search) Many ranking approaches use Euclidean distance w.r.t. some no>onal point Underlying geographic distribu0on of theme of interest oben ignored

J. Tang, M. Sanderson. (2014) EvaluaGon and User Preference Study on SpaGal Diversity. In Proceedings of the 32nd European Conference on IR Research on Advances in Informa>on Retrieval (ECIR 2010)

User interfaces User interfaces central to query (re)formula0on and results display Most mainstream examples s>ll focus on points on maps, and basic cartographic issues (e.g. overplopng) are ignored Surprisingly limited crossover from the geovisualisa>on community (especially for search rather than explora0on) Search strategies known to be important in search (c.f. informa0on foraging model from Stuart Card) yet typically we s>ll adopt a one size fits all approach Large, complex, corpora, crying out for effec>ve visualiza>on approaches which capture variability and richness

H. Samet, M. D. Adelfio, B. C. Fruin, M. D. Lieberman, J. Sankaranarayanan. 2013. PhotoStand: A Map Query Interface for a Database of News Photos. PVLDB, 6(12):1350-1353.

Evalua0on Much evalua>on in IR has focussed on system- centred compara0ve approaches (e.g. GeoCLEF) Pure text baselines oben hard to beat func>on of corpora, query type, granularity and relevance judgements Limited, controlled access to query logs (c.f. AOL debacle) however work based on these has great poten>al for bewer understanding user needs (realis0c evalua0on) Increasing use of approaches based on crowd sourcing (e.g. CrowdFlower) but literature suggests evalua0ng specific geographic relevance is challenging, even with local knowledge Long tail queries are important user- centred, qualita0ve, approaches have great poten>al S>ll a need for community wide coopera0on and evalua0on to allow meaningful comparison

Mandl, T., Carvalho, P., Di Nunzio, G. M., Gey, F., Larson, R. R., Santos, D., & Womser- Hacker, C. (2009). GeoCLEF 2008: the CLEF 2008 cross- language geographic informa>on retrieval track overview. In EvaluaGng Systems for MulGlingual and MulGmodal InformaGon Access (pp. 808-821). Springer Berlin Heidelberg.

Some closing remarks Unstructured text documents very rich, great poten>al for answering geographic ques0ons (c.f. Chris Jones) Balance of research awen0on (social media vs. more tradi>onal text) perhaps uneven Developing complete systems complex and oben neglected need for more component sharing reproducible research? GIR is necessarily interdisciplinary great poten>al for more effec0ve collabora0ons