Spatial Role Labeling: Towards Extraction of Spatial Relations from Natural Language

Spatial Role Labeling: Towards Extraction of Spatial Relations from Natural Language PARISA KORDJAMSHIDI, MARTIJN VAN OTTERLO and MARIE-FRANCINE MOENS Katholieke Universiteit Leuven This article reports on the novel task of spatial role labeling in natural language text. It proposes machine learning methods to extract spatial roles and their relations. This work experiments with both a step-wise approach, where spatial prepositions are found and the related trajectors and landmarks are then extracted, and a joint learning approach, where a spatial relation and its composing indicator, trajector and landmark are classified collectively. Context-dependent learning techniques, such as a skip-chain conditional random field, yield good results on the GUM evaluation data (Maptask) data and CLEF-IAPR TC-12 Image Benchmark. An extensive error analysis, including feature assessment, and a cross-domain evaluation pinpoint the main bottlenecks and avenues for future research. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing Language Parsing and understanding; Text analysis; H.3.1 [Information Storage and Retrieval]: content analysis and indexing Linguistic processing General Terms: Experimentations, Languages Additional Key Words and Phrases: Semantic labeling, Spatial relations, Spatial information extraction 1. INTRODUCTION An essential function of language is to convey spatial relationships between objects and their relative/absolute location in a space. The sentence Give me the gray book on the large table. expresses information about the spatial configuration of two objects (book, table) in some space. Understanding such spatial utterances is a problem in many areas, including robotics, navigation, traffic management, and query answering systems [Tappan 2004]. Although the current work focuses on natural language processing, our long-term research considers spatial information extraction in a multimodal environment and aims to obtain and represent spatial relations using formal representations, allowing further spatial reasoning. For example, an interesting multimodal environment is the navigation domain, where we expect a robot to follow navigation instructions [Kollar et al. 2010]. When a camera is placed on the robot, it should be able to both recognize objects and their location and search for particular items based on verbal instruction. Another example is answering queries about objects locations using both textual descriptions and visual data; combining the evidence provided by recognizing objects in the texts and images could generate answers that are more reliable. Spatial information extraction from language could also play an important role in semantic search, i.e., extracting information based on ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1 0??.

2 Spatial Role Labeling meaningful categories. We recently introduced spatial role labeling problem as the extraction of generic spatial semantics from natural language [Kordjamshidi et al. 2010b]. We defined a semantic labeling scheme to annotate spatial information. It tags natural language with the spatial roles carried by words according to the holistic spatial semantic theory (HSS) [Zlatevl 2007]. The core problem of spatial role labeling is assigning specific tags to words or phrases in natural language sentences to express their roles in terms of spatial semantics. For example, in John is sitting on the ground, the preposition on is an indicator of a spatial relation between John and the ground. Many prepositions never carry a spatial meaning, whereas some have spatial sense depending on the context. The preposition on in this sentence has spatial sense, though it has no such sense in the sentence I can count on him. John is the first argument of the on-relation and is a trajector. The phrase the ground is the second argument of the on-relation and is a landmark. In the related research in this domain, restricted languages extract very specific and application-dependent relations from text [Kelleher 2003; Tappan 2004; Li et al. 2007]. Previous research has not systematically covered spatial relation and role extraction from unrestricted natural language with machine learning methods, but this paper aims to do so. Statistical machine learning models are promising approaches to address the intrinsically ambiguous nature of spatial information in natural language. A major obstacle when dealing with unrestricted language is the scarcity of annotated data available for training machine learning models. We therefore start with the available resources. In our leading experiments, we learn prepositions spatial senses by exploiting annotated data from the preposition project (TPP) employed in SemEval-2007 [Litkowski and Hargraves 2007] and then use the results of preposition disambiguation in a spatial role labeler that identifies trajector and landmark roles. We use linguistically motivated features and evaluate several context-dependent classification algorithms. We successfully evaluate spatial role labeling on texts from the GUM (General Upper Model spatial ontology) evaluation data [Bateman et al. 2007] and CLEF IAPR TC-12 Image Benchmark data [Grubinger et al. 2006]. 1 One advantage of our pipelining approach is that knowledge from another linguistic resource is injected into the learning system. The TPP data are exploited here to solve the first part of our relation extraction algorithm, i.e., finding prepositions that have a spatial sense. We use annotated data from a larger source outside our training and test data in the extraction task, potentially increasing generalization possibilities. Errors concerning incorrectly recognizing prepositions spatial meaning can propagate and lead to incorrect recognition of spatial roles and relationships. Thus, the pipelined approach has difficulties competing with models that jointly learn the spatial meaning of a preposition and corresponding spatial roles of its arguments. Analyzing and comparing these settings provide inspiration for utilizing (other) resources for our task. We present the first experimental study on learning to extract spatial information from unrestricted natural language. Our main contributions include the following: We introduce the novel spatial role labeling task, which extracts spatial relations from natural language. 1 See also http://imageclef.org/photodata

Spatial Role Labeling 3 We present the first domain-independent English dataset with labeled data for spatial expressions, specifically designed for machine learning solutions. Based on linguistically oriented features, we evaluate conditional random field (CRF) algorithms and compare their suitability for the task. We demonstrate the injection of external data resources into the spatial role labeling task by exploiting sense-annotated prepositions from TPP and compare it to a one-step approach, limited to only using spatially annotated data. We provide extensive experiments to show that our approach produces good results for the spatial role labeling task. We extensively survey related approaches for spatial language understanding in cognitive science, linguistics and computer science. We pinpoint bottlenecks and outline future research directions. Main structure of this article. This paper is structured as follows. In Section 2, we describe the spatial role labeling task and formally define it in Section 3. In Section 4, we describe our approach, based on machine learning techniques, to learn the spatial role labeling task from an annotated dataset. This approach solves two main subproblems for which solutions are described subsequently. The first subproblem is identifying the pivot of spatial relations, for which we learn to predict prepositions roles more specifically, as described in Section 4.1. The second subproblem is identifying possible arguments of the spatial relations, for which we learn to predict whether parts of a sentence can be classified as so-called trajectors or landmarks, as described in Section 4.2. Both subproblems tackle the overall goal of extracting spatial relations from text. In Section 4.3, we investigate another setting in which we classify all roles jointly, i.e., without separate classifications for spatial indicators and trajectors/landmarks. Section 4.4 reports which algorithms, based on probabilistic graphical models, are employed by both subproblems. In Section 5, we present and discuss a series of experiments. After introducing the main structure and rationale of the experiments, we show results for several datasets and perform an additional feature analysis. We give results in quantitative form but also present a qualitative analysis to show the effectiveness of the approachs. To complement the error analysis and see how well the learned classifiers generalize new data, we evaluate them on several texts from different subject domains than the training domain. After the experiments, in Section 6, we discuss related lines of research on spatial information representation and extraction in cognitive science, linguistics and machine learning. Section 7 concludes this article and outlines prominent research directions in spatial language processing. 2. THE SPATIAL ROLE LABELING TASK As discussed above, spatial information plays an important role in many applications [Galton 2009]. However, its automatic recognition in natural language expressions is undeveloped or, when addressed, limited to recognizing coarse-grained and brittle information added to predicates and mainly expressed by verbs. To highlight some general aspects of spatial semantics, consider the following two sentences (taken from [Bateman et al. 2010]): (1) He left the institute an hour ago. (2) He left the institute a year ago. In the first example the sentence semantics indicate that the person is no longer in the

4 Spatial Role Labeling building, and the sentence is about physically leaving the building and going somewhere else. This change directly amounts to a physical and spatial relocation. The second sentence expresses a more fundamental change: the person has apparently quit his job at the institute. The second type of spatial change is more involved and less material. Another set of examples is as follows: (3) The computer is on the table and the mouse is to the left of it. (4) The party leader could be considered at the far left of the political spectrum. The first sentence expresses two explicit physical relations about objects on a table. The second sentence uses a similar relation at the far left of, but its meaning is more conceptual. Only drawing this political spectrum on a piece of paper allows one to put the party leader on its left side. These examples illustrate some of the challenges in spatial language understanding. Similar lexical items can provide different spatial meanings. Conversely, two different descriptions may have a similar semantic interpretation: (5) Looking over his right shoulder, he saw his dog sitting quietly. (6) The dog sat quietly on the floor to his right. In the sentences (1) and (2), the spatial information is mainly expressed through a verb, whereas the other examples primarily use prepositions. Furthermore, some information is not explicitly represented in the words but can be inferred from common sense. For example, one can infer that the mouse is on the table in sentence (3). This sentence also includes a related inference step resolvable at the linguistic level. An anaphora resolution step attaches it to the computer before determining the spatial semantics. It could refer to the table, in which case the spatial semantics also differ. Despite the variations in spatial information in natural language expressions, a sentence can essentially express spatial relations between objects. For example, the third sentence contains an on-relation between the computer and the table. Another relation is that the mouse is to the left of the computer. Such relations, denoted on(computer,table) and totheleftof(mouse,computer), form the starting point of any system that processes spatial information in natural language. In on(computer,table), we can distinguish the different spatial roles of phrases in a sentence: on expresses a predicate (or, relation) and computer and table are arguments with their own roles. Our main concern in this article is extracting such spatial relations. We define spatial role labeling as the automatic labeling of words or phrases in sentences with a set of spatial roles. The roles take part in one or more spatial relations expressed by the sentence. The sentence-level spatial analysis of texts characterizes spatial descriptions, such as determining the objects spatial properties and locations to answer what/who and where questions. The spatial indicator (typically a preposition) establishes the type of spatial relation, and other constituents express the participants of the spatial relation (e.g., entities locations). The following sentence is an example: Give me the [gray book] tr [on] si [the big table] lm. Our spatial role set consists of trajector (tr), landmark (lm) and spatial indicator (si) (and none otherwise) [Kelleher 2003; Zlatevl 2007; Kordjamshidi et al. 2010b]. The above sentence contains several subsequences labeled with these roles. They are as follows: Trajector: the entity whose (trans)location is of relevance. The book is the main entity

Spatial Role Labeling 5 of which location is specified in the sentence. The trajector can be static or dynamic, a person or an object, or even a whole event. Alternative terms used in the literature are local/figure object, locatum, referent or target. Landmark: the reference entity in relation to which the location or trajectory of the trajector s motion is specified. The main entity s location designator (the trajector, the book) is the table. Other terms for landmarks are reference object, ground, or relatum. Spatial indicator: the tokens that define constraints on the spatial properties, such as the trajector s location with respect to the landmark (e.g., in, on). A spatial indicator expresses a relation (or predicate) with the landmark and trajector as its arguments. Spatial indicators explain the types of spatial relations and are often prepositions but can also be verbs and nouns among other parts of speech. These indicators are the pivot of spatial relations. Other conceptual aspects, such as motion indicators, indicate specific spatial motion information (usually specified in terms of verbs); frame of reference and the path of a motion are influencing concepts for spatial semantics and roles [Zlatevl 2007]. However, we restrict our focus to prepositions conveying spatial information. Spatial role labeling is a special type of semantic role labeling, and, as with semantic roles, the spatial relations supported by the roles contribute to a sentence s semantic frame recognition [Màrquez et al. 2008]. In semantic frame labeling, a predicate is identified and disambiguated, and its role arguments are recognized. In spatial role labeling, the spatial indicator is identified (instead of the verb predicate) and disambiguated, and its semantic role arguments including the trajector and landmark, are found. However, differences between these two tasks exist. In spatial role labeling, the roles are more specific regarding their semantics; there is no direct correspondence between the sentence s semantic Fig. 1. Parse tree labeled with spatial roles. structure based on traditional semantic frames (patient, agent) and the spatial semantics structure. In the above example, FrameNet s Giving frame provides the semantic type Locative relation; the Place where the Donor gives the Theme to the Recipient. The location refers to the place where the give is performed, and not the location of the book, mentioned in the prepositional phrases. Moreover, both the formal and informal (pragmatic) spatial expression meanings in natural language are highly dependent on lexical details, the ontological structure of spatial information spaces, and the embedding of extracted information into existing spatial knowledge. Another difference between spatial role labeling and semantic role labeling is that no large annotated corpora were available from which spatial roles could be learned directly. New data resources were needed to apply machine learning techniques. In this respect, breaking the problem into parts and utilizing existing linguistic resources have the advan-

6 Spatial Role Labeling tage of limiting the training examples that must be labeled. These external resources could improve the performance of the spatial role labeling task, which is evaluated in this paper. General spatial relation extraction presents many challenges concerning task-specific ambiguities and difficulties. However, there is not always a direct mapping between a sentence s grammatical structure and its spatial semantic structure. This issue is more challenging in complex spatial expressions that convey several spatial relations. The simple example below shows that grammatical dependencies cannot always identify spatial dependencies and connections: The vase is on the ground on your left. The dependency tree relates the first appearance of on to the words vase and ground. This process produces a valid spatial relation connecting the right trajector to the right landmarks. If we systematically follow the grammatical clues and information, then the second appearance of on connects the ground and your left, producing a less meaningful spatial relation in terms of trajector, landmark and spatial indicator ( ground on your left ), Figure 1 shows the related parse tree. When confronted with more complex relations and nested noun phrases, deriving spatially valid relations is not straightforward and highly dependent on the lexical meaning of words. However, recognizing the right prepositional phrase (PP) attachment during syntactic parsing can improve the identification of spatial arguments. Other linguistic phenomena, such as spatial-focus-shift and ellipsis of trajector and landmark [Li et al. 2007], make extraction more difficult. Spatial motion detection and recognition of the frame of reference are additional challenges that are not treated here. 3. PROBLEM DEFINITION The spatial role labeling task finds spatial relations in natural language sentences, each of which includes a spatial indicator and its arguments. We assume that the sentence is a priori partitioned into a number of segments. The segments could be words, phrases or arbitrary subsequences of the sentence. More formally, let S be a sentence defined as a sequence of N segments: S = w 1, w 2,..., w N We define a set of roles: roles = {trajector, landmark, spatial indicator, none}, and each segment in the sentence can be assigned one or more of these roles. Each spatial relation in sentence S is a triple wspatial indicator, w trajector, w landmark where w spatial indicator, w trajector and w landmark are three distinct segments of S, denoting the parts of S that represent the spatial indicator and its trajector and landmark arguments, respectively. For any spatial relation, the value of the trajector (or landmark) can be undefined, meaning that no segment in S represents the trajector (or landmark). In those cases, we call the trajector (or landmark) implicit, as in the sentence Come over here, where the trajector you is only implicitly present. Given a sentence S, the set of all spatial indicators of S is denoted I. It is induced by

the indicator function I defined over all segments w of S: { 1, if w is a spatial indicator I(w) = 0, otherwise Spatial Role Labeling 7 We assume that spatial indicators overlap with neither each other nor trajectors and landmarks. In other words, for any sentence S, if w and w are two segments of S, then I(w) = 1 and I(w ) = 1 imply that w w =. Because trajectors and landmarks are spatial indicator arguments, we define two indicator functions relative to a given spatial indicator s in sentence S. The set of trajectors (landmarks) with respect to spatial indicator s is denoted T s (L s ), induced by indicator functions T s and L s defined over all segments in S. For a spatial indicator s, its trajector and landmark cannot overlap with each other or s itself (though they can be undefined, as mentioned earlier). Although we have defined spatial indicators, trajectors and landmarks as arbitrary segments of a sentence, we focus on single words, each as one segment. However, a phrase in the sentence commonly plays a role, and we thus assume that the head word of the phrase is the role-holder. A head word determines its phrase s syntactic type; analogously, it is a stem that determines the semantic category of its component s compound. The other elements of a phrase modify the head. For example, in the huge blue book, book is the head word, and huge and blue are modifiers. In our data, the labeling scheme reflects this fact and only assigns roles to head words and labels the remaining words (e.g., modifiers) as none. Hence, a sentence is hereafter assumed to be a sequence of words. Our ground-truth data include sequences, each of which contains exactly one (labeled) spatial indicator with all possible trajectors and landmarks. A sentence can thus provide multiple examples, up to the number of its contained spatial indicators. We formally define each sentence in the corpus as a sequence of words w 1,...,w n. Let k be the number of prepositions in a sentence s; s then induces k examples e 1...e k, where examples e i and e j have the same spatial indicator for no i and j. Each e i (i = 1...k) is a sequence (w 1, l 1 ),..., (w n, l n ) in which each word w i (i = 1...n) is tagged such that i) at most, one w j gets a label l j = spatial indicator; ii) some words get a label trajector or landmark, if they are a trajector or landmark of the spatial indicator w j ; and iii) the remaining words get a label none. If a preposition is not spatial, all words in the example are tagged with none. As an illustration, consider the following sentence, which gives two examples: A girl and a boy are sitting at none trajector none none trajector none none sp.indicator none none none none none none none none the desk in the classroom. none landmark none none none none trajector sp.indicator none landmark The sentence is labeled twice, each time with a different indicator. Using our indicator functions, we have I = {at,in} T at = {girl,boy} and L at = {desk} T in = {desk} and L in = {classroom}

8 Spatial Role Labeling The spatial relations for this sentence are the triples produced by the following (we only account for head words in the role-playing phrases): {at} {girl, boy} {desk} = { at, girl, desk, at, boy, desk } {in} {desk} {classroom} = { in, desk, classroom } An example with an implicit trajector is the following sentence: Go under the bridge none spatial indicator none landmark In this case, we derive the spatial relation using I = {under} and T under = and L under = {bridge} which results in under, undefined, bridge as the corresponding spatial relation. This article takes a given corpus of sentences tagged with spatial indicators, trajectors and landmarks, giving a multitude of sequence examples, and constructs (i.e., learns) an automated spatial relation extraction method that can be employed successfully on unseen data. 4. APPROACH The problem definition leads to a similar problem as semantic role labeling (SRL), where words are classified based on a known predicate (a verb). In spatial role labeling, the spatial indicator is the pivot (i.e., predicate) of the spatial relation. A spatial indicator can be from various lexical word classes, although the most dominant form is the preposition. In SRL, one can start from a verb and find roles related to it, but in spatial role labeling, one must first find the sense of the pivot (i.e., the preposition). Sometimes, a proposition has a spatial sense, but that same preposition might not have a spatial sense in a different context. In our approach, the set of roles is {trajector, landmark, spatial indicator, none}, and we use an additional term undefined to highlight the existence of implicit trajectors or landmarks; undefined does not appear in the annotated data, nor is it learned or predicted by our classifiers. It solely serves as a place-holder for missing elements if the three components of a spatial relation cannot be explicitly found in a sentence (Algorithm 1 provides further explanation). The set of all spatial relations in a sentence S, denoted SR, is defined thus (where s, t, l are head words in S): SR = { w, w, w w I, w T w, w L w } In this definition, three functions should be estimated. First, the function I is needed; it takes a word in the sentence as an input and estimates whether it is a spatial indicator. We employ a general probabilistic classifier; for spatial indicators, we learn a function Î representing the probability that a word is spatial, given some features about sentence S. To get the (deterministic) indicator function I, we compute (using r = {spatial, nonspatial}) { 1, if spatial = arg max x r Î(x w, f(w, S)) I(w) = (1) 0, otherwise

Spatial Role Labeling 9 optimized over training data, where f(w, S) denotes a set of features derived from sentence S and word w. Indicating which words in the sentence have the trajector or landmark role requires two other functions, given that we know that some word s is a spatial indicator. Because the parameters for both trajectors and landmarks are the same (i.e., the spatial indicator), we can combine them into a multi-class classification problem that classifies words in a sentence (i.e., head words) into r = {trajector, landmark, none}. We call this function ˆR, and it takes a spatial indicator and tags words with these roles. We use a probabilistic classifier here, and to obtain deterministic classifications for landmarks and trajectors, we first compute r w,s = arg max x r ˆR(x w, s, f(w, s, S)) (2) where w is a word in sentence S, s is a spatial indicator, f(w, s, S) denotes a set of features defined over the word w, the spatial indicator s, and the sentence S. This process maximizes a probability function given a set of features. The details of this function are described in the next section. We continue with L s (w) = { 1, if r w,s = landmark 0, otherwise T s (w) = { 1, if r w,s = trajector 0, otherwise From Equations 1 and 2, we see that a natural pipelined task decomposition presents itself. We can first find words that potentially carry a spatial sense (I(s) = 1), and we then find the corresponding trajectors and landmarks for each pivot. The general structure of our pipeline approach consists of the following steps, outlined in subsequent sections: Finding spatial indicators: The first task consists of labeling parts of an input sentence S that play the spatial pivot role or finding the preposition with spatial sense. Section 4.1 describes this step, which utilizes TPP data to learn the labeling task. As we see below, we reduce this step to finding potential spatial indicators by only considering a sentence s prepositions. Finding spatial arguments: The second task consists of classifying parts of an input sentence S that play the landmark or trajector roles, given a (spatial) pivot. We employ two annotated datasets (CLEF and GUM (Maptask)) and describe it in Section 4.2. In an additional relation extraction phase, we assemble the results of the previous two steps to form spatial relation triplets with spatial indicators and their trajector and landmark arguments (see also Algorithm 1). This step is straightforward and involves no learning. We also investigate an alternative approach in which we tackle both steps jointly: Finding spatial indicators and their arguments jointly: In this task, we do not use a separate preposition disambiguation step but instead learn to tag all words in a sentence jointly. The examples in the dataset are used to train a single classifier that assigns the spatial indicator, trajector, and landmark roles simultaneously. Classifications can therefore correlate without using additional data resources (e.g., TPP). Section 4.3 describes this approach. The remainder of this section describes the features and algorithms we designed and implemented for the spatial relation recognition task.

10 Spatial Role Labeling 4.1 Learning Spatial Indicators Various lexical categories (e.g., verbs, adjectives) can express spatial information, but prepositions primarily do so [Baldwin et al. 2009]. However, because prepositions often have different senses [Tratz and Hovy 2009; Litkowski and Hargraves 2007], we wish to recognize whether they convey a spatial sense. The sense of prepositions can be disambiguated by machine learning methods, as a large corpus exists for it. We consider prepositions because of their importance and the feasibility of the disambiguation task. According to the aforementioned formalization, the set I contains only prepositions and I(w) = 1 holds only for prepositions with spatial sense. We aim to promote the use of a specific training scheme for preposition sense disambiguation and not perform other linguistic techniques to recognize them. The locatives recognized by SRL might be a solution, but this is often not true. The following two examples stem from the preposition disambiguation dataset (TPP) [Litkowski and Hargraves 2007]. (i) He saw Owen redden with pleasure and laughed flinging an arm about his shoulders... (ii) This project compares assumptions incorporated into social policies about these obligations... Prep POS DepRel SRL sense about(i) IN NMOD Arg1 spatial about(ii) IN NMOD Arg1 topic Table I. Assigned labels by a POS tagger, dependency tree and SRL to about with two senses. Table I shows the labels assigned by a part-of-speech (POS) tagger, a dependency parser, and SRL to the preposition about. The parse tree, the dependency tree and even the semantic role labeler could not distinguish between two senses of the preposition about. We therefore propose to learn these senses from a corpus labeled with senses (TPP) provided for the preposition disambiguation task (SemEval07) [Litkowski and Hargraves 2007], featuring the category SpatialSense among others. More specifically, the component Î performs this preposition disambiguation task in Equation 1. It uses the following linguistically motivated features and the preposition contextual features that we aim to classify: The preposition itself By exploiting the dependency parser: The words directly dependent on the preposition (head1) The words on which the preposition is directly dependent (head2) For the predicates which have a dependency relation with the preposition: All words that are arguments of the predicate other than the preposition are added using a semantic role labeler For all extracted words satisfying the above conditions, the following features are also included: The lemma

Spatial Role Labeling 11 The part-of-speech tag (POS) The type of dependency relation (DPRL) The semantic role labels and, for predicates, the sense of the predicate (if assigned) We present a sentence containing a preposition and the extracted features as an example. He saw Owen redden with pleasure, and laughed, flinging an arm about his shoulders... {Preposition( about ), Preposition DPRL( NMOD ), head1( arm ), head1 POS( NN ), head1 sense( arm.01 ), head2( shoulders ), head2 POS( NNS ), head2 isarg(shoulders.01), Preposition POS( IN ), Preposition isarg( A1 arm.01 ), head1lemma( arm ), head1 DPRL( OBJ ), head1 isarg( A1 flinging.01 ), head2 lemma( shoulder ), head2 DPRL( PMOD ), head2 sense( shoulders.01 )} To identify the spatial prepositions, we use the TPP data provided for the preposition disambiguation task, SemEval07 [Litkowski and Hargraves 2007]. We extract the features from the training and test data and use a maximum entropy and a Naive Bayes classifier to disambiguate the prepositions sense. This process results in a binary classification of a preposition s spatial or nonspatial sense. 4.2 Trajector and Landmark Classification As explained in Section 4, a multi-class classifier ˆR must be trained to map each word w onto a class label from the set {trajector, landmark, none}, given a spatial indicator s. Because spatial indicator features are used to classify the roles of words in the sentence, the spatial indicator must be known before classifying trajectors and landmarks. Hence, we utilize the first step of preposition sense disambiguation, described in the previous section, to recognize the spatial indicators first, after which its arguments (trajectors and landmarks) can be classified. The generic feature set used in Equation 2 can now be defined in more detail using three different sorts. The first set of features relates to the word that we aim to classify (f 1 (w)), the second includes the features of the spatial indicator of which the word may be an argument (f 2 (s)), and the third contains the features that relate the word to the sentence s indicator (f 3 (w, s)). SRL inspired these features, but they center on the spatial indicator. As mentioned, features are defined for head words. Features of a word w f 1 (w): The word (form) of w. The part-of-speech tag. The dependency to the syntactic head in the dependency tree. The semantic role. The subcategorization of the word (sister-nodes of its parent node in the tree). Features of the spatial indicator s f 2 (s): The spatial indicator word (form). The subcategorization of s. Relational features of w w.r.t. s f 3 (w, s): The path in the parse tree from the w to the s. The binary linear position of w with respect to the s (e.g., before or not).

12 Spatial Role Labeling The number of nodes on the path between s and w normalized by dividing over the number of all nodes in the parse tree (to obtain an integer value it is reversed and rounded afterwards): distance = #Nodes on the path between s and w #Nodes in the parse tree Take the following sentence as an example. The vase is on the ground on your left. Here, the input features for classification of vase w.r.t. the first on are: vase, NN, SBJ, A0, NP-VP }{{} on, NP }{{} NN NP S VP PP IN, true, 3 }{{} f 1 (w) f 2 (w) f 3 (w, s) A semantic role labeler is typically trained on a large external dataset. Using assigned semantic roles as features brings in additional knowledge, which may not be present in the dataset used to train the spatial role labeler. This issue encourages the use of the semantic roles as features. The task is now a multi-class classification problem in which each word, represented by a feature vector, is separately classified, assuming that these classifications are independent. We use such a model in our initial experiments. In subsequent models, words are also described by their features, but the class to which they are assigned depends not only on their own values but also on the other feature vector values and relations among the various classes. The obtained class of a word may constrain the class of the next word. We therefore employ several conditional random field (CRF) models. In these models, a sentence is a sequence of observations (i.e., words), w 1,..., w N, which can be represented using a probabilistic graphical model. Each observation can be described in terms of the described feature vectors, and the model outputs a label for each word in the sequence. After recognizing the trajector and landmark given a spatial indicator, we have all the relation elements. Relation extraction is performed in a straightforward way, by assembling all extracted spatial indicators, trajectors, and landmarks and combining them into spatial relation triplets. Algorithm 1 shows the entire process, based on preposition disambiguation and trajector/landmark classification. 4.3 Learning Spatial Relations without a Priori Spatial Indicator Classification The spatial role labeling task can be seen as a joint classification task: to predict each triplet of segments as being in the indicator-trajector-landmark relation or not. In the previous section, we outlined a pipelining method for spatial role labeling, where a preposition (i.e., spatial indicator) is classified as spatial or nonspatial and the trajector and landmark are then sought for the obtained spatial indicators. Our focus on prepositions added one constraint to this task; the indicator should be a preposition (a realistic bias in English). The main purpose of this pipeline approach is to exploit a large external data source (TPP) for spatial sense disambiguation. Combining two steps of the pipeline provides another option for learning spatial relations. We could omit the first step of using a dedicated classifier for spatial sense recognition, and learn to assign all spatial roles jointly, i.e., tagging words with trajector,

Spatial Role Labeling 13 Algorithm 1 Spatial-Relation-Extraction( S : sentence ) returns relations SR 1: {preposition disambiguation} 2: for all w S do 3: Estimate Î(w) by training a probabilistic classifier and 4: construct the set I of all spatial indicators of the sentence S. 5: for all s I do 6: {trajector and landmark classification} 7: for all w S do 8: Estimate a probabilistic multi-class classifier ˆR and 9: construct the sets T s and L s according to the assigned labels. 10: if T s = then T s {undefined} 11: if L s = then L s {undefined} 12: {relation extraction} 13: SR SR { } s, t, l t T s, l L s 14: return SR landmark, spatial indicator or none, based on a training dataset. To train the classifier, we can employ a procedure and examples as in the pipeline setting, but the classifier must then learn one more label (spatial indicator). To test and evaluate the classifier on a new (unlabeled) sentence S, we see that S can contain several prepositions with spatial sense and many trajectors and landmarks, whereas the classifier can only assign a single label to each word. The solution we use here is to, again, generate multiple examples from S, where each example contains a designated pivot with specific features extracted for that word (e.g., path features from words to the pivot). For each example, the words are classified using these features. One must theoretically generate as many examples as there are words in S; in our practice, it suffices to do this procedure only for pivots that are prepositions. The main advantage of this setting is that the learning algorithm gets the freedom to classify trajectors, landmarks and indicators in the context of one another. In the relation extraction step, we perform the same general steps as in Algorithm 1, differing primarily in that we take all prepositions as possible spatial indicators in the preposition disambiguation phase (lines 1 4) and that the classifier ˆR now uses all roles, including spatial indicator. This fact allows multiple words to be classified as spatial indicators in one sequence and could in principle allow the extraction of spurious relations. However, due to the learning bias (i.e., each example contains only one targeted preposition), we discovered that spurious relations are rarely extracted. While on the one hand, the joint setting enables a learning algorithm to use the information in the data without depending on external data resources, on the other hand, there is a hazard of becoming specialized to the spatial preposition distribution in the available data. The experimental results section empirically investigates this trade-off. 4.4 Algorithms A conditional random field (CRF) is a state-of-the art model for context-dependent classification. A CRF is an undirected graphical model or Markov random field, conditioned on a set of observations X to predict a set of output variables Y. We define G = (V, E) as an undirected graph (with vertices V and edges E) such that a node v V corresponds to each random variable and V = X Y. We denote an assignment to X by x, an assign-

14 Spatial Role Labeling ment to a set A X by x A, and similarly for Y. If each random variable y Y obeys the Markov property with respect to G, then (Y, X) is a conditional random field. This model represents a probability distribution over a large number of random variables by a product of local functions that each depend on a small subset of variables. This factorization of the global probability distribution makes learning and inference feasible. A CRF generally defines a probability distribution p(y x) as follows: p(y x) = 1 Z(x) Ψ A(x A, y A ) A in which Ψ A (x A, y A ) is a potential function, where Ψ A : V n R + and Z(x) is the normalization factor: Z = y A Ψ A(x A, y A ) and Finally the conditional probability is the following: p(y x) = 1 Z(x) Ψ A G { K(A) } Ψ A (x A, y A ) = exp λ Ak f Ak (x A, y A ) k=1 { K(A) } exp λ Ak f Ak (x A, y A ) k=1 For the CRF experiments we use Mallet 2 and GRMM: 3 Linear-chain CRF. The structure of graph G is theoretically arbitrary; however, when modeling sequences (in our case, words of a sentence), the simplest graph is a linearchain CRF in the form of a (often first-order) Markov chain [Lafferty et al. 2001; Sutton and MacCallum 2006]. In this setting, the spatial role label of a word in the sentence depends on the label of word in the previous position. Considering sequential relationships can increase the learning model s accuracy. The conditional probability p(x y) is 1 N Z(x) Ψ t(y t 1, y t, x) t=1 where X = (x 1,...,x K ) is a sequence or other structural set of observations and Y = (y 1,...,y K ) is the corresponding set of labels assigned to X. In the spatial role labeling task, X ranges over the words of a sentence, while Y ranges over the classes trajector (tr), landmark (lm), spatial indicator (si in the joint setting) or none of these (none). Ψ t (y t 1, y t, x) is a potential function, which is a real-valued function that captures the degree to which the assignment y t to the output variable fits the transition from y t 1 and X. The potentials typically factorize according to a set of features F = {f k } such that Ψ(y t 1, y t, x) = exp{ K k=1 λf k(y t 1, y t, x)}. The linear chain CRF setting of Mallet uses a forward-backward algorithm to compute the marginal distributions and the Viterbi algorithm to compute the most probable sequence label assignment. For our task, allowing transitions unobserved in the training 2 http://mallet.cs.umass.edu/download.php 3 http://mallet.cs.umass.edu/grmm/index.php

Spatial Role Labeling 15 y1 y2 y3 y4 y5 y6 y7 X1 X2 X3 X4 X5 X6 X7 The book is on the table. Fig. 2. Graphical representation of CRF with preposition template. Prepositions are connected to the candidate trajectors and candidate landmarks i.e noun phrases. Factors occur as black squares. data during the inference and prediction phases adds more flexibility to the model, particularly when there are few training examples. This setting is called fully-connected in the Mallet tool, and we use it in our experiments. We refer to this setting as linear chain CRF(FC). General CRF with preposition template. In many relation extraction tasks, certain long-distance dependencies between entities play an important role. In our task, prepositions primarily play a spatial indicator role, while trajectors and landmarks are noun phrases. There could be many words in between the roles in the sentence that have no particular role and are assigned the none label. In light of this fact, we apply a version of a skip-chain CRF [Sutton and MacCallum 2006] to account for the probabilistic dependencies between distant labels. These dependencies are represented by augmenting the linear-chain CRF with factors dependent on the labels of the sentence s pivot preposition and noun phrases. The features on skip edges can incorporate information from the context of both endpoints, so the strong evidence of one endpoint can influence the label at the other endpoint. In our skip-chain CRF model, we exploit two clique templates: one is the normal sequential part (connecting neighboring words) and the other connects pivot prepositions to candidate trajectors and landmarks. Following the related work [Sutton and MacCallum 2006], the set of all pairs of positions for which there are skip edges (i.e., between prepositions and nouns) is represented as PN = {(u, v)}; the probability of label sequence y given input x is p θ (y x) = 1 Z(x) N Ψ t (y t, y t 1, x) t=1 (u,v) P N Ψ uv (y u, y v, x) where Ψ t are factors for sequential relations and Ψ uv are factors over skip edges. We define the factors as Ψ t (y t, y t 1, x) = exp{ λ 1k f 1k (y t, y t 1, x, t)} and Ψ uv (y u, y v, x) = exp{ λ 2k f 2k (y u, y v, x, u, v)}, where θ 1 = {λ 1k } K1 k=1 are the parameters of the linear-chain template and {f 1k } is the related set of feature functions or sufficient statistics. Similarly, θ 2 = {λ 2k } K2 k=1 are the parameters of the preposition template, and {f 2k } is its related set of feature functions or sufficient statistics. The full set of model parameters is θ = {θ 1, θ 2 }. We use loopy belief propagation as the approximate inference algorithm in our experiments. We compare the results of the CRFs with two baseline approaches: MaxEnt (baseline) model. As a baseline learning model, we classify the words of a

16 Spatial Role Labeling sentence independently using a standard maximum entropy classifier. Simple baseline. To encourage the use of machine learning, a simple baseline is employed: given a spatial preposition, the first head word before the preposition is taken as the trajector and the head word after the preposition as the landmark. There is no learning from data in this setting, but the dependency tree is exploited to discover dependent headwords. 5. EXPERIMENTAL STUDY In this section, we report on a series of experiments to evaluate various components of the spatial role labeling and relation extraction tasks. 5.1 Structure and Goals of the Experimental Setup We present our leading research questions and identify the sections where we experimentally answer those questions. Which data resources are available, or can be generated, to learn the spatial role labeling task from data? We answer this question in Section 5.2. In our experiments, we clearly want to solve the spatial role labeling task for unrestricted natural language input. However, we are limited by the amount of available data for machine learning. We describe our novel data resources and summarize their statistics. How can we detect the spatial sense of prepositions using available resources? We answer this question in Section 5.3. We first investigate whether other resources (e.g., locatives obtained from SRL) can help and what benefits lie in directly learning the spatial sense from a large external and available data source (TPP). If we assume that the spatial sense of a preposition is known or learned beforehand, how can we learn its corresponding trajectors and landmarks from data? In Section 5.4, we present various classifiers that take a given (spatial) preposition as input, and label the corresponding arguments (landmark and trajector) of the predicate the preposition represents. What benefits lie in the sequential nature of finding the spatial sense of a preposition and then finding trajectors and landmarks (the so-called pipeline technique)? In Section 5.4.1, we first decouple the two problems and focus solely on the situation in which the spatial sense is known perfectly. In Section 5.4.2, we investigate two different situations where we fully automate the task, i.e., we use the preposition disambiguation output as input for spatial relation recognition. Without ground-truth data on the spatial sense of the prepositions, some landmarks or trajectors cannot be found because this spatial sense is classified incorrectly. We investigate a setting in which unknown prepositions are classified as spatial by default and another in which they are nonspatial by default. What benefits lie in jointly recognizing spatial indicators, trajectors and landmarks, and how can long-distance dependencies help in this setting? In Section 5.4.3, we investigate an approach in which we learn to tag words with an extended label set that includes spatial indicators. This process side-steps preposition disambiguation as a separate phase; thus, classifications depend only on the information in one training dataset.

Spatial Role Labeling 17 How do different pipelining methods affect the accuracy of the whole-relation extraction? In Section 5.5, we perform experiments in which we measure the accuracy of different pipelining techniques on the whole-relation extraction (thus finding the correct spatial indicators, i.e., prepositions and their correct landmarks and trajectors). What is the effect of the used features on the extraction task? Section 5.6 discusses the effects of leave-one-out feature analysis. What is the cross-domain performance of the approach on an unrestricted natural language text that contains both spatial and nonspatial information? In Section 5.7, we apply our system to several small, general, and unrestricted natural language texts to evaluate performance on data outside the training domain. What are the main sources of errors in our approach? In Section 5.8, we investigate the errors made in 50 sentences of our dataset. We can distinguish five general categories of errors, including nested spatial relations and spatial focus shift. The errors caused by different model characteristics and different data domain characteristics are investigated in two separate subsections. 5.2 Dataset Description For our experimental analysis, we use several manually annotated datasets. We describe their characteristics and usefulness for our study in this section. Statistics for the corpora are presented in Table II. TPP dataset For the preposition disambiguation task, we employ the standard test and training data provided by the SemEval-2007 challenge [Litkowski and Hargraves 2007]. It contains 34 separate XML files, one for each preposition, totaling over 25,000 instances with 16,557 training and 8,096 test example sentences; each sentence contains one example of the respective preposition. GUM (Maptask) dataset Because the spatial role labeling task is newly defined, there is no annotated English corpus available. However, the GUM (General Upper Model) evaluation data [Bateman et al. 2007], comprising a subset of a well-known corpus for spatial language is a useful dataset. It has been used to validate the expressivity of spatial relations in the GUM ontology. Currently, the dataset contains more than 300 English examples and 300 German examples. We used 100 English samples in this corpus that are originally from the Maptask corpus. The GUM-annotation for this sentence is an example: is: The destination is beneath the start. SpatialLocating (locatum destination, process being, placement GL1 (relatum start, hasspatialmodality UnderProjectionExternal)). Here, relatum and locatum are alternative terms for landmark and trajector. Spatial modality is the spatial relation mentioned in the specific spatial ontology. The corpus contains 65 trajectors and 69 landmarks appearing in 112 spatial relations. Each sentence produces spatially labeled sequences in the number of its prepositions: 122 sequences for GUM (Maptask). Although complete phrases are annotated in this dataset, we only use a phrase s headword with trajector (tr) and landmark (lm) labels and their