Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide

Size: px

Start display at page:

Download "Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide"

Ferdinand George
5 years ago
Views:

1 Passage Retrieval A thesis submitted by Martin Choquette Fitzwilliam College, Cambridge as part of the MPhil course in Computer Speech and Language Processing University of Cambridge 30 August, 1996

2 Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide users with useful summaries of the retrieved documents in connection with their information need. Of particular interest was the possibility of using syntactic analysis in order to bring various syntactic constructs into equivalent forms and thus making the comparison of passages more precise. Preliminary results suggest that, in the context of large scale retrieval systems and leaving aside the problem of semantic equivalence of terms, the derivation of valuable constructs cannot realistically be based on syntactic considerations alone. In particular, the use of very stringent search expressions dened from syntactic constructs, can give rise to undesirable side eects, unless carefully controlled.

3 Contents 1 Introduction 3 2 Passage Retrieval Passage Retrieval and its Applications Passage Retrieval and its Implications Syntactic Analysis and Compound Terms Approach Implementation Tagging Structural Tagging Part-of-Speech Tagging Document Retrieval Indexing Matching Passage Retrieval Implementing Compound Terms Deriving a List of Compound Terms Parsing Dealing with Ambiguities Paragraph Scoring Function Evaluation Grammars Tested Testing Results and Discussion Conclusion 35 A An Example of Output 40 B Requests 43 C Grammars 45 1

4 D Dierent queries for the same request 55 E Tables of Retrievals 59 2

5 Chapter 1 Introduction The increasing availability of full texts in electronic form induces the need for good retrieval tools. In cases where users must deal with a large number of possibly relevant documents, they might nd helpful to be presented with short extracts of the retrieved documents, especially if they also turn out to be large documents. The extracts, when properly chosen (i.e., relevant), may thus provide users with indicators that the information they are looking for can be found in a given document. Ideally, they could provide summaries of the retrieved documents in connection with the information need of the users, which they express through requests. The present work deals with passage retrieval, the process of attempting to retrieve the most relevant passages inside a text by computational means. In particular, we will be most concerned with the use of syntactic analysis in the context of passage retrieval; the main motivation being that syntactic analysis can help in coping with concepts expressed with more than one word, such as \oil spill", \ethical conduct", \plan to cut taxes", \charges of obstruction of justice" and so on. A program has been written to demonstrate the practicability of this approach in a large scale retrieval environment and test the eect of representing concepts with various degree of precision. Given a request, the program retrieves the most likely relevant documents available and displays the most likely relevant passages within each document in connection with the request. The next Chapter presents a number of actual applications related to passage retrieval and some requirements brought about by passage retrieval. It then proceeds with a brief survey of recent works on syntactic analysis applied on a large scale and ends with a description of the approach adopted in this work. Chapter 3 goes into further details by describing the system from an implementation point of view while Chapter 4 reports on the performance of the system in terms of eciency and the eects of using increasingly stringent concept descriptions on retrieval performance. Finally, conclusions on the whole project are drawn. 3

6 Chapter 2 Passage Retrieval Passage retrieval is closely related to information retrieval, the meaning of which, as a eld of research in library and computer sciences, seems to broaden as technology evolves [1]. In the beginning, the purpose of information retrieval systems was to provide users with references to o-line documents which were likely to be relevant to their information need. To characterise this rst meaning of information retrieval many authors use the term document retrieval. As storage capacity of computer systems evolved, it then became possible for retrieval systems to present the user with the complete text of documents, mainly articles and reports. To denote this more specic form of document retrieval, in which the full text of a document is made available to the user as opposed to a brief summary in the form of a title or an abstract, the term text retrieval was introduced. But, as full text documents of all sorts become increasingly available in electronic form, being able to extract only parts of them appears to be one of the most appealing solution to the problem posed by their diversity, both in terms of document content and document length. Passage retrieval addresses this particular issue of retrieving extracts of documents which might satisfy immediately the user's information need. This chapter presents various ways of dealing with passages and their implications, one of which is the need for detailed analyses of full text. Accordingly, Section 2.3 then proceeds with a description of recent works involving syntactic analysis in the context of large scale text processing, as in the TREC [2] and MUC [3] initiatives. The chapter ends with an overview of the retrieval system developed as part of this MPhil project. 2.1 Passage Retrieval and its Applications The idea of retrieving passages can be exploited in various ways. I will mention here only a few (see Salton et al. [4, 5, 6] for further applications). In this work, passage retrieval is mainly viewed as an indication tool in a document 4

7 retrieval system. Given a document, the passage retrieval system extracts and displays the passages in the text which best match the corresponding query (i.e., the representation of the request used to assess its resemblance to other texts). Thus, regardless of the length of the document, users can tell at a glance whether a given document is likely to answer their information need. Moreover, if the document turns out to be of interest, the excerpts submitted by the passage retrieval system can lead users directly to the relevant parts of the text. This, in fact, turns out to be very helpful, if not essential, when dealing with documents comprising up to 400,000 words and has been acknowledged by many [7]. Another usage of passage retrieval is as answer-reporter and answer-indicator [8]. If passages are retrieved and displayed, regardless of the documents they are in, according to their sole resemblance to, say, a question, they are likely to provide users with a direct answer (answer-reporter) or with an indication that the answer can be found in a particular document (answer-indicator). The passage retrieval system can thus provide specic facts to users on a particular subject. This application can be viewed as an alternative to more elaborate, but more domain-specic too, question-answering systems. Many researchers in information retrieval suggest that passage retrieval techniques can help to improve the performance of document retrieval systems (see for instance [9, 5, 7, 10]). This is justied by the fact that document retrieval systems rely on the concentration of terms related to the request to assess the likelihood of relevance of a whole document. But since document collections can be greatly heterogeneous, an unbiased measure of this concentration cannot be obtained by looking only at the global properties of a document such as its length or the number of occurrences in it of a particular term. Passage retrieval, by providing a more detailed view of candidate retrieved documents, can therefore give a more accurate picture of the co-occurrence of terms within documents and thus give rise, by re-ranking documents, to better document retrieval performance. This approach is often referred to as global-local match. To illustrate it more concretely, suppose that a query consists of two terms and that these terms occur exactly the same number of times in documents, say, A and B. This, is the global match. From a global retrieval point of view, A and B are equivalent. But suppose further that the two query terms always appear together in document A but never in B. Then, from a local point of view, A is likely to be more relevant than B since, according to the query, the two terms seem to be related. 2.2 Passage Retrieval and its Implications In many respects, passage retrieval is similar to document retrieval: we try to match a query with some representation of a passage. In fact, a crude form of passage retrieval would be to apply to passages the same retrieval techniques used for documents. Why can't passages be considered as documents after all? 5

8 The answer lies in the very nature of documents and passages and looking at some properties of the two will show that the similarity between document and passage retrievals can be an asset, but certainly not an end. Unlike documents, passages do not have well-dened boundaries a priori. What is a passage? Callan [9] distinguishes three types of passages: discourse passages discourse units like sections, paragraphs, sentences and so on which appear as such in the text; semantic passages i.e., based upon the content of the text. O'Connor [11] for instance suggests to use cue phrases, such as \however", \on the other hand" and so on, together with query terms in order to delimit a passage; and window passages in which passages are dened as (possibly overlapping) windows comprising approximately the same number of words. These types of passages have dierent properties for a given application. The choice of a particular type of passage therefore merits some consideration. A document treating of a particular subject is likely to comprise many passages in which single query terms occur, making a discriminating assessment of the likelihood of relevance of these passages hardly practicable on the basis of single terms alone. Passage retrieval therefore requires a deeper analysis taking into account the relationships between the terms as they occur in the original texts. Deeper statistical analysis techniques rely on characteristics directly derivable from the text itself, such as proximity of the terms [12], or from a collection of documents, such as the number of documents in which a term occurs [13]. Other analysis techniques, on the other hand, rely on various external sources of knowledge about the terms, or groups of terms, to assert the relations holding between them. In the context of information retrieval, these techniques must show a good balance between eciency and eectiveness [14] where eciency has to do with speed and economy of means and eectiveness is dened in terms of precision (the proportion of retrieved passages which are relevant) and recall (the proportion of relevant passages which are retrieved). Furthermore, the techniques must be robust enough to deal with virtually any kind of input. 2.3 Syntactic Analysis and Compound Terms The core of the work described in this thesis relies on syntactic knowledge. It aims at using linguistically motivated techniques to derive compound terms, i.e., combinations of single terms. The use of compound terms (or compounds for short) has proved eective to a certain extent in document and text retrieval [15, 6

9 16]. However, there seems to be no gain in using syntactic instead of statistical methods to derive compounds. This result is rather counterintuitive if we consider the various legitimate syntactic variants of an expression, the equivalence of which cannot be precisely recognised by statistical means alone, and calls for further investigation. What follows is a brief presentation of recent related works in which syntactic analysis is used to derive compounds. Much eort has been made these recent years to develop syntactically-based system which can operate on large sets of texts. This implies both robustness and eciency. Appelt et al. [17] have shown with FASTUS that shallow parsing can be perfectly appropriate in such circumstances for particular tasks (in their case, information extraction). They use a series of nite-state machines to detect syntactic constructs which can be reliably identied, i.e., noun and verb groups along with single words which were critical for their purposes such as prepositions, conjunctions and so on. In the domain of information retrieval, Evans et al. (CLARIT-TREC) apply basic syntactic processing to generate what they call \simplex noun phrases", consisting of the head noun and all pre-nominal modi- ers. The corresponding compounds are the same simplex noun phrases in which each single term is normalised to its root form. Strzalkowski [18] proposes a dierent and more comprehensive approach to extracting compound terms. The parser he uses attempts rst to generate a complete, and syntactically sound, analysis for each sentence. If, after a xed period of time, a complete parse has not been found, the parser enters a \skipand-t" mode in which some portions of text may be skipped in order to \t" the parse. A complete analysis makes possible the identication of the main verb of a sentence as well as its subject, object and subordinate clauses. This additional information is then used to build compounds in which terms are chosen and brought together not only according to their syntactic function but also according to their role within the sentence. For instance, compound terms in Strzalkowski's system consist of \head-modier" pairs of terms of the following types: (1) head noun and its left modier; (2) head noun and head of its right modier; (3) main verb of a clause and head of its object phrase; and (4) head of the subject phrase and the main verb. Again, here, terms are normalised to their root form but this normalisation is performed by a dictionary-assisted sux trimmer, as opposed to traditional morphological stemmers. 2.4 Approach Taking into consideration the requirements brought about by both document and passage retrievals (Sections 2.1 and 2.2) as well as the warning at the beginning of the previous Section about syntactically-derived compounds, the present work has the following specic objectives: 1. devising a general, robust and reasonably ecient way of deriving com- 7

10 pound terms from full text by syntactic means; 2. in particular, regarding objective 1, being able to deal with parsing ambiguities; 3. identifying various forms of compounds which can be extracted from single syntactic constructs, studying their behaviour in the context of passage retrieval and trying to nd the most promising ones for passage selection (as opposed to ranking); 4. determining various forms of syntactic constructs which should be conated into equivalent compound forms so that they can match (in other words, allowing for syntactic variants); 5. integrating the whole in a document retrieval framework; by 6. conceiving a passage-query matching function capable of exploiting the information made available by the document retrieval module while satisfying the syntactic requirements. This section gives an overview of what has been done in this project as an attempt to meet these objectives. Since the matching of passages relies in turn on the matching of the syntactic expressions they comprise, we will be mainly concerned by the problem of matching these expressions. The problem of equivalence of variants is central to our approach. The following discussion will be based on the need for matching the variants below: Wall Street investment banking sources involved (2.1) Wall Street investment banking sources involvement (2.2) involvement of Wall Street investment banking sources (2.3) involving Wall Street banking sources (2.4) Our goal is to reduce all these expressions to some representations that will be considered as equivalent by the matching function. The solution adopted to match expressions (2.2) and (2.1) is indeed a very common one, namely, stemming. Thus, by reducing the words to their root form, morphological variants become equivalent. For instance, \involved" and \involvement" are both reduced to \involv". Parsing comes into play when it comes to making expressions such as (2.2) and (2.3) equivalent. In this case, after stemming, the words need to be reordered. This is done, in our framework, by dening grammar rules capable of recognising the two dierent forms and, once these forms have been recognised, by applying a number of corresponding transformation rules to them. We could, for instance, 8

11 dene a rule stating that whenever the form \NG 1 of NG 2 " is recognised, it must be transformed into \NG 2 NG 1 ". This rule would suce to make expressions (2.1), (2.2) and (2.3) not only equivalent but identical, all being represented by the single compound: wall street invest bank sourc involv (2.5) The same compound extraction scheme is applied to expressions occurring in requests as well as in candidate passages. As this latter example shows, partial parsing can prove to be sucient for our purposes. This work, in particular, favours shallow parsing and gives much attention to noun groups and noun phrases. Although noun phrases do not fall into the category of constituents which can be reliably identied (because of the ambiguities due to the problem of attachment of prepositional phrases), shallow parsing was still applied in our work to derive compounds from noun phrases. It was felt, as the above example shows, that they could help in obtaining valuable compounds. Although we favoured shallow parsing, in order to meet our experimental objectives, our program has no pre-dened grammar. It is up to the experimenters to provide the system with grammars which suit their need. Back to the variants problem, it remains to match expression (2.4) with all the others. This is done by allowing partial matches, as opposed to strict matches in which all the terms must be in a one-to-one correspondence. Thus, if terms in expression (2.4) are properly reordered and if we allow two compounds to match provided that they can be made identical only by deleting some words (as the matching function I adopted requires), the compounds: wall street invest bank sourc involv (2.6) wall street bank sourc involv (2.7) will be considered as equivalent. The matching of semantic variants, such as \gains" and \prots", though desirable, is not addressed in this work. In the context of this work, passages correspond to paragraphs. Paragraphs usually bear enough context to constitute short, autonomous and meaningful units so that they lend themselves readily to our summarisation purposes. Semantic and window passages have not been considered since the former usually imply a deeper linguistic analysis than what we aim at and the latter would result in rather unnatural summarisation units. Although this work does not address the issue of interaction with users, this is an example of decision in which they were (and had to be) taken into consideration at the time of designing the system. 9

12 As is now well-known, the basic matching units in this work are compound terms and both request and paragraphs are consequently transformed into sets of compounds. But the choice of compounds as units of representation may be rightly questioned. Why not use trees or any more elaborate representation which can more adequately retain the richness of syntactic analysis? It is primarily a matter of robustness and convenience. Dening a robust and comprehensive grammar for English clearly falls out of the scope of this work. Furthermore, trying to match two texts on the basis of syntactic trees or graphs is a very complex task [19]. Compounds on the other hand, as far as matching is concerned, can be treated, as a rst approximation, just like single terms. Unlike single terms however, they can express relationships between words. The next chapter describes at greater length the solutions presented in this section. 10

13 Chapter 3 Implementation The passage retrieval system described in this chapter consists of a conventional document retrieval system, using only statistics on single terms, supplemented with a syntactically-based passage retrieval module. As gure 3.1 outlines, the document retrieval system rst nds, for a given request, the rst N D best matching documents which are then read and processed by the passage retrieval module in order to nd the rst N P best matching paragraphs within each document. The data le, i.e., the le comprising all the retrievable documents, is a collection of stories from the Wall Street Journal. An example of output is given in appendix A. The parser developed in this project requires that the words in the original text should be unambiguously tagged beforehand. A tagging step therefore precedes the use of any text by our system and tags, rather than words or stems, are used at the time of parsing. The bulk of the system described here has been implemented in C. Only the part-of-speech tagger, kindly made available by Dr Stephen Pulman, and the grammar compiler are written in Prolog. The grammar compiler generates code, encoding the grammar rules in C, which can then be compiled and linked with the other modules. The document retrieval module makes use of the sort UNIX utility and two sets of programs publicly available are used to stem words [20] and to nd minimal perfect hash functions [21]. The following discussion is in step with the outline given in gure 3.1. First, the tagging process is described, then the document retrieval one and a more detailed presentation of the passage retrieval process ends this chapter. 3.1 Tagging The retrieval of parts of a document along with the syntactic analysis of its content imply the use of a signicant amount of information which must be derived by processing raw material. Since the required information does not 11

14 /* Pre-processing */ /* Needs to be done only when the document collection is created or updated */ tag data le to obtain tagged data le TDF index the content of TDF to obtain the collection representation CR /* Retrieval Process */ tag request le to obtain tagged request le TRF for each request R in TRF do /* Document retrieval */ derive query Q from R build DL(Q), the list of the N D best matching documents for query Q /* Passage retrieval */ build list of query terms QTL from request R for each document D 2 DL(Q), in decreasing order of matching score do reset PL(Q; D) the list of the N P best matching paragraphs in D for query Q for each paragraph P in D do build PTL the list of terms in P let score(q; P ) = match(qtl; PTL) if score(q; P ) is amongst the best scores then add P and score(q; P ) to PL(Q; D) print the N P best matching paragraphs of D according to PL(Q; D) Figure 3.1: Overview of the retrieval process. depend on subsequent operations, it only makes sense to encode both raw data and information into a convenient le format. The le thus obtained, which will be referred to as the tagged le, is then used in place of the source le. In addition to avoiding redundant computation of the same information whenever it is needed, the use of the tagged le simplied the conception of other parts of the program. Moreover, the tagging step helps considerably in reducing the number of parsing ambiguities. The tagging step can be seen as consisting of two types of more specic tagging. First, a structural tagging is performed in order to identify all the constituents comprised in a le. Within the context of our work, the constituents vary in size from a whole Wall Street Journal story down to sentence constituents, i.e., words, numbers, punctuation marks and so on. Structural tagging is then followed by a part-of-speech tagging, in which each sentence constituent is assigned a syntactic category. The whole process is illustrated in gure

15 ' & $ ' % & $ % Tagging Raw Data (Request) File? Tokeniser 6 Abbreviation File - Token File? Statistical Tagger - Tagged Data (Request) File 6 Morphological Tagger 6 Partially Tagged File Structural Tagging Figure 3.2: The tagging process. A text source le consists of a sequence of stories from the Wall Street Journal. As gure 3.3 shows, some of the constituents are already delimited by SGMLlike markers. The <DOC> and </DOC> markers for instance denote respectively the start and the end of a story. This story in turn is divided into various constituents. Of all these constituents, only the document identier (DOCNO), the headline (HL) and the actual text of the story (TEXT) are retained. Other constituents, which do not seem to bear any useful information for our purpose, are discarded. The constituents which are explicitly marked-up at the tokenisation step are paragraphs, sentences and sentence constituents. Paragraphs can be easily spotted in the source le by using the fact that they are separated by at least one blank line. The task of correctly identifying sentences and their constituents, however, requires more attention. As far as sentence identication is concerned, the presence of a full stop cannot be readily interpreted as the end of a sentence: full stops also occur in ellipses (\... ") and are used as separators (e.g., decimal numbers and addresses) and as abbreviation endings. Accordingly, a full stop is interpreted as a sentence ending if it does not fall into one of these categories. A full stop is considered to be a separator if the characters immediately preceding and following it belong to the set of permissible characters consisting of the hyphen, letters and digits. An abbreviation dictionary is used to determine whether a sequence of letters, the rst of which may be capital while the others must be in lower case, ending with a full stop qualify as an abbreviation. For reasons that will become clear in Section 3.1.2, the tokenisation of a sentence into its constituents attempts to reproduce the sort of segmentation that appears in the tagged corpus of the Penn Treebank. Apart from the diculties introduced by full stops, which were treated as described above, some reasonable 13

16 <DOC> <DOCNO> WSJ </DOCNO> <DD> = </DD> <AN> </AN> <HL> Comerica Unit Sets Acquisition </HL> <DD> 07/28/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> CMCA GOVMT </CO> <IN> BANKS (BNK) TENDER OFFERS, MERGERS, ACQUISITIONS (TNM) </IN> <DATELINE> DETROIT </DATELINE> <TEXT> Comerica Inc. said its Comerica Bank-Texas unit acquired Forestwood National Bank of Dallas in a federally assisted takeover. Forestwood National, which had $56 million in assets, was closed by the Federal Deposit Insurance Corp. yesterday. Comerica didn't disclose the price of the transaction. Under terms of the transaction, Comerica will take control of Forestwood's $53 million in deposits and some of the Dallas bank's small loans. Comerica will have the option to acquire additional loans from Forestwood's portfolio. </TEXT> </DOC> Figure 3.3: A story from the Wall Street Journal way of dealing with single quotes had to be devised. Single quotes can either delimit a quotation, be part of a word (e.g., o'clock, D'Amato, O'Connor) or denote a contraction (e.g., she's, we're, won't, '70s) or a possessive form (e.g., Cambridge's, companies'). In the case of possessive forms and contractions tokenisation should produce two tokens (except for cases like '70s). For instance, \can't" must be divided into \ca" and \n't". 1 In order to recognise each of the above use of single quotes, the following rules were applied: A word, in a broad sense, is a sequence of letters, digits, hyphens that begins and ends with letters or digits. One single quote is allowed to occur after any character of the sequence but the last three. Contractions and singular possessive forms consist of a single quote followed by at least one and at most two letters or digits and an optional \s". \n't" is also a contraction. 1 The README le accompanying the Penn corpus species that \can't" is tokenised as \can" and \n't". Looking at the tagged texts, however, reveals that it was tokenised as \ca" and \n't". 14

17 If a single quote is neither part of a word nor part of a contraction or a singular possessive form, it forms a token by itself, just as other punctuation marks do. It is then up to the part-of-speech tagger to determine whether a single quote occurring at the end of a plural noun denotes a possessive form. These rules work properly on a wide range of instances, failing mainly to recognise foreign words (e.g., Aral'sk) and some poetic or rather informal contractions (e.g., ne'er, y'all, 'til) Part-of-Speech Tagging The part-of-speech tagger, developed by Dr Stephen Pulman, is inspired by the tagger described by DeRose [22] but it has since undergone a number of changes. Had the words only one grammatical function when considered individually, part-of-speech tagging would be straightforward. But since it is not the case, some means of disambiguation must be used to determine their one function in a specic context. The approach adopted relies entirely on probabilities. The aim of the tagger is to nd the sequence of tags T 1 ; T 2 ; : : : ; T n which best describes the syntactic role of each word within a given sequence of words w 1 ; w 2 ; : : : ; w n, usually a sentence. In this particular case, this is done by searching, through the space of possible sequences of tags, the sequence for which some approximation of P (T 1 ; T 2 ; : : : ; T n jw 1 ; w 2 ; : : : ; w n ) is maximal. Thanks to dynamic programming, this can be achieved in linear time. All the statistics used by the searching procedure are kept in three rather large databases. The rst one, the lexicon, gives the probability that a word w occurs given that it has a tag T, i.e., P (wjt ). The trigram matrix then gives the probability that a tag T i occurs given that it is preceded by tags T i?1 and T i?2, i.e., P (T i jt i?1 ; T i?2 ). Finally, a bigram matrix, giving P (T i jt i?1 ), supplements the trigram matrix when statistics on a particular trigram are not available. All of the above probabilities are derived from a large sample of manually tagged text. In the context of our work, the tagger has been trained on stories from the Wall Street Journal which were manually tagged as part of the Penn Treebank corpus. This explains why the tokenisation procedure must be as faithful as possible to what appears in the Penn Treebank corpus. The technique sketched above is appealing in many ways: it is fast, robust and, with an accuracy of at least 95%, amongst the most accurate. However, the part-of-speech tagger, as used in this project, does not attempt to tag words which do not appear in the lexicon and therefore in the training corpus. A crude morphological tagging then follows to tag untagged numbers and some special combination of punctuation marks. Words which are still untagged at the end of this step remain untagged (more precisely, they keep the \??" tag). At a later stage, it is then up to the grammar designer to decide what the parser should 15

18 do with such words. This last morphological tagging step is in fact dened as a simple search and replace procedure on various patterns. However simple this solution is, it turned out to be protable in helping to cut down the number of possible parses at the time of passage scoring. 3.2 Document Retrieval From a user point of view, text retrieval consists in: typing a request, reecting an information need, submitting the request to the text retrieval system, waiting a few seconds and looking at the texts proposed by the system. The user, if not satised, can then rene the request and submit it again until satisfaction or belief that no further gain can be made. From the perspective of, let's say, a modern (as opposed to boolean) text retrieval system, this translates into: 1. dividing the request into single words, 2. eliminating those that bear no content, 3. nding the stems of the remaining words so that variants having a common stem become equivalent, 4. obtaining the set of texts which contain one or more of the stems, 5. ranking the texts thus obtained according to some measure of importance of the stems occuring in them, 6. returning the result to the user. Clearly, step 4 cannot be performed in reasonable time without recourse to some information pre-compiled at the time of indexing. As far as step 4 is concerned, the aid needed corresponds to an inverted le. But it turns out that more information about the texts and their content can be gathered during the indexing process. This extra information plays a major role at step 5. Section presents the indexing process as dened in our system while Section describes the matching process in which a score, indicating how well the representation of a document matches up with the query, is computed according to some criteria Indexing The purpose of indexing is two-fold: rst, it reads tokens from the data le, nds their stem, and stores the latter in an inverted le; second, it compiles statistics both about documents (or story, in our case) and about stems. These statistics 16

19 ' & $ ' % & % $ Indexing Tagged Data File Stop list - Filter -? Content Words Tags Stemmer Collection Representation 6 Statistics (Counters) 6 Stems Figure 3.4: The indexation process are also stored along with the inverted le to form what we call the collection representation. Figure 3.4 shows the various steps involved. A token, i.e., a term and its associated tag, is read from the data le and then passed on to a ltering module. If the term is itself a tag, denoting for instance the start of a new story, the token goes directly to the accounting module. If the term is a word, capital letters are changed into lower case letters, digits are untouched and any other character is removed. The resulting term is then looked up in a list of stop words, i.e., words like \after", \next", \would" which bear no meaning on their own, to see whether it is a content word. Stop words are discarded while content words go through the stemming module. The stemming module then applies Porter's algorithm [23, 20] to remove the sux of the term according to a series of transformation rules. The resulting stem is then passed on to the accounting module. Along with a few housekeeping functions such as keeping track of the position of stories in the data le, the accounting module is responsible for gathering information on features which are most critical to assess the importance of a term in a collection of documents. There are numerous ways of assessing the importance of, or weighting, a term. We adopted the approach presented by Robertson and Sparck Jones [24]. This approach has been well-tried and has the good grace to be simple. It relies on three features: Collection Frequency Weight (CFW) This feature is a function of the number of documents in which a term appears, denoted here as n t. It is dened as: CFW(t) = log N n t (3.1) 17

20 where N is the number of documents in the collection. This measure is motivated by the fact that terms which appear in fewer documents are better at discriminating documents. Note that when n t = N, CFW(t) = 0. Term Frequency (TF) This is simply the number of times a term t appears in a document D and is denoted here as TF(t; D). This measure is useful when trying to determine which is the more important of two documents containing the term t. Intuitively, the one in which t occurs more often is the more important. Normalised Document Length (NDL) Suppose that a term occurs exactly the same number of time in two documents. Which document, in this case, is more important? A reasonable answer is: the shorter one, since the shorter document seems to concentrate more on term t than the longer one. 2 The normalised document length is dened as: NDL(D) = DL(D) Average DL for all documents where DL(D) is the length of the document given in some units. Should it be sentences, terms or characters, normalisation makes NDL insensible to the units used to measure DL. Our system uses stems as units. Stop words can be omitted since they are evenly distributed. The collection representation thus compiled can then be put to good use in order to process any number of subsequent queries. It only needs to be compiled again whenever the data le undergoes some change Matching Matching, in fact, is the core retrieval process. Figure 3.5 illustrates how it is performed by our program. In this gure, the conation module processes the request le in almost the same way as the indexing module processes the data le. The result, however, is, for each request, a list of term records. A term record gives access to the term along with its stem, its tag, its frequency within the query (i.e., TF(t; Q)), its collection frequency and other useful information. Tag terms (e.g., <DOC>), stop words and punctuation marks are also kept in term records and are identied as such. The term records of a given query are then fed into the scoring module. At this stage, the various sources of information contained in the collection representation 2 In the context of document retrieval, that is. As mentioned earlier, if the longer document devotes an entire section to t and that the concentration of t is greater in that section than it is in the shorter document, the answer is not so clear any more. This is an area where passage retrieval can help signicantly. 18

21 ' & % $ ' & $ % ' & % $ ' & % $ Tagged Request File - Stop list Document Matching Conation Collection Representation - - List of Term Records List of Query Term Records 6 6 Scoring 6 List of Matching Documents Figure 3.5: The document matching process. are combined to yield an overall score for each document having at least one term in common with the query (other documents are given a score of zero). The overall score for a document D relative to a query Q is computed as: OS(D; Q) = X t2q TF(t; Q) TF(t; D) CFW(t) (K + 1) K NDL(D) + TF(t; D) (3.2) where K is a constant controlling the eect of TF(t; D). Robertson and Sparck Jones suggest to use K = 2 as a safe value. It was adopted without further ado, the purpose of this work being not to nd an optimal weighting scheme for the stories we used. The result of the process is a list of document identiers in decreasing order of document score OS. As gure 3.5 suggests, the query term records are updated during the scoring process (more precisely, the collection frequency weights are set). 3.3 Passage Retrieval Following the discussion in the preceding chapter, passage retrieval hardly needs further introduction. In the sections below we rst dene compound terms from an implementation point of view, we describe how lists of compounds are derived from text excerpts and how these lists are used to compute the similarity between two excerpts Implementing Compound Terms In order to meet our objective of similarity, no restriction is imposed on compound terms. They are dened, accordingly, as ordered list of terms. There is no 19

22 ' & $ % ' & % $ ' & $ % List of Term Records?? Parsing Compound List Builder Grammar - Chart List of Compounds 6 Transformation 6 Figure 3.6: The derivation of lists of compound terms. restriction as to the number they can comprise. The ordering requirement is not restrictive in any way since terms can always be put together in canonical (e.g., alphabetical) order at the time of building the compounds. For instance, if we adopt an alphabetical ordering \permanent neurological" (as in \permanent neurological damage") would be transformed into, and equivalent to, \neurological permanent". The importance of ordering seems to depend on the type of constituents we are dealing with. As the above example shows, it should not be taken into account with adjectives, since their order of occurrence has no impact on the meaning of an expression. This is not the case with nouns, however. Dierent orderings yield dierent meanings. Consider for instance \passenger jetliner" and \jetliner passenger". Thus, the need for being able of ordering terms within a compound Deriving a List of Compound Terms The list of compound terms associated with a paragraph is obtained by rst parsing separately each sentence it contains and then by building, from the resulting parses, smaller lists of compound terms which are ultimately merged. The sort of list produced depends entirely on the grammar in force at the time of retrieval. It is indeed possible to dene various grammars and experiment with them. A grammar is essentially a context-free grammar with a view to building a list of compounds. Rules are of the form: Mother ) Children : Fields: Mother and Children have exactly the same role as in context-free grammars. We dene, in the Fields section, a series of values that the mother category can take. These elds specify the transformation rules mentioned earlier which are 20

23 applied once a parse has been obtained (see gure 3.6). For instance, the grammar rule: np => [np, pos, np] : [head := 3:head, compound := 1:head ++ 3:head]. says that a noun phrase (np) can consist of two noun phrases separated by a possessive form (np) and that the mother category can take two values, namely, the values associated to head and compound. The eld values are dened in terms of the values of the children. compound, in this example, is the concatenation of the heads of the rst and third children, i.e., the noun phrases. Fields are in many ways similar to features, the main dierence being that elds cannot be unied. A complete grammar is shown in gure 3.7, with examples. The eval functor indicates which category and which eld of this category must be used to construct the list of compounds after a sentence is parsed. The list of compound terms of a paragraph is built by merging the lists of all the sentences. The.+ operator indicates that the category to which it is applied can occur more than once. For example, the rule: s => [s_elm.+] : [cmpd := union(1,'$',cmpd)]. says that a sentence s is a sequence of sentence elements s elm and that the value of cmpd is the union of the cmpds of the children, \$" denoting the last child. As can be seen from this grammar, although the parse must be complete, the grammar does not have to be very precise. In this particular case, the complete parse is just a concatenation of partial parses Parsing The parsing technique we adopted is bottom-up chart parsing. This technique has the advantage of being robust in many ways. It not only allows us to impose very few restrictions on the type of context-free grammars that our system can handle but it can also cope with any input text, producing, at worst, all possible parses, i.e., a large number of ambiguous partial parses. Furthermore, it is reasonably easy to implement and it can be extended naturally to cope with elds. We will describe, in Section 3.3.4, how, for our purpose, the number of ambiguous parses can be limited by adding simple constraints to the general procedure. Figure 3.8 shows the general algorithm we used. It is adapted from the one proposed by Allen [25, p. 56]. It uses arcs to store what, in a rule, has been recognised and which portion of the sentence the recognised categories cover. An arc gives, for a particular rule, the status of a parse on a specic portion of text. It is a triple of the form: 21

24 % Main category and field. eval(s, cmpd). % A sentence is a sequence of sentence elements. s => [s_elm.+] : [cmpd := union(1, '$', cmpd)]. % A sentence element can be... s_elm => [ng] : % a noun phrase, [cmpd := 1:cmpd]. s_elm => [prt] : % a gerund or present participle, [cmpd := 1:cmpd]. s_elm => [pos] : % a possessive ending []. s_elm => [punct] : % a puctuation mark, []. s_elm => [wd] : % or any other form of word. [cmpd := 1:cmpd]. % Noun groups (including possessive forms and present participle). ng => [n_adj_list, pos, n_adj_list] : % Emma's restaurant [cmpd := 1:adj u 3:adj u 1:n ++ 3:n]. ng => [n_adj_list] : % West German catch-up measure [cmpd := 1:adj u 1:n]. ng => [prt, n_adj_list] : % violating antitrust laws [cmpd := 2:adj u 2:n ++ 1:wd]. n_adj_list => [n_adj.+] : [n := concat(1, '$', n), adj := union(1, '$', adj)]. % West German catch-up measure n_adj => [adj] : [adj := 1:wd]. n_adj => [n] : [n := 1:wd]. % modern % packaging Figure 3.7: A grammar recognising noun groups. 22

25 /* Initialisation */ for i := n downto 1 do stack < C i! t i ; i? 1; i > onto agenda /* Main loop */ while agenda is not empty do pop arc < C! X 1 : : : X n ; p s 0; p e 0 > from agenda /* Adding arcs */ for each rule of the form X! C do stack < X! C; p s 0; p e 0 > onto agenda for each rule of the form X! CX 2 : : : X n do add < X! C X 2 : : : X n ; p s 0; p e 0 > to chart /* Extending arcs */ for each < X! X 1 : : : C : : : X n ; p s ; p s 0 > in chart do add < X! X 1 : : : C : : : X n ; p s ; p e 0 > to chart for each < X! X 1 : : : X n?1 C; p s ; p s 0 > in chart do stack < X! X 1 : : : X n?1c; p s ; p e 0 > onto agenda Figure 3.8: The bottom-up chart parsing algorithm. < X! X 1 X 2 : : : X i X i+1 : : : X n ; start; end > where indicates that categories X 1 to X i have been recognised so far and start and end are the starting point and the endpoint of the arc respectively. In general, if the arc covers terms i to j, start = i? 1 and end = j (the extremities of the arcs lie between the terms). The mother category X has been recognised when all its children have been recognised. All the arcs in turn are stored in a chart and an agenda keeps track of the arcs whose mother category has been recognised but which still have to be inserted into the chart. The idea is, at the moment of inserting a recognised constituent C into the chart, to create new arcs for: rst, all the rules which have C as their rst child and, second, all the arcs already in the chart which require C in order to be extended. Complete parses are given by the arcs covering the whole sentence S, i.e., arcs for which start = 0 and end = jsj. In the algorithm presented here, the agenda is implemented as a stack in order to ensure that all the arcs lying between positions 0 and, say, j are in the chart so that they can be extended whenever an arc from i(< j) to k(> j) is inserted Dealing with Ambiguities Once a sentence has been parsed, one of the possible complete parses must be chosen in order to obtain a list of compound terms. One of the key issue when 23

26 building a compound list is to choose a \good" sentence in the sense that it will help in obtaining sensible compounds according to the grammar. We present here how our system deals with ambiguities and derives promising parses, the rst of which is used to build the compound list. An alternative way to choosing only one interpretation to obtain compound terms might be to use all the possible parses. This approach, however, entails problems that do not, in our view, have straightforward solutions. It would considerably increase the computational load placed on the system and introduce delicate questions about weighting. As suggested by Allen in [25, p. 176], the use of the Kleene + operator (henceforth, the list operator) can signicantly help to reduce the number of parsing ambiguities. For instance, a sequence of words consisting of four nouns (N N N N) has ve dierent interpretations ((N (N (N N))), (N ((N N) N)), ((N N)(N N)) and so on) if we parse it with the grammar NP! N2 N2! N2 N2 N2! N whereas it has only one interpretation if we parse it according to NP! N + : (3.3) The inclusion of the list operator is therefore very appealing. 3 It is all the more so if we consider the combinatorial eect of putting the various interpretations of dierent parts of a sentence together. But the simple inclusion of the list operator turns out to be insucient. Following up the above example, suppose, as in gure 3.7, we add the rules S! C + (3.4) C! NP C! Adj C! Punct : : : to rule (3.3). While NP has only one interpretation for N N N N, S has eight dierent interpretations, from 3 One can argue that it does not retain the bracketing information but, as Sparck Jones points out about compound nouns in [26], not much can be done regarding bracketing at the syntactic level. 24

27 S(C(NP(N))C(NP(N))C(NP(N))C(NP(N))) to (3.5) S(C(NP(NNNN))) (3.6) Now, if interpretation (3.5) is to be the preferred one, why should rule (3.3), with its list operator, be dened at all? Taking this observation into account, we can infer that the intended meaning of rule (3.5) is that a sentence is a sequence of noun phrases (NP) separated from each other by some other constituents. Then, in the light of the intended meaning, it appears that only interpretation (3.6) is valid and that all the others are, in some sense, accidental, resulting from an underspecication of the grammar. Since trying to write a grammar which faithfully reects the intended meaning would be rather cumbersome, parsing constraints have been dened such that grammars similar to the one considered here would lead to the \intended" interpretations. The purpose of the constraints is to favour the \greediest" interpretations of the list operator, i.e., the interpretations which include all that the list operator can possibly include. For instance, when presented with a sequence of four nouns, the favoured interpretation of a list of nouns (N+) is (N N N N) rather than any other interpretation which would cover only a part of this sequence. In the framework of chart parsing, the preference for greedy interpretations can be brought in by working with only a subset of all the possible arcs. Suppose that two arcs representing a recognised constituent (i.e., coming from the agenda) cover the same portion and that they are both subject to the same rule. In other words, that these two arcs only dier from each other by their sub-constituents. These two arcs not only create an ambiguity by themselves but this ambiguity is then propagated to any arcs which comprise them as constituents. This situation can be remedied in a large part by establishing preference rules when it comes to inserting arcs to the chart. These preference rules are stated as follows: Preference rules An arc A, coming from the agenda, can be added to the chart only if: 1. no other arc of the same form (i.e., same coverage, same rule) having less sub-constituents than A is already in the chart; and 2. the parsing tree of any other arc of the same form, with the same number of sub-constituents, has, at least, the same number of constituents as the parsing tree of A. 3. if A is added to the chart, any other arc which has the same form as A is removed if it does not comply to rules (1) and (2). Rule (1), applies specically to grammar rules with list operators. By taking the arcs which have less sub-constituents, we favour the greediest interpretations 25

Effectiveness of complex index terms in information retrieval

Effectiveness of complex index terms in information retrieval Tokunaga Takenobu, Ogibayasi Hironori and Tanaka Hozumi Department of Computer Science Tokyo Institute of Technology Abstract This paper explores