Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide

Size: px
Start display at page:

Download "Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide"

Transcription

1 Passage Retrieval A thesis submitted by Martin Choquette Fitzwilliam College, Cambridge as part of the MPhil course in Computer Speech and Language Processing University of Cambridge 30 August, 1996

2 Abstract This work describes a syntactically-based approach to passage retrieval and how it is embedded in a document retrieval environment to provide users with useful summaries of the retrieved documents in connection with their information need. Of particular interest was the possibility of using syntactic analysis in order to bring various syntactic constructs into equivalent forms and thus making the comparison of passages more precise. Preliminary results suggest that, in the context of large scale retrieval systems and leaving aside the problem of semantic equivalence of terms, the derivation of valuable constructs cannot realistically be based on syntactic considerations alone. In particular, the use of very stringent search expressions dened from syntactic constructs, can give rise to undesirable side eects, unless carefully controlled.

3 Contents 1 Introduction 3 2 Passage Retrieval Passage Retrieval and its Applications Passage Retrieval and its Implications Syntactic Analysis and Compound Terms Approach Implementation Tagging Structural Tagging Part-of-Speech Tagging Document Retrieval Indexing Matching Passage Retrieval Implementing Compound Terms Deriving a List of Compound Terms Parsing Dealing with Ambiguities Paragraph Scoring Function Evaluation Grammars Tested Testing Results and Discussion Conclusion 35 A An Example of Output 40 B Requests 43 C Grammars 45 1

4 D Dierent queries for the same request 55 E Tables of Retrievals 59 2

5 Chapter 1 Introduction The increasing availability of full texts in electronic form induces the need for good retrieval tools. In cases where users must deal with a large number of possibly relevant documents, they might nd helpful to be presented with short extracts of the retrieved documents, especially if they also turn out to be large documents. The extracts, when properly chosen (i.e., relevant), may thus provide users with indicators that the information they are looking for can be found in a given document. Ideally, they could provide summaries of the retrieved documents in connection with the information need of the users, which they express through requests. The present work deals with passage retrieval, the process of attempting to retrieve the most relevant passages inside a text by computational means. In particular, we will be most concerned with the use of syntactic analysis in the context of passage retrieval; the main motivation being that syntactic analysis can help in coping with concepts expressed with more than one word, such as \oil spill", \ethical conduct", \plan to cut taxes", \charges of obstruction of justice" and so on. A program has been written to demonstrate the practicability of this approach in a large scale retrieval environment and test the eect of representing concepts with various degree of precision. Given a request, the program retrieves the most likely relevant documents available and displays the most likely relevant passages within each document in connection with the request. The next Chapter presents a number of actual applications related to passage retrieval and some requirements brought about by passage retrieval. It then proceeds with a brief survey of recent works on syntactic analysis applied on a large scale and ends with a description of the approach adopted in this work. Chapter 3 goes into further details by describing the system from an implementation point of view while Chapter 4 reports on the performance of the system in terms of eciency and the eects of using increasingly stringent concept descriptions on retrieval performance. Finally, conclusions on the whole project are drawn. 3

6 Chapter 2 Passage Retrieval Passage retrieval is closely related to information retrieval, the meaning of which, as a eld of research in library and computer sciences, seems to broaden as technology evolves [1]. In the beginning, the purpose of information retrieval systems was to provide users with references to o-line documents which were likely to be relevant to their information need. To characterise this rst meaning of information retrieval many authors use the term document retrieval. As storage capacity of computer systems evolved, it then became possible for retrieval systems to present the user with the complete text of documents, mainly articles and reports. To denote this more specic form of document retrieval, in which the full text of a document is made available to the user as opposed to a brief summary in the form of a title or an abstract, the term text retrieval was introduced. But, as full text documents of all sorts become increasingly available in electronic form, being able to extract only parts of them appears to be one of the most appealing solution to the problem posed by their diversity, both in terms of document content and document length. Passage retrieval addresses this particular issue of retrieving extracts of documents which might satisfy immediately the user's information need. This chapter presents various ways of dealing with passages and their implications, one of which is the need for detailed analyses of full text. Accordingly, Section 2.3 then proceeds with a description of recent works involving syntactic analysis in the context of large scale text processing, as in the TREC [2] and MUC [3] initiatives. The chapter ends with an overview of the retrieval system developed as part of this MPhil project. 2.1 Passage Retrieval and its Applications The idea of retrieving passages can be exploited in various ways. I will mention here only a few (see Salton et al. [4, 5, 6] for further applications). In this work, passage retrieval is mainly viewed as an indication tool in a document 4

7 retrieval system. Given a document, the passage retrieval system extracts and displays the passages in the text which best match the corresponding query (i.e., the representation of the request used to assess its resemblance to other texts). Thus, regardless of the length of the document, users can tell at a glance whether a given document is likely to answer their information need. Moreover, if the document turns out to be of interest, the excerpts submitted by the passage retrieval system can lead users directly to the relevant parts of the text. This, in fact, turns out to be very helpful, if not essential, when dealing with documents comprising up to 400,000 words and has been acknowledged by many [7]. Another usage of passage retrieval is as answer-reporter and answer-indicator [8]. If passages are retrieved and displayed, regardless of the documents they are in, according to their sole resemblance to, say, a question, they are likely to provide users with a direct answer (answer-reporter) or with an indication that the answer can be found in a particular document (answer-indicator). The passage retrieval system can thus provide specic facts to users on a particular subject. This application can be viewed as an alternative to more elaborate, but more domain-specic too, question-answering systems. Many researchers in information retrieval suggest that passage retrieval techniques can help to improve the performance of document retrieval systems (see for instance [9, 5, 7, 10]). This is justied by the fact that document retrieval systems rely on the concentration of terms related to the request to assess the likelihood of relevance of a whole document. But since document collections can be greatly heterogeneous, an unbiased measure of this concentration cannot be obtained by looking only at the global properties of a document such as its length or the number of occurrences in it of a particular term. Passage retrieval, by providing a more detailed view of candidate retrieved documents, can therefore give a more accurate picture of the co-occurrence of terms within documents and thus give rise, by re-ranking documents, to better document retrieval performance. This approach is often referred to as global-local match. To illustrate it more concretely, suppose that a query consists of two terms and that these terms occur exactly the same number of times in documents, say, A and B. This, is the global match. From a global retrieval point of view, A and B are equivalent. But suppose further that the two query terms always appear together in document A but never in B. Then, from a local point of view, A is likely to be more relevant than B since, according to the query, the two terms seem to be related. 2.2 Passage Retrieval and its Implications In many respects, passage retrieval is similar to document retrieval: we try to match a query with some representation of a passage. In fact, a crude form of passage retrieval would be to apply to passages the same retrieval techniques used for documents. Why can't passages be considered as documents after all? 5

8 The answer lies in the very nature of documents and passages and looking at some properties of the two will show that the similarity between document and passage retrievals can be an asset, but certainly not an end. Unlike documents, passages do not have well-dened boundaries a priori. What is a passage? Callan [9] distinguishes three types of passages: discourse passages discourse units like sections, paragraphs, sentences and so on which appear as such in the text; semantic passages i.e., based upon the content of the text. O'Connor [11] for instance suggests to use cue phrases, such as \however", \on the other hand" and so on, together with query terms in order to delimit a passage; and window passages in which passages are dened as (possibly overlapping) windows comprising approximately the same number of words. These types of passages have dierent properties for a given application. The choice of a particular type of passage therefore merits some consideration. A document treating of a particular subject is likely to comprise many passages in which single query terms occur, making a discriminating assessment of the likelihood of relevance of these passages hardly practicable on the basis of single terms alone. Passage retrieval therefore requires a deeper analysis taking into account the relationships between the terms as they occur in the original texts. Deeper statistical analysis techniques rely on characteristics directly derivable from the text itself, such as proximity of the terms [12], or from a collection of documents, such as the number of documents in which a term occurs [13]. Other analysis techniques, on the other hand, rely on various external sources of knowledge about the terms, or groups of terms, to assert the relations holding between them. In the context of information retrieval, these techniques must show a good balance between eciency and eectiveness [14] where eciency has to do with speed and economy of means and eectiveness is dened in terms of precision (the proportion of retrieved passages which are relevant) and recall (the proportion of relevant passages which are retrieved). Furthermore, the techniques must be robust enough to deal with virtually any kind of input. 2.3 Syntactic Analysis and Compound Terms The core of the work described in this thesis relies on syntactic knowledge. It aims at using linguistically motivated techniques to derive compound terms, i.e., combinations of single terms. The use of compound terms (or compounds for short) has proved eective to a certain extent in document and text retrieval [15, 6

9 16]. However, there seems to be no gain in using syntactic instead of statistical methods to derive compounds. This result is rather counterintuitive if we consider the various legitimate syntactic variants of an expression, the equivalence of which cannot be precisely recognised by statistical means alone, and calls for further investigation. What follows is a brief presentation of recent related works in which syntactic analysis is used to derive compounds. Much eort has been made these recent years to develop syntactically-based system which can operate on large sets of texts. This implies both robustness and eciency. Appelt et al. [17] have shown with FASTUS that shallow parsing can be perfectly appropriate in such circumstances for particular tasks (in their case, information extraction). They use a series of nite-state machines to detect syntactic constructs which can be reliably identied, i.e., noun and verb groups along with single words which were critical for their purposes such as prepositions, conjunctions and so on. In the domain of information retrieval, Evans et al. (CLARIT-TREC) apply basic syntactic processing to generate what they call \simplex noun phrases", consisting of the head noun and all pre-nominal modi- ers. The corresponding compounds are the same simplex noun phrases in which each single term is normalised to its root form. Strzalkowski [18] proposes a dierent and more comprehensive approach to extracting compound terms. The parser he uses attempts rst to generate a complete, and syntactically sound, analysis for each sentence. If, after a xed period of time, a complete parse has not been found, the parser enters a \skipand-t" mode in which some portions of text may be skipped in order to \t" the parse. A complete analysis makes possible the identication of the main verb of a sentence as well as its subject, object and subordinate clauses. This additional information is then used to build compounds in which terms are chosen and brought together not only according to their syntactic function but also according to their role within the sentence. For instance, compound terms in Strzalkowski's system consist of \head-modier" pairs of terms of the following types: (1) head noun and its left modier; (2) head noun and head of its right modier; (3) main verb of a clause and head of its object phrase; and (4) head of the subject phrase and the main verb. Again, here, terms are normalised to their root form but this normalisation is performed by a dictionary-assisted sux trimmer, as opposed to traditional morphological stemmers. 2.4 Approach Taking into consideration the requirements brought about by both document and passage retrievals (Sections 2.1 and 2.2) as well as the warning at the beginning of the previous Section about syntactically-derived compounds, the present work has the following specic objectives: 1. devising a general, robust and reasonably ecient way of deriving com- 7

10 pound terms from full text by syntactic means; 2. in particular, regarding objective 1, being able to deal with parsing ambiguities; 3. identifying various forms of compounds which can be extracted from single syntactic constructs, studying their behaviour in the context of passage retrieval and trying to nd the most promising ones for passage selection (as opposed to ranking); 4. determining various forms of syntactic constructs which should be conated into equivalent compound forms so that they can match (in other words, allowing for syntactic variants); 5. integrating the whole in a document retrieval framework; by 6. conceiving a passage-query matching function capable of exploiting the information made available by the document retrieval module while satisfying the syntactic requirements. This section gives an overview of what has been done in this project as an attempt to meet these objectives. Since the matching of passages relies in turn on the matching of the syntactic expressions they comprise, we will be mainly concerned by the problem of matching these expressions. The problem of equivalence of variants is central to our approach. The following discussion will be based on the need for matching the variants below: Wall Street investment banking sources involved (2.1) Wall Street investment banking sources involvement (2.2) involvement of Wall Street investment banking sources (2.3) involving Wall Street banking sources (2.4) Our goal is to reduce all these expressions to some representations that will be considered as equivalent by the matching function. The solution adopted to match expressions (2.2) and (2.1) is indeed a very common one, namely, stemming. Thus, by reducing the words to their root form, morphological variants become equivalent. For instance, \involved" and \involvement" are both reduced to \involv". Parsing comes into play when it comes to making expressions such as (2.2) and (2.3) equivalent. In this case, after stemming, the words need to be reordered. This is done, in our framework, by dening grammar rules capable of recognising the two dierent forms and, once these forms have been recognised, by applying a number of corresponding transformation rules to them. We could, for instance, 8

11 dene a rule stating that whenever the form \NG 1 of NG 2 " is recognised, it must be transformed into \NG 2 NG 1 ". This rule would suce to make expressions (2.1), (2.2) and (2.3) not only equivalent but identical, all being represented by the single compound: wall street invest bank sourc involv (2.5) The same compound extraction scheme is applied to expressions occurring in requests as well as in candidate passages. As this latter example shows, partial parsing can prove to be sucient for our purposes. This work, in particular, favours shallow parsing and gives much attention to noun groups and noun phrases. Although noun phrases do not fall into the category of constituents which can be reliably identied (because of the ambiguities due to the problem of attachment of prepositional phrases), shallow parsing was still applied in our work to derive compounds from noun phrases. It was felt, as the above example shows, that they could help in obtaining valuable compounds. Although we favoured shallow parsing, in order to meet our experimental objectives, our program has no pre-dened grammar. It is up to the experimenters to provide the system with grammars which suit their need. Back to the variants problem, it remains to match expression (2.4) with all the others. This is done by allowing partial matches, as opposed to strict matches in which all the terms must be in a one-to-one correspondence. Thus, if terms in expression (2.4) are properly reordered and if we allow two compounds to match provided that they can be made identical only by deleting some words (as the matching function I adopted requires), the compounds: wall street invest bank sourc involv (2.6) wall street bank sourc involv (2.7) will be considered as equivalent. The matching of semantic variants, such as \gains" and \prots", though desirable, is not addressed in this work. In the context of this work, passages correspond to paragraphs. Paragraphs usually bear enough context to constitute short, autonomous and meaningful units so that they lend themselves readily to our summarisation purposes. Semantic and window passages have not been considered since the former usually imply a deeper linguistic analysis than what we aim at and the latter would result in rather unnatural summarisation units. Although this work does not address the issue of interaction with users, this is an example of decision in which they were (and had to be) taken into consideration at the time of designing the system. 9

12 As is now well-known, the basic matching units in this work are compound terms and both request and paragraphs are consequently transformed into sets of compounds. But the choice of compounds as units of representation may be rightly questioned. Why not use trees or any more elaborate representation which can more adequately retain the richness of syntactic analysis? It is primarily a matter of robustness and convenience. Dening a robust and comprehensive grammar for English clearly falls out of the scope of this work. Furthermore, trying to match two texts on the basis of syntactic trees or graphs is a very complex task [19]. Compounds on the other hand, as far as matching is concerned, can be treated, as a rst approximation, just like single terms. Unlike single terms however, they can express relationships between words. The next chapter describes at greater length the solutions presented in this section. 10

13 Chapter 3 Implementation The passage retrieval system described in this chapter consists of a conventional document retrieval system, using only statistics on single terms, supplemented with a syntactically-based passage retrieval module. As gure 3.1 outlines, the document retrieval system rst nds, for a given request, the rst N D best matching documents which are then read and processed by the passage retrieval module in order to nd the rst N P best matching paragraphs within each document. The data le, i.e., the le comprising all the retrievable documents, is a collection of stories from the Wall Street Journal. An example of output is given in appendix A. The parser developed in this project requires that the words in the original text should be unambiguously tagged beforehand. A tagging step therefore precedes the use of any text by our system and tags, rather than words or stems, are used at the time of parsing. The bulk of the system described here has been implemented in C. Only the part-of-speech tagger, kindly made available by Dr Stephen Pulman, and the grammar compiler are written in Prolog. The grammar compiler generates code, encoding the grammar rules in C, which can then be compiled and linked with the other modules. The document retrieval module makes use of the sort UNIX utility and two sets of programs publicly available are used to stem words [20] and to nd minimal perfect hash functions [21]. The following discussion is in step with the outline given in gure 3.1. First, the tagging process is described, then the document retrieval one and a more detailed presentation of the passage retrieval process ends this chapter. 3.1 Tagging The retrieval of parts of a document along with the syntactic analysis of its content imply the use of a signicant amount of information which must be derived by processing raw material. Since the required information does not 11

14 /* Pre-processing */ /* Needs to be done only when the document collection is created or updated */ tag data le to obtain tagged data le TDF index the content of TDF to obtain the collection representation CR /* Retrieval Process */ tag request le to obtain tagged request le TRF for each request R in TRF do /* Document retrieval */ derive query Q from R build DL(Q), the list of the N D best matching documents for query Q /* Passage retrieval */ build list of query terms QTL from request R for each document D 2 DL(Q), in decreasing order of matching score do reset PL(Q; D) the list of the N P best matching paragraphs in D for query Q for each paragraph P in D do build PTL the list of terms in P let score(q; P ) = match(qtl; PTL) if score(q; P ) is amongst the best scores then add P and score(q; P ) to PL(Q; D) print the N P best matching paragraphs of D according to PL(Q; D) Figure 3.1: Overview of the retrieval process. depend on subsequent operations, it only makes sense to encode both raw data and information into a convenient le format. The le thus obtained, which will be referred to as the tagged le, is then used in place of the source le. In addition to avoiding redundant computation of the same information whenever it is needed, the use of the tagged le simplied the conception of other parts of the program. Moreover, the tagging step helps considerably in reducing the number of parsing ambiguities. The tagging step can be seen as consisting of two types of more specic tagging. First, a structural tagging is performed in order to identify all the constituents comprised in a le. Within the context of our work, the constituents vary in size from a whole Wall Street Journal story down to sentence constituents, i.e., words, numbers, punctuation marks and so on. Structural tagging is then followed by a part-of-speech tagging, in which each sentence constituent is assigned a syntactic category. The whole process is illustrated in gure

15 ' & $ ' % & $ % Tagging Raw Data (Request) File? Tokeniser 6 Abbreviation File - Token File? Statistical Tagger - Tagged Data (Request) File 6 Morphological Tagger 6 Partially Tagged File Structural Tagging Figure 3.2: The tagging process. A text source le consists of a sequence of stories from the Wall Street Journal. As gure 3.3 shows, some of the constituents are already delimited by SGMLlike markers. The <DOC> and </DOC> markers for instance denote respectively the start and the end of a story. This story in turn is divided into various constituents. Of all these constituents, only the document identier (DOCNO), the headline (HL) and the actual text of the story (TEXT) are retained. Other constituents, which do not seem to bear any useful information for our purpose, are discarded. The constituents which are explicitly marked-up at the tokenisation step are paragraphs, sentences and sentence constituents. Paragraphs can be easily spotted in the source le by using the fact that they are separated by at least one blank line. The task of correctly identifying sentences and their constituents, however, requires more attention. As far as sentence identication is concerned, the presence of a full stop cannot be readily interpreted as the end of a sentence: full stops also occur in ellipses (\... ") and are used as separators (e.g., decimal numbers and addresses) and as abbreviation endings. Accordingly, a full stop is interpreted as a sentence ending if it does not fall into one of these categories. A full stop is considered to be a separator if the characters immediately preceding and following it belong to the set of permissible characters consisting of the hyphen, letters and digits. An abbreviation dictionary is used to determine whether a sequence of letters, the rst of which may be capital while the others must be in lower case, ending with a full stop qualify as an abbreviation. For reasons that will become clear in Section 3.1.2, the tokenisation of a sentence into its constituents attempts to reproduce the sort of segmentation that appears in the tagged corpus of the Penn Treebank. Apart from the diculties introduced by full stops, which were treated as described above, some reasonable 13

16 <DOC> <DOCNO> WSJ </DOCNO> <DD> = </DD> <AN> </AN> <HL> Comerica Unit Sets Acquisition </HL> <DD> 07/28/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> CMCA GOVMT </CO> <IN> BANKS (BNK) TENDER OFFERS, MERGERS, ACQUISITIONS (TNM) </IN> <DATELINE> DETROIT </DATELINE> <TEXT> Comerica Inc. said its Comerica Bank-Texas unit acquired Forestwood National Bank of Dallas in a federally assisted takeover. Forestwood National, which had $56 million in assets, was closed by the Federal Deposit Insurance Corp. yesterday. Comerica didn't disclose the price of the transaction. Under terms of the transaction, Comerica will take control of Forestwood's $53 million in deposits and some of the Dallas bank's small loans. Comerica will have the option to acquire additional loans from Forestwood's portfolio. </TEXT> </DOC> Figure 3.3: A story from the Wall Street Journal way of dealing with single quotes had to be devised. Single quotes can either delimit a quotation, be part of a word (e.g., o'clock, D'Amato, O'Connor) or denote a contraction (e.g., she's, we're, won't, '70s) or a possessive form (e.g., Cambridge's, companies'). In the case of possessive forms and contractions tokenisation should produce two tokens (except for cases like '70s). For instance, \can't" must be divided into \ca" and \n't". 1 In order to recognise each of the above use of single quotes, the following rules were applied: A word, in a broad sense, is a sequence of letters, digits, hyphens that begins and ends with letters or digits. One single quote is allowed to occur after any character of the sequence but the last three. Contractions and singular possessive forms consist of a single quote followed by at least one and at most two letters or digits and an optional \s". \n't" is also a contraction. 1 The README le accompanying the Penn corpus species that \can't" is tokenised as \can" and \n't". Looking at the tagged texts, however, reveals that it was tokenised as \ca" and \n't". 14

17 If a single quote is neither part of a word nor part of a contraction or a singular possessive form, it forms a token by itself, just as other punctuation marks do. It is then up to the part-of-speech tagger to determine whether a single quote occurring at the end of a plural noun denotes a possessive form. These rules work properly on a wide range of instances, failing mainly to recognise foreign words (e.g., Aral'sk) and some poetic or rather informal contractions (e.g., ne'er, y'all, 'til) Part-of-Speech Tagging The part-of-speech tagger, developed by Dr Stephen Pulman, is inspired by the tagger described by DeRose [22] but it has since undergone a number of changes. Had the words only one grammatical function when considered individually, part-of-speech tagging would be straightforward. But since it is not the case, some means of disambiguation must be used to determine their one function in a specic context. The approach adopted relies entirely on probabilities. The aim of the tagger is to nd the sequence of tags T 1 ; T 2 ; : : : ; T n which best describes the syntactic role of each word within a given sequence of words w 1 ; w 2 ; : : : ; w n, usually a sentence. In this particular case, this is done by searching, through the space of possible sequences of tags, the sequence for which some approximation of P (T 1 ; T 2 ; : : : ; T n jw 1 ; w 2 ; : : : ; w n ) is maximal. Thanks to dynamic programming, this can be achieved in linear time. All the statistics used by the searching procedure are kept in three rather large databases. The rst one, the lexicon, gives the probability that a word w occurs given that it has a tag T, i.e., P (wjt ). The trigram matrix then gives the probability that a tag T i occurs given that it is preceded by tags T i?1 and T i?2, i.e., P (T i jt i?1 ; T i?2 ). Finally, a bigram matrix, giving P (T i jt i?1 ), supplements the trigram matrix when statistics on a particular trigram are not available. All of the above probabilities are derived from a large sample of manually tagged text. In the context of our work, the tagger has been trained on stories from the Wall Street Journal which were manually tagged as part of the Penn Treebank corpus. This explains why the tokenisation procedure must be as faithful as possible to what appears in the Penn Treebank corpus. The technique sketched above is appealing in many ways: it is fast, robust and, with an accuracy of at least 95%, amongst the most accurate. However, the part-of-speech tagger, as used in this project, does not attempt to tag words which do not appear in the lexicon and therefore in the training corpus. A crude morphological tagging then follows to tag untagged numbers and some special combination of punctuation marks. Words which are still untagged at the end of this step remain untagged (more precisely, they keep the \??" tag). At a later stage, it is then up to the grammar designer to decide what the parser should 15

18 do with such words. This last morphological tagging step is in fact dened as a simple search and replace procedure on various patterns. However simple this solution is, it turned out to be protable in helping to cut down the number of possible parses at the time of passage scoring. 3.2 Document Retrieval From a user point of view, text retrieval consists in: typing a request, reecting an information need, submitting the request to the text retrieval system, waiting a few seconds and looking at the texts proposed by the system. The user, if not satised, can then rene the request and submit it again until satisfaction or belief that no further gain can be made. From the perspective of, let's say, a modern (as opposed to boolean) text retrieval system, this translates into: 1. dividing the request into single words, 2. eliminating those that bear no content, 3. nding the stems of the remaining words so that variants having a common stem become equivalent, 4. obtaining the set of texts which contain one or more of the stems, 5. ranking the texts thus obtained according to some measure of importance of the stems occuring in them, 6. returning the result to the user. Clearly, step 4 cannot be performed in reasonable time without recourse to some information pre-compiled at the time of indexing. As far as step 4 is concerned, the aid needed corresponds to an inverted le. But it turns out that more information about the texts and their content can be gathered during the indexing process. This extra information plays a major role at step 5. Section presents the indexing process as dened in our system while Section describes the matching process in which a score, indicating how well the representation of a document matches up with the query, is computed according to some criteria Indexing The purpose of indexing is two-fold: rst, it reads tokens from the data le, nds their stem, and stores the latter in an inverted le; second, it compiles statistics both about documents (or story, in our case) and about stems. These statistics 16

19 ' & $ ' % & % $ Indexing Tagged Data File Stop list - Filter -? Content Words Tags Stemmer Collection Representation 6 Statistics (Counters) 6 Stems Figure 3.4: The indexation process are also stored along with the inverted le to form what we call the collection representation. Figure 3.4 shows the various steps involved. A token, i.e., a term and its associated tag, is read from the data le and then passed on to a ltering module. If the term is itself a tag, denoting for instance the start of a new story, the token goes directly to the accounting module. If the term is a word, capital letters are changed into lower case letters, digits are untouched and any other character is removed. The resulting term is then looked up in a list of stop words, i.e., words like \after", \next", \would" which bear no meaning on their own, to see whether it is a content word. Stop words are discarded while content words go through the stemming module. The stemming module then applies Porter's algorithm [23, 20] to remove the sux of the term according to a series of transformation rules. The resulting stem is then passed on to the accounting module. Along with a few housekeeping functions such as keeping track of the position of stories in the data le, the accounting module is responsible for gathering information on features which are most critical to assess the importance of a term in a collection of documents. There are numerous ways of assessing the importance of, or weighting, a term. We adopted the approach presented by Robertson and Sparck Jones [24]. This approach has been well-tried and has the good grace to be simple. It relies on three features: Collection Frequency Weight (CFW) This feature is a function of the number of documents in which a term appears, denoted here as n t. It is dened as: CFW(t) = log N n t (3.1) 17

20 where N is the number of documents in the collection. This measure is motivated by the fact that terms which appear in fewer documents are better at discriminating documents. Note that when n t = N, CFW(t) = 0. Term Frequency (TF) This is simply the number of times a term t appears in a document D and is denoted here as TF(t; D). This measure is useful when trying to determine which is the more important of two documents containing the term t. Intuitively, the one in which t occurs more often is the more important. Normalised Document Length (NDL) Suppose that a term occurs exactly the same number of time in two documents. Which document, in this case, is more important? A reasonable answer is: the shorter one, since the shorter document seems to concentrate more on term t than the longer one. 2 The normalised document length is dened as: NDL(D) = DL(D) Average DL for all documents where DL(D) is the length of the document given in some units. Should it be sentences, terms or characters, normalisation makes NDL insensible to the units used to measure DL. Our system uses stems as units. Stop words can be omitted since they are evenly distributed. The collection representation thus compiled can then be put to good use in order to process any number of subsequent queries. It only needs to be compiled again whenever the data le undergoes some change Matching Matching, in fact, is the core retrieval process. Figure 3.5 illustrates how it is performed by our program. In this gure, the conation module processes the request le in almost the same way as the indexing module processes the data le. The result, however, is, for each request, a list of term records. A term record gives access to the term along with its stem, its tag, its frequency within the query (i.e., TF(t; Q)), its collection frequency and other useful information. Tag terms (e.g., <DOC>), stop words and punctuation marks are also kept in term records and are identied as such. The term records of a given query are then fed into the scoring module. At this stage, the various sources of information contained in the collection representation 2 In the context of document retrieval, that is. As mentioned earlier, if the longer document devotes an entire section to t and that the concentration of t is greater in that section than it is in the shorter document, the answer is not so clear any more. This is an area where passage retrieval can help signicantly. 18

21 ' & % $ ' & $ % ' & % $ ' & % $ Tagged Request File - Stop list Document Matching Conation Collection Representation - - List of Term Records List of Query Term Records 6 6 Scoring 6 List of Matching Documents Figure 3.5: The document matching process. are combined to yield an overall score for each document having at least one term in common with the query (other documents are given a score of zero). The overall score for a document D relative to a query Q is computed as: OS(D; Q) = X t2q TF(t; Q) TF(t; D) CFW(t) (K + 1) K NDL(D) + TF(t; D) (3.2) where K is a constant controlling the eect of TF(t; D). Robertson and Sparck Jones suggest to use K = 2 as a safe value. It was adopted without further ado, the purpose of this work being not to nd an optimal weighting scheme for the stories we used. The result of the process is a list of document identiers in decreasing order of document score OS. As gure 3.5 suggests, the query term records are updated during the scoring process (more precisely, the collection frequency weights are set). 3.3 Passage Retrieval Following the discussion in the preceding chapter, passage retrieval hardly needs further introduction. In the sections below we rst dene compound terms from an implementation point of view, we describe how lists of compounds are derived from text excerpts and how these lists are used to compute the similarity between two excerpts Implementing Compound Terms In order to meet our objective of similarity, no restriction is imposed on compound terms. They are dened, accordingly, as ordered list of terms. There is no 19

22 ' & $ % ' & % $ ' & $ % List of Term Records?? Parsing Compound List Builder Grammar - Chart List of Compounds 6 Transformation 6 Figure 3.6: The derivation of lists of compound terms. restriction as to the number they can comprise. The ordering requirement is not restrictive in any way since terms can always be put together in canonical (e.g., alphabetical) order at the time of building the compounds. For instance, if we adopt an alphabetical ordering \permanent neurological" (as in \permanent neurological damage") would be transformed into, and equivalent to, \neurological permanent". The importance of ordering seems to depend on the type of constituents we are dealing with. As the above example shows, it should not be taken into account with adjectives, since their order of occurrence has no impact on the meaning of an expression. This is not the case with nouns, however. Dierent orderings yield dierent meanings. Consider for instance \passenger jetliner" and \jetliner passenger". Thus, the need for being able of ordering terms within a compound Deriving a List of Compound Terms The list of compound terms associated with a paragraph is obtained by rst parsing separately each sentence it contains and then by building, from the resulting parses, smaller lists of compound terms which are ultimately merged. The sort of list produced depends entirely on the grammar in force at the time of retrieval. It is indeed possible to dene various grammars and experiment with them. A grammar is essentially a context-free grammar with a view to building a list of compounds. Rules are of the form: Mother ) Children : Fields: Mother and Children have exactly the same role as in context-free grammars. We dene, in the Fields section, a series of values that the mother category can take. These elds specify the transformation rules mentioned earlier which are 20

23 applied once a parse has been obtained (see gure 3.6). For instance, the grammar rule: np => [np, pos, np] : [head := 3:head, compound := 1:head ++ 3:head]. says that a noun phrase (np) can consist of two noun phrases separated by a possessive form (np) and that the mother category can take two values, namely, the values associated to head and compound. The eld values are dened in terms of the values of the children. compound, in this example, is the concatenation of the heads of the rst and third children, i.e., the noun phrases. Fields are in many ways similar to features, the main dierence being that elds cannot be unied. A complete grammar is shown in gure 3.7, with examples. The eval functor indicates which category and which eld of this category must be used to construct the list of compounds after a sentence is parsed. The list of compound terms of a paragraph is built by merging the lists of all the sentences. The.+ operator indicates that the category to which it is applied can occur more than once. For example, the rule: s => [s_elm.+] : [cmpd := union(1,'$',cmpd)]. says that a sentence s is a sequence of sentence elements s elm and that the value of cmpd is the union of the cmpds of the children, \$" denoting the last child. As can be seen from this grammar, although the parse must be complete, the grammar does not have to be very precise. In this particular case, the complete parse is just a concatenation of partial parses Parsing The parsing technique we adopted is bottom-up chart parsing. This technique has the advantage of being robust in many ways. It not only allows us to impose very few restrictions on the type of context-free grammars that our system can handle but it can also cope with any input text, producing, at worst, all possible parses, i.e., a large number of ambiguous partial parses. Furthermore, it is reasonably easy to implement and it can be extended naturally to cope with elds. We will describe, in Section 3.3.4, how, for our purpose, the number of ambiguous parses can be limited by adding simple constraints to the general procedure. Figure 3.8 shows the general algorithm we used. It is adapted from the one proposed by Allen [25, p. 56]. It uses arcs to store what, in a rule, has been recognised and which portion of the sentence the recognised categories cover. An arc gives, for a particular rule, the status of a parse on a specic portion of text. It is a triple of the form: 21

24 % Main category and field. eval(s, cmpd). % A sentence is a sequence of sentence elements. s => [s_elm.+] : [cmpd := union(1, '$', cmpd)]. % A sentence element can be... s_elm => [ng] : % a noun phrase, [cmpd := 1:cmpd]. s_elm => [prt] : % a gerund or present participle, [cmpd := 1:cmpd]. s_elm => [pos] : % a possessive ending []. s_elm => [punct] : % a puctuation mark, []. s_elm => [wd] : % or any other form of word. [cmpd := 1:cmpd]. % Noun groups (including possessive forms and present participle). ng => [n_adj_list, pos, n_adj_list] : % Emma's restaurant [cmpd := 1:adj u 3:adj u 1:n ++ 3:n]. ng => [n_adj_list] : % West German catch-up measure [cmpd := 1:adj u 1:n]. ng => [prt, n_adj_list] : % violating antitrust laws [cmpd := 2:adj u 2:n ++ 1:wd]. n_adj_list => [n_adj.+] : [n := concat(1, '$', n), adj := union(1, '$', adj)]. % West German catch-up measure n_adj => [adj] : [adj := 1:wd]. n_adj => [n] : [n := 1:wd]. % modern % packaging Figure 3.7: A grammar recognising noun groups. 22

25 /* Initialisation */ for i := n downto 1 do stack < C i! t i ; i? 1; i > onto agenda /* Main loop */ while agenda is not empty do pop arc < C! X 1 : : : X n ; p s 0; p e 0 > from agenda /* Adding arcs */ for each rule of the form X! C do stack < X! C; p s 0; p e 0 > onto agenda for each rule of the form X! CX 2 : : : X n do add < X! C X 2 : : : X n ; p s 0; p e 0 > to chart /* Extending arcs */ for each < X! X 1 : : : C : : : X n ; p s ; p s 0 > in chart do add < X! X 1 : : : C : : : X n ; p s ; p e 0 > to chart for each < X! X 1 : : : X n?1 C; p s ; p s 0 > in chart do stack < X! X 1 : : : X n?1c; p s ; p e 0 > onto agenda Figure 3.8: The bottom-up chart parsing algorithm. < X! X 1 X 2 : : : X i X i+1 : : : X n ; start; end > where indicates that categories X 1 to X i have been recognised so far and start and end are the starting point and the endpoint of the arc respectively. In general, if the arc covers terms i to j, start = i? 1 and end = j (the extremities of the arcs lie between the terms). The mother category X has been recognised when all its children have been recognised. All the arcs in turn are stored in a chart and an agenda keeps track of the arcs whose mother category has been recognised but which still have to be inserted into the chart. The idea is, at the moment of inserting a recognised constituent C into the chart, to create new arcs for: rst, all the rules which have C as their rst child and, second, all the arcs already in the chart which require C in order to be extended. Complete parses are given by the arcs covering the whole sentence S, i.e., arcs for which start = 0 and end = jsj. In the algorithm presented here, the agenda is implemented as a stack in order to ensure that all the arcs lying between positions 0 and, say, j are in the chart so that they can be extended whenever an arc from i(< j) to k(> j) is inserted Dealing with Ambiguities Once a sentence has been parsed, one of the possible complete parses must be chosen in order to obtain a list of compound terms. One of the key issue when 23

26 building a compound list is to choose a \good" sentence in the sense that it will help in obtaining sensible compounds according to the grammar. We present here how our system deals with ambiguities and derives promising parses, the rst of which is used to build the compound list. An alternative way to choosing only one interpretation to obtain compound terms might be to use all the possible parses. This approach, however, entails problems that do not, in our view, have straightforward solutions. It would considerably increase the computational load placed on the system and introduce delicate questions about weighting. As suggested by Allen in [25, p. 176], the use of the Kleene + operator (henceforth, the list operator) can signicantly help to reduce the number of parsing ambiguities. For instance, a sequence of words consisting of four nouns (N N N N) has ve dierent interpretations ((N (N (N N))), (N ((N N) N)), ((N N)(N N)) and so on) if we parse it with the grammar NP! N2 N2! N2 N2 N2! N whereas it has only one interpretation if we parse it according to NP! N + : (3.3) The inclusion of the list operator is therefore very appealing. 3 It is all the more so if we consider the combinatorial eect of putting the various interpretations of dierent parts of a sentence together. But the simple inclusion of the list operator turns out to be insucient. Following up the above example, suppose, as in gure 3.7, we add the rules S! C + (3.4) C! NP C! Adj C! Punct : : : to rule (3.3). While NP has only one interpretation for N N N N, S has eight dierent interpretations, from 3 One can argue that it does not retain the bracketing information but, as Sparck Jones points out about compound nouns in [26], not much can be done regarding bracketing at the syntactic level. 24

27 S(C(NP(N))C(NP(N))C(NP(N))C(NP(N))) to (3.5) S(C(NP(NNNN))) (3.6) Now, if interpretation (3.5) is to be the preferred one, why should rule (3.3), with its list operator, be dened at all? Taking this observation into account, we can infer that the intended meaning of rule (3.5) is that a sentence is a sequence of noun phrases (NP) separated from each other by some other constituents. Then, in the light of the intended meaning, it appears that only interpretation (3.6) is valid and that all the others are, in some sense, accidental, resulting from an underspecication of the grammar. Since trying to write a grammar which faithfully reects the intended meaning would be rather cumbersome, parsing constraints have been dened such that grammars similar to the one considered here would lead to the \intended" interpretations. The purpose of the constraints is to favour the \greediest" interpretations of the list operator, i.e., the interpretations which include all that the list operator can possibly include. For instance, when presented with a sequence of four nouns, the favoured interpretation of a list of nouns (N+) is (N N N N) rather than any other interpretation which would cover only a part of this sequence. In the framework of chart parsing, the preference for greedy interpretations can be brought in by working with only a subset of all the possible arcs. Suppose that two arcs representing a recognised constituent (i.e., coming from the agenda) cover the same portion and that they are both subject to the same rule. In other words, that these two arcs only dier from each other by their sub-constituents. These two arcs not only create an ambiguity by themselves but this ambiguity is then propagated to any arcs which comprise them as constituents. This situation can be remedied in a large part by establishing preference rules when it comes to inserting arcs to the chart. These preference rules are stated as follows: Preference rules An arc A, coming from the agenda, can be added to the chart only if: 1. no other arc of the same form (i.e., same coverage, same rule) having less sub-constituents than A is already in the chart; and 2. the parsing tree of any other arc of the same form, with the same number of sub-constituents, has, at least, the same number of constituents as the parsing tree of A. 3. if A is added to the chart, any other arc which has the same form as A is removed if it does not comply to rules (1) and (2). Rule (1), applies specically to grammar rules with list operators. By taking the arcs which have less sub-constituents, we favour the greediest interpretations 25

Effectiveness of complex index terms in information retrieval

Effectiveness of complex index terms in information retrieval Effectiveness of complex index terms in information retrieval Tokunaga Takenobu, Ogibayasi Hironori and Tanaka Hozumi Department of Computer Science Tokyo Institute of Technology Abstract This paper explores

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Spatial Role Labeling CS365 Course Project

Spatial Role Labeling CS365 Course Project Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Introduction to Metalogic

Introduction to Metalogic Philosophy 135 Spring 2008 Tony Martin Introduction to Metalogic 1 The semantics of sentential logic. The language L of sentential logic. Symbols of L: Remarks: (i) sentence letters p 0, p 1, p 2,... (ii)

More information

UNIT-VIII COMPUTABILITY THEORY

UNIT-VIII COMPUTABILITY THEORY CONTEXT SENSITIVE LANGUAGE UNIT-VIII COMPUTABILITY THEORY A Context Sensitive Grammar is a 4-tuple, G = (N, Σ P, S) where: N Set of non terminal symbols Σ Set of terminal symbols S Start symbol of the

More information

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to

More information

Lecture 14 - P v.s. NP 1

Lecture 14 - P v.s. NP 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) February 27, 2018 Lecture 14 - P v.s. NP 1 In this lecture we start Unit 3 on NP-hardness and approximation

More information

Introduction to Metalogic 1

Introduction to Metalogic 1 Philosophy 135 Spring 2012 Tony Martin Introduction to Metalogic 1 1 The semantics of sentential logic. The language L of sentential logic. Symbols of L: (i) sentence letters p 0, p 1, p 2,... (ii) connectives,

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Writing Patent Specifications

Writing Patent Specifications Writing Patent Specifications Japan Patent Office Asia-Pacific Industrial Property Center, JIPII 2013 Collaborator: Shoji HADATE, Patent Attorney, Intellectual Property Office NEXPAT CONTENTS Page 1. Patent

More information

NPy [wh] NP# VP V NP e Figure 1: An elementary tree only the substitution operation, since, by using descriptions of trees as the elementary objects o

NPy [wh] NP# VP V NP e Figure 1: An elementary tree only the substitution operation, since, by using descriptions of trees as the elementary objects o Exploring the Underspecied World of Lexicalized Tree Adjoining Grammars K. Vijay-hanker Department of Computer & Information cience University of Delaware Newark, DE 19716 vijay@udel.edu David Weir chool

More information

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing L445 / L545 / B659 Dept. of Linguistics, Indiana University Spring 2016 1 / 46 : Overview Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the

More information

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46. : Overview L545 Dept. of Linguistics, Indiana University Spring 2013 Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the problem as searching

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing Daniel Ferrés and Horacio Rodríguez TALP Research Center Software Department Universitat Politècnica de Catalunya {dferres,horacio}@lsi.upc.edu

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP  NP PP 1.0. N people 0. /6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5

More information

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar NATURAL LANGUAGE PROCESSING Dr. G. Bharadwaja Kumar Sentence Boundary Markers Many natural language processing (NLP) systems generally take a sentence as an input unit part of speech (POS) tagging, chunking,

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 August 28, 2003 These supplementary notes review the notion of an inductive definition and give

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

with the ability to perform a restricted set of operations on quantum registers. These operations consist of state preparation, some unitary operation

with the ability to perform a restricted set of operations on quantum registers. These operations consist of state preparation, some unitary operation Conventions for Quantum Pseudocode LANL report LAUR-96-2724 E. Knill knill@lanl.gov, Mail Stop B265 Los Alamos National Laboratory Los Alamos, NM 87545 June 1996 Abstract A few conventions for thinking

More information

Finding Succinct. Ordered Minimal Perfect. Hash Functions. Steven S. Seiden 3 Daniel S. Hirschberg 3. September 22, Abstract

Finding Succinct. Ordered Minimal Perfect. Hash Functions. Steven S. Seiden 3 Daniel S. Hirschberg 3. September 22, Abstract Finding Succinct Ordered Minimal Perfect Hash Functions Steven S. Seiden 3 Daniel S. Hirschberg 3 September 22, 1994 Abstract An ordered minimal perfect hash table is one in which no collisions occur among

More information

How to Pop a Deep PDA Matters

How to Pop a Deep PDA Matters How to Pop a Deep PDA Matters Peter Leupold Department of Mathematics, Faculty of Science Kyoto Sangyo University Kyoto 603-8555, Japan email:leupold@cc.kyoto-su.ac.jp Abstract Deep PDA are push-down automata

More information

FORMALIZATION AND VERIFICATION OF PROPERTY SPECIFICATION PATTERNS. Dmitriy Bryndin

FORMALIZATION AND VERIFICATION OF PROPERTY SPECIFICATION PATTERNS. Dmitriy Bryndin FORMALIZATION AND VERIFICATION OF PROPERTY SPECIFICATION PATTERNS by Dmitriy Bryndin A THESIS Submitted to Michigan State University in partial fulllment of the requirements for the degree of MASTER OF

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0 Massachusetts Institute of Technology 6.863J/9.611J, Natural Language Processing, Spring, 2001 Department of Electrical Engineering and Computer Science Department of Brain and Cognitive Sciences Handout

More information

Abstract parsing: static analysis of dynamically generated string output using LR-parsing technology

Abstract parsing: static analysis of dynamically generated string output using LR-parsing technology Abstract parsing: static analysis of dynamically generated string output using LR-parsing technology Kyung-Goo Doh 1, Hyunha Kim 1, David A. Schmidt 2 1. Hanyang University, Ansan, South Korea 2. Kansas

More information

Computation Theory Finite Automata

Computation Theory Finite Automata Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program

More information

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering Review Earley Algorithm Chapter 13.4 Lecture #9 October 2009 Top-Down vs. Bottom-Up Parsers Both generate too many useless trees Combine the two to avoid over-generation: Top-Down Parsing with Bottom-Up

More information

CISC4090: Theory of Computation

CISC4090: Theory of Computation CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter

More information

Math 42, Discrete Mathematics

Math 42, Discrete Mathematics c Fall 2018 last updated 10/10/2018 at 23:28:03 For use by students in this class only; all rights reserved. Note: some prose & some tables are taken directly from Kenneth R. Rosen, and Its Applications,

More information

Information Extraction and GATE. Valentin Tablan University of Sheffield Department of Computer Science NLP Group

Information Extraction and GATE. Valentin Tablan University of Sheffield Department of Computer Science NLP Group Information Extraction and GATE Valentin Tablan University of Sheffield Department of Computer Science NLP Group Information Extraction Information Extraction (IE) pulls facts and structured information

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

Computing the acceptability semantics. London SW7 2BZ, UK, Nicosia P.O. Box 537, Cyprus,

Computing the acceptability semantics. London SW7 2BZ, UK, Nicosia P.O. Box 537, Cyprus, Computing the acceptability semantics Francesca Toni 1 and Antonios C. Kakas 2 1 Department of Computing, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK, ft@doc.ic.ac.uk 2 Department of Computer

More information

Computational Tasks and Models

Computational Tasks and Models 1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139 Upper and Lower Bounds on the Number of Faults a System Can Withstand Without Repairs Michel Goemans y Nancy Lynch z Isaac Saias x Laboratory for Computer Science Massachusetts Institute of Technology

More information

The Lambek-Grishin calculus for unary connectives

The Lambek-Grishin calculus for unary connectives The Lambek-Grishin calculus for unary connectives Anna Chernilovskaya Utrecht Institute of Linguistics OTS, Utrecht University, the Netherlands anna.chernilovskaya@let.uu.nl Introduction In traditional

More information

CHAPTER THREE: RELATIONS AND FUNCTIONS

CHAPTER THREE: RELATIONS AND FUNCTIONS CHAPTER THREE: RELATIONS AND FUNCTIONS 1 Relations Intuitively, a relation is the sort of thing that either does or does not hold between certain things, e.g. the love relation holds between Kim and Sandy

More information

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations

More information

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki

Discovery of Frequent Word Sequences in Text. Helena Ahonen-Myka. University of Helsinki Discovery of Frequent Word Sequences in Text Helena Ahonen-Myka University of Helsinki Department of Computer Science P.O. Box 26 (Teollisuuskatu 23) FIN{00014 University of Helsinki, Finland, helena.ahonen-myka@cs.helsinki.fi

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

On Using Selectional Restriction in Language Models for Speech Recognition

On Using Selectional Restriction in Language Models for Speech Recognition On Using Selectional Restriction in Language Models for Speech Recognition arxiv:cmp-lg/9408010v1 19 Aug 1994 Joerg P. Ueberla CMPT TR 94-03 School of Computing Science, Simon Fraser University, Burnaby,

More information

Introduction to Languages and Computation

Introduction to Languages and Computation Introduction to Languages and Computation George Voutsadakis 1 1 Mathematics and Computer Science Lake Superior State University LSSU Math 400 George Voutsadakis (LSSU) Languages and Computation July 2014

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Exhaustive Classication of Finite Classical Probability Spaces with Regard to the Notion of Causal Up-to-n-closedness

Exhaustive Classication of Finite Classical Probability Spaces with Regard to the Notion of Causal Up-to-n-closedness Exhaustive Classication of Finite Classical Probability Spaces with Regard to the Notion of Causal Up-to-n-closedness Michaª Marczyk, Leszek Wro«ski Jagiellonian University, Kraków 16 June 2009 Abstract

More information

Language Processing with Perl and Prolog

Language Processing with Perl and Prolog Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and

More information

Stochastic dominance with imprecise information

Stochastic dominance with imprecise information Stochastic dominance with imprecise information Ignacio Montes, Enrique Miranda, Susana Montes University of Oviedo, Dep. of Statistics and Operations Research. Abstract Stochastic dominance, which is

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

NLP Homework: Dependency Parsing with Feed-Forward Neural Network NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used

More information

HMM and Part of Speech Tagging. Adam Meyers New York University

HMM and Part of Speech Tagging. Adam Meyers New York University HMM and Part of Speech Tagging Adam Meyers New York University Outline Parts of Speech Tagsets Rule-based POS Tagging HMM POS Tagging Transformation-based POS Tagging Part of Speech Tags Standards There

More information

Model-Theory of Property Grammars with Features

Model-Theory of Property Grammars with Features Model-Theory of Property Grammars with Features Denys Duchier Thi-Bich-Hanh Dao firstname.lastname@univ-orleans.fr Yannick Parmentier Abstract In this paper, we present a model-theoretic description of

More information

Computational Models - Lecture 4

Computational Models - Lecture 4 Computational Models - Lecture 4 Regular languages: The Myhill-Nerode Theorem Context-free Grammars Chomsky Normal Form Pumping Lemma for context free languages Non context-free languages: Examples Push

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd CDMTCS Research Report Series A Version of for which ZFC can not Predict a Single Bit Robert M. Solovay University of California at Berkeley CDMTCS-104 May 1999 Centre for Discrete Mathematics and Theoretical

More information

Safety Analysis versus Type Inference

Safety Analysis versus Type Inference Information and Computation, 118(1):128 141, 1995. Safety Analysis versus Type Inference Jens Palsberg palsberg@daimi.aau.dk Michael I. Schwartzbach mis@daimi.aau.dk Computer Science Department, Aarhus

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

Document Title. Estimating the Value of Partner Contributions to Flood Mapping Projects. Blue Book

Document Title. Estimating the Value of Partner Contributions to Flood Mapping Projects. Blue Book Document Title Estimating the Value of Partner Contributions to Flood Mapping Projects Blue Book Version 1.1 November 2006 Table of Contents 1. Background...1 2. Purpose...1 3. Overview of Approach...2

More information

Advanced Undecidability Proofs

Advanced Undecidability Proofs 17 Advanced Undecidability Proofs In this chapter, we will discuss Rice s Theorem in Section 17.1, and the computational history method in Section 17.3. As discussed in Chapter 16, these are two additional

More information

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models Dependency grammar Morphology Word order Transition-based neural parsing Word representations Recurrent neural networks Informs Models Dependency grammar Morphology Word order Transition-based neural parsing

More information

The same definition may henceforth be expressed as follows:

The same definition may henceforth be expressed as follows: 34 Executing the Fregean Program The extension of "lsit under this scheme of abbreviation is the following set X of ordered triples: X := { E D x D x D : x introduces y to z}. How is this extension

More information

Written Qualifying Exam. Spring, Friday, May 22, This is nominally a three hour examination, however you will be

Written Qualifying Exam. Spring, Friday, May 22, This is nominally a three hour examination, however you will be Written Qualifying Exam Theory of Computation Spring, 1998 Friday, May 22, 1998 This is nominally a three hour examination, however you will be allowed up to four hours. All questions carry the same weight.

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Fun with weighted FSTs

Fun with weighted FSTs Fun with weighted FSTs Informatics 2A: Lecture 18 Shay Cohen School of Informatics University of Edinburgh 29 October 2018 1 / 35 Kedzie et al. (2018) - Content Selection in Deep Learning Models of Summarization

More information

B u i l d i n g a n d E x p l o r i n g

B u i l d i n g a n d E x p l o r i n g B u i l d i n g a n d E x p l o r i n g ( Web) Corpora EMLS 2008, Stuttgart 23-25 July 2008 Pavel Rychlý pary@fi.muni.cz NLPlab, Masaryk University, Brno O u t l i n e (1)Introduction to text/web corpora

More information

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus Timothy A. D. Fowler Department of Computer Science University of Toronto 10 King s College Rd., Toronto, ON, M5S 3G4, Canada

More information

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science Three-dimensional Stable Matching Problems Cheng Ng and Daniel S Hirschberg Department of Information and Computer Science University of California, Irvine Irvine, CA 92717 Abstract The stable marriage

More information

0 o 1 i B C D 0/1 0/ /1

0 o 1 i B C D 0/1 0/ /1 A Comparison of Dominance Mechanisms and Simple Mutation on Non-Stationary Problems Jonathan Lewis,? Emma Hart, Graeme Ritchie Department of Articial Intelligence, University of Edinburgh, Edinburgh EH

More information

Designing and Evaluating Generic Ontologies

Designing and Evaluating Generic Ontologies Designing and Evaluating Generic Ontologies Michael Grüninger Department of Industrial Engineering University of Toronto gruninger@ie.utoronto.ca August 28, 2007 1 Introduction One of the many uses of

More information

Even More Complex Search. Multi-Level vs Hierarchical Search. Lecture 11: Search 10. This Lecture. Multi-Level Search. Victor R.

Even More Complex Search. Multi-Level vs Hierarchical Search. Lecture 11: Search 10. This Lecture. Multi-Level Search. Victor R. Lecture 11: Search 10 This Lecture Victor R. Lesser CMPSCI 683 Fall 2010 Multi-Level Search BlackBoard Based Problem Solving Hearsay-II Speech Understanding System Multi-Level vs Hierarchical Search Even

More information

\It is important that there be options in explication, and equally important that. Joseph Retzer, Market Probe, Inc., Milwaukee WI

\It is important that there be options in explication, and equally important that. Joseph Retzer, Market Probe, Inc., Milwaukee WI Measuring The Information Content Of Regressors In The Linear Model Using Proc Reg and SAS IML \It is important that there be options in explication, and equally important that the candidates have clear

More information

KRIPKE S THEORY OF TRUTH 1. INTRODUCTION

KRIPKE S THEORY OF TRUTH 1. INTRODUCTION KRIPKE S THEORY OF TRUTH RICHARD G HECK, JR 1. INTRODUCTION The purpose of this note is to give a simple, easily accessible proof of the existence of the minimal fixed point, and of various maximal fixed

More information

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only 1/53 Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only Larry Moss Indiana University Nordic Logic School August 7-11, 2017 2/53 An example that we ll see a few times Consider the

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

Introduction to Semantics. The Formalization of Meaning 1

Introduction to Semantics. The Formalization of Meaning 1 The Formalization of Meaning 1 1. Obtaining a System That Derives Truth Conditions (1) The Goal of Our Enterprise To develop a system that, for every sentence S of English, derives the truth-conditions

More information

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Raman

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

CMPT-825 Natural Language Processing. Why are parsing algorithms important? CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop October 26, 2010 1/34 Why are parsing algorithms important? A linguistic theory is implemented in a formal system to generate

More information

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link

More information

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract Published in: Advances in Neural Information Processing Systems 8, D S Touretzky, M C Mozer, and M E Hasselmo (eds.), MIT Press, Cambridge, MA, pages 190-196, 1996. Learning with Ensembles: How over-tting

More information