Towards Collaborative Information Retrieval

Size: px
Start display at page:

Download "Towards Collaborative Information Retrieval"

Transcription

1 Towards Collaborative Information Retrieval Markus Junker, Armin Hust, and Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 28, 6768 Kaiserslautern, Germany {markus.junker, armin.hust, Abstract. Information Retrieval Systems have been studied in Computer Science for Decades. The traditional ad-hoc task in Information Retrieval is to find all documents relevant for an ad-hoc given query. Much work has been spent into improving on this task, in particular in the Text Retrieval Evaluation Conference series (TREC). In 1998 it was decided on TREC-8 that this task should not be longer persuaded within TREC, in particular because the accuracy has plateaued in the last years. At DFKI, we are working on approaches for Collaborative Information Retrieval (CIR) which learn to improve retrieval effectiveness from the interaction of different users with the retrieval engine. Such systems may have the potential to overcome the current plateau in ad-hoc retrieval. In our presentation we will motivate CIR and discuss its relation to personalization issues in Information Retrieval, to Collaborative Filtering, and to the traditional relevance feedback. We then introduce the assumptions under which CIR may prove beneficial. These assumption are likely to be met in many practical scenarios as we will argue. For a first step towards techniques, we focus on a restricted setting in CIR in which only old queries and correct answer documents to these queries are available for improving on a new query. Three technical approaches are motivated to meet this goal, experimentally evaluated and analysed. Although the data they ARE evaluated on is rather sub-optimal with respect to the assumptions, the results are very encouraging and point to a very attractive research direction. 1 Introduction 1.1 Traditional Document Search 1.2 Collaborative Information Retrieval 1.3 Relation to other feedback-based query expansion techniques 1.4 Collaborative Information Retrieval vs. Collaborative Filtering 2 Document Retrieval The task of document retrieval is to retrieve documents relevant to a given query from a fixed set of documents. Documents as well as queries are represented in a common way using a set of index terms (called terms from now

2 2 Markus Junker et al. on). Terms are determined from words of the documents, usually during preprocessing phases where some noise reduction procedures are incorporated (e.g. stemming and stopword elimination). In the following a term is represented by t i, 1 i M, where M is the number of terms in the document collection. 2.1 Vector Space Model One of the simplest but most popular models used in IR is the vector space model (VSM) [1], [4]. A document in the VSM is represented as a M dimensional vector d j = (w 1j,w 2j,...,w Mj ) T, 1 j N, (1) where T indicates the transpose of the vector, w ij is the weight of term t i in document d j and N is the number of documents in the document collection. A query in the VSM is also represented as a M dimensional vector q k = (w 1k,w 2k,...,w Mk ) T, 1 k L, (2) where w ik is the weight of term t i in query q k and L is the number of queries contained in the document collection. The result of the execution of a query is a list of documents ranked according to their similarity to the given query. The similarity sim(d j,q k ) between a document d j and a query q k is measured by the cosine of the angle between these two M dimensional vectors: sim(d j,q k ) = d T j q k d j q k, (3) where is the Euclidean norm of a vector. In the case that the vectors are already normalized (and hence have a unit length) the similarity is simply the dot product between the two vectors d j and q k : sim(d j,q k ) = d T j q k = M w ij w ik (4) The VSM is one of the models we are using in this work for comparison to our new query expansion method. i=1 2.2 Query Expansion Usage of short queries in IR produces a shortcoming in the number of documents ranked according to their similarity to the query. The number of ranked documents is related to the number of appropriate query terms. The more query terms, the more documents are retrieved and ranked according to their

3 Towards Collaborative Information Retrieval 3 similarity to the query [5]. Several methods, called query expansion methods, have been proposed to cope with this problem [1], [4]. These methods fall into three categories: usage of feedback information from the user, usage of information derived locally from the set of initially retrieved documents, and usage of information derived globally from the document collection. Pseudo Relevance Feedback The method called pseudo feedback (also called pseudo relevance feedback [PRF]) works in three stages: First documents are ranked according to their similarity to the original query. Then highly ranked documents are assumed to be relevant and their terms (all of them or some highly weighted terms) are used for expanding the original query. Then documents are ranked again according to the similarity to the expanded query. In this work we employ a simple variant of pseudo relevance feedback [3]. Let E be the set of document vectors given by { } sim(d j,q k ) E = d j max 1 i N {sim(d i,q k )} θ, 1 j N (5) where q k is the original query and θ is a threshold of the similarity. Then the sum d s of the document vectors in E d s = d j (6) d j E is used as expansion terms for the original query. The expanded query vector q k is obtained by q k = q k + α d s d s, (7) where α is a parameter for weighting the expansion terms. Then the documents are ranked again according to their similarity sim(d j,q k ). PRF is one of the models we are using in this work for comparison to our new query expansion method. 3 Query Similarity and Relevant Documents In this paper we employ a query expansion method based on query similarities and relevant documents (QSD). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (8)

4 4 Markus Junker et al. where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (9) 3.1 Query Expansion Method Our query expansion method is based on the similarity between the new query to be issued and the existing old queries. We will first give a high level description of the method, then the detailed mathematical description is given. Query expansion works as follows: compute the similarities between the new query and each of the existing old queries select the old queries having a similarity to the new query which is greater than or equal to a given threshold from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector use these document vectors and a weighting scheme to enrich the new query The formal description is given here. The similarity sim(q k,q l ) between a query q k and a query q l is measured by the cosine of the angle between these two M dimensional vectors: sim(q k,q l ) = q T k q l q k q l, (1) or simply using the dot product between the two vectors q k and q l : sim(q k,q l ) = q T k q l = M w ik w il (11) in the case that the vectors are already normalized (and hence have a unit length). From now on denote σ k := sim(q k,q l ) for a fixed query q l. Let S be the set S = {q k σ k σ, 1 k L} (12) of existing old queries q k having a similarity greater than or equal to a threshold σ to the new query q l and let T k be the sets i=1 T k = {d j q k S r jk = 1,1 j N}, 1 k L (13) of all documents relevant to the queries q k in S. Then the sums d k of the document vectors in each T k dk = d j T k d j (14)

5 Towards Collaborative Information Retrieval 5 are used as expansion terms for the original query. The expanded query vector q l is obtained by q l = q l + L k=1 λ k d k d k, (15) where the λ k are parameter for weighting the expansion terms. Notes: if σ in (12) is chosen to high the set S may be empty. Then the sets T k will be empty and the document vectors d k will be (,...,) T. In this case the new query will not be expanded. even if a query q k is in the set S, the corresponding set T k may be empty (in case where no relevance judgements are contained in the ground truth data for query q k ). Then the corresponding document vector d k will be (,...,) T. parameters σ in (12) and λ k in (15) are the tuning parameters for method QSD. 4 Query Linear Combination and Relevant Documents In this paper we employ a query expansion method based on linear combinations of similar queries and their relevant documents (QLD). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (16) where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (17) 4.1 Query Expansion Method Our query expansion method is based on the construction of the new query as a linear combination of existing old queries. We will first give a high level description of the method, then the detailed mathematical description is given. Query expansion works as follows:

6 6 Markus Junker et al. compute the similarities between the new query and each of the existing old queries select the old queries having a similarity to the new query which is greater than or equal to a given threshold use these selected old queries to rebuild the new query approximately as a linear combination of the old queries according to the least squares method and get the coefficients of the linear combination select these old queries, where the coefficient in the linear combination for the new query is greater than or equal to a given threshold from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector use these document vectors and a weighting scheme to enrich the new query, where the coefficients in the linear combination for the new query are used as weighting parameters The formal description is given here. The similarity sim(q k,q l ) between a query q k and a query q l is measured by the cosine of the angle between these two M dimensional vectors: sim(q k,q l ) = q T k q l q k q l, (18) or simply using the dot product between the two vectors q k and q l : sim(q k,q l ) = q T k q l = M w ik w il (19) in the case that the vectors are already normalized (and hence have a unit length). From now on denote σ k := sim(q k,q l ) for a fixed query q l. Let S be the set S = {q k σ k σ, 1 k L} (2) of existing old queries q k having a similarity greater than or equal to a threshold σ to the new query q l, let S be the number of queries in S, and let T k be the sets i=1 T k = {d j q k S r jk = 1,1 j N}, 1 k L (21) of all documents relevant to the queries q k in S. Then the sums d k of the document vectors in each T k d k = d j T k d j (22) are used as expansion terms for the original query.

7 The expanded query vector q l is obtained by Towards Collaborative Information Retrieval 7 S q l = q l + k=1 d k λ k d k, (23) where the λ k are parameter for weighting the expansion terms. The λ k are computed as follows: in most cases we cannot represent the new query q l exactly as a linear combination of the old queries q k S, i.e. S q l = λ k q k, q k S. (24) k=1 will not have a solution for the coefficients λ k. Equation 24 is equivalent to a system of linear equations Qλ = q l (25) where Q R M S = (q 1,q 2,...,q S ) is a matrix of M rows and S columns and λ = (λ 1,λ 2,...,λ S ) T is a column vector consisting of S elements. In our case Q is normally singular (M S ) and there is no solution to the system. In this situation we find a vector ˆλ so that it provides a closest fit to the equation in some sense. Our approach is to minimize the Euclidean norm of the vector Qλ q l, i.e we solve ˆλ = argmin λ Qλ q l (26) where ˆλ = ( ˆλ 1, ˆλ 2,..., ˆ λ S ) T is called the least squares solution for the system Qλ = q l. Then from our set of S coefficients ˆλ k we select only these coefficients λ k which are greater than or equal to a predefined parameter β, i.e. λ k = { ˆλk if ˆλk β if ˆλk < β (27) Notes: if σ in (2) is chosen to high the set S may be empty. Then the sets T k will be empty and the document vectors d k will be (,...,) T. In this case the new query will not be expanded. even if a query q k is in the set S, the corresponding set T k may be empty (in case where no relevance judgements are contained in the ground truth data for query q k ). Then the corresponding document vector d k will be (,...,) T. parameters σ in (2) and β in (27) are the tuning parameters for method QLD.

8 8 Markus Junker et al. 5 Term-based Concept Learning In this paper we employ a query expansion method based on learning concepts for each query term (TCL). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (28) where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (29) 5.1 Query Expansion Method Our query expansion method is based on the learning of term-based concepts for each term occurring within the current query. These concepts are learned with the ground-truth documents of old queries. We will first give a high level description of the method, then the detailed mathematical description is given. The TCL method is divided in two phases: the learning phase and the expansion phase. The learning phase for each term works as follows: select the old queries where the specific query term occurs from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector and use these documents vectors to build the term concept. The application phase for each term is easy: select the appropriate concept of the current term use a weighting scheme ro enrich the new query with the concept. The formal description is given as follows: Let q = ( w 1,..., w i,..., w M ) T be the user query represented within the vector space model (see 2.1). For each term of the query the appropriate weight w i is between and 1. Let q k = (w 1k,...,w ik,...,w Mk ) T, 1 k L be the queries given in the document collection. Let Q be the set of all queries q k (note that q / Q)

9 Towards Collaborative Information Retrieval 9 and D be the set of all documents d. Let R + (q k ) be the set of all documents which are relevant to q k : R + (q k ) = {d j D r jk = 1} (3) Now, the first step of the learning phase collects all (previous) queries which have the i-th term in common: Q i = {q k Q w i } (31) If the i-th term doesn t occur in any query q then Q i is empty. The second step collects all documents which are relevant to these collected queries: D ik = {d j d j R + (q k ) q k Q i } (32) And finaly in the last step of the learning phase the concept of each term t i is build as the sum of all documents which are relevant to the (previous) queries which have the term in common: C(t i ) = d j D ik d j (33) Analog to the queries and documents a term-based concept is represented as a vector. In the case that the term t i occurs in no query q k, the concept is empty. It is represented as (,..,) T. Now, where the term-based concepts are learned. The user query q can be expanded. The expanded query vector q is obtained analog to (15) by q = q + M ω i C(t i ), (34) i=1 where the ω i are parameters for weighting the concepts. In the experiments described below ω i is set to 1 and it can be ignored. Before applying the expanded query, it is normalized... q = Hier koennte man noch einige Notes hinschreiben... q q. (35)

10 1 Markus Junker et al. 6 Experimental Design In this section we describe the test collections used in the experimental comparison. We use standard document test collections and standard queries and questions provided by [6] and [7]. On the one hand by utilizing these collections we take advantage of the ground truth data for performance evaluation. On the other hand we do not expect to have queries having highly correlated similarities as we would expect in a real world application. So it is a challenging task to show performance improvements for our method. In our experiments we used the following eight collections: the CACM (titles and abstracts from the journal Communications of the ACM ), CISI (Institute of Scientific Information) and CRAN (aeronautics abstracts) collections are available at [6]. All collections are provided with queries and their ground truth. the CR (congressional record) collection. The CR collection is contained in the TREC test collections disk 4 [7], accompanied by the ground truth for 34 selected queries out of the TREC standard queries We created three test cases for the CR collection, using the TREC queries of different length in order to investigate the influence of query length. The CR-title contains the title queries (the shortest query representation), the CR-desc contains the description queries (the medium length query representation), the CR-narr contains the narrative queries (the longest query representation). the FR (federal register) collection. The FR collection is contained in the TREC test collections disk 2, accompanied by the ground truth for 112 selected queries out of the TREC standard queries the AP9 (associated press articles) collection contained in the TREC test collections disk 3. Originally the AP9 collection contains documents. From the TREC-9 Question Answering track (QA) we selected the question set Questions 21-7 were created without reference to the document set. Then in a separate pass equivalent but re-worded questions (questions ) were created from a subset of these 5 questions [8]. Because of the method used for the construction of this set (especially questions ) we expected to get higher similarities between different questions. From the ground truth data provided with the QA-track we selected only those questions having a relevant answer document in the AP9 document collection. Thus we reduced our test data to 723 documents and 353 questions. Terms used for document and query representation were obtained by stemming and eliminating stopwords. Table 1 lists statistics about the collections after stemming and stopword elimination has been carried out, statistics about these collections before stemming and stopword elimination can be found in [1] and [3].

11 Towards Collaborative Information Retrieval 11 The best known term weighting schemes use weights according to the socalled tf idf schemes. In our experiments we employ a standard scheme as follows: For document vectors the weights w ij are calculated as w ij = tf ij idf i, (36) where tf ij is a weight computed from the raw frequency f ij of a term t i (the number of occurrences of term t i in document d j ) tf ij = f ij, (37) and idf i is the inverse document frequency of term t i given by idf i = log N n i, (38) where n i is the number of documents containing term t i. For query vectors the weights w ik are calculated as w ik = f ik, (39) where f ik is the raw frequency of a term t i in a query q k (the number of occurrences of term t i in query q k ). Table 1. Statistics about test collections CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 size(mb) number of documents number of terms mean document length (short) (med) (med) (long) (long) (long) (long) (long) number of queries mean query length (med) (long) (med) (med) (long) (short) (med) (short) mean number of relev documents per query (med) (high) (med) (high) (high) (high) (med) (low) 7 Experimental Results In this section the results of the experiments are presented. Results were evaluated using the average over all queries. Recall/ graphs were generated according to [1]. Then significance tests were applied to the results.

12 12 Markus Junker et al. 7.1 Results for QSD method The methods VSM (vector space model), PRF (pseudo relevance feedback) and QSD (query similarity and relevant documents) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 2 and [3]). Table 2. Best parameter values for method PRF and QSD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF α θ QSD σ λ k σ k σ k σ k σ k σ k σ k σ k σ k Method QSD has been evaluated using different settings for parameters σ and λ k. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we computed the similarity σ k := sim(q k,q l ) for all queries q k,1 k L,k l according to equations (1) and (11). Then σ in equation (12) has been varied from. up to 1. in steps of.1, and all λ k in equation (15) have been varied simultaneously from. up to 5. in steps of.1. An additional step has been carried out where each λ k has been set to the value of σ k (the similarity between the new query q l and the existing queries q k ). Finally we got our new query vector according to equation (15) and issued the query. Best parameters values for σ and λ k are reported in table 2. Table 3. Best parameter values for methods QSDPRF and PRFQSD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 QSDPRF σ, λ k same as in table 2 α θ PRFQSD α, θ same as in table 2 σ λ k σ k σ k σ k σ k σ k σ k σ k σ k In the next steps we combined two methods of query expansion in this ways: First, after having expanded the new query using the QSD method, we applied the PRF method against the expanded query. This method is reported as the QSDPRF method. Second, after having expanded the new query using the PRF method, we applied the QSD method against the ex-

13 Towards Collaborative Information Retrieval 13 panded query. This method is reported as the PRFQSD method. Best parameter value settings have again been obtained by experiment and are chosen such that average is highest. These parameter value settings are reported in table 3. Table 4. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF QSD QSDPRF PRFQSD Table 4 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Table 5 compares the average obtained by the different methods. The ratio of difference is calculated as follows: let X be the average obtained by one of the methods and let Y be the average obtained by another method. Then the ratio is calculated by (X Y )/Y. A positive value for the ratio indicates an improvement, a negative value indicates a degradation in average. Note, that average values in table 4 are rounded to three digits, while average values are not rounded during computation of ratios of differences. Table 5. Average comparison X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM +53.2% +7.3% +13.4% +16.6% +1.9% +25.5% +33.3% +1.6% QSD VSM +82.5% +17.8% +11.4% -2.%.% +12.9% +28.9% +9.2% QSD PRF +19.1% +9.8% -1.7% -15.9% -9.8% -1.1% -3.3% +7.4% QSDPRF VSM +98.5% +2.4% +17.6% +11.2% +1.8% +31.4% +92.1% +9.5% QSDPRF PRF +29.6% +12.2% +3.7% -4.6%.% +4.7% +44.1% +7.8% QSDPRF QSD +8.8% +2.3% +5.5% +13.5% +1.7% +16.4% +49.% +.4% PRFQSD VSM +96.9% +25.6% +2.6% +12.1% +1.8% +33.4% +64.3% +9.6% PRFQSD PRF +28.5% +17.% +6.4% -3.9%.% +6.3% +23.2% +7.8% PRFQSD QSD +7.8% +6.6% +8.3% +14.4% +1.8% +18.2% +27.4% +.4% PRFQSD QSDPRF -.8% +4.3% +2.6% +.8%.% +1.5% -14.5%.% Figures 1 to 4 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, QSD, QSDPRF and PRFQSD.

14 14 Markus Junker et al. CACM / graph, σ 1 = 4, σ 2 =.58 CISI / graph, σ 1 =.41, σ 2 = avg =.13 (VSM) avg =.199 (PRF) avg = 37 (QSD) avg = 57 (QSDPRF) avg = 55 (PRFQSD) avg =.12 (VSM) avg =.129 (PRF) avg =.142 (QSD) avg =.145 (QSDPRF) avg =.151 (PRFQSD) Fig. 1. Recall/ graphs for the CACM and CISI test collections. σ 1 is the tuning parameter for the QSD method, σ 2 denotes the tuning parameter for the PRFQSD method. CR desc / graph, σ 1 =.44, σ 2 = 6 CR narr / graph, σ 1 = 3, σ 2 = avg =.175 (VSM) avg = 4 (PRF) avg =.172 (QSD) avg =.195 (QSDPRF) avg =.196 (PRFQSD) avg =.173 (VSM) avg =.192 (PRF) avg =.173 (QSD) avg =.191 (QSDPRF) avg =.192 (PRFQSD) Fig. 2. Recall/ graphs for the CRAN and CR-desc test collections. CR title / graph, σ 1 =.42, σ 2 =.46 CRAN / graph, σ 1 =.49, σ 2 = avg =.135 (VSM) avg =.169 (PRF) avg =.152 (QSD) avg =.177 (QSDPRF) avg =.18 (PRFQSD) avg = 84 (VSM) avg =.435 (PRF) avg =.428 (QSD) avg =.451 (QSDPRF) avg =.463 (PRFQSD) Fig. 3. Recall/ graphs for the CR-narr and CR-title test collections.

15 Towards Collaborative Information Retrieval FR / graph, σ 1 = 6, σ 2 =.44 avg =.85 (VSM) avg =.113 (PRF) avg =.19 (QSD) avg =.163 (QSDPRF) avg =.139 (PRFQSD) AP9 / graph, σ 1 =.68, σ 2 = avg =.743 (VSM) avg =.755 (PRF) avg =.811 (QSD) avg =.814 (QSDPRF) avg =.814 (PRFQSD) Fig. 4. Recall/ graphs for the FR and AP9 test collections.

16 16 Markus Junker et al. 7.2 Results for QLD method The methods VSM (vector space model), PRF (pseudo relevance feedback) and QLD (query linear combination and relevant documents) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 6 and [3]). Table 6. Best parameter values for method PRF and QLD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF α θ QLD σ β Method QLD has been evaluated using different settings for parameters σ and β. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we computed the similarity σ k := sim(q k,q l ) for all queries q k,1 k L,k l according to equations (18) and (19). Then σ in equation (2) has been varied from. up to 1. in steps of.1, and β in equation (27) has been varied from. up to the maximum value ˆλ k computed by equation (26) in steps of.1. Finally we got our new query vector according to equation (23) and issued the query. Best parameters values for σ and β are reported in table 6. Table 7. Best parameter values for methods QLDPRF and PRFQLD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 QLDPRF σ, β same as in table 6 α θ PRFQLD α, θ same as in table 6 σ β In the next steps we combined two methods of query expansion in this ways: First, after having expanded the new query using the QLD method, we applied the PRF method against the expanded query. This method is reported as the QLDPRF method. Second, after having expanded the new query using the PRF method, we applied the QLD method against the expanded query. This method is reported as the PRFQLD method. Best parameter value settings have again been obtained by experiment and are chosen such that

17 Towards Collaborative Information Retrieval 17 average is highest. These parameter value settings are reported in table 7. Table 8. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF QLD QLDPRF PRFQLD Table 8 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Table 9 compares the average obtained by the different methods. The ratio of difference is calculated as follows: let X be the average obtained by one of the methods and let Y be the average obtained by another method. Then the ratio is calculated by (X Y )/Y. A positive value for the ratio indicates an improvement, a negative value indicates a degradation in average. Note, that average values in table 8 are rounded to three digits, while average values are not rounded during computation of ratios of differences. Table 9. Average comparison X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM +53.2% +7.3% +13.4% +16.6% +1.9% +25.5% +33.3% +1.6% QLD VSM +75.3% +41.9% +13.5% -% +1.% +21.7% +27.% +9.1% QLD PRF +14.4% +32.3% +.% -14.4% -8.9% -3.% -4.7% +7.4% QLDPRF VSM +11% +43.7% +18.2% +16.4% +1.9% +36.4% +89.7% +9.6% QLDPRF PRF +37.2% +34.% +4.2% -%.% +8.7% +42.3% +7.8% QLDPRF QLD +19.9% +1.3% +4.1% +16.7% +9.8% +12.1% +49.4% +.4% PRFQLD VSM % +4.5% +22.5% +18.5% +11.7% +4.6% +69.6% +9.5% PRFQLD PRF +38.2% +3.9% +8.% +1.6% +.8% +12.% +27.2% +7.8% PRFQLD QLD +2.8% -1.% +7.9% +18.7% +1.6% +15.5% +33.5% +.4% PRFQLD QLDPRF +.7% -2.3% +3.7% +1.8% +.7% +3.% -1.6%.% Figures 5 to 8 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, QLD, QLDPRF and PRFQLD.

18 18 Markus Junker et al. CACM / graph σ 1 = 2, β 1 =.16, σ 2 =.14, β 2 = 5 CISI / graph σ 1 = 5, β 1 = 3, σ 2 = 4, β 2 = avg =.13 (VSM) avg =.199 (PRF) avg = 27 (QLD) avg = 73 (QLDPRF) avg = 75 (PRFQLD) avg =.12 (VSM) avg =.129 (PRF) avg =.171 (QLD) avg =.173 (QLDPRF) avg =.169 (PRFQLD) Fig. 5. Recall/ graphs for the CACM and CISI test collections. σ 1 and β 1 are tuning parameter for the QLD method, σ 2 and β 2 denote the tuning parameter for the PRFQLD method. CR desc / graph σ 1 = 6, β 1 = 7, σ 2 =., β 2 = CR narr / graph σ 1 =.17, β 1 =, σ 2 =.16, β 2 = avg =.175 (VSM) avg = 4 (PRF) avg =.175 (QLD) avg = 4 (QLDPRF) avg = 8 (PRFQLD) avg =.173 (VSM) avg =.192 (PRF) avg =.175 (QLD) avg =.192 (QLDPRF) avg =.193 (PRFQLD) Fig. 6. Recall/ graphs for the CRAN and CR-desc test collections. CR title / graph σ 1 =.41, β 1 =.43, σ 2 =.46, β 2 =.47 CRAN / graph σ 1 = 7, β 1 =.41, σ 2 =.43, β 2 = avg =.135 (VSM) avg =.169 (PRF) avg =.164 (QLD) avg =.184 (QLDPRF) avg =.19 (PRFQLD) avg = 84 (VSM) avg =.435 (PRF) avg =.436 (QLD) avg =.453 (QLDPRF) avg =.47 (PRFQLD) Fig. 7. Recall/ graphs for the CR-narr and CR-title test collections.

19 Towards Collaborative Information Retrieval 19 FR / graph σ 1 = 7, β 1 =.13, σ 2 =.4, β 2 = 4 AP9 / graph σ 1 =.68, β 1 =.48, σ 2 =.69, β 2 = avg =.85 (VSM) avg =.113 (PRF) avg =.18 (QLD) avg =.161 (QLDPRF) avg =.144 (PRFQLD) avg =.743 (VSM) avg =.755 (PRF) avg =.81 (QLD) avg =.814 (QLDPRF) avg =.813 (PRFQLD) Fig. 8. Recall/ graphs for the FR and AP9 test collections.

20 2 Markus Junker et al. 7.3 Results for TCL method The methods VSM (vector space model), PRF (pseudo relevance feedback) and TCL (Term-based Concept Learning) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 2). Method TCL has been evaluated using no parameters, i.e. all weighting parameters are set to 1. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we learned for each query term a concept according to equations... The new query q l has been enriched with the learned term-based concepts. Finally we got the new query vector according to equation... and issued the query. Table 1. Best parameter values for methods PRF(Concepts) and Concept+PRF CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF(Concepts) α θ Concept+PRF α, θ same as in table 2 ω xxx x.xxxx.5xxx.1xxx.69xx In the next step we combined the TLC method with the PRF in two ways: First, after having selected the appropriate old queries, we applied the PRF method against the ground-truth documents of these queries. So, not all ground-truth documents are used to learn the term-based concepts, but just the best documents. This method is reported as the PRF(Concepts) method. Best parameter value settings again have been obtained by experiment and are chosen such that average is highest. These parameter value settings are reported in table 1. Second, we applied the PRF method in parallel with the TLC method. For each query (PRF) and each query term (TLC), the PRF-documents are summed up to a document vector and the term-based concept is added with a weight ω to the final document vector which is used to expand the query. This method is reported as the Concept+PRF method. Table 11 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Figures 1 to 4 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, Concept, PRF(Concept) and Concept+PRF.

21 Towards Collaborative Information Retrieval 21 Table 11. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF Concepts PRF(Concepts) Concept+PRF CACM / graph avg =.13 (VSM) avg =.199 (PRF) avg = 82 (Concept) avg = 4 (PRF(Concept)) avg = 8 (Concept+PRF) CISI / graph avg =.12 (VSM) avg =.129 (PRF) avg =.1 (Concept) avg =.127 (PRF(Concept)) avg =.126 (Concept+PRF) Fig. 9. Recall/ graphs for the CACM and CISI test collections CRAN / graph avg = 84 (VSM) avg =.435 (PRF) avg = 42 (Concept) avg =.426 (PRF(Concept)) avg =.444 (Concept+PRF) CR desc / graph avg =.175 (VSM) avg = 4 (PRF) avg =.167 (Concept) avg = 6 (PRF(Concept)) avg =.194 (Concept+PRF) Fig. 1. Recall/ graphs for the CRAN and CR-desc test collections. 7.4 Significance Testing The next step for the evaluation is the comparison of the values of the average obtained by different methods. Statistical tests provide information about wether observed differences in different methods are really significant or just by chance. Several statistical tests have been used in IR [2], [9]. We employ the paired t-test described in [2]: Let x l and y l be the scores of retrieval methods X and Y for query q l,1 l L and define d l = x l y l. The assumption is that the model is additive, i.e. d l = µ + ε l, where µ is the

22 22 Markus Junker et al CR narr / graph avg =.173 (VSM) avg =.192 (PRF) avg =.82 (Concept) avg =.173 (PRF(Concept)) avg =.1 (Concept+PRF) CR title / graph avg =.135 (VSM) avg =.169 (PRF) avg =.16 (Concept) avg =.18 (PRF(Concept)) avg =.181 (Concept+PRF) Fig. 11. Recall/ graphs for the CR-narr and CR-title test collections..4 5 FR / graph avg =.85 (VSM) avg =.161 (PRF) avg =.14 (Concept) avg =.199 (PRF(Concept)) avg =.15 (Concept+PRF) AP9 / graph avg =.742 (VSM) avg =.755 (PRF) avg =.629 (Concept) avg =.773 (PRF(Concept)) avg =.762 (Concept+PRF) Fig. 12. Recall/ graphs for the FR and AP9 test collections. mean value and the errors ε l are independent and normally distributed. The null hypothesis H is that µ =, i.e. method X performs as well as method Y. The alternative hypothesis H 1 is that µ >, i.e. method X performs better than method Y in terms of average. where and The t-test t = d s n, (4) d = 1 n s 2 = 1 n 1 n d i (41) i=1 n (d i d) 2 (42) follows the t-distribution with a degree of freedom of n 1, where n is the number of samples, d is the sample mean and s 2 is the sample variance. i=1

23 Towards Collaborative Information Retrieval 23 Given the value of t we obtain the p-value, i.e. the probability of observing the sample results d l under the assumption that H is true. Comparing the p- value to a given significance level α, we can decide whether the null hypothesis H should be rejected or not. 7.5 Significance Testing for QSD method The results are shown in table 12. Each row contains the results of two tests, i.e. test method X against method Y and vice versa. We tested each method X against each other method Y and vice versa. We used significance levels α =.5 and α =.1. If there are differences showing up from different significance levels we indicated it in parentheses, where the value without parentheses is from significance level α =.5 and the value in parentheses is from significance level α =.1. An entry of + in a table cell indicates that the null hypothesis is rejected for testing X against Y. This means for significance level α =.5 that method X is likely to perform better than method Y and for significance level α =.1 that method X is almost guaranteed to perform better than method Y. An entry of in a table cell indicates that the null hypothesis is rejected for testing Y against X. This means for significance level α =.5 that method Y is likely to perform better than method X and for significance level α =.1 that method Y is almost guaranteed to perform better than method X. An entry of o in a table cell indicates that the null hypothesis can not be rejected in both test. This means that there is low probability that one of the methods is performing better than the other method. Table 12. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) QSD VSM + + (o) + o o o + (o) + QSD PRF o o o (o) o o + QSDPRF VSM + + (o) + + (o) o + (o) + + QSDPRF PRF o o o o (o) o + (o) + QSDPRF QSD o + (o) + + (o) o + (o) + o PRFQSD VSM o + (o) + (o) + + PRFQSD PRF + + (o) + o o o + (o) + PRFQSD QSD o o + o o (o) + (o) + (o) o PRFQSD QSDPRF o o o o o o o o For significance level α =.5 it can be seen from table 12 that method QSD performs better than method VSM with a high probability for all collections except the three CR collections. Improvements of method QSD over

24 24 Markus Junker et al. method PRF are significant only for the AP9 test collection, and PRF performs significantly better than QSD for the CR-desc and CR-narr collections. For all other collections (CACM, CISI, CRAN, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QSD vs. PRF and vice versa. Hence for this collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen from table 12 that method QSD is almost guaranteed to perform better than method VSM only for the CACM, CRAN and AP9 collections. Improvements of method QSD over method PRF are significant only for the AP9 test collection, and PRF performs significantly better than QSD only for the CR-desc collection. For all other collections (CACM, CISI, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QSD vs. PRF and vice versa. Additionally it can be seen that the combined methods QSDPRF and PRFQSD outperform the VSM, PRF and QSD methods in most cases, and we have a strong indication for it (at the significance level α =.1) for the CRAN and the FR collections. Only for the AP9 collection there is no indication that one of the combined methods QSDPRF or PRFQSD outperforms the QSD method. This seems to be a result of the construction of the queries in this collection (see section 6). 7.6 Significance Testing for QLD method The results are shown in table 13. Each row contains the results of two tests, i.e. test method X against method Y and vice versa. We tested each method X against each other method Y and vice versa. We used significance levels α =.5 and α =.1. If there are differences showing up from different significance levels we indicated it in parentheses, where the value without parentheses is from significance level α =.5 and the value in parentheses is from significance level α =.1. An entry of + in a table cell indicates that the null hypothesis is rejected for testing X against Y. This means for significance level α =.5 that method X is likely to perform better than method Y and for significance level α =.1 that method X is almost guaranteed to perform better than method Y. An entry of in a table cell indicates that the null hypothesis is rejected for testing Y against X. This means for significance level α =.5 that method Y is likely to perform better than method X and for significance level α =.1 that method Y is almost guaranteed to perform better than method X.

25 Towards Collaborative Information Retrieval 25 An entry of o in a table cell indicates that the null hypothesis can not be rejected in both test. This means that there is low probability that one of the methods is performing better than the other method. Table 13. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) QLD VSM o o o + (o) + QLD PRF o + o o o o + QLDPRF VSM (o) QLDPRF PRF + (o) + o o o o + (o) + QLDPRF QLD + + (o) + + o o + o PRFQLD VSM (o) PRFQLD PRF o o o + (o) + PRFQLD QLD + (o) o + + o o + o PRFQLD QLDPRF o o o o o o o o For significance level α =.5 it can be seen from table 13 that method QLD performs better than method VSM with a high probability for all collections except for the three CR collections. Improvements of method QLD over method PRF are significant for the CISI and AP9 test collection, and PRF performs significantly better than QLD for the CR-desc collection. For all other collections (CACM, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QLD vs. PRF and vice versa. Hence for this collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen from table 13 that method QLD is almost guaranteed to perform better than method VSM for four collections. Improvements of method QLD over method PRF are significant only for the CISI and AP9 test collection, and PRF performs significantly better than QLD only for the CR-desc collection. For all other collections (CACM, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QLD vs. PRF and vice versa. Additionally it can be seen that the combined methods QLDPRF and PRFQLD outperform the VSM, PRF and QLD methods in most cases, and we have a strong indication for it (at the significance level α =.1) for the CRAN, CR-desc and the FR collections. Only for the AP9 collection there is no indication that one of the combined methods QLDPRF or PRFQLD outperforms the QLD method. This

26 26 Markus Junker et al. seems to be a result of the construction of the queries in this collection (see section 6). 7.7 Significance Testing for TCL method The results for the TCL method are shown in table 14 which is constructed analog to table 12. Table 14. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) Concepts VSM + o + + Concepts PRF + o o o PRF(Concepts) VSM o + (o) + + PRF(Concepts) PRF + o o o o + + (o) PRF(Concepts) Concepts o (o) + o + + Concept+PRF VSM + o + o + + (o) o Concept+PRF PRF + o o o (o) o o Concept+PRF Concepts + (o) (o) o + (o) + For significance level α =.5 it can be seen that method TCL performs in half of the collection better than method VSM and in half of the collection better or equal than method PRF. Improvements of method TCL over method PRF is significant for the CACM test collection, and the PRF performs significantly better than TCL for four collections. For collections CR-desc, CR-title and FR there is no indication for the rejection of the null hypothesis dureing significance testing of methods TCL vs. PRF and vice versa. Hence, for these collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen that method TCL is almost guaranteed to perform better than method VSM for three collections. Improvements of method VSM over method PRF is significant only for the CACM test collection and PRF performs significantly better than TCL on four collections. For all other collections there is no indication for the rejection of the null hypothesis during significance testing of methods TCL vs. PRF and vice versa. Additionally, it can be seen that the combined methods PRF(Concepts) and Concept+PRF outperform the VSM, PRF and especially the TCL methods in most cases, and we have strong indication (α =.1) for it for the CACM, CRAN, and the FR collections. At the CR-narr collection it can bee seen that only the PRF performs better than the VSM on the level α =.5. Neither the TCL nor the combi-

27 Towards Collaborative Information Retrieval 27 nations can improve the baseline of the VSM. This is an indication for badly learned concepts. The long narrative queries contain terms which are not representative or have the behavior of stop words. Many terms are occurring in many queries and then the query is expanded with concepts which have no sense. 8 Conclusions and Future Work 8.1 Conclusions We have experimentally compared new query expansion methods based on query similarities and document relevance with two conventional information retrieval methods. From the results gathered from eight static test collections we have only one clear indication that the QSD and QLD methods are superior to the conventional PRF method. But in contrast we also have only one clear indication that the conventional PRF method is superior to the QSD and QLD method. From our results we think that we can combine this new method with the conventional PRF method. No performance degradation has been observed for this combination of the two methods. The results that have been obtained by combining the new QSD and QLD method with the conventional PRF method are promising. Due to the construction method for the queries in the AP9 test collection where the QSD and QLD methods significantly performs better than the other methods we think that we could utilize this new method in cases where old queries and their corresponding relevance information has been learned previously and where new queries have high similarities to old existing queries. 8.2 Future Work Experimental results on the static document collections are promising. In future work we will integrate the query expansion method into real-life IR systems. The first step will be the integration into search engines which deal with more static document collections. These search engines ought to have learning methods for some relevance feedback. Users issuing new queries will then profit from knowledge which has been learned previously from other users, who issued the same or similar queries and judged the relevance of the documents retrieved by those queries. Next steps will then be the formulation of a related query expansion method which will consider the more dynamic nature of underlying document collections, and learning of the tuning parameters σ and β for the QSD and QLD methods.

28 28 Markus Junker et al. 9 Acknowledgements This work was supported by the German Ministry for Education and Research, bmb+f (Grant: 1 IN 92 B8). References 1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison- Wesley Publishing Company, D. Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of SIGIR-93, pages , K. Kise, M. Junker, A. Dengel, and K. Matsumoto. Experimental evaluation of passage-based document retrieval. In Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR 1), C. Manning and H. Schütze. Foundations of Natural Language Processing. MIT Press, Y. Qiu and H.-P. Frei. Concept-based query expansion. In Proceedings of SIGIR- 93, 16th ACM International Conference on Research and Development in Information Retrieval, pages , Pittsburgh, US, ftp://ftp.cs.cornell.edu/pub/smart E. M. Voorhees and D. Harman. Overview of the ninth text retrieval conference (trec-9). In Proceedings of the Ninth Text Retrieval Conference (TREC-9), Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of SIGIR-99, pages 42 49, 1999.

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Study of the Dirichlet Priors for Term Frequency Normalisation A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Can Vector Space Bases Model Context?

Can Vector Space Bases Model Context? Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval

More information

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 1 Sofia 2009 A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data Ch. Aswani Kumar 1,

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Modeling the Score Distributions of Relevant and Non-relevant Documents

Modeling the Score Distributions of Relevant and Non-relevant Documents Modeling the Score Distributions of Relevant and Non-relevant Documents Evangelos Kanoulas, Virgil Pavlu, Keshi Dai, and Javed A. Aslam College of Computer and Information Science Northeastern University,

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Pivoted Length Normalization I. Summary idf II. Review

Pivoted Length Normalization I. Summary idf II. Review 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted

More information

A Neural Passage Model for Ad-hoc Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS

BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS TECHNICAL REPORT YL-2011-001 BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS Roi Blanco, Hugo Zaragoza Yahoo! Research Diagonal 177, Barcelona, Spain {roi,hugoz}@yahoo-inc.com Bangalore Barcelona

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

Combining Vector-Space and Word-based Aspect Models for Passage Retrieval

Combining Vector-Space and Word-based Aspect Models for Passage Retrieval Combining Vector-Space and Word-based Aspect Models for Passage Retrieval Raymond Wan Vo Ngoc Anh Ichigaku Takigawa Hiroshi Mamitsuka Bioinformatics Center, Institute for Chemical Research, Kyoto University,

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Effectiveness of complex index terms in information retrieval

Effectiveness of complex index terms in information retrieval Effectiveness of complex index terms in information retrieval Tokunaga Takenobu, Ogibayasi Hironori and Tanaka Hozumi Department of Computer Science Tokyo Institute of Technology Abstract This paper explores

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell

More information

Measuring the Variability in Effectiveness of a Retrieval System

Measuring the Variability in Effectiveness of a Retrieval System Measuring the Variability in Effectiveness of a Retrieval System Mehdi Hosseini 1, Ingemar J. Cox 1, Natasa Millic-Frayling 2, and Vishwa Vinay 2 1 Computer Science Department, University College London

More information

The troubles with using a logical model of IR on a large collection of documents

The troubles with using a logical model of IR on a large collection of documents The troubles with using a logical model of IR on a large collection of documents Fabio Crestani Dipartimento di Elettronica ed Informatica Universitá a degli Studi di Padova Padova - Italy Ian Ruthven

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

1. Ignoring case, extract all unique words from the entire set of documents.

1. Ignoring case, extract all unique words from the entire set of documents. CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea

More information

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University

More information

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Query Performance Prediction: Evaluation Contrasted with Effectiveness Query Performance Prediction: Evaluation Contrasted with Effectiveness Claudia Hauff 1, Leif Azzopardi 2, Djoerd Hiemstra 1, and Franciska de Jong 1 1 University of Twente, Enschede, the Netherlands {c.hauff,

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents

Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents Mohammad Abolhassani and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen,

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi On the Foundations of Diverse Information Retrieval Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi 1 Outline Need for diversity The answer: MMR But what was the

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

Manning & Schuetze, FSNLP, (c)

Manning & Schuetze, FSNLP, (c) page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

FUSION METHODS BASED ON COMMON ORDER INVARIABILITY FOR META SEARCH ENGINE SYSTEMS

FUSION METHODS BASED ON COMMON ORDER INVARIABILITY FOR META SEARCH ENGINE SYSTEMS FUSION METHODS BASED ON COMMON ORDER INVARIABILITY FOR META SEARCH ENGINE SYSTEMS Xiaohua Yang Hui Yang Minjie Zhang Department of Computer Science School of Information Technology & Computer Science South-China

More information

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing Sree Lekha Thota and Ben Carterette Department of Computer and Information Sciences University of Delaware, Newark, DE, USA

More information

NCDREC: A Decomposability Inspired Framework for Top-N Recommendation

NCDREC: A Decomposability Inspired Framework for Top-N Recommendation NCDREC: A Decomposability Inspired Framework for Top-N Recommendation Athanasios N. Nikolakopoulos,2 John D. Garofalakis,2 Computer Engineering and Informatics Department, University of Patras, Greece

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents

Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents Mohammad Abolhassani and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen,

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Integrating Logical Operators in Query Expansion in Vector Space Model

Integrating Logical Operators in Query Expansion in Vector Space Model Integrating Logical Operators in Query Expansion in Vector Space Model Jian-Yun Nie, Fuman Jin DIRO, Université de Montréal C.P. 6128, succursale Centre-ville, Montreal Quebec, H3C 3J7 Canada {nie, jinf}@iro.umontreal.ca

More information

a Short Introduction

a Short Introduction Collaborative Filtering in Recommender Systems: a Short Introduction Norm Matloff Dept. of Computer Science University of California, Davis matloff@cs.ucdavis.edu December 3, 2016 Abstract There is a strong

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

Collaborative Topic Modeling for Recommending Scientific Articles

Collaborative Topic Modeling for Recommending Scientific Articles Collaborative Topic Modeling for Recommending Scientific Articles Chong Wang and David M. Blei Best student paper award at KDD 2011 Computer Science Department, Princeton University Presented by Tian Cao

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Chapter 10 Regression Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Scatter Diagrams A graph in which pairs of points, (x, y), are

More information

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata ABSTRACT Wei Wu Microsoft Research Asia No 5, Danling Street, Haidian District Beiing, China, 100080 wuwei@microsoft.com

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Basic Linear Algebra in MATLAB

Basic Linear Algebra in MATLAB Basic Linear Algebra in MATLAB 9.29 Optional Lecture 2 In the last optional lecture we learned the the basic type in MATLAB is a matrix of double precision floating point numbers. You learned a number

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

Updating the Partial Singular Value Decomposition in Latent Semantic Indexing Updating the Partial Singular Value Decomposition in Latent Semantic Indexing Jane E. Tougas a,1, a Faculty of Computer Science, Dalhousie University, Halifax, NS, B3H 1W5, Canada Raymond J. Spiteri b,2

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Quick Introduction to Nonnegative Matrix Factorization

Quick Introduction to Nonnegative Matrix Factorization Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 15: Learning to Rank Sec. 15.4 Machine learning for IR ranking? We ve looked at methods for ranking

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Generic Text Summarization

Generic Text Summarization June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Predicting Neighbor Goodness in Collaborative Filtering

Predicting Neighbor Goodness in Collaborative Filtering Predicting Neighbor Goodness in Collaborative Filtering Alejandro Bellogín and Pablo Castells {alejandro.bellogin, pablo.castells}@uam.es Universidad Autónoma de Madrid Escuela Politécnica Superior Introduction:

More information

Lecture 4: An FPTAS for Knapsack, and K-Center

Lecture 4: An FPTAS for Knapsack, and K-Center Comp 260: Advanced Algorithms Tufts University, Spring 2016 Prof. Lenore Cowen Scribe: Eric Bailey Lecture 4: An FPTAS for Knapsack, and K-Center 1 Introduction Definition 1.0.1. The Knapsack problem (restated)

More information

Fast LSI-based techniques for query expansion in text retrieval systems

Fast LSI-based techniques for query expansion in text retrieval systems Fast LSI-based techniques for query expansion in text retrieval systems L. Laura U. Nanni F. Sarracco Department of Computer and System Science University of Rome La Sapienza 2nd Workshop on Text-based

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 9: Collaborative Filtering, SVD, and Linear Algebra Review Paul Ginsparg

More information

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics

More information

psychological statistics

psychological statistics psychological statistics B Sc. Counselling Psychology 011 Admission onwards III SEMESTER COMPLEMENTARY COURSE UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION CALICUT UNIVERSITY.P.O., MALAPPURAM, KERALA,

More information

NewPR-Combining TFIDF with Pagerank

NewPR-Combining TFIDF with Pagerank NewPR-Combining TFIDF with Pagerank Hao-ming Wang 1,2, Martin Rajman 2,YeGuo 3, and Bo-qin Feng 1 1 School of Electronic and Information Engineering, Xi an Jiaotong University, Xi an, Shaanxi 710049, P.R.

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information