Towards Collaborative Information Retrieval

Size: px

Start display at page:

Download "Towards Collaborative Information Retrieval"

Jared Phelps
6 years ago
Views:

1 Towards Collaborative Information Retrieval Markus Junker, Armin Hust, and Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 28, 6768 Kaiserslautern, Germany {markus.junker, armin.hust, Abstract. Information Retrieval Systems have been studied in Computer Science for Decades. The traditional ad-hoc task in Information Retrieval is to find all documents relevant for an ad-hoc given query. Much work has been spent into improving on this task, in particular in the Text Retrieval Evaluation Conference series (TREC). In 1998 it was decided on TREC-8 that this task should not be longer persuaded within TREC, in particular because the accuracy has plateaued in the last years. At DFKI, we are working on approaches for Collaborative Information Retrieval (CIR) which learn to improve retrieval effectiveness from the interaction of different users with the retrieval engine. Such systems may have the potential to overcome the current plateau in ad-hoc retrieval. In our presentation we will motivate CIR and discuss its relation to personalization issues in Information Retrieval, to Collaborative Filtering, and to the traditional relevance feedback. We then introduce the assumptions under which CIR may prove beneficial. These assumption are likely to be met in many practical scenarios as we will argue. For a first step towards techniques, we focus on a restricted setting in CIR in which only old queries and correct answer documents to these queries are available for improving on a new query. Three technical approaches are motivated to meet this goal, experimentally evaluated and analysed. Although the data they ARE evaluated on is rather sub-optimal with respect to the assumptions, the results are very encouraging and point to a very attractive research direction. 1 Introduction 1.1 Traditional Document Search 1.2 Collaborative Information Retrieval 1.3 Relation to other feedback-based query expansion techniques 1.4 Collaborative Information Retrieval vs. Collaborative Filtering 2 Document Retrieval The task of document retrieval is to retrieve documents relevant to a given query from a fixed set of documents. Documents as well as queries are represented in a common way using a set of index terms (called terms from now

2 2 Markus Junker et al. on). Terms are determined from words of the documents, usually during preprocessing phases where some noise reduction procedures are incorporated (e.g. stemming and stopword elimination). In the following a term is represented by t i, 1 i M, where M is the number of terms in the document collection. 2.1 Vector Space Model One of the simplest but most popular models used in IR is the vector space model (VSM) [1], [4]. A document in the VSM is represented as a M dimensional vector d j = (w 1j,w 2j,...,w Mj ) T, 1 j N, (1) where T indicates the transpose of the vector, w ij is the weight of term t i in document d j and N is the number of documents in the document collection. A query in the VSM is also represented as a M dimensional vector q k = (w 1k,w 2k,...,w Mk ) T, 1 k L, (2) where w ik is the weight of term t i in query q k and L is the number of queries contained in the document collection. The result of the execution of a query is a list of documents ranked according to their similarity to the given query. The similarity sim(d j,q k ) between a document d j and a query q k is measured by the cosine of the angle between these two M dimensional vectors: sim(d j,q k ) = d T j q k d j q k, (3) where is the Euclidean norm of a vector. In the case that the vectors are already normalized (and hence have a unit length) the similarity is simply the dot product between the two vectors d j and q k : sim(d j,q k ) = d T j q k = M w ij w ik (4) The VSM is one of the models we are using in this work for comparison to our new query expansion method. i=1 2.2 Query Expansion Usage of short queries in IR produces a shortcoming in the number of documents ranked according to their similarity to the query. The number of ranked documents is related to the number of appropriate query terms. The more query terms, the more documents are retrieved and ranked according to their

3 Towards Collaborative Information Retrieval 3 similarity to the query [5]. Several methods, called query expansion methods, have been proposed to cope with this problem [1], [4]. These methods fall into three categories: usage of feedback information from the user, usage of information derived locally from the set of initially retrieved documents, and usage of information derived globally from the document collection. Pseudo Relevance Feedback The method called pseudo feedback (also called pseudo relevance feedback [PRF]) works in three stages: First documents are ranked according to their similarity to the original query. Then highly ranked documents are assumed to be relevant and their terms (all of them or some highly weighted terms) are used for expanding the original query. Then documents are ranked again according to the similarity to the expanded query. In this work we employ a simple variant of pseudo relevance feedback [3]. Let E be the set of document vectors given by { } sim(d j,q k ) E = d j max 1 i N {sim(d i,q k )} θ, 1 j N (5) where q k is the original query and θ is a threshold of the similarity. Then the sum d s of the document vectors in E d s = d j (6) d j E is used as expansion terms for the original query. The expanded query vector q k is obtained by q k = q k + α d s d s, (7) where α is a parameter for weighting the expansion terms. Then the documents are ranked again according to their similarity sim(d j,q k ). PRF is one of the models we are using in this work for comparison to our new query expansion method. 3 Query Similarity and Relevant Documents In this paper we employ a query expansion method based on query similarities and relevant documents (QSD). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (8)

4 4 Markus Junker et al. where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (9) 3.1 Query Expansion Method Our query expansion method is based on the similarity between the new query to be issued and the existing old queries. We will first give a high level description of the method, then the detailed mathematical description is given. Query expansion works as follows: compute the similarities between the new query and each of the existing old queries select the old queries having a similarity to the new query which is greater than or equal to a given threshold from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector use these document vectors and a weighting scheme to enrich the new query The formal description is given here. The similarity sim(q k,q l ) between a query q k and a query q l is measured by the cosine of the angle between these two M dimensional vectors: sim(q k,q l ) = q T k q l q k q l, (1) or simply using the dot product between the two vectors q k and q l : sim(q k,q l ) = q T k q l = M w ik w il (11) in the case that the vectors are already normalized (and hence have a unit length). From now on denote σ k := sim(q k,q l ) for a fixed query q l. Let S be the set S = {q k σ k σ, 1 k L} (12) of existing old queries q k having a similarity greater than or equal to a threshold σ to the new query q l and let T k be the sets i=1 T k = {d j q k S r jk = 1,1 j N}, 1 k L (13) of all documents relevant to the queries q k in S. Then the sums d k of the document vectors in each T k dk = d j T k d j (14)

5 Towards Collaborative Information Retrieval 5 are used as expansion terms for the original query. The expanded query vector q l is obtained by q l = q l + L k=1 λ k d k d k, (15) where the λ k are parameter for weighting the expansion terms. Notes: if σ in (12) is chosen to high the set S may be empty. Then the sets T k will be empty and the document vectors d k will be (,...,) T. In this case the new query will not be expanded. even if a query q k is in the set S, the corresponding set T k may be empty (in case where no relevance judgements are contained in the ground truth data for query q k ). Then the corresponding document vector d k will be (,...,) T. parameters σ in (12) and λ k in (15) are the tuning parameters for method QSD. 4 Query Linear Combination and Relevant Documents In this paper we employ a query expansion method based on linear combinations of similar queries and their relevant documents (QLD). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (16) where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (17) 4.1 Query Expansion Method Our query expansion method is based on the construction of the new query as a linear combination of existing old queries. We will first give a high level description of the method, then the detailed mathematical description is given. Query expansion works as follows:

6 6 Markus Junker et al. compute the similarities between the new query and each of the existing old queries select the old queries having a similarity to the new query which is greater than or equal to a given threshold use these selected old queries to rebuild the new query approximately as a linear combination of the old queries according to the least squares method and get the coefficients of the linear combination select these old queries, where the coefficient in the linear combination for the new query is greater than or equal to a given threshold from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector use these document vectors and a weighting scheme to enrich the new query, where the coefficients in the linear combination for the new query are used as weighting parameters The formal description is given here. The similarity sim(q k,q l ) between a query q k and a query q l is measured by the cosine of the angle between these two M dimensional vectors: sim(q k,q l ) = q T k q l q k q l, (18) or simply using the dot product between the two vectors q k and q l : sim(q k,q l ) = q T k q l = M w ik w il (19) in the case that the vectors are already normalized (and hence have a unit length). From now on denote σ k := sim(q k,q l ) for a fixed query q l. Let S be the set S = {q k σ k σ, 1 k L} (2) of existing old queries q k having a similarity greater than or equal to a threshold σ to the new query q l, let S be the number of queries in S, and let T k be the sets i=1 T k = {d j q k S r jk = 1,1 j N}, 1 k L (21) of all documents relevant to the queries q k in S. Then the sums d k of the document vectors in each T k d k = d j T k d j (22) are used as expansion terms for the original query.

7 The expanded query vector q l is obtained by Towards Collaborative Information Retrieval 7 S q l = q l + k=1 d k λ k d k, (23) where the λ k are parameter for weighting the expansion terms. The λ k are computed as follows: in most cases we cannot represent the new query q l exactly as a linear combination of the old queries q k S, i.e. S q l = λ k q k, q k S. (24) k=1 will not have a solution for the coefficients λ k. Equation 24 is equivalent to a system of linear equations Qλ = q l (25) where Q R M S = (q 1,q 2,...,q S ) is a matrix of M rows and S columns and λ = (λ 1,λ 2,...,λ S ) T is a column vector consisting of S elements. In our case Q is normally singular (M S ) and there is no solution to the system. In this situation we find a vector ˆλ so that it provides a closest fit to the equation in some sense. Our approach is to minimize the Euclidean norm of the vector Qλ q l, i.e we solve ˆλ = argmin λ Qλ q l (26) where ˆλ = ( ˆλ 1, ˆλ 2,..., ˆ λ S ) T is called the least squares solution for the system Qλ = q l. Then from our set of S coefficients ˆλ k we select only these coefficients λ k which are greater than or equal to a predefined parameter β, i.e. λ k = { ˆλk if ˆλk β if ˆλk < β (27) Notes: if σ in (2) is chosen to high the set S may be empty. Then the sets T k will be empty and the document vectors d k will be (,...,) T. In this case the new query will not be expanded. even if a query q k is in the set S, the corresponding set T k may be empty (in case where no relevance judgements are contained in the ground truth data for query q k ). Then the corresponding document vector d k will be (,...,) T. parameters σ in (2) and β in (27) are the tuning parameters for method QLD.

8 8 Markus Junker et al. 5 Term-based Concept Learning In this paper we employ a query expansion method based on learning concepts for each query term (TCL). Our method uses feedback information and information globally available from previous queries. Feedback information in our experimental environment is available in the ground truth data provided by the test document collections. The ground truth provides relevance information, i.e. for each query there exists a list of relevant documents. Relevance information for one query is represented by a N dimensional vector: r k = (r 1k,r 2k,...,r Nk ) T, 1 k L, (28) where r ik = { 1 if document i is relevant to query k if document i is not relevant to query k (29) 5.1 Query Expansion Method Our query expansion method is based on the learning of term-based concepts for each term occurring within the current query. These concepts are learned with the ground-truth documents of old queries. We will first give a high level description of the method, then the detailed mathematical description is given. The TCL method is divided in two phases: the learning phase and the expansion phase. The learning phase for each term works as follows: select the old queries where the specific query term occurs from these selected old queries get the sets of relevant documents from the ground truth data from each set of relevant documents compute a new document vector and use these documents vectors to build the term concept. The application phase for each term is easy: select the appropriate concept of the current term use a weighting scheme ro enrich the new query with the concept. The formal description is given as follows: Let q = ( w 1,..., w i,..., w M ) T be the user query represented within the vector space model (see 2.1). For each term of the query the appropriate weight w i is between and 1. Let q k = (w 1k,...,w ik,...,w Mk ) T, 1 k L be the queries given in the document collection. Let Q be the set of all queries q k (note that q / Q)

9 Towards Collaborative Information Retrieval 9 and D be the set of all documents d. Let R + (q k ) be the set of all documents which are relevant to q k : R + (q k ) = {d j D r jk = 1} (3) Now, the first step of the learning phase collects all (previous) queries which have the i-th term in common: Q i = {q k Q w i } (31) If the i-th term doesn t occur in any query q then Q i is empty. The second step collects all documents which are relevant to these collected queries: D ik = {d j d j R + (q k ) q k Q i } (32) And finaly in the last step of the learning phase the concept of each term t i is build as the sum of all documents which are relevant to the (previous) queries which have the term in common: C(t i ) = d j D ik d j (33) Analog to the queries and documents a term-based concept is represented as a vector. In the case that the term t i occurs in no query q k, the concept is empty. It is represented as (,..,) T. Now, where the term-based concepts are learned. The user query q can be expanded. The expanded query vector q is obtained analog to (15) by q = q + M ω i C(t i ), (34) i=1 where the ω i are parameters for weighting the concepts. In the experiments described below ω i is set to 1 and it can be ignored. Before applying the expanded query, it is normalized... q = Hier koennte man noch einige Notes hinschreiben... q q. (35)

10 1 Markus Junker et al. 6 Experimental Design In this section we describe the test collections used in the experimental comparison. We use standard document test collections and standard queries and questions provided by [6] and [7]. On the one hand by utilizing these collections we take advantage of the ground truth data for performance evaluation. On the other hand we do not expect to have queries having highly correlated similarities as we would expect in a real world application. So it is a challenging task to show performance improvements for our method. In our experiments we used the following eight collections: the CACM (titles and abstracts from the journal Communications of the ACM ), CISI (Institute of Scientific Information) and CRAN (aeronautics abstracts) collections are available at [6]. All collections are provided with queries and their ground truth. the CR (congressional record) collection. The CR collection is contained in the TREC test collections disk 4 [7], accompanied by the ground truth for 34 selected queries out of the TREC standard queries We created three test cases for the CR collection, using the TREC queries of different length in order to investigate the influence of query length. The CR-title contains the title queries (the shortest query representation), the CR-desc contains the description queries (the medium length query representation), the CR-narr contains the narrative queries (the longest query representation). the FR (federal register) collection. The FR collection is contained in the TREC test collections disk 2, accompanied by the ground truth for 112 selected queries out of the TREC standard queries the AP9 (associated press articles) collection contained in the TREC test collections disk 3. Originally the AP9 collection contains documents. From the TREC-9 Question Answering track (QA) we selected the question set Questions 21-7 were created without reference to the document set. Then in a separate pass equivalent but re-worded questions (questions ) were created from a subset of these 5 questions [8]. Because of the method used for the construction of this set (especially questions ) we expected to get higher similarities between different questions. From the ground truth data provided with the QA-track we selected only those questions having a relevant answer document in the AP9 document collection. Thus we reduced our test data to 723 documents and 353 questions. Terms used for document and query representation were obtained by stemming and eliminating stopwords. Table 1 lists statistics about the collections after stemming and stopword elimination has been carried out, statistics about these collections before stemming and stopword elimination can be found in [1] and [3].

11 Towards Collaborative Information Retrieval 11 The best known term weighting schemes use weights according to the socalled tf idf schemes. In our experiments we employ a standard scheme as follows: For document vectors the weights w ij are calculated as w ij = tf ij idf i, (36) where tf ij is a weight computed from the raw frequency f ij of a term t i (the number of occurrences of term t i in document d j ) tf ij = f ij, (37) and idf i is the inverse document frequency of term t i given by idf i = log N n i, (38) where n i is the number of documents containing term t i. For query vectors the weights w ik are calculated as w ik = f ik, (39) where f ik is the raw frequency of a term t i in a query q k (the number of occurrences of term t i in query q k ). Table 1. Statistics about test collections CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 size(mb) number of documents number of terms mean document length (short) (med) (med) (long) (long) (long) (long) (long) number of queries mean query length (med) (long) (med) (med) (long) (short) (med) (short) mean number of relev documents per query (med) (high) (med) (high) (high) (high) (med) (low) 7 Experimental Results In this section the results of the experiments are presented. Results were evaluated using the average over all queries. Recall/ graphs were generated according to [1]. Then significance tests were applied to the results.

12 12 Markus Junker et al. 7.1 Results for QSD method The methods VSM (vector space model), PRF (pseudo relevance feedback) and QSD (query similarity and relevant documents) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 2 and [3]). Table 2. Best parameter values for method PRF and QSD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF α θ QSD σ λ k σ k σ k σ k σ k σ k σ k σ k σ k Method QSD has been evaluated using different settings for parameters σ and λ k. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we computed the similarity σ k := sim(q k,q l ) for all queries q k,1 k L,k l according to equations (1) and (11). Then σ in equation (12) has been varied from. up to 1. in steps of.1, and all λ k in equation (15) have been varied simultaneously from. up to 5. in steps of.1. An additional step has been carried out where each λ k has been set to the value of σ k (the similarity between the new query q l and the existing queries q k ). Finally we got our new query vector according to equation (15) and issued the query. Best parameters values for σ and λ k are reported in table 2. Table 3. Best parameter values for methods QSDPRF and PRFQSD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 QSDPRF σ, λ k same as in table 2 α θ PRFQSD α, θ same as in table 2 σ λ k σ k σ k σ k σ k σ k σ k σ k σ k In the next steps we combined two methods of query expansion in this ways: First, after having expanded the new query using the QSD method, we applied the PRF method against the expanded query. This method is reported as the QSDPRF method. Second, after having expanded the new query using the PRF method, we applied the QSD method against the ex-

13 Towards Collaborative Information Retrieval 13 panded query. This method is reported as the PRFQSD method. Best parameter value settings have again been obtained by experiment and are chosen such that average is highest. These parameter value settings are reported in table 3. Table 4. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF QSD QSDPRF PRFQSD Table 4 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Table 5 compares the average obtained by the different methods. The ratio of difference is calculated as follows: let X be the average obtained by one of the methods and let Y be the average obtained by another method. Then the ratio is calculated by (X Y )/Y. A positive value for the ratio indicates an improvement, a negative value indicates a degradation in average. Note, that average values in table 4 are rounded to three digits, while average values are not rounded during computation of ratios of differences. Table 5. Average comparison X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM +53.2% +7.3% +13.4% +16.6% +1.9% +25.5% +33.3% +1.6% QSD VSM +82.5% +17.8% +11.4% -2.%.% +12.9% +28.9% +9.2% QSD PRF +19.1% +9.8% -1.7% -15.9% -9.8% -1.1% -3.3% +7.4% QSDPRF VSM +98.5% +2.4% +17.6% +11.2% +1.8% +31.4% +92.1% +9.5% QSDPRF PRF +29.6% +12.2% +3.7% -4.6%.% +4.7% +44.1% +7.8% QSDPRF QSD +8.8% +2.3% +5.5% +13.5% +1.7% +16.4% +49.% +.4% PRFQSD VSM +96.9% +25.6% +2.6% +12.1% +1.8% +33.4% +64.3% +9.6% PRFQSD PRF +28.5% +17.% +6.4% -3.9%.% +6.3% +23.2% +7.8% PRFQSD QSD +7.8% +6.6% +8.3% +14.4% +1.8% +18.2% +27.4% +.4% PRFQSD QSDPRF -.8% +4.3% +2.6% +.8%.% +1.5% -14.5%.% Figures 1 to 4 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, QSD, QSDPRF and PRFQSD.

14 14 Markus Junker et al. CACM / graph, σ 1 = 4, σ 2 =.58 CISI / graph, σ 1 =.41, σ 2 = avg =.13 (VSM) avg =.199 (PRF) avg = 37 (QSD) avg = 57 (QSDPRF) avg = 55 (PRFQSD) avg =.12 (VSM) avg =.129 (PRF) avg =.142 (QSD) avg =.145 (QSDPRF) avg =.151 (PRFQSD) Fig. 1. Recall/ graphs for the CACM and CISI test collections. σ 1 is the tuning parameter for the QSD method, σ 2 denotes the tuning parameter for the PRFQSD method. CR desc / graph, σ 1 =.44, σ 2 = 6 CR narr / graph, σ 1 = 3, σ 2 = avg =.175 (VSM) avg = 4 (PRF) avg =.172 (QSD) avg =.195 (QSDPRF) avg =.196 (PRFQSD) avg =.173 (VSM) avg =.192 (PRF) avg =.173 (QSD) avg =.191 (QSDPRF) avg =.192 (PRFQSD) Fig. 2. Recall/ graphs for the CRAN and CR-desc test collections. CR title / graph, σ 1 =.42, σ 2 =.46 CRAN / graph, σ 1 =.49, σ 2 = avg =.135 (VSM) avg =.169 (PRF) avg =.152 (QSD) avg =.177 (QSDPRF) avg =.18 (PRFQSD) avg = 84 (VSM) avg =.435 (PRF) avg =.428 (QSD) avg =.451 (QSDPRF) avg =.463 (PRFQSD) Fig. 3. Recall/ graphs for the CR-narr and CR-title test collections.

15 Towards Collaborative Information Retrieval FR / graph, σ 1 = 6, σ 2 =.44 avg =.85 (VSM) avg =.113 (PRF) avg =.19 (QSD) avg =.163 (QSDPRF) avg =.139 (PRFQSD) AP9 / graph, σ 1 =.68, σ 2 = avg =.743 (VSM) avg =.755 (PRF) avg =.811 (QSD) avg =.814 (QSDPRF) avg =.814 (PRFQSD) Fig. 4. Recall/ graphs for the FR and AP9 test collections.

16 16 Markus Junker et al. 7.2 Results for QLD method The methods VSM (vector space model), PRF (pseudo relevance feedback) and QLD (query linear combination and relevant documents) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 6 and [3]). Table 6. Best parameter values for method PRF and QLD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF α θ QLD σ β Method QLD has been evaluated using different settings for parameters σ and β. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we computed the similarity σ k := sim(q k,q l ) for all queries q k,1 k L,k l according to equations (18) and (19). Then σ in equation (2) has been varied from. up to 1. in steps of.1, and β in equation (27) has been varied from. up to the maximum value ˆλ k computed by equation (26) in steps of.1. Finally we got our new query vector according to equation (23) and issued the query. Best parameters values for σ and β are reported in table 6. Table 7. Best parameter values for methods QLDPRF and PRFQLD CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 QLDPRF σ, β same as in table 6 α θ PRFQLD α, θ same as in table 6 σ β In the next steps we combined two methods of query expansion in this ways: First, after having expanded the new query using the QLD method, we applied the PRF method against the expanded query. This method is reported as the QLDPRF method. Second, after having expanded the new query using the PRF method, we applied the QLD method against the expanded query. This method is reported as the PRFQLD method. Best parameter value settings have again been obtained by experiment and are chosen such that

17 Towards Collaborative Information Retrieval 17 average is highest. These parameter value settings are reported in table 7. Table 8. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF QLD QLDPRF PRFQLD Table 8 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Table 9 compares the average obtained by the different methods. The ratio of difference is calculated as follows: let X be the average obtained by one of the methods and let Y be the average obtained by another method. Then the ratio is calculated by (X Y )/Y. A positive value for the ratio indicates an improvement, a negative value indicates a degradation in average. Note, that average values in table 8 are rounded to three digits, while average values are not rounded during computation of ratios of differences. Table 9. Average comparison X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM +53.2% +7.3% +13.4% +16.6% +1.9% +25.5% +33.3% +1.6% QLD VSM +75.3% +41.9% +13.5% -% +1.% +21.7% +27.% +9.1% QLD PRF +14.4% +32.3% +.% -14.4% -8.9% -3.% -4.7% +7.4% QLDPRF VSM +11% +43.7% +18.2% +16.4% +1.9% +36.4% +89.7% +9.6% QLDPRF PRF +37.2% +34.% +4.2% -%.% +8.7% +42.3% +7.8% QLDPRF QLD +19.9% +1.3% +4.1% +16.7% +9.8% +12.1% +49.4% +.4% PRFQLD VSM % +4.5% +22.5% +18.5% +11.7% +4.6% +69.6% +9.5% PRFQLD PRF +38.2% +3.9% +8.% +1.6% +.8% +12.% +27.2% +7.8% PRFQLD QLD +2.8% -1.% +7.9% +18.7% +1.6% +15.5% +33.5% +.4% PRFQLD QLDPRF +.7% -2.3% +3.7% +1.8% +.7% +3.% -1.6%.% Figures 5 to 8 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, QLD, QLDPRF and PRFQLD.

18 18 Markus Junker et al. CACM / graph σ 1 = 2, β 1 =.16, σ 2 =.14, β 2 = 5 CISI / graph σ 1 = 5, β 1 = 3, σ 2 = 4, β 2 = avg =.13 (VSM) avg =.199 (PRF) avg = 27 (QLD) avg = 73 (QLDPRF) avg = 75 (PRFQLD) avg =.12 (VSM) avg =.129 (PRF) avg =.171 (QLD) avg =.173 (QLDPRF) avg =.169 (PRFQLD) Fig. 5. Recall/ graphs for the CACM and CISI test collections. σ 1 and β 1 are tuning parameter for the QLD method, σ 2 and β 2 denote the tuning parameter for the PRFQLD method. CR desc / graph σ 1 = 6, β 1 = 7, σ 2 =., β 2 = CR narr / graph σ 1 =.17, β 1 =, σ 2 =.16, β 2 = avg =.175 (VSM) avg = 4 (PRF) avg =.175 (QLD) avg = 4 (QLDPRF) avg = 8 (PRFQLD) avg =.173 (VSM) avg =.192 (PRF) avg =.175 (QLD) avg =.192 (QLDPRF) avg =.193 (PRFQLD) Fig. 6. Recall/ graphs for the CRAN and CR-desc test collections. CR title / graph σ 1 =.41, β 1 =.43, σ 2 =.46, β 2 =.47 CRAN / graph σ 1 = 7, β 1 =.41, σ 2 =.43, β 2 = avg =.135 (VSM) avg =.169 (PRF) avg =.164 (QLD) avg =.184 (QLDPRF) avg =.19 (PRFQLD) avg = 84 (VSM) avg =.435 (PRF) avg =.436 (QLD) avg =.453 (QLDPRF) avg =.47 (PRFQLD) Fig. 7. Recall/ graphs for the CR-narr and CR-title test collections.

19 Towards Collaborative Information Retrieval 19 FR / graph σ 1 = 7, β 1 =.13, σ 2 =.4, β 2 = 4 AP9 / graph σ 1 =.68, β 1 =.48, σ 2 =.69, β 2 = avg =.85 (VSM) avg =.113 (PRF) avg =.18 (QLD) avg =.161 (QLDPRF) avg =.144 (PRFQLD) avg =.743 (VSM) avg =.755 (PRF) avg =.81 (QLD) avg =.814 (QLDPRF) avg =.813 (PRFQLD) Fig. 8. Recall/ graphs for the FR and AP9 test collections.

20 2 Markus Junker et al. 7.3 Results for TCL method The methods VSM (vector space model), PRF (pseudo relevance feedback) and TCL (Term-based Concept Learning) were applied. Best parameter value settings for parameters α and θ for method PRF have been obtained by experiment. Parameters α and θ are chosen such that average is highest (see table 2). Method TCL has been evaluated using no parameters, i.e. all weighting parameters are set to 1. From the set of queries contained in each collection we selected each query one after the other and treated it as a new query q l,1 l L. Then for each fixed query q l we learned for each query term a concept according to equations... The new query q l has been enriched with the learned term-based concepts. Finally we got the new query vector according to equation... and issued the query. Table 1. Best parameter values for methods PRF(Concepts) and Concept+PRF CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF(Concepts) α θ Concept+PRF α, θ same as in table 2 ω xxx x.xxxx.5xxx.1xxx.69xx In the next step we combined the TLC method with the PRF in two ways: First, after having selected the appropriate old queries, we applied the PRF method against the ground-truth documents of these queries. So, not all ground-truth documents are used to learn the term-based concepts, but just the best documents. This method is reported as the PRF(Concepts) method. Best parameter value settings again have been obtained by experiment and are chosen such that average is highest. These parameter value settings are reported in table 1. Second, we applied the PRF method in parallel with the TLC method. For each query (PRF) and each query term (TLC), the PRF-documents are summed up to a document vector and the term-based concept is added with a weight ω to the final document vector which is used to expand the query. This method is reported as the Concept+PRF method. Table 11 shows the average obtained by using the best parameter values for different methods. For each collection the best value of average is indicated by bold font, the second best value is indicated by italic font. Figures 1 to 4 show the / graphs for the test collections. Each figure contains the graphs for methods VSM, PRF, Concept, PRF(Concept) and Concept+PRF.

21 Towards Collaborative Information Retrieval 21 Table 11. Average obtained in different methods CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 VSM PRF Concepts PRF(Concepts) Concept+PRF CACM / graph avg =.13 (VSM) avg =.199 (PRF) avg = 82 (Concept) avg = 4 (PRF(Concept)) avg = 8 (Concept+PRF) CISI / graph avg =.12 (VSM) avg =.129 (PRF) avg =.1 (Concept) avg =.127 (PRF(Concept)) avg =.126 (Concept+PRF) Fig. 9. Recall/ graphs for the CACM and CISI test collections CRAN / graph avg = 84 (VSM) avg =.435 (PRF) avg = 42 (Concept) avg =.426 (PRF(Concept)) avg =.444 (Concept+PRF) CR desc / graph avg =.175 (VSM) avg = 4 (PRF) avg =.167 (Concept) avg = 6 (PRF(Concept)) avg =.194 (Concept+PRF) Fig. 1. Recall/ graphs for the CRAN and CR-desc test collections. 7.4 Significance Testing The next step for the evaluation is the comparison of the values of the average obtained by different methods. Statistical tests provide information about wether observed differences in different methods are really significant or just by chance. Several statistical tests have been used in IR [2], [9]. We employ the paired t-test described in [2]: Let x l and y l be the scores of retrieval methods X and Y for query q l,1 l L and define d l = x l y l. The assumption is that the model is additive, i.e. d l = µ + ε l, where µ is the

22 22 Markus Junker et al CR narr / graph avg =.173 (VSM) avg =.192 (PRF) avg =.82 (Concept) avg =.173 (PRF(Concept)) avg =.1 (Concept+PRF) CR title / graph avg =.135 (VSM) avg =.169 (PRF) avg =.16 (Concept) avg =.18 (PRF(Concept)) avg =.181 (Concept+PRF) Fig. 11. Recall/ graphs for the CR-narr and CR-title test collections..4 5 FR / graph avg =.85 (VSM) avg =.161 (PRF) avg =.14 (Concept) avg =.199 (PRF(Concept)) avg =.15 (Concept+PRF) AP9 / graph avg =.742 (VSM) avg =.755 (PRF) avg =.629 (Concept) avg =.773 (PRF(Concept)) avg =.762 (Concept+PRF) Fig. 12. Recall/ graphs for the FR and AP9 test collections. mean value and the errors ε l are independent and normally distributed. The null hypothesis H is that µ =, i.e. method X performs as well as method Y. The alternative hypothesis H 1 is that µ >, i.e. method X performs better than method Y in terms of average. where and The t-test t = d s n, (4) d = 1 n s 2 = 1 n 1 n d i (41) i=1 n (d i d) 2 (42) follows the t-distribution with a degree of freedom of n 1, where n is the number of samples, d is the sample mean and s 2 is the sample variance. i=1

23 Towards Collaborative Information Retrieval 23 Given the value of t we obtain the p-value, i.e. the probability of observing the sample results d l under the assumption that H is true. Comparing the p- value to a given significance level α, we can decide whether the null hypothesis H should be rejected or not. 7.5 Significance Testing for QSD method The results are shown in table 12. Each row contains the results of two tests, i.e. test method X against method Y and vice versa. We tested each method X against each other method Y and vice versa. We used significance levels α =.5 and α =.1. If there are differences showing up from different significance levels we indicated it in parentheses, where the value without parentheses is from significance level α =.5 and the value in parentheses is from significance level α =.1. An entry of + in a table cell indicates that the null hypothesis is rejected for testing X against Y. This means for significance level α =.5 that method X is likely to perform better than method Y and for significance level α =.1 that method X is almost guaranteed to perform better than method Y. An entry of in a table cell indicates that the null hypothesis is rejected for testing Y against X. This means for significance level α =.5 that method Y is likely to perform better than method X and for significance level α =.1 that method Y is almost guaranteed to perform better than method X. An entry of o in a table cell indicates that the null hypothesis can not be rejected in both test. This means that there is low probability that one of the methods is performing better than the other method. Table 12. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) QSD VSM + + (o) + o o o + (o) + QSD PRF o o o (o) o o + QSDPRF VSM + + (o) + + (o) o + (o) + + QSDPRF PRF o o o o (o) o + (o) + QSDPRF QSD o + (o) + + (o) o + (o) + o PRFQSD VSM o + (o) + (o) + + PRFQSD PRF + + (o) + o o o + (o) + PRFQSD QSD o o + o o (o) + (o) + (o) o PRFQSD QSDPRF o o o o o o o o For significance level α =.5 it can be seen from table 12 that method QSD performs better than method VSM with a high probability for all collections except the three CR collections. Improvements of method QSD over

24 24 Markus Junker et al. method PRF are significant only for the AP9 test collection, and PRF performs significantly better than QSD for the CR-desc and CR-narr collections. For all other collections (CACM, CISI, CRAN, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QSD vs. PRF and vice versa. Hence for this collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen from table 12 that method QSD is almost guaranteed to perform better than method VSM only for the CACM, CRAN and AP9 collections. Improvements of method QSD over method PRF are significant only for the AP9 test collection, and PRF performs significantly better than QSD only for the CR-desc collection. For all other collections (CACM, CISI, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QSD vs. PRF and vice versa. Additionally it can be seen that the combined methods QSDPRF and PRFQSD outperform the VSM, PRF and QSD methods in most cases, and we have a strong indication for it (at the significance level α =.1) for the CRAN and the FR collections. Only for the AP9 collection there is no indication that one of the combined methods QSDPRF or PRFQSD outperforms the QSD method. This seems to be a result of the construction of the queries in this collection (see section 6). 7.6 Significance Testing for QLD method The results are shown in table 13. Each row contains the results of two tests, i.e. test method X against method Y and vice versa. We tested each method X against each other method Y and vice versa. We used significance levels α =.5 and α =.1. If there are differences showing up from different significance levels we indicated it in parentheses, where the value without parentheses is from significance level α =.5 and the value in parentheses is from significance level α =.1. An entry of + in a table cell indicates that the null hypothesis is rejected for testing X against Y. This means for significance level α =.5 that method X is likely to perform better than method Y and for significance level α =.1 that method X is almost guaranteed to perform better than method Y. An entry of in a table cell indicates that the null hypothesis is rejected for testing Y against X. This means for significance level α =.5 that method Y is likely to perform better than method X and for significance level α =.1 that method Y is almost guaranteed to perform better than method X.

25 Towards Collaborative Information Retrieval 25 An entry of o in a table cell indicates that the null hypothesis can not be rejected in both test. This means that there is low probability that one of the methods is performing better than the other method. Table 13. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) QLD VSM o o o + (o) + QLD PRF o + o o o o + QLDPRF VSM (o) QLDPRF PRF + (o) + o o o o + (o) + QLDPRF QLD + + (o) + + o o + o PRFQLD VSM (o) PRFQLD PRF o o o + (o) + PRFQLD QLD + (o) o + + o o + o PRFQLD QLDPRF o o o o o o o o For significance level α =.5 it can be seen from table 13 that method QLD performs better than method VSM with a high probability for all collections except for the three CR collections. Improvements of method QLD over method PRF are significant for the CISI and AP9 test collection, and PRF performs significantly better than QLD for the CR-desc collection. For all other collections (CACM, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QLD vs. PRF and vice versa. Hence for this collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen from table 13 that method QLD is almost guaranteed to perform better than method VSM for four collections. Improvements of method QLD over method PRF are significant only for the CISI and AP9 test collection, and PRF performs significantly better than QLD only for the CR-desc collection. For all other collections (CACM, CRAN, CR-narr, CR-title and FR) there is no indication for the rejection of the null hypothesis during significance testing of methods QLD vs. PRF and vice versa. Additionally it can be seen that the combined methods QLDPRF and PRFQLD outperform the VSM, PRF and QLD methods in most cases, and we have a strong indication for it (at the significance level α =.1) for the CRAN, CR-desc and the FR collections. Only for the AP9 collection there is no indication that one of the combined methods QLDPRF or PRFQLD outperforms the QLD method. This

26 26 Markus Junker et al. seems to be a result of the construction of the queries in this collection (see section 6). 7.7 Significance Testing for TCL method The results for the TCL method are shown in table 14 which is constructed analog to table 12. Table 14. Paired t-test results for α =.5 (α =.1) X Y CACM CISI CRAN CR-desc CR-narr CR-title FR AP9 PRF VSM (o) + (o) + (o) + (o) Concepts VSM + o + + Concepts PRF + o o o PRF(Concepts) VSM o + (o) + + PRF(Concepts) PRF + o o o o + + (o) PRF(Concepts) Concepts o (o) + o + + Concept+PRF VSM + o + o + + (o) o Concept+PRF PRF + o o o (o) o o Concept+PRF Concepts + (o) (o) o + (o) + For significance level α =.5 it can be seen that method TCL performs in half of the collection better than method VSM and in half of the collection better or equal than method PRF. Improvements of method TCL over method PRF is significant for the CACM test collection, and the PRF performs significantly better than TCL for four collections. For collections CR-desc, CR-title and FR there is no indication for the rejection of the null hypothesis dureing significance testing of methods TCL vs. PRF and vice versa. Hence, for these collections we cannot assume from significance testing that one method performs better than the other method. For significance level α =.1 it can be seen that method TCL is almost guaranteed to perform better than method VSM for three collections. Improvements of method VSM over method PRF is significant only for the CACM test collection and PRF performs significantly better than TCL on four collections. For all other collections there is no indication for the rejection of the null hypothesis during significance testing of methods TCL vs. PRF and vice versa. Additionally, it can be seen that the combined methods PRF(Concepts) and Concept+PRF outperform the VSM, PRF and especially the TCL methods in most cases, and we have strong indication (α =.1) for it for the CACM, CRAN, and the FR collections. At the CR-narr collection it can bee seen that only the PRF performs better than the VSM on the level α =.5. Neither the TCL nor the combi-

27 Towards Collaborative Information Retrieval 27 nations can improve the baseline of the VSM. This is an indication for badly learned concepts. The long narrative queries contain terms which are not representative or have the behavior of stop words. Many terms are occurring in many queries and then the query is expanded with concepts which have no sense. 8 Conclusions and Future Work 8.1 Conclusions We have experimentally compared new query expansion methods based on query similarities and document relevance with two conventional information retrieval methods. From the results gathered from eight static test collections we have only one clear indication that the QSD and QLD methods are superior to the conventional PRF method. But in contrast we also have only one clear indication that the conventional PRF method is superior to the QSD and QLD method. From our results we think that we can combine this new method with the conventional PRF method. No performance degradation has been observed for this combination of the two methods. The results that have been obtained by combining the new QSD and QLD method with the conventional PRF method are promising. Due to the construction method for the queries in the AP9 test collection where the QSD and QLD methods significantly performs better than the other methods we think that we could utilize this new method in cases where old queries and their corresponding relevance information has been learned previously and where new queries have high similarities to old existing queries. 8.2 Future Work Experimental results on the static document collections are promising. In future work we will integrate the query expansion method into real-life IR systems. The first step will be the integration into search engines which deal with more static document collections. These search engines ought to have learning methods for some relevance feedback. Users issuing new queries will then profit from knowledge which has been learned previously from other users, who issued the same or similar queries and judged the relevance of the documents retrieved by those queries. Next steps will then be the formulation of a related query expansion method which will consider the more dynamic nature of underlying document collections, and learning of the tuning parameters σ and β for the QSD and QLD methods.

28 28 Markus Junker et al. 9 Acknowledgements This work was supported by the German Ministry for Education and Research, bmb+f (Grant: 1 IN 92 B8). References 1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison- Wesley Publishing Company, D. Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of SIGIR-93, pages , K. Kise, M. Junker, A. Dengel, and K. Matsumoto. Experimental evaluation of passage-based document retrieval. In Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR 1), C. Manning and H. Schütze. Foundations of Natural Language Processing. MIT Press, Y. Qiu and H.-P. Frei. Concept-based query expansion. In Proceedings of SIGIR- 93, 16th ACM International Conference on Research and Development in Information Retrieval, pages , Pittsburgh, US, ftp://ftp.cs.cornell.edu/pub/smart E. M. Voorhees and D. Harman. Overview of the ninth text retrieval conference (trec-9). In Proceedings of the Ninth Text Retrieval Conference (TREC-9), Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of SIGIR-99, pages 42 49, 1999.

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical