Extending Relevance Model for Relevance Feedback

Size: px

Start display at page:

Download "Extending Relevance Model for Relevance Feedback"

Warren Lamb
5 years ago
Views:

1 Extendng Relevance Model for Relevance Feedback Le Zhao, Chenmn Lang and Jame Callan Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty {lezhao, chenmnl, Abstract Relevance feedback s the retreval task where the system s gven not only an nformaton need, but also some relevance judgement nformaton, usually from users feedback for an ntal result lst by the system. Wth dfferent amount of feedback nformaton avalable, the optmal feedback strategy mght be very dfferent. In TREC Relevance Feedback task, the system s gven dfferent sets of feedback nformaton from 1 relevant document to over 40 judgements wth at least 3 relevant. Thus, n ths work, we try to develop a feedback algorthm that works well on all levels of feedback by extendng the relevance model for pseudo relevance feedback to nclude judged relevant documents when scorng feedback terms. Wthn these dfferent levels of feedback, t s more dffcult for the feedback algorthm to perform well when gven mnmal amount of feedback. Experments show that our algorthm performs robustly n those dffcult cases. 1 Introducton Relevance feedback s the retreval task where the system s gven not only a user query, but also user feedback on some of the top ranked results. Feedback gves the retreval system a chance to mprove ts results by explotng the extra nformaton through more elaborate technques. Ths can be helpful n cases where the users want as many relevant results as possble. Expermentng wth and falure analyses on feedback algorthms also provdes us a chance to understand what are mssng n users keyword queres and how to mprove these queres. In the followng, we frst ntroduce the relevance feedback task defned by ths year s track, the bass ad hoc retreval algorthm. In Secton 2, we layout the basc model for combnng pseudo relevance feedback and true relevance feedback n query expanson. One dffculty n dong that s properly and effectvely balancng term weghts from the two sources. We gve our soluton. We show experments n Secton 3 and conclude. 1.1 Task defnton Wth ths year s relevance feedback experments, the retreval system wll be evaluated on dfferent number of feedback documents. Specfcally, for each query, set A s the baselne wthout any feedback nformaton. Set B allows the system to use 1 relevant document as postve feedback; set C allows 3 relevant and 3 non-relevant; set D uses 10 judged documents (a superset of set C), and set E provdes all judged documents to the retreval algorthm, varyng from 40 to 100s of judged documents. 1.2 Pseudo relevance feedback as bass algorthm Snce dfferent numbers of feedback documents are possble, especally, when only gven a small number of judged documents, any relevance feedback algorthm that only look at the few judged documents can have dffculty provdng relable expanson queres. Ths s because fewer feedback documents means larger varatons of term appearances and expanson weghts. The approach we propose here, s to leverage the small amount of feedback nformaton wth some pseudo relevance feedback, so that the larger number of feedback documents provde lower varance estmates of the feedback terms, stablzng the performance of the feedback algorthm.

2 The motvaton for usng the partcular pseudo relevance feedback method proposed n (Lavrenko and Croft 2001) s that the best runs on the two prevous TREC 2006 and 2007 Terabyte tracks, (Metzler et al 2005) and (L and Yan 2006), both used ths partcular pseudo relevance feedback method. Snce the relevance feedback task uses the same collecton and smlar set of queres, we set the baselne retreval method to be the same as n the two prevous top runs a dependency model to acheve hgher top precson and a pseudo relevance feedback step followng. Because of ts relevance to ths work, we show the pseudo relevance feedback model mplemented n the Indr search engne: Pr ( I) = D topn Pr ( DPI ) ( DPD ) ( ) PI ( ) Relevance_model_query: #weght(w 1 r 1 w 2 r 2.. w n r n ), where, w = P(r I). Fnal_query: #weght( (1 - w) Relevance_model_query w Orgnal_query ) Here, r s a concept term, whch could refer to any term or phrase (we restrct ourselves to only keywords), D refers to a feedback document, and I means the nformaton need whch s expressed by the query. P(r I) s a multnomal dstrbuton over all concepts, of how deally the relevant class should look lke. P(I D) s approxmated wth the probablty for a document to generate the query, whch s just the language modelng verson of the relevance score. As P(I) s constant w.r.t. r, P(r I) s easly calculated f we assume P(D) s unform. Top weghted terms are selected for query expanson and weghted exactly as ther weghts, so that the expanson query ranks documents accordng to ther KL-dvergence to the relevance model P(r I). The fnal query s just nterpolatng the expanson query wth the orgnal query. The orgnal queres can be the keyword query ttles provded by TREC or the dependency model phrase queres whch we used as bass algorthm. Now we ntroduce our relevance feedback algorthm. 2 Relevance feedback algorthm The desgn of a feedback algorthm would be easy f the system s gven complete judgments for top N results, just use the judged documents as tranng samples, use smple term features and tran a bnary classfer for relevant or not, see for example (Zhu, Zhao and Callan 2007). Ths wll be especally effectve f N s large. However, n many cases the system does not have such a luxury. In the current track, 3 out of the 4 feedback sets have N as low as 1, 6 or 10. Thus, n ths secton we propose a unfed pseudo-relevance and true relevance feedback method based on the relevance model formalsm (Lavrenko and Croft 2001). From equaton (1) usng Bayes rule, we get the orgnal form of equaton (1), Pr ( I) = Pr ( DPD ) ( I) (2) D feedback For the feedback documents that have relevance judgments,.e. D D T, we trust the judgments, thus we set P(D I) = Rel( DI, ) Rel( D, ) D R( I) I where Rel(D, I) s the judged relevance, R(I) s the set of all relevant documents for topc I. (Ths year, for some of the nput judgments, {0, 1, 2} 3 levels of relevance s used.) Because judged rrelevant documents have relevance score 0, they wll not have any effect n the model. (1)

3 For the pseudo relevant documents, D D P, we have to use Bayes rule to expand the P(D I) term just as n (1), thus, arrvng at the followng decomposton, PI ( Dj) PD ( j) Pr ( I) = Pr ( D) PD ( I) + Pr ( D) PI ( ) D j Dj DP = Pr ( D) PD ( I) + Pr ( D) PI ( D) PD ( ) PI ( ) j j j DP Ths fnal score P(r I) wll be the weght for the expanson term r n the relevance model. The above dervatons are exact, except there are some unknowns n the result: P(I), P(D j ), and the total collectonrelevance Rel( D, ) D C I. Notce that these probabltes may depend on topc, and specal care must be taken to deal wth these values, as they drectly affect the relatve mportance of true relevant documents w.r.t. pseudo relevant documents. We propose a justfed and effectve way of dealng wth ths balancng. 2.1 Balancng weghts from pseudo- and true- relevant documents There are several key problems to solve n order to properly balance the weghts. Frstly, snce some of the probabltes wll never be known, we want to use a hyper-parameter to weght the two sources, and tune the weght on tranng data. Secondly, the numbers of (true and pseudo) feedback documents can be dfferent across topcs, thus, a larger D T set would probably lead to a bas toward terms from the judged set and thus the optmal hyper parameter would vary. That s why a normalzaton step needs to happen so that the numbers of feedback documents true v.s. pseudo wll not affect the relatve weghts of the terms from these two sources. Thrdly, note P(I D j ) the query generaton probablty wll have large varatons across topcs, as a query can contan dfferent number of query terms whch can be popular or rare. Thus, we need to reduce ths varance so that the hyper parameter s less varable across topcs, and t s easer to learn an effectve hyper parameter value whch wll be appled across all topcs. From equaton (3), snce we don t know the judgments of all the documents, we estmate P(D I) on the gven samples D T, ths gves us the resultng term P ˆ( D ) Rel(, ) Rel(, ) I = D I D D D I, call t the T emprcal relevance dstrbuton. (4) Pr ( I) = Pr ( D) PD ˆ( I) + Pr ( D) PI ( D) PD ( ) PI ( ) ˆ j j j DP (3) We denote the hyper parameterα, and parameterze P(r I) n α as follows, Rel(, ) PI ( D) α D I j P ( r I) = αp( r D) + (1 α) P( r D ) P( D ) Rel(, ) P( I) D I DP j j (5) P(D j ) s the document pror, f we assume unform, t doesn t matter what exact value t takes, as tunng α wll effectvely cancel ths constant term out. For an analogy wth the emprcal relevance dstrbuton, we also use the emprcal pror, 1/ D P. These two emprcal dstrbutons act as normalzers to the two sources, ensurng the weghts are normalzed properly wth respect to the numbers of samples n both sources. The above solves the frst and second problems outlned at the begnnng of the subsecton. For the last problem P(I), we don t know how exactly should t be determned, but we do know t should be dependent on the collecton and on the topc, and also we expect t to provde proper balancng effect for the term P(I D j ), whch vares a lot across topcs. Gven the requrements, we can have several alternatve assgnments for P(I), a) max D C{P(I D)}, b) avg Dj TOPN {P(I D j )}. Intutvely, f we neglect α,

4 assgnment (a) means we assume that a relevant document s as good as the best scorng document n the collecton, whle assgnment (b) means t s as good as the average score of the top N results. We avoded usng the smplest P(I C) the query generaton probablty of the collecton model as an approxmaton for P(I), because P(I D j ) would vary more wldly than P(I C) dependng on how well the documents n the collecton are matchng wth the topcs, whle P(I C) s gnorant of the documents. PI ( ) = max PI ( D) (6) D C PI ( ) = avg PI ( D) (7) D TOPN Wthout experments, t s dffcult to see how the two dfferent assgnments would behave because there s a regulatng parameter α outsde of the scope of the assgnments that wll determne how much weght to put on the two sources (true and pseudo relevance feedback). In general, we should select assgnments that produce the most stable optmal regulatng α across all topcs, or more drectly select the best assgnment accordng to the fnal retreval effectveness measure (such as overall MAP etc.). 3 Experments 3.1 Tranng and test datasets We tran all the parameters n the model on the tranng set by parameter sweeps. To create the tranng set, topcs and assessments from the followng track were used: TREC Terabyte 2004, 2005 and 2006 tracks and Mllon Query track We requre each tranng topc to contan at least 40 judgments, and use 2/3 of the judgements as feedback documents and the rest 1/3 as evaluaton data. For the feedback judgments, at least 4 should be relevant and 5 non relevant. Other topcs are fltered out. We keep a rato of 3:1 for MQ topcs and Terabyte topcs just to mmc the testng condtons and evaluate on the whole set usng MAP. Total number of topcs s 296, wth 74 Terabyte topcs. Although for tranng effcency reasons we sampled a smaller subset of 113 tranng topcs from ths whole set. The selecton of the feedback documents s crucal for creatng a proper tranng set. However, snce we don t have the same results data from prevous tracks as the relevance feedback track organzers do, we can only try to be as close to the test data as possble. Snce the TREC judgment pool bases toward top ranked documents, although maybe not as hgh a bas as the feedback set for the test data, we stll smply used a 2/3 random sample of the judgments from the pool as feedback and the rest as evaluaton data. We fear that drectly usng top ranked results from the baselne run would bas the results too much, but we dd not try, wth the offcal test data now avalable we could try and report rankngs of the feedback documents for the tranng sets that we may use, and compare to that of the test set. 3.2 Parameter tunng and performance analyss Submtted results are shown n Table 1. All parameters were tuned on tranng set. Because of the large number of parameters to tune (especally condtoned on the dfferent setups), we fx all the rest parameters and tune only one parameter at a tme. We fx smoothng parameters frst for the baselne Set A algorthm, and then keep the same parameters for feedback runs. When tunng feedback runs, we only vary the number of feedback terms and feedback documents, α and the weght for the orgnal query. From the results, we can see that the optmal α s qute stable across dfferent feedback settngs, meanng we have done proper balancng. Also, we dd a more detaled per topc optmal α analyss on Set C and Set D, and results show that although the optmal α vares across topcs, there are 1/3 of the topcs where α can be 0-1 wthout changng performance; most of these topcs are ether too hard (MAP<0.05) or too easy (MAP>0.5). In 3/4 of the other topcs where α makes a dfference, optmal α > 0.5, meanng true relevance feedback s emphaszed more than the top rankng result. If we take the optmal MAP value for each topc, we get the upper bound for the performance we can get f we vary α, the upper bound on the tranng set turns out to be MAP = for α n the set {0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} and wth all

5 Table 1. Submted results wth common parameters: dependency model ntal queres, phrase smoothng Drchlet wth mu=4000, and term smoothng mu=1700, #feedback terms = 50, orgnal query weght=0.7, max assgnment for P(I). Other parameters shown below: Runs: Runtag Stop-words #fb Docs other parameters set as runs CMURF08.C1 and D1, especally, the orgnal query weght s set as 0.7, whch restrcts the effects of the expanson query. When only wth one (the top) relevant document for feedback, on the test set, our model (α = 0.7) performs sgnfcantly better than relevance feedback alone ( α = 1), MAP v.s , wth sgnfcance level p < by a pared sgn test. Although t s not sgnfcantly better than pseudo relevance feedback alone (α = 0, MAP= ). We also show per topc how our results compare to the medan system n Fgure 1. The results are worse than the medan on average. And a comparson wth other systems show that the baselne s much worse than other systems. However, comparng the relatve gans n MAP by usng more relevance data (Table 1), the constructed tranng set and the offcal test data have a smlar trend, and the most performance gan s n the 10% range. On the tranng data, ths gan s acheved rght at set B, whle on the test data, the best performance s at set E. Ths suggests that a randomly sampled feedback document (what s done on the tranng topcs) s more nformatve than a top ranked relevant document (for the test case). Also, ths relatve gan s smaller than the best systems, and t mght be because of the worse baselne. More experments need to be done to re-establsh a baselne and confrm. 4 Conclusons α Tranng: MAP, RR, P@10 Offcal Test*: MAP RR P@10, mtc statap CMURF08.A1 Included 10 < , , , , , CMURF08.B1 Excluded , , , , , CMURF08.C1 Excluded , , , , , CMURF08.D1 Excluded , , , , , CMURF08.E1 Excluded , , , , , CMURF08.B2 Included , , , , , CMURF08.C2 Excluded , , , , , CMURF08.D2 Excluded , , , , , * Reported test metrcs MAP RR and P@10 are evaluated on the 31 Terabyte track topcs, the mtc and statap measures are evaluated on MQ topcs, wth 237 and 208 vald topcs respectvely. We presented a coherent model of ncorporatng pseudo relevance feedback together wth true relevance feedback. We showed that the dffculty les manly n balancng term weghts comng from the two dfferent sources. Several steps have been taken to deal wth weght balancng, and reasonable alternatve solutons exst. Results show that mergng relevance feedback and pseudo relevance feedback s sgnfcantly better than relevance feedback alone when feedback documents are few. Randomly sampled relevant document s carryng more nformaton than a top ranked relevant document, n terms of feedback performance.

6 Per Topc AP Dfference from Medan System Fgure 1. Comparng CMURF08.E1 wth the medan system. For future work, we would lke to explore ways of modelng negatve feedback, more fnely controlled term extracton, such as restrctng a wndow around each query term, to get more precse feedback terms. We are also nterested n nvestgatng how perturbng the feedback documents would affect results. References Donald Metzler, Trevor Strohman, Yun Zhou and W.B. Croft, Indr at TREC 2005: Terabyte Track. In Proceedngs of the 14th Text REtreval Conference (TREC). Jngjng L and Hongfe Yan, Pekng Unversty at the TREC 2006 Terabyte Track. In Proceedngs of the 15th Text REtreval Conference (TREC). J. J. Rocho, Relevance Feedback n Informaton Retreval. In The SMART Retreval System Experments n Automatc Document Processng, pages Xuanhu Wang, Hu Fang and Chengxang Zha, A Study of Methods for Negatve Relevance Feedback. In Proceedngs of the 31st annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval (SIGIR '08), pages Vctor Lavrenko and W.B. Croft, Relevance-Based Language Models. Proceedngs of the 24th Annual nternatonal ACM SIGIR Conference on Research and Development n nformaton Retreval (SIGIR '01), pages Yangbo Zhu, Le Zhao and Jame Callan, Structured Queres for Legal Search. In Proceedngs of the 2007 Text REtreval Conference (TREC 2007).

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to