Selecting Good Expansion Terms for Pseudo-Relevance Feedback

Size: px

Start display at page:

Download "Selecting Good Expansion Terms for Pseudo-Relevance Feedback"

Osborn Job Daniels
5 years ago
Views:

1 Guhong Cao, Jan-Yun Ne Department of Computer Scence and Operatons Research Unversty of Montreal, Canada {caogu, Selectng Good Expanson erms for Pseudo-Relevance Feedback ABSRAC Pseudo-relevance feedback assumes that most frequent terms n the pseudo-feedback documents are useful for the retreval. In ths study, we re-examne ths assumpton and show that t does not hold n realty many expanson terms dentfed n tradtonal approaches are ndeed unrelated to the query and harmful to the retreval. We also show that good expanson terms cannot be dstngushed from bad ones merely on ther dstrbutons n the feedback documents and n the whole collecton. We then propose to ntegrate a term classfcaton process to predct the usefulness of expanson terms. Multple addtonal features can be ntegrated n ths process. Our experments on three REC collectons show that retreval effectveness can be much mproved when term classfcaton s used. In addton, we also demonstrate that good terms should be dentfed drectly accordng to ther possble mpact on the retreval effectveness,.e. usng supervsed learnng, nstead of unsupervsed learnng. Categores and Subject Descrptors H.3.3 [Informaton Storage and Retreval]: Retreval models General erms Desgn, Algorthm, heory, Expermentaton Keywords Pseudo-relevance feedback, Expanson erm Classfcaton, SVM, Language Models. INRODUCION User queres are usually too short to descrbe the nformaton need accurately. Many mportant terms can be absent from the query, leadng to a poor coverage of the relevant documents. o solve ths problem, query expanson has been wdely used [9], [5], [2], [22]. Among all the approaches, pseudo-relevance feedback (PRF) explotng the retreval result has been the most effectve [2]. he basc assumpton of PRF s that the top-ranked documents n the frst retreval result contan many useful terms that can help dscrmnate relevant documents from rrelevant ones. In general, the expanson terms are extracted ether accordng to the term dstrbutons n the feedback documents (.e. one tres to extract the most frequent terms); or accordng to the comparson between the term dstrbutons n the feedback documents and n the whole document collecton (.e. to extract the most specfc terms n the feedback documents). Several addtonal crtera have been Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. o copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGIR 08, July 20 24, 2008, Sngapore. Copyrght 2008 ACM /08/07...$5.00. Janfeng Gao Mcrosoft Research, Redmond, USA jfgao@mcrosoft.com Stephen Robertson Mcrosoft Research at Cambrdge, Cambrdge, UK ser@mcrosoft.com proposed. For example, df s wdely used n vector space model [5]. Query length has been consdered n [7] for the weghtng of expanson terms. Some lngustc features have been tested n [6]. However, few studes have drectly examned whether the expanson terms extracted from pseudo-feedback documents by the exstng methods can ndeed help retreval. In general, one was concerned only wth the global mpact of a set of expanson terms on the retreval effectveness. A fundamental queston often overlooked at s whether the expanson terms extracted are truly related to the query and are useful for IR. In fact, as we wll show n ths paper, the assumpton that most expanson terms extracted from the feedback documents are useful does not hold, even when the global retreval effectveness can be mproved. Among the extracted terms, a nonneglgble part s ether unrelated to the query or s harmful, nstead of helpful, to retreval effectveness. So a crucal queston s: how can we better select useful expanson terms from pseudo-feedback documents? In ths study, we propose to use a supervsed learnng method for term selecton. he term selecton problem can be consdered as a term classfcaton problem we try to separate good expanson terms from the others drectly accordng to ther potental mpact on the retreval effectveness. hs method s dfferent from the exstng ones, whch can typcally be consdered as an unsupervsed learnng. SVM [6], [20] wll be used for term classfcaton, whch uses not only the term dstrbuton crtera as n prevous studes, but also several addtonal crtera such as term proxmty. hs approach proposed has at least the followng advantages: ) Expanson terms are no longer selected merely based on term dstrbutons and other crtera ndrectly related to the retreval effectveness. It s done drectly accordng to ther possble mpact on the retreval effectveness. We can expect the selected terms to have a hgher mpact on the effectveness. 2) he term classfcaton process can naturally ntegrate varous crtera, and thus provdes a framework for ncorporatng dfferent sources of evdence. We evaluate our method on three REC collectons and compare t to the tradtonal approaches. he expermental results show that the retreval effectveness can be mproved sgnfcantly when term classfcaton s ntegrated. o our knowledge, ths s the frst attempt tryng to nvestgate the drect mpact on retreval effectveness of ndvdual expanson terms n pseudo-relevance feedback. he remanng of the paper s organzed as follows: Secton 2 revews some related work and the state-of-the-art approaches to query expanson. In secton 3, we examne the PRF assumpton used n the prevous studes and show that t does not hold n realty. Secton 4 presents some experments to nvestgate the potental usefulness of selectng good terms for expanson. Secton 5 descrbes our term classfcaton method and reports an evaluaton of the classfcaton process. he ntegraton of the classfcaton 243

2 results nto the PRF methods s descrbed n Secton 6. In secton 7, we evaluate the resultng retreval method wth three REC collectons. Secton 8 concludes ths paper and suggests some avenues for future work. 2. Related Work Pseudo-relevance feedback has been wdely used n IR. It has been mplemented n dfferent retreval models: vector space model [5], probablstc model [3], and so on. Recently, the PRF prncple has also been mplemented wthn the language modelng framework. Snce our work s also carred out usng language modelng, we wll revew the related studes n ths framework n more detal. he basc rankng functon n language modelng uses KLdvergence as follows: Score ( d, q) = w V w θ q )log w θ d ) () where V s the vocabulary of the whole collecton, and are respectvely the query model and the document model. he document model has to be smoothed to solve the zero-probablty problem. A commonly used smoothng method s Drchlet smoothng [23]: tf ( w, d) + u w C) w θ d ) = d + u where d s the length of the document, tf(w,d) the term frequency of w wthn d, w C) the probablty of w n the whole collecton C estmated wth MLE (Maxmum Lkelhood Estmaton), and u s the Drchlet pror (set at,500 n our experments). he query model descrbes the user s nformaton need. In most tradtonal approaches usng language modelng, ths model s estmated wth MLE wthout smoothng. We denote ths model by w θ o). In general, ths query model has a poor coverage of the relevant and useful terms, especally for short queres. Many terms related to the query s topc are absent from (or has a zero probablty n) the model. Pseudo-relevance feedback s often used to mprove the query model. We menton two representatve approaches here: relevance model and mxture model. he relevance model [8] assumes that a query term s generated by a relevance model w θ R). However, t s mpossble to defne the relevance model wthout any relevance nformaton. [8] thus explots the top-ranked feedback documents by assumng them to be samples from the relevance model. he relevance model s then estmated as follows: D F P ( w θ ) w D) D θ ) R R Where F denotes the feedback documents. On the rght sde, the relevance model θ R s approxmated by the orgnal query Q. Applyng Bayesan rule and makng some smplfcatons, we obtan: w D) Q D) D) P ( w θ R ) = w D) Q D) Q) hat s, the probablty of a term w n the relevance model s determned by ts probablty n the feedback documents (.e. w D)) as well as the correspondence of the latter to the query (.e. Q D)). he above relevance model s used to enhance the orgnal query model by the followng nterpolaton: P w θ ) = ( λ) w θ ) + λ w θ ) (4) ( q 0 R (2) (3) where s the nterpolaton weght (set at 0.5 n our experments). Notce that the above nterpolaton can also be mplemented as document re-rankng n practce, n whch only the top-ranked documents are re-ranked accordng to the relevance model. he mxture model [22] also tres to buld a language model for the query topc from the feedback documents, but n a way dfferent from the relevance model. It assumes that the query topc model to be extracted corresponds to the part that s the most dstnctve from the whole document collecton. hs dstnctve part s extracted as follows: Each feedback document s assumed to be generated by the topc model to be extracted and the collecton model, and the EM algorthm [3] s used to extract the topc model so as to maxmze the lkelhood of the feedback documents. hen the topc model s combned wth the orgnal query model by an nterpolaton smlarly to the relevance model. Although the specfc technques used n the above two approaches are dfferent, both assume that the strong terms contaned n the feedback documents are related to the query and are useful to mprove the retreval effectveness. In both cases, the strong terms are determned accordng to ther dstrbutons. he only dfference s that the relevant model tres to extract the most frequent terms from the feedback documents (.e. wth a strong w D)), whle the mxture model tres to extract those that are the most dstnctve between the feedback documents and the general collecton. hese crtera have been generally used n other PRF approaches (e.g. [2]). Several addtonal crtera have been used to select terms related to the query. For example, [4] proposed the prncple that the selected terms should have a hgher probablty n the relevant documents than n the rrelevant documents. For document flterng, term selecton s more wdely used n order to update the topc profle. For example, [24] extracted terms from true relevant and rrelevant documents to update the user profle (.e. query) usng the Roccho method. Kwok et al. [7] also made use of the query length as well as the sze of the vocabulary. Smeaton and Van Rjsbergen [6] examned the mpact of determnng expanson terms usng mnmal spannng tree and some smple lngustc analyss. Despte the large number of studes, a crucal queston that has not been drectly examned s whether the expanson terms selected n a way or another are truly useful for the retreval. One was usually concerned wth the global mpact of a set of expanson terms. Indeed, n many experments, mprovements n the retreval effectveness have been observed wth PRF [8], [9], [9], [22]. hs mght suggest that most expanson terms are useful. Is t really so n realty? We wll examne ths queston n the next secton. Notce that some studes (e.g. []) have tred to understand the effect of query expanson. However, these studes have examned the terms extracted from the whole collecton nstead of from the feedback documents. In addton, they also focused on the term dstrbuton aspects. 3. A Re-examnaton of the PRF Assumpton he general assumpton behnd PRF can be formulated as follows: Most frequent or dstnctve terms n pseudo-relevance feedback documents are useful and they can mprove the retreval effectveness when added nto the query. o test ths assumpton, we wll consder all the terms extracted from the feedback documents usng the mxture model. We wll test each of these terms n turn to see ts mpact on the retreval 244

3 effectveness. he followng score functon s used to ntegrate an expanson term e: Score d, q) = t θ ) log t θ ) + wlog e θ ) (5) ( w V o d d where t s a query term, t θo) s the orgnal query model as descrbed n secton 2, e s the expanson term under consderaton, and w s ts weght. he above expresson s a smplfed form of query expanson wth a sngle term. In order to make the test smpler, the followng smplfcatons are made: ) An expanson term s assumed to act on the query ndependently from other expanson terms; 2) Each expanson term s added nto the query wth equal weght - the weght w s set at 0.0 or In practce, an expanson term may act on the query n dependence wth other terms, and ther weghts may be dfferent. Despte these smplfcatons, our test can stll reflect the man characterstcs of the expanson terms. Good expanson terms are those that mprove the effectveness when w s 0.0 and hurt the effectveness when w s -0.0; bad expanson terms produce the opposte effect. Neutral expanson terms are those that produce smlar effect when w s 0.0 or herefore we can generate three groups of expanson terms: good, bad and neutral. Ideally, we would lke to use only good expanson terms to expand queres. Let us descrbe the dentfcaton of the three groups of terms n more detal. Suppose MAq) and MAqU e) are respectvely the MAP of the orgnal query and expanded query (expanded wth e). We measure the performance change due to e by the rato chg( e) = [ MA q e) MA q) ] MA q). We set a threshold at e., good and bad expanson terms should produce a performance change such that chg(e) > In addton to the above performance change, we also assume that a term appearng less than 3 tmes n the feedback documents s not an mportant expanson term. hs allows us to flter out some nose. he above dentfcaton produces a desred result for term classfcaton. Now, we wll examne whether the canddate expanson terms proposed by the mxture model are good terms. Our verfcaton s made on three REC collectons: AP, WSJ and Dsk4&5. he characterstcs of these collectons are descrbed n Secton 7.. We consder 50 queres for each collecton and 80 expansons wth the largest probabltes for each query. he followng table shows the proporton of good, bad and neutral terms for all the queres n each collecton. Collecton Good erms Neutral erms Bad erms AP 7.52% 47.59% 36.69% WSJ 7.4% 49.89% 32.69% Dsk4&5 7.64% 56.46% 25.88% able. Proportons of each group of expanson terms selected by the mxture model As we can see, only less than 8% of the expanson terms used n the mxture model are good terms n all the three collectons. he proporton of bad terms s hgher. hs shows that the expanson process ndeed added more bad terms than good ones. We also notce from able that a large proporton of the expanson terms are neutral terms, whch have lttle mpact on the retreval effectveness. Although ths part of the terms does necessarly not hurt retreval, addng them nto the query would produce a long query and thus a heaver query traffc (longer evaluaton tme). It s then desrable to remove these terms, too. feedback documents Good Neutral Bad collecton Fgure. Dstrbuton of the expanson terms for arbus subsdes n the feedback documents and n the collecton he above analyss clearly shows that the term selecton process used n the mxture model s nsuffcent. Smlar phenomenon s observed on the relevance model and can be generalzed to all the methods explotng the same crtera. hs suggests that the term selecton crtera used - term dstrbutons n the feedback documents and n the whole document collecton, s nsuffcent. hs also ndcates that good and bad expanson terms may have smlar dstrbutons because the mxture model, whch explots the dfference of term dstrbuton between the feedback documents and the collecton, has faled to dstngush them. o llustrate the last pont, let us look at the dstrbuton of the expanson terms selected wth the mxture model for REC query #5 arbus subsdes. In Fgure, we place the top 80 expanson terms wth the largest probabltes n a two-dmensonal space one dmenson represents the logarthm of ts probablty n the pseudo-relevant documents and another dmenson represents that n the whole collecton. o make the llustraton easer, a smple normalzaton s made so that the fnal value wll be n the range [0, ]. Fgure shows the dstrbuton of the three groups of expanson terms. We can observe that the neutral terms are somehow solated from the good and the bad terms to some extent (on the lower-rght corner), but the good expanson terms are ntertwned wth the bad expanson terms. hs fgure llustrates the dffculty to separate good and bad expanson terms accordng to term dstrbutons solely. It s then desrable to use addtonal crtera to better select useful expanson terms. 4. Usefulness of Selectng Good erms Before proposng an approach to select good terms, let us frst examne the possble mpact wth a good term selecton process. Let us assume an oracle classfer that separate correctly good, bad and neutral expanson terms as determned n Secton 3. In ths experment, we wll only keep the good expanson terms for each query. All the good terms are ntegrated nto the new query model n the same way as ether relevance model or mxture model. able 2 shows the MAP (Mean Average Precson) for the top 000 results wth the orgnal query model (LM), the expanded query models by the relevance model (REL) and by the mxture model (MIX), as well as by the oracle expanson terms (REL+Oracle and MIX+Oracle). he superscrpt, L, R and M ndcates that the mprovement over LM, REL and MIX s statstcally sgnfcant at p<

4 Models AP WSJ Dsk4&5 LM REL L L L REL+Oracle R,L R,L R,L MIX L L L MIX+Oracle M,L M,L M,L able 2.he mpact of oracle expanson classfer We can see that the retreval effectveness can be much mproved f term classfcaton s done perfectly. he oracle expanson terms can generally mprove the MAP of the relevance model and the mxture model by 8-30%. hs shows the usefulness of correctly classfyng the expanson terms and the hgh potental of mprovng the retreval effectveness by a good term classfcaton. he MAP obtaned wth the oracle expanson terms represents the upper bound retreval effectveness we can expect to obtan usng pseudorelevance feedback. Our problem now s to develop an effectve method to correctly classfy the expanson terms. 5. Classfcaton of Expanson erms 5. Classfer Any classfer can be used for term classfcaton. Here, we use SVM. More specfcally, we use the SVM varant C-SVM [2] because of ts effectveness and smplcty [20]. Several kernel functons can be used n SVM. We use the radal-based kernel functon (RBF) because t has relatvely fewer hyper parameters and has shown to be effectve n prevous studes [2],[5]. hs functon s defned as follows: 2 K ( x, x ) = exp( x x 2σ ) (6) j j where σ s a parameter controllng the shape of the RBF functon. he functon gets flatter when s larger. Another parameter C>0 n C-SVM should be set to control the trade-off between the slack varable penalty and the margn [2]. Both parameters are estmated wth a 5-fold cross-valdaton to maxmze the classfcaton accuracy of the tranng data (see able 7). In our term classfcaton, we are nterested to know not only f a term s good, but also the extent to whch t s good. hs latter value s useful for us to measure the mportance of an expanson term and to weght t n the new query. herefore, once we obtan a classfcaton score, we use the method descrbed n [2] to transform t nto a posteror probablty: Suppose the classfcaton score calculated wth the SVM s s(x). hen the probablty of x belongng to the class of good terms (denoted by +) s defned by: P ( + x) = exp( As( x) + B) (7) where A and B are the parameters, whch are estmated by mnmzng the cross-entropy of a porton of tranng data, namely the development data. hs process has been automated n LIBSVM [5]. We wll have + x)>0.5 f and only f the term x s classfed as a good term. More detals about ths model can be found n [2]. Note that the above probablstc SVM may have dfferent classfcaton results from the smple SVM, whch classfes nstances accordng to sgn(s(x)). In our experments, we have tested both probablstc and smple SVMs, and found that the former performs better. We use the SVM mplementaton LIBSVM [5] n our experments. 5.2 Features Used for erm Classfcaton Each expanson term s represented by a feature vector F( e) = N [ f ( e), f ( e),..., f ( e) N ] R, where means a transpose of a vector. Useful features nclude those already used n tradtonal approaches such as term dstrbuton n the feedback documents and term dstrbuton n the whole collecton. As we mentoned, these features are nsuffcent. herefore, we consder the followng addtonal features: - co-occurrences of the expanson term wth the orgnal query terms; - proxmty of the expanson terms to the query terms. We wll explan several groups of features below. Our assumpton s that the most useful feature for term selecton s the one that makes the largest dfference between the feedback documents and the whole collecton (smlar to the prncple used n the mxture model). So, we wll defne two sets of features, one for the feedback documents and another for the whole collecton. However, techncally, both sets of features can be obtaned n a smlar way. herefore, we wll only descrbe the features for the feedback documents. he others can be defned smlarly. erm dstrbutons he frst features are the term dstrbutons n the pseudo-relevant documents and n the collecton. he feature for the feedback documents s defned as follows: tf e D f ( (, ) e) = log tf ( t, D ) t where F s the set of feedback documents. f 2 (e) s defned smlarly on the whole collecton. hese features are the tradtonal ones used n the relevance model and mxture model. Co-occurrence wth sngle query term Many studes have found that the terms that co-occur wth the query terms frequently are often related to the query []. herefore, we defne the followng feature to capture ths fact: f 3 ( e) = log n n = t C( t, e D) tf ( t, D) where C(t,e D) s the frequency of co-occurrences of query term t and the expanson term e wthn text wndows n document D. he wndow sze s emprcally set to be 2 words. Co-occurrence wth pars query terms A stronger co-occurrence relaton for an expanson term s wth two query terms together. [] has shown that ths type of co-occurrence relaton s much better than the prevous one because t can take nto account some query contexts. he text wndow sze used here s 5 words. Gven the set of possble term pars, we defne the followng feature, whch s slghtly extended from the prevous one: f 5 ( e) = log Ω ( t, t j ) Ω Weghted term proxmty C t ( t, t, e D) j tf ( t, D) he dea of usng term proxmty has been used n several studes [8]. Here we also assume that two terms that co-occur at a smaller dstance s more closely related. here are several ways to defne the dstance between two terms n a set of documents [8]. Here, we defne t as the mnmum number of words between the two terms among all co-occurrences n the documents. Let us denote ths dstance between t and t j among the set B of documents by dst(t,t j B). For a query of multple words, we have to aggregate the dstances between the expanson term and all query terms. he smplest method s to consder the average dstance, whch s smlar to the average dstance defned n [8]. However, t does not produce good results n our experments. Instead, the weghted 246

5 average dstance works better. In the latter, a dstance s weghted by the frequency of ther co-occurrences. We then have the followng feature: f ( e) 7 n C( t =, = log n e) dst( t, e F) = C( t, e) where C(t, e) s the frequency of co-occurrences of t and e wthn text wndows n the collecton. he wndow sze s set to 2 words as before. Document frequency for query terms and the expanson term together he features n ths group model the count of documents n whch the expanson term co-occurs wth all query terms. We then have: f9 = log[ I( ) ) + ] D F t q t D e D 0. 5 where I(x) s the ndcator functon whose value s when the Boolean expresson x s true, and 0 otherwse. he constant 0.5 here acts as a smoothng factor to avod zero value. o avod that a feature whose values vares n a larger numerc range domnates those varyng n smaller numerc ranges, scalng on feature values s necessary [5]. he scalng s done n a queryby-query manner. Let e GEN(q) be an expanson term of the query q, and f (e) s one feature value of e. We scale f (e) as follows: ' ( f ( e) mn ) ( max mn ) f ( e) = where mn =mn e GEN(q)f (e) and max =max e GEN(q)f (e). Wth ths transformaton, each feature becomes a real number n [0, ]. In our experments, only the above features are used. However, the general method s not lmted to them. Other features can be added. he possblty to ntegrate arbtrary features for the selecton of expanson terms ndeed represents an advantage of our method. 5.3 Classfcaton Experments Let us now examne the qualty of our classfcaton. We use three test collectons (see able 7 n Secton 7.), wth 50 queres for each collecton. We dvde these queres nto three groups of 50 queres. We then do leave-one-out cross valdaton to evaluate the classfcaton accuracy. o generate tranng and test data, we use the method descrbed n secton 3 to label possble expanson terms of each query as good terms or non-good terms (ncludng bad and neutral terms), and then represent each expanson wth the features descrbed n secton 5.2. he canddate expanson terms are those that occur n the feedback documents (top 20 documents n the ntal retreval) no less than three tmes. able 3 shows the classfcaton results. In ths table, we show the percentage of good expanson terms for all the queres n each collecton around /3. Usng the SVM classfer, we obtan a classfcaton accuracy of about 69%. hs number s not hgh. In fact, f we use a naïve classfer that always classfes nstances nto non good class, the accuracy (.e. one mnuses the percentage of good terms) s only slghtly lower. However, such a classfer s useless for our purpose because no expanson term s classfed as good term. Better ndcators are recall, and more partcularly precson. Although the classfer only dentfes about /3 of the good terms (.e. recall), around 60% of the dentfed ones are truly good terms (.e. precson). Comparng to able for the expanson terms selected by the mxture model, we can see that the expanson terms select by the SVM classfer are of much hgher qualty. hs Coll. Percentage of SVM good terms Accuracy Rec. Prec. AP WSJ Dsk4& able 3. Classfcaton results of SVM shows that the addtonal features we consdered n the classfcaton are useful, although they could be further mproved n the future. In the next secton, we wll descrbe how the selected expanson terms are ntegrated nto our retreval model. 6. Re-weghtng Expanson erms wth erm Classfcaton he classfcaton process performs a further selecton of expanson terms among those proposed by the relevance model and the mxture model respectvely. he selected terms can be ntegrated n these models n two dfferent ways: hard flterng,.e. we only keep the expanson terms classfed as good; or soft flterng,.e. we use the classfcaton score to enhance the weght of good terms n the fnal query model. Our experments show that the second method performs better. We wll make a comparson between these two methods n Secton 7.4. In ths secton, we focus on the second method, whch means a redefnton of the models w θr) for the relevance model and w θ ) for the mxture model. hese models are redefned as follows: For a term e such that + e)>0.5, old ( e θ R) ( + e )) Z old e θ ) ( + e ) new w θ ) = α + R ) ( ) Z new w θ ) = α + ) (8) where Z s the normalzaton factor, and s a coeffcent, whch s estmated wth some development data n our experments usng lne search [4]. Once the expanson terms are re-weghted, we wll retan the top 80 terms wth the hghest probabltes for expanson. her weghts are normalzed before beng nterpolated wth the orgnal query model. he number 80 s used for a far comparson wth the relevance model and the mxture model. Name Descrpton #Docs ran opcs Dev. opcs est topcs AP Assoc. Press , Wall St. Journal 73, WSJ Dsk4&5 REC dsk4&5 556, able 4. Statstcs of evaluaton data sets 7. IR Experments 7. Expermental Settngs We evaluate our method wth three REC collectons, AP88-90, WSJ87-92 and all documents on REC dsks 4&5. able 4 shows the statstcs of the three collectons. For each dataset, we splt the avalable topcs nto three parts: the tranng data to tran the SVM classfer, the development data to estmate the parameter n equaton 9, and the test data. We only use the ttle for each REC topc as our query. Both documents and queres are stemmed wth Porter stemmer and stop words are removed. he man evaluaton metrc s Mean Average Precson (MAP) for top 000 documents. Snce some prevous studes showed that PRF mproves recall but may hurt precson, we also show the precson at top 30 and 00 documents,.e., P@30 and P@00. We also show recall as a supplementary measure. We do a query-by-query analyss and conduct t-test to determne whether the mprovement on MAP s statstcally sgnfcant. 247

6 he Indr 2.6 search engne [7] s used as our basc retreval system. We use the relevance model mplemented n Indr, but mplemented the mxture model followng [22] snce Indr does not mplement ths model. 7.2 Ad-hoc Retreval Results In the experments, the followng methods are compared: LM: the KL-dvergence retreval model wth the orgnal queres; REL: the relevance model; REL+SVM: the relevance model wth term classfcaton; MIX: the mxture model; MIX+SVM: the mxture model wth term classfcaton. hese models requre some parameters, such as the weght of orgnal model when formng the fnal query representaton, the Drchlet pror for document model smoothng and so on. Snce the purpose of ths paper s not to optmze these parameters, we set all of them at the same values for all the models. ables 5, 6 and 7 show the results obtaned wth dfferent models on the three collectons. In the tables, mp means the mprovement rate over LM model, * ndcates that the mprovement s statstcally sgnfcant at the level of p<0.05, and ** at p<0.0. he superscrpts R and M ndcate that the result s statstcally better than the relevance model and mxture model respectvely at p<0.05. From the tables, we observe that both relevance model and mxture model, whch explot a form of PRF, can mprove the retreval effectveness of LM sgnfcantly. hs observaton s consstent wth prevous studes. he MAP we obtaned wth these two models represent the state-of-the-art effectveness on these test collectons. Comparng the relevance model and the mxture model, we see that the latter performs better. he reason may be the followng: he mxture model reles more on the dfference between the feedback documents and the whole collecton to select the expanson terms, than the relevance model. By dong ths, one can flter out more bad or neutral expanson terms. On all the three collectons, the model ntegratng term classfcaton performs very well. When the classfcaton model s used together wth a PRF model, the effectveness s always mproved. On the AP and Dsk4&5 collectons, the mprovements are more than 7.5% and are statstcally sgnfcant. he mprovements on the WSJ collecton are smaller (about 3.5%) and are not statstcally sgnfcant. About the mpact on precson, we can also see that term classfcaton can also mprove the precson at top ranked documents, except n the case of Dsk4&5 when SVM s added to REL. hs shows that n most cases, addng the expanson terms does not hurt, but mproves, precson. Let us show the expanson terms for the queres machne translaton and natural language processng, n able 8. he stemmed words have been restored n ths table for better readablty. All the terms contaned n the table are those suggested by the mxture model. However, only part of them (n talc) s useful expanson terms. Many of them are general terms that are not useful, for example, food, make, year, 50, and so on. Model P@30 P@00 MAP Imp Recall LM REL %** REL+SVM R 22.93%** MIX %** MIX+SVM M,R 28.36%** able 5. Ad-hoc retreval results on AP data Model P@30 P@00 MAP Imp Recall LM REL %** REL+SVM %** MIX %** MIX+SVM R 4.82%** 0.70 able 6. Ad-hoc retreval results on WSJ data Model P@30 P@00 MAP Imp Recall LM REL %* REL+SVM R 4.20%** MIX %** MIX+SVM M,R 25.96%** able 7. Ad-hoc retreval results on Dsk4&5 data machne translaton Expanson terms t θ ) Expanson terms t θ ) compute year sovet work company make typewrter englsh busy bm ncrease people natural language processng Expanson terms t θ ) Expanson terms t θ ) englsh publsh word naton french develop food russan make program world dctonary gorlla able 8. Expanson terms of two queres. he terms n talc are real good expanson terms, and those n bold are classfed as good terms he classfcaton process can help dentfy well the useful expanson terms (n bold): although not all the useful expanson terms are dentfed, those dentfed (e.g. program, dctonary ) are hghly related and useful. As the weght of these terms s ncreased, the relatve weght of the other terms s decreased, makng ther weghts n the fnal query model smaller. hese examples llustrate why the term classfcaton process can mprove the retreval effectveness. 7.3 Supervsed vs. Unsupervsed Learnng Compared to the relevance model and the mxture model, the approach wth term classfcaton made two changes: t uses supervsed learnng nstead of unsupervsed learnng; t uses several addtonal features. It s then mportant to see whch of these changes contrbuted the most to the ncrease n retreval effectveness. In order to see ths, we desgn a method usng unsupervsed learnng, but wth the same addtonal features. he unsupervsed learnng extends the mxture model n the followng way: 248

7 Each feedback document s also consdered to be generated from the topc model (to be extracted) and the collecton model. We try to extract the topc model so as to maxmze the lkelhood of the feedback documents as n the mxture model. However, the dfference s that, nstead of defnng the topc model w θ ) as a multnomal model, we defne t as a log-lnear model that combnes all the features: ( F( w ) Z w θ ) = exp λ ) (9) where F(w) s the feature vector defned n secton 5.2, λ s the weght vector and Z s the normalzaton factor to make w θ ) a real probablty. λ s estmated by maxmzng the lkelhood of the feedback documents. o avod overfttng, we do regularzaton on λ by assumng that t has a zero-mean Gaussan pror dstrbuton [2]. hen the objectve functon to be maxmzed becomes: L ( F) = tf ( w, D) log( ( α) w θ C ) + α w θ ) (0) D F w ) V λ λ where s the regularzaton factor, whch s set to be 0.0 n our experments. α s the parameter representng how lkely we use the topc model to generate the pseudo-relevant document. It s set at a fxed value as n [22] (0.5 n our case). Snce L(F) s a concave functon w.r.t. λ, t has a unque maxmum. We solve ths unconstraned optmzaton problem wth L-BFGS algorthm [0]. able 9 shows the results measured by MAP. Agan, the superscrpt, M and L ndcate the mprovement over MIX and Log-lnear model s statstcally sgnfcant at p<0.05. From ths table, we can observe that the log-lnear model outperforms the mxture model only slghtly. hs shows that an unsupervsed learnng method, even wth addtonal features, cannot mprove the retreval effectveness by a large margn. he possble reason s that the objectve functon, L(F), does not correlate very well wth MAP. he parameters maxmzng L(F) do not necessarly produce good MAP. In comparson, the MIX+SVM model outperforms largely the loglnear model on all the three collectons, and the mprovements on AP and Dsk4&5 are statstcally sgnfcant. hs result shows that a supervsed learnng method can more effectvely capture the characterstcs of the genune good expanson terms than an unsupervsed method. Model AP WSJ Dsk4&5 MIX Log-lnear MIX+SVM M,L M,L able 9. Supervsed Learnng VS Unsupervsed Learnng 7.4 Soft Flterng vs. Hard Flterng We mentoned two possble ways to use the classfcaton results: hard flterng of expanson terms by retanng only the good terms, or soft flterng by ncreasng the weght of the good terms. In ths secton, we compare the two methods. able 0 shows the results obtaned wth both methods. In the table, M, R, and H ndcate the mprovement over MIX, REL and HARD are statstcally sgnfcant wth p<0.05 From ths table, we see that both hard and soft flterng mproves the effectveness. Although the mprovements wth hard flterng are smaller, they are steady on all the three collectons. However, only the mprovement over MIX model on the AP and Dsk4&5 data s statstcally sgnfcant. In comparson, the soft flterng method performs much better. Our explanaton s that, snce the classfcaton accuracy s far from perfect (actually, t s less than 70% as shown n able 3), some top ranked good expanson terms, whch could mprove the performance sgnfcantly, can be removed by the hard flterng. On the other hand, n the soft flterng case, even f the top ranked good terms are msclassfed, we wll only reduce ther relatve weght n the fnal query model rather than removng them. herefore, these expanson terms can stll contrbute to mprovng the performance. In other words, the soft flterng method s less affected by classfcaton errors. Model AP WSJ Dsk4&5 MIX MIX+HARD M M MIX+SOF M,H M,H REL REL+HARD REL+SOF R,H R able 0. Soft Flterng VS Hard Flterng 7.5 Reducng Query raffc A crtcal aspect wth query expanson s that, as more terms are added nto the query, the query traffc,.e. the tme needed for ts evaluaton, becomes larger. In the prevous sectons, for the purpose of comparson wth prevous methods, we used 80 expanson terms. In practce, ths number can be too large. In ths secton, we examne the possblty to further reduce the number of expanson terms. In ths experment, after a re-weghtng wth soft flterng, nstead of keepng 80 expanson terms, we only select the top 0 expanson terms. hese terms are used to construct a small query topc model w θ ). hs model s nterpolated wth the orgnal query model as before. he followng table descrbes the results usng the mxture model. Model AP WSJ Dsk4&5 MIX+SOF able. Soft flterng wth 0 terms As expected, the effectveness wth 0 expanson terms s lower than wth 80 terms. However, we can stll obtan much hgher effectveness than the tradtonal language model LM, and all the mprovements are sgnfcantly sgnfcant. he results wth 0 expanson terms can also be advantageously compared to the mxture model wth 80 expanson terms: for both AP and Dsk4&5 collectons, the effectveness s hgher than the mxture model. he effectveness on WSJ s very close. hs experment shows that we can reduce the number of expanson terms, and even wth a reasonably small number, the retreval effectveness can be greatly ncreased. hs observaton allows us to control query traffc wthn an acceptable range, and make the method more feasble n the search engnes. 8. Concluson Pseudo-relevance feedback, whch adds addtonal terms extracted from the feedback documents, s an effectve method to mprove the query representaton and the retreval effectveness. he basc assumpton s that most strong terms n the feedback documents are useful for IR. In ths paper, we re-examned ths hypothess on three test collectons and showed that the expanson terms determned n tradtonal ways are not all useful. In realty, only a small proporton of the suggested expanson terms are useful, and many others are ether harmful or useless. In addton, we also showed that the 249

8 tradtonal crtera for the selecton of expanson terms based on term dstrbutons are nsuffcent: good and bad expanson terms are not dstngushable on these dstrbutons. Motvated by these observatons, we proposed to further classfy expanson terms usng addtonal features. In addton, we am to select the expanson terms drectly accordng to ther possble mpact on the retreval effectveness. hs method s dfferent from the exstng ones, whch often rely on some other crtera that do not always correlate wth the retreval effectveness. Our experments on three REC collectons showed that the expanson terms selected usng our method are sgnfcantly better than the tradtonal expanson terms. In addton, we also showed that t s possble to lmt the query traffc by controllng the number of expanson terms, and ths stll lead to qute large mprovements n retreval effectveness. hs study shows the mportance to examne the crucal problem of usefulness of expanson terms before the terms are used. he method we propose also provdes a general framework to ntegrate multple sources of evdence. hs study suggests several nterestng research avenues for our future nvestgaton: he results we obtaned wth term classfcaton are much lower than wth the oracle expanson terms. hs means that there s stll much room for mprovement. In partcular, mprovement n classfcaton qualty could drectly result n mprovement n retreval effectveness. he mprovement of classfcaton qualty could be obtaned by ntegratng more useful features. In ths paper, we have lmted our nvestgaton to only a few often used features. More dscrmnatve features can be nvestgated n the future. REFERENCES [] Ba, J. Ne, J., Bouchard, H. and Cao, G. Usng query contexts n nformaton retreval. In the Proceedngs of SIGIR 2007, Armsterdam, Netherlands, [2] Bshop, C. Patten recognton and machne learnng. Sprnger, [3] Dempster, A., Lard, N. and Rubn, D. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety. Seres B. 39():-38, 977 [4] Gao, J., Q, H., Xa, X., and Ne, J. Lnear dscrmnant model for nformaton retreval. In the Proceedngs of SIGIR 2005, pp , [5] Hsu, C. Chang, C. and Ln, C. A practcal gude to support vector classfcaton. echncal Report, Natonal awan Unversty. [6] Joachms,. ext categorzaton wth support vector machnes: learnng wth features. In ECML, pp.37-42, 998. [7] Kwok, K.L, Grunfeld, L., Chan, K., HREC-8 ad-hoc, query and flterng track experments usng PIRCS, In REC0, [8] Lavrenko, V. and Croft, B. Relevance-based language models. In the Proceedngs of SIGIR 200, pp.20-28, 200. [9] Metzler, D. and Croft, B. Latent Concept Expanson Usng Markov Random Felds. In the Proceedngs of SIGIR 2007, pp [0] Nocedal, J. and Wrght, S. Numercal optmzaton. Sprnger, [] Peat, H.J. and Wllett, P., he lmtatons of term cooccurrence data for query expanson n document retreval systems. JASIS, 42(5): , 99. [2] Platt, J. Probabltes for SV Machnes. Advances n large margn classfers, pages 6-74, Cambrdge, MA, MI Press [3] Robertson, S., and Sparck Jones, K. Relevance weghtng of search terms. JASIS, 27:29-46, 976 [4] Robertson, S.E., On term selecton for query expanson, Journal of Documentaton, 46(4): [5] Roccho, J. Relevance feedback n nformaton retreval. In he SMAR Retreval System: Experments n Automatc Document Processng, pages , 97 [6] Smeaton, A. F. and Van Rjsbergen, C. J. he retreval effects of query expanson on a feedback document retreval system. Computer Journal, 26(3): [7] Strohman,., Metzler, D. and urtle, H., and Croft, B. (2004). Indr: A Language Model-based Search Engne for Complex Queres. In Proceedngs of the Internatonal conference on Intellgence Analyss. [8] ao,. and Zha, C. An exploraton of proxmty measures nformaton retreval. In the Proceedngs of SIGIR 2007, pp , [9] ao,. and Zha, C. Regularzed estmaton of mxture models for robust pseudo-relevance feedback. In the Proceedngs of SIGIR [20] Vapnk, V. Statstcal Learnng heory. New York: Wley, 998 [2] Xu, J. and Croft, B. Query expanson usng local and global document analyss. In the Proceedngs of SIGIR 2006, pp.4-, 996. [22] Zha, C. and Lafferty, J. Model-based feedback n the KLdvergence retreval model. In CIKM, pp , 200a. [23] Zha, C. and Lafferty, J. A study of smoothng methods for language models appled to ad hoc nformaton retreval. In Proceedngs of SIGIR 200, pp , 200b. [24] Zhang, Y., Callan, J., he bas problem and language models n adaptve flterng. In the Proceedngs of REC, pp.78-83,

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general