Unified Utility Maximization Framework for Resource Selection

Size: px

Start display at page:

Download "Unified Utility Maximization Framework for Resource Selection"

Dorothy Phillips
6 years ago
Views:

1 Unfe Utlty Maxmzaton Framewor for Resoure Seleton Luo S Language Tehnology Inst. Shool of Compute Sene Carnege Mellon Unversty Pttsburgh, PA 523 ls@s.mu.eu Jame Callan Language Tehnology Inst. Shool of Compute Sene Carnege Mellon Unversty Pttsburgh, PA 523 allan@s.mu.eu ABSTRACT Ths paper presents a unfe utlty framewor for resoure seleton of strbute text nformaton retreval. Ths new framewor shows an effent an effetve way to nfer the probabltes of relevane of all the ouments aross the text atabases. Wth the estmate relevane nformaton, resoure seleton an be mae by expltly optmzng the goals of fferent applatons. Spefally, when use for atabase reommenaton, the seleton s optmze for the goal of hghreall (nlue as many relevant ouments as possble n the selete atabases; when use for strbute oument retreval, the seleton targets the hgh-preson goal (hgh preson n the fnal merge lst of ouments. Ths new moel proves a more sol framewor for strbute nformaton retreval. Empral stues show that t s at least as effetve as other state-of-the-art algorthms. Categores an Subjet Desrptors H.3.3 [Informaton Searh an Retreval]: General Terms Algorthms Keywors strbute nformaton retreval, resoure seleton. INTRODUCTION Conventonal searh engnes suh as Google or AltaVsta use a-ho nformaton retreval soluton by assumng all the searhable ouments an be ope nto a sngle entralze atabase for the purpose of nexng. Dstrbute nformaton retreval, also nown as feerate searh [,4,7,,4,22] s fferent from a-ho nformaton retreval as t aresses the ases when ouments annot be aqure an store n a sngle atabase. For example, Hen Web ontents (also alle nvsble or eep Web ontents are nformaton on the Web that annot be aesse by the onventonal searh engnes. Permsson to mae gtal or har opes of all or part of ths wor for personal or lassroom use s grante wthout fee prove that opes are not mae or strbute for proft or ommeral avantage an that opes bear ths note an the full taton on the frst page. To opy otherwse, or republsh, to post on servers or to restrbute to lsts, requres pror spef permsson an/or a fee. CIKM 04, November 8--3, 2004, Washngton, DC, USA. Copyrght 2004 ACM //04/00 $5.00. Hen web ontents have been estmate to be 2-50 [9] tmes larger than the ontents that an be searhe by onventonal searh engnes. Therefore, t s very mportant to searh ths type of valuable nformaton. The arhteture of strbute searh soluton s hghly nfluene by fferent envronmental haratersts. In a small loal area networ suh as small ompany envronments, the nformaton provers may ooperate to prove orpus statsts or use the same type of searh engnes. Early strbute nformaton retreval researh fouse on ths type of ooperatve envronments [,8]. On the other se, n a we area networ suh as very large orporate envronments or on the Web there are many types of searh engnes an t s ffult to assume that all the nformaton provers an ooperate as they are requre. Even f they are wllng to ooperate n these envronments, t may be har to enfore a sngle soluton for all the nformaton provers or to etet whether nformaton soures prove the orret nformaton as they are requre. Many applatons fall nto the latter type of unooperatve envronments suh as the Mn projet [6] whh ntegrates non-ooperatng gtal lbrares or the QProber system [9] whh supports browsng an searhng of unooperatve hen Web atabases. In ths paper, we fous manly on unooperatve envronments that ontan multple types of nepenent searh engnes. There are three mportant sub-problems n strbute nformaton retreval. Frst, nformaton about the ontents of eah nvual atabase must be aqure (resoure representaton [,8,2]. Seon, gven a query, a set of resoures must be selete to o the searh (resoure seleton [5,7,2]. Thr, the results retreve from all the selete resoures have to be merge nto a sngle fnal lst before t an be presente to the en user (retreval an results mergng [,5,20,22]. Many types of solutons exst for strbute nformaton retreval. Invsble-web.net proves gue browsng of hen Web atabases by olletng the resoure esrptons of these atabases an bulng herarhes of lasses that group them by smlar tops. A atabase reommenaton system goes a step further than a browsng system le Invsble-web.net by reommenng most relevant nformaton soures to users queres. It s ompose of the resoure esrpton an the resoure seleton omponents. Ths soluton s useful when the

2 users want to browse the selete atabases by themselves nstea of asng the system to retreve relevant ouments automatally. Dstrbute oument retreval s a more sophstate tas. It selets relevant nformaton soures for users queres as the atabase reommenaton system oes. Furthermore, users queres are forware to the orresponng selete atabases an the returne nvual rane lsts are merge nto a sngle lst to present to the users. The goal of a atabase reommenaton system s to selet a small set of resoures that ontan as many relevant ouments as possble, whh we all a hgh-reall goal. On the other se, the effetveness of strbute oument retreval s often measure by the Preson of the fnal merge oument result lst, whh we all a hgh-preson goal. Pror researh nate that these two goals are relate but not ental [4,2]. However, most prevous solutons smply use effetve resoure seleton algorthm of atabase reommenaton system for strbute oument retreval system or solve the nonssteny wth heurst methos [,4,2]. Ths paper presents a unfe utlty maxmzaton framewor to ntegrate the resoure seleton problem of both atabase reommenaton an strbute oument retreval together by treatng them as fferent optmzaton goals. Frst, a entralze sample atabase s bult by ranomly samplng a small amount of ouments from eah atabase wth query-base samplng []; atabase sze statsts are also estmate [2]. A logst transformaton moel s learne off lne wth a small amount of tranng queres to map the entralze oument sores n the entralze sample atabase to the orresponng probabltes of relevane. Seon, after a new query s submtte, the query an be use to searh the entralze sample atabase whh proues a sore for eah sample oument. The probablty of relevane for eah oument n the entralze sample atabase an be estmate by applyng the logst moel to eah oument s sore. Then, the probabltes of relevane of all the (mostly unseen ouments among the avalable atabases an be estmate usng the probabltes of relevane of the ouments n the entralze sample atabase an the atabase sze estmates. For the tas of resoure seleton for a atabase reommenaton system, the atabases an be rane by the expete number of relevant ouments to meet the hgh-reall goal. For resoure seleton for a strbute oument retreval system, atabases ontanng a small number of ouments wth large probabltes of relevane are favore over atabases ontanng many ouments wth small probabltes of relevane. Ths seleton rteron meets the hgh-preson goal of strbute oument retreval applaton. Furthermore, the Sem-supervse learnng (SSL [20,22] algorthm s apple to merge the returne ouments nto a fnal rane lst. The unfe utlty framewor maes very few assumptons an wors n unooperatve envronments. Two ey features mae t a more sol moel for strbute nformaton retreval: It formalzes the resoure seleton problems of fferent applatons as varous utlty funtons, an optmzes the utlty funtons to aheve the optmal results aorngly; an It shows an effetve an effent way to estmate the probabltes of relevane of all ouments aross atabases. Spefally, the framewor buls logst moels on the entralze sample atabase to transform entralze retreval sores to the orresponng probabltes of relevane an uses the entralze sample atabase as the brge between nvual atabases an the logst moel. The human effort (relevane jugment requre to tran the sngle entralze logst moel oes not sale wth the number of atabases. Ths s a large avantage over prevous researh, whh requre the amount of human effort to be lnear wth the number of atabases [7,5]. The unfe utlty framewor s not only more theoretally sol but also very effetve. Empral stues show the new moel to be at least as aurate as the state-of-the-art algorthms n a varety of onfguratons. The next seton susses relate wor. Seton 3 esrbes the new unfe utlty maxmzaton moel. Seton 4 explans our expermental methoology. Setons 5 an 6 present our expermental results for resoure seleton an oument retreval. Seton 7 onlues. 2. PRIOR RESEARCH There has been onserable researh on all the sub-problems of strbute nformaton retreval. We survey the most relate wor n ths seton. The frst problem of strbute nformaton retreval s resoure representaton. The STARTS protool s one soluton for aqurng resoure esrptons n ooperatve envronments [8]. However, n unooperatve envronments, even the atabases are wllng to share ther nformaton, t s not easy to juge whether the nformaton they prove s aurate or not. Furthermore, t s not easy to oornate the atabases to prove resoure representatons that are ompatble wth eah other. Thus, n unooperatve envronments, one ommon hoe s query-base samplng, whh ranomly generates an sens queres to nvual searh engnes an retreves some ouments to bul the esrptons. As the sample ouments are selete by ranom queres, query-base samplng s not easly foole by any aversaral spammer that s ntereste to attrat more traff. Experments have shown that rather aurate resoure esrptons an be bult by senng about 80 queres an ownloang about 300 ouments []. Many resoure seleton algorthms suh as ggloss/vgloss [8] an CORI [] have been propose n the last eae. The CORI algorthm represents eah atabase by ts terms, the oument frequenes an a small number of orpus statsts (etals n []. As pror researh on fferent atasets has shown the CORI algorthm to be the most stable an effetve of the three algorthms [,7,8], we use t as a baselne algorthm n ths wor. The relevant oument strbuton estmaton (ReDDE [2] resoure seleton algorthm s a reent algorthm that tres to estmate the strbuton of relevant ouments aross the avalable atabases an rans the atabases aorngly. Although the ReDDE algorthm has been shown to be effetve, t reles on heurst onstants that are set emprally [2]. The last step of the oument retreval sub-problem s results mergng, whh s the proess of transformng atabase-spef oument sores nto omparable atabase-nepenent

3 oument sores. The sem supervse learnng (SSL [20,22] result mergng algorthm uses the ouments aqure by querybase samplng as tranng ata an lnear regresson to learn the atabase-spef, query-spef mergng moels. These lnear moels are use to onvert the atabase-spef oument sores nto the approxmate entralze oument sores. The SSL algorthm has been shown to be effetve [22]. It serves as an mportant omponent of our unfe utlty maxmzaton framewor (Seton 3. In orer to aheve aurate oument retreval results, many prevous methos smply use resoure seleton algorthms that are effetve of atabase reommenaton system. But as ponte out above, a goo resoure seleton algorthm optmze for hgh-reall may not wor well for oument retreval, whh targets the hgh-preson goal. Ths type of nonssteny has been observe n prevous researh [4,2]. The researh n [2] tre to solve the problem wth a heurst metho. The researh most smlar to what we propose here s the eson-theoret framewor (DTF [7,5]. Ths framewor omputes a seleton that mnmzes the overall osts (e.g., retreval qualty, tme of oument retreval system an several methos [5] have been propose to estmate the retreval qualty. However, two ponts stngush our researh from the DTF moel. Frst, the DTF s a framewor esgne spefally for oument retreval, but our new moel ntegrates two stnt applatons wth fferent requrements (atabase reommenaton an strbute oument retreval nto the same unfe framewor. Seon, the DTF buls a moel for eah atabase to alulate the probabltes of relevane. Ths requres human relevane jugments for the results retreve from eah atabase. In ontrast, our approah only buls one logst moel for the entralze sample atabase. The entralze sample atabase an serve as a brge to onnet the nvual atabases wth the entralze logst moel, thus the probabltes of relevane of ouments n fferent atabases an be estmate. Ths strategy an save large amount of human jugment effort an s a bg avantage of the unfe utlty maxmzaton framewor over the DTF espeally when there are a large number of atabases. 3. UNIFIED UTILITY MAXIMIZATION FRAMEWORK The Unfe Utlty Maxmzaton (UUM framewor s base on estmatng the probabltes of relevane of the (mostly unseen ouments avalable n the strbute searh envronment. In ths seton we esrbe how the probabltes of relevane are estmate an how they are use by the Unfe Utlty Maxmzaton moel. We also esrbe how the moel an be optmze for the hgh-reall goal of a atabase reommenaton system an the hgh-preson goal of a strbute oument retreval system. 3. Estmatng Probabltes of Relevane As ponte out above, the purpose of resoure seleton s hghreall an the purpose of oument retreval s hgh-preson. In orer to meet these verse goals, the ey ssue s to estmate the probabltes of relevane of the ouments n varous atabases. Ths s a ffult problem beause we an only observe a sample of the ontents of eah atabase usng query-base samplng. Our strategy s to mae full use of all the avalable nformaton to alulate the probablty estmates. 3.. Learnng Probabltes of Relevane In the resoure esrpton step, the entralze sample atabase s bult by query-base samplng an the atabase szes are estmate usng the sample-resample metho [2]. At the same tme, an effetve retreval algorthm (Inquery [2] s apple on the entralze sample atabase wth a small number (e.g., 50 of tranng queres. For eah tranng query, the CORI resoure seleton algorthm [] s apple to selet some number (e.g., 0 of atabases an retreve 50 oument s from eah atabase. The SSL results mergng algorthm [20,22] s use to merge the results. Then, we an ownloa the top 50 ouments n the fnal merge lst an alulate ther orresponng entralze sores usng Inquery an the orpus statsts of the entralze sample atabase. The entralze sores are further normalze (ve by the maxmum entralze sore for eah query, as ths metho has been suggeste to mprove estmaton auray n prevous researh [5]. Human jugment s aqure for those ouments an a logst moel s bult to transform the normalze entralze oument sores to probabltes of relevane as follows: exp( a + b S ( R( = P( rel = _ ( + exp( a + b S ( _ where S ( s the normalze entralze oument sore an a an b are the two parameters of the logst moel. These two parameters are estmate by maxmzng the probabltes of relevane of the tranng queres. The logst moel proves us the tool to alulate the probabltes of relevane from entralze oument sores Estmatng Centralze Doument Sores When the user submts a new query, the entralze oument sores of the ouments n the entralze sample atabase are alulate. However, n orer to alulate the probabltes of relevane, we nee to estmate entralze oument sores for all ouments aross the atabases nstea of only the sample ouments. Ths goal s aomplshe usng: the entralze sores of the ouments n the entralze sample atabase, an the atabase sze statsts. We efne the atabase sale fator for the th atabase as the rato of the estmate atabase sze an the number of ouments sample from ths atabase as follows: SF N _ b b = (2 Nb _ samp where N b s the estmate atabase sze an N s the b _ samp number of ouments from the th atabase n the entralze sample atabase. The ntuton behn the atabase sale fator s that, for a atabase whose sale fator s 50, f one oument from ths atabase n the entralze sample atabase has a entralze oument sore of 0.5, we may guess that there are about 50 ouments n that atabase whh have sores of about 0.5. Atually, we an apply a fner non-parametr lnear nterpolaton metho to estmate the entralze oument sore urve for eah atabase. Formally, we ran all the sample ouments from the th atabase by ther entralze oument

4 sores to get the sample entralze oument sore lst {S (s, S (s 2, S (s 3,..} for the th atabase; we assume that f we oul alulate the entralze oument sores for all the ouments n ths atabase an get the omplete entralze oument sore lst, the top oument n the sample lst woul have ran SF b /2, the seon oument n the sample lst woul ran SF b 3/2, an so on. Therefore, the ata ponts of sample ouments n the omplete lst are: {(SF b /2, S (s, (SF b 3/2, S (s 2, (SF b 5/2, S (s 3,..}. Peewse lnear nterpolaton s apple to estmate the entralze oument sore urve, as llustrate n Fgure. The omplete entralze oument sore lst an be estmate by alulatng the values of fferent rans on the entralze oument urve as: S (, j [, ]. j N b It an be seen from Fgure that more sample ata ponts proue more aurate estmates of the entralze oument sore urves. However, for atabases wth large atabase sale ratos, ths n of lnear nterpolaton may be rather naurate, espeally for the top rane (e.g., [, SF b /2] ouments. Therefore, an alternatve soluton s propose to estmate the entralze oument sores of the top rane ouments for atabases wth large sale ratos (e.g., larger than 00. Spefally, a logst moel s bult for eah of these atabases. The logst moel s use to estmate the entralze oument sore of the top oument n the orresponng atabase by usng the two sample ouments from that atabase wth hghest entralze sores. S ( exp( α 0 + α S ( s + α 2S ( s2 = + exp( α + α S ( s + α S ( s 0 2 α, 0 α an α are the parameters of the logst moel. For 2 eah tranng query, the top retreve oument of eah atabase s ownloae an the orresponng entralze oument sore s alulate. Together wth the sores of the top two sample ouments, these parameters an be estmate. After the entralze sore of the top oument s estmate, an exponental funton s ftte for the top part ([, SF b /2] of the entralze oument sore urve as: S ( j 0 2 (3 = exp( β + β j j [, SF / 2] (4 0 β = log( S ( β (5 (log( S ( s log( S ( β = (6 ( SF / 2 The two parameters β an 0 β b b are ftte to mae sure the exponental funton passes through the two ponts (, S ( an (SF b /2, S (s. The exponental funton s only use to ajust the top part of the entralze oument sore urve an the lower part of the urve s stll ftte wth the lnear nterpolaton metho esrbe above. The ajustment by fttng exponental funton of the top rane ouments has been shown emprally to proue more aurate results. Fgure. Lnear nterpolaton onstruton of the omplete entralze oument sore lst (atabase sale fator s 50. From the entralze oument sore urves, we an estmate the omplete entralze oument sore lsts aorngly for all the avalable atabases. After the estmate entralze oument sores are normalze, the omplete lsts of probabltes of relevane an be onstrute out of the omplete entralze oument sore lsts by Equaton. Formally for the th atabase, the omplete lst of probabltes of relevane s: R(, j [, ]. j N b 3.2 The Unfe Utlty Maxmzaton Moel In ths seton, we formally efne the new unfe utlty maxmzaton moel, whh optmzes the resoure seleton problems for two goals of hgh-reall (atabase reommenaton an hgh-preson (strbute oument retreval n the same framewor. In the tas of atabase reommenaton, the system nees to ee how to ran atabases. In the tas of oument retreval, the system not only nees to selet the atabases but also nees to ee how many ouments to retreve from eah selete atabase. We generalze the atabase reommenaton seleton proess, whh mpltly reommens all ouments n every selete atabase, as a speal ase of the seleton eson for the oument retreval tas. Formally, we enote as the number of ouments we woul le to retreve from the th atabase an = {, 2,...} as a seleton aton for all the atabases. The atabase seleton eson s mae base on the omplete lsts of probabltes of relevane for all the atabases. The omplete lsts of probabltes of relevane are nferre from all the avalable nformaton spefally R, whh stans for the s resoure esrptons aqure by query-base samplng an the atabase sze estmates aqure by sample-resample; S stans for the entralze oument sores of the ouments n the entralze sample atabase. If the metho of estmatng entralze oument sores an probabltes of relevane n Seton 3. s aeptable, then the most probable omplete lsts of probabltes of relevane an be erve an we enote them as θ = {(R(, j [, N b ], 2 j (R(, j [, N ],...}. Ranom vetor enotes an b2 arbtrary set of omplete lsts of probabltes of relevane an P θ R s, S as the probablty of generatng ths set of lsts. ( Fnally, to eah seleton aton an a set of omplete lsts of j

5 probabltes of relevane θ, we assoate a utlty funton U ( θ, whh nates the beneft from mang the seleton when the true omplete lsts of probabltes of relevane are θ. Therefore, the seleton eson efne by the Bayesan framewor s: = arg max θ U (, θ P( θ R. S θ One ommon approah to smplfy the omputaton n the Bayesan framewor s to only alulate the utlty funton at the most probable parameter values nstea of alulatng the whole expetaton. In other wors, we only nee to alulate U ( θ, an Equaton 7 s smplfe as follows: = arg max U (, θ Ths equaton serves as the bas moel for both the atabase reommenaton system an the oument retreval system. 3.3 Resoure Seleton for Hgh-Reall Hgh-reall s the goal of the resoure seleton algorthm n feerate searh tass suh as atabase reommenaton. The goal s to selet a small set of resoures (e.g., less than N sb atabases that ontan as many relevant ouments as possble, whh an be formally efne as: b s (7 (8 U (, θ = R( (9 N s the nator funton, whh s when the th atabase s selete an 0 otherwse. Plug ths equaton nto the bas moel n Equaton 8 an assoate the selete atabase number onstrant to obtan the followng: = arg max Subjet to : N b = N sb j R( j (0 The soluton of ths optmzaton problem s very smple. We an alulate the expete number of relevant ouments for eah atabase as follows: b N R = R ( j ( N The N sb atabases wth the largest expete number of relevant ouments an be selete to meet the hgh-reall goal. We all ths the UUM/HR algorthm (Unfe Utlty Maxmzaton for Hgh-Reall. 3.4 Resoure Seleton for Hgh-Preson Hgh-Preson s the goal of resoure seleton algorthm n feerate searh tass suh as strbute oument retreval. It s measure by the Preson at the top part of the fnal merge oument lst. Ths hgh-preson rteron s realze by the followng utlty funton, whh measures the Preson of retreve ouments from the selete atabases. I ( U (, θ = R( (2 Note that the ey fferene between Equaton 2 an Equaton 9 s that Equaton 9 sums up the probabltes of relevane of all the ouments n a atabase, whle Equaton 2 only onsers a muh smaller part of the ranng. Spefally, we an alulate the optmal seleton eson by: j = = arg max R( (3 Dfferent ns of onstrants ause by fferent haratersts of the oument retreval tass an be assoate wth the above optmzaton problem. The most ommon one s to selet a fxe number (N sb of atabases an retreve a fxe number (N ro of ouments from eah selete atabase, formally efne as: = arg max R( j Subjet to : = N = N ro sb, f j j 0 (4 Ths optmzaton problem an be solve easly by alulatng the number of expete relevant ouments n the top part of the eah atabase s omplete lst of probabltes of relevane: N = ro _ R R ( j N Top (5 Then the atabases an be rane by these values an selete. We all ths the UUM/HP-FL algorthm (Unfe Utlty Maxmzaton for Hgh-Preson wth Fxe Length oument ranngs from eah selete atabase. A more omplex stuaton s to vary the number of retreve ouments from eah selete atabase. More spefally, we allow fferent selete atabases to return fferent numbers of ouments. For smplfaton, the result lst lengths are requre to be multples of a baselne number 0. (Ths value an also be vare, but for smplfaton t s set to 0 n ths paper. Ths restrton s set to smulate the behavor of ommeral searh engnes on the Web. (Searh engnes suh as Google an AltaVsta return only 0 or 20 oument s for every result page. Ths proeure saves the omputaton tme of alulatng optmal atabase seleton by allowng the step of ynam programmng to be 0 nstea of (more etal s susse latterly. For further smplfaton, we restrt to selet at most 00 ouments from eah atabase ( <=00 Then, the seleton optmzaton problem s formalze as follows: = arg max R( j Subjet to : = N = N = 0, sb Total _ ro [0,, 2,..,0] N Total_ro s the total number of ouments to be retreve. (6 Unfortunately, there s no smple soluton for ths optmzaton problem as there are for Equatons 0 an 4. However, a

6 Input: Complete lsts of probabltes of relevane for all the DB atabases. Output: Optmal seleton soluton for Equaton 6. Create the three-mensonal array: Sel (.. DB,..N Total_ro/0,..N sb Eah Sel (x, y, z s assoate wth a seleton eson xyz, whh represents the best seleton eson n the onton: only atabases from number to number x are onsere for seleton; totally y0 ouments wll be retreve; only z atabases are selete out of the x atabase anates. An Sel (x, y, z s the orresponng utlty value by hoosng the best seleton. Intalze Sel (,..N Total_ro /0,..N sb wth only the estmate relevane nformaton of the st atabase. Iterate the urrent atabase anate from 2 to DB For eah entry Sel (, y, z: Fn suh that: = arg max ( Sel(, y, z + R( subjet to : mn( y,0 If ( Sel(, y, z + j 0 R( j 0 ynam programmng algorthm an be apple to alulate the optmal soluton. The bas steps of ths ynam programmng metho are esrbe n Fgure 2. As ths algorthm allows retrevng result lsts of varyng lengths from eah selete atabase, t s alle UUM/HP-VL algorthm. After the seleton esons are mae, the selete atabases are searhe an the orresponng oument s are retreve from eah atabase. The fnal step of oument retreval s to merge the returne results nto a sngle rane lst wth the semsupervse learnng algorthm. It was ponte out before that the SSL algorthm maps the atabase-spef sores nto the entralze oument sores an buls the fnal rane lst aorngly, whh s onsstent wth all our seleton proeures where ouments wth hgher probabltes of relevane (thus hgher entralze oument sores are selete. 4. EXPERIMENTAL METHODOLOGY 4. Testbes It s esrable to evaluate strbute nformaton retreval algorthms wth testbes that losely smulate the real worl applatons. The TREC Web olletons WT2g or WT0g [4,3] prove a way to partton ouments by fferent Web servers. In ths way, a large number (O(000 of atabases wth rather verse j j > Sel(, y, z Ths means that we shoul retreve 0 ouments from the th atabase, otherwse we shoul not selet ths atabase an the prevous best soluton Sel (-, y, z shoul be ept. Then set the value of yz an Sel (, y, z aorngly. v The best seleton soluton s gven by DB NToral _ ro /0Nsb an the orresponng utlty value s Sel ( DB, N Total_ro/0, N sb. Fgure 2. The ynam programmng optmzaton proeure for Equaton 6. Table: Testbe statsts. Sze Number of ouments Sze (MB Testbe (GB Mn Avg Max Mn Avg Max Tre Table2: Query set statsts. TREC TREC Average Length Name Top Set Top Fel (Wors Tre Ttle 3. ontents oul be reate, whh may mae ths testbe a goo anate to smulate the operatonal envronments suh as open oman hen Web. However, two weaness of ths testbe are: Eah atabase ontans only a small amount of oument (259 ouments by average for WT2g [4]; an The ontents of WT2g or WT0g are arbtrarly rawle from the Web. It s not lely for a hen Web atabase to prove personal homepages or web pages natng that the pages are uner onstruton an there s no useful nformaton at all. These types of web pages are ontane n the WT2g/WT0g atasets. Therefore, the nosy Web ata s not smlar wth that of hgh-qualty hen Web atabase ontents, whh are usually organze by oman experts. Another hoe s the TREC news/government ata [,5,7, 8,2]. TREC news/government ata s onentrate on relatvely narrow tops. Compare wth TREC Web ata: The news/government ouments are muh more smlar to the ontents prove by a top-orente atabase than an arbtrary web page, A atabase n ths testbe s larger than that of TREC Web ata. By average a atabase ontans thousans of ouments, whh s more realst than a atabase of TREC Web ata wth about 250 ouments. As the ontents an szes of the atabases n the TREC news/government testbe are more smlar wth that of a top-orente atabase, t s a goo anate to smulate the strbute nformaton retreval envronments of large organzatons (ompanes or omanspef hen Web stes, suh as West that proves aess to legal, fnanal an news text atabases [3]. As most urrent strbute nformaton retreval systems are evelope for the envronments of large organzatons (ompanes or omanspef hen Web other than open oman hen Web, TREC news/government testbe was hosen n ths wor. Tre23-00ol-bysoure testbe s one of the most use TREC news/government testbe [,5,7,2]. It was hosen n ths wor. Three testbes n [2] wth sewe atabase sze strbutons an fferent types of relevant oument strbutons were also use to gve more thorough smulaton for real envronments. Tre23-00ol-bysoure: 00 atabases were reate from TREC CDs, 2 an 3. They were organze by soure an publaton ate []. The szes of the atabases are not sewe. Detals are n Table. Three testbes bult n [2] were base on the tre23-00olbysoure testbe. Eah testbe ontans many small atabases an two large atabases reate by mergng about 0-20 small atabases together.

7 Colletons Selete. Tre23-00Col Testbe. Colletons Selete. Representatve Testbe. envronment, three fferent types of searh engnes were use n the experments: INQUERY [2], a ungram statstal language moel wth lnear smoothng [2,20] an a TFIDF retreval algorthm wth lt weght [2,20]. All these algorthms were mplemente wth the Lemur toolt [2]. These three ns of searh engnes were assgne to the atabases among the four testbes n a roun-robn manner. 5. RESULTS: RESOURCE SELECTION OF DATABASE RECOMMENDATION All four testbes esrbe n Seton 4 were use n the experments to evaluate the resoure seleton effetveness of the atabase reommenaton system. The resoure esrptons were reate usng query-base samplng. About 80 queres were sent to eah atabase to ownloa 300 unque ouments. The atabase sze statsts were estmate by the sample-resample metho [2]. Ffty queres (0-50 were use as tranng queres to bul the relevant logst moel an to ft the exponental funtons of the entralze oument sore urves for large rato atabases (etals n Seton 3.. Another 50 queres (5-00 were use as test ata. Colleton Selete. Colleton Selete. Relevant Testbe. Nonrelevant Testbe. Fgure 3. Resoure seleton experments on the four testbes. Tre23-2lb-60ol ( representatve : The atabases n the tre23-00ol-bysoure were sorte wth alphabetal orer. Two large atabases were reate by mergng 20 small atabases wth the roun-robn metho. Thus, the two large atabases have more relevant ouments ue to ther large szes, even though the enstes of relevant ouments are roughly the same as the small atabases. Tre23-AP-WSJ-60ol ( relevant : The 24 Assoate Press olletons an the 6 Wall Street Journal olletons n the tre23-00ol-bysoure testbe were ollapse nto two large atabases APall an WSJall. The other 60 olletons were left unhange. The APall an WSJall atabases have hgher enstes of ouments relevant to TREC queres than the small atabases. Thus, the two large atabases have many more relevant ouments than the small atabases. Tre23-FR-DOE-8ol ( nonrelevant : The 3 Feeral Regster olletons an the 6 Department of Energy olletons n the tre23-00ol-bysoure testbe were ollapse nto two large atabases FRall an DOEall. The other 80 olletons were left unhange. The FRall an DOEall atabases have lower enstes of ouments relevant to TREC queres than the small atabases, even though they are muh larger. 00 queres were reate from the ttle fels of TREC tops The queres 0-50 were use as tranng queres an the queres 5-00 were use as test queres (etals n Table Searh Engnes In the unooperatve strbute nformaton retreval envronments of large organzatons (ompanes or omanspef hen Web, fferent atabases may use fferent types of searh engne. To smulate the multple type-engne Resoure seleton algorthms of atabase reommenaton systems are typally ompare usng the reall metr R n [,7,8,2]. Let B enote a baselne ranng, whh s often the RBR (relevane base ranng, an E as a ranng prove by a resoure seleton algorthm. An let B an E enote the number of relevant ouments n the th rane atabase of B or E. Then R n s efne as follows: R = = = E B (7 Usually the goal s to searh only a few atabases, so our fgures only show results for seletng up to 20 atabases. The experments summarze n Fgure 3 ompare the effetveness of the three resoure seleton algorthms, namely the CORI, ReDDE an UUM/HR. The UUM/HR algorthm s esrbe n Seton 3.3. It an be seen from Fgure 3 that the ReDDE an UUM/HR algorthms are more effetve (on the representatve, relevant an nonrelevant testbes or as goo as (on the Tre23-00Col testbe the CORI resoure seleton algorthm. The UUM/HR algorthm s more effetve than the ReDDE algorthm on the representatve an relevant testbes an s about the same as the ReDDE algorthm on the Tre23-00Col an the nonrelevant testbes. Ths suggests that the UUM/HR algorthm s more robust than the ReDDE algorthm. It an be note that when seletng only a few atabases on the Tre23-00Col or the nonrelevant testbes, the ReDEE algorthm has a small avantage over the UUM/HR algorthm. We attrbute ths to two auses: The ReDDE algorthm was tune on the Tre23-00Col testbe; an Although the fferene s small, ths may suggest that our logst moel of estmatng probabltes of relevane s not aurate enough. More tranng ata or a more sophstate moel may help to solve ths mnor puzzle.

8 Table 3. Preson on the tre23-00ol-bysoure testbe when 3 atabases were selete. (The frst baselne s CORI; the seon baselne for UUM/HP methos s UUM/HR. Preson at Do Ran CORI ReDDE UUM/HR UUM/HP-FL UUM/HP-VL 5 os (-4.4% (+8.8% (+28.6%(+8.% (+27.5%(+7.2% 0 os (-4.8% (+4.8% (+26.2%(+20.5% (+25.6%(+9.9% 5 os (-2.0% (+2.9% (+22.2%(+5.7% (+20.5%(+7.% 20 os (-5.% (+4.% (+8.5%(+3.8% (+7.8%(+3.2% 30 os (-4.3% (+6.9% (+22.8%(+4.8% (+22.3%(+4.4% Table 4. Preson on the tre23-00ol-bysoure testbe when 5 atabases were selete. (The frst baselne s CORI; the seon baselne for UUM/HP methos s UUM/HR. Preson at Do Ran CORI ReDDE UUM/HR UUM/HP-FL UUM/HP-VL 5 os (-2.0% (+7.0% (+7.0%(+9.4% (+5.0%(+7.5% 0 os (-.% (+0.0% (+0.0%(+0.0% (+3.7%(+3.7% 5 os (+0.0% (+4.5% (+0.%(+5.4% (+4.6%(+9.7% 20 os (-.2% (+3.5% (+8.2%(+4.5% (+.7%(+7.9% 30 os (-3.% (+2.3% (+8.0%(+5.6% (+7.6%(+5.3% Table 5. Preson on the representatve testbe when 3 atabases were selete. (The frst baselne s CORI; the seon baselne for UUM/HP methos s UUM/HR. Preson at Do Ran CORI ReDDE UUM/HR UUM/HP-FL UUM/HP-VL 5 os (+9.7% (+24.7% (+23.7%(-0.9% (+34.4%(+7.8% 0 os (+9.4% (+35.3% (+33.5%(-.3% (+36.5%(+0.9% 5 os (+24.4% (+38.5% (+35.9%(-.9% (+4.4%( os (+25.0% (+36.0% (+34.7%(-.0% (+4.3%(+4.0% 30 os (+35.8% (+5.9% (+47.9%(-2.6% (+53.5%(+.0% Table 6. Preson on the representatve testbe when 5 atabases were selete. (The frst baselne s CORI; the seon baselne for UUM/HP methos s UUM/HR. Preson at Do Ran CORI ReDDE UUM/HR UUM/HP-FL UUM/HP-VL 5 os (+3.0% (+5.2% (+8.%(-6.% (+4.%(-0.9% 0 os (+4.6% (+0.3% (+5.0%(+4.2% (+7.5%(+6.5% 5 os (+2.9% (+9.6% (+25.7%(+5.0% (+26.0%(+5.4% 20 os (+8.9% (+24.3% (+28.8%(+3.6% (+30.6%(+5.% 30 os (+26.% (+35.3% (+34.4%(-0.7% (+36.8%(+.2% 6. RESULTS: DOCUMENT RETRIEVAL EFFECTIVENESS For oument retreval, the selete atabases are searhe an the returne results are merge nto a sngle fnal lst. In all of the experments susse n ths seton the results retreve from nvual atabases were ombne by the semsupervse learnng results mergng algorthm. Ths verson of the SSL algorthm [22] s allowe to ownloa a small number of returne oument texts on the fly to reate atonal tranng ata n the proess of learnng the lnear moels whh map atabase-spef oument sores nto estmate entralze oument sores. It has been shown to be very effetve n envronments where only short result-lsts are retreve from eah selete atabase [22]. Ths s a ommon senaro n operatonal envronments an was the ase for our experments. Doument retreval effetveness was measure by Preson at the top part of the fnal oument lst. The experments n ths seton were onute to stuy the oument retreval effetveness of fve seleton algorthms, namely the CORI, ReDDE, UUM/HR, UUM/HP-FL an UUM/HP-VL algorthms. The last three algorthms were propose n Seton 3. All the frst four algorthms selete 3 or 5 atabases, an 50 ouments were retreve from eah selete atabase. The UUM/HP-FL algorthm also selete 3 or 5 atabases, but t was allowe to ajust the number of ouments to retreve from eah selete atabase; the number retreve was onstrane to be from 0 to 00, an a multple of 0. The Tre23-00Col an representatve testbes were selete for oument retreval as they represent two extreme ases of resoure seleton effetveness; n one ase the CORI algorthm s as goo as the other algorthms an n the other ase t s qute

9 Table 7. Preson on the tre23-00ol-bysoure testbe when 3 atabases were selete (The frst baselne s CORI; the seon baselne for UUM/HP methos s UUM/HR. (Searh engnes o not return oument sores Preson at Do Ran CORI ReDDE UUM/HR UUM/HP-FL UUM/HP-VL 5 os (-8.0% (+4.6% (+28.4%(+22.8% (+28.4%( os (-5.4% (+0.6% (+24.%(+23.4% (+2.%(+20.4% 5 os (-7.4% (+.6% (+2.5%(+9.5% (+5.7%(+3.8% 20 os (-5.6% (+3.3% (+2.2%(+7.3% (+8.5%(+4.7% 30 os (-3.2% (+6.3% (+20.0%(+2.9% (+20.0%(+2.9% a lot worse than the other algorthms. Tables 3 an 4 show the results on the Tre23-00Col testbe, an Tables 5 an 6 show the results on the representatve testbe. On the Tre23-00Col testbe, the oument retreval effetveness of the CORI seleton algorthm s roughly the same or a lttle bt better than the ReDDE algorthm but both of them are worse than the other three algorthms (Tables 3 an 4. The UUM/HR algorthm has a small avantage over the CORI an ReDDE algorthms. One man fferene between the UUM/HR algorthm an the ReDDE algorthm was ponte out before: The UUM/HR uses tranng ata an lnear nterpolaton to estmate the entralze oument sore urves, whle the ReDDE algorthm [2] uses a heurst metho, assumes the entralze oument sore urves are step funtons an maes no stnton among the top part of the urves. Ths fferene maes UUM/HR better than the ReDDE algorthm at stngushng ouments wth hgh probabltes of relevane from low probabltes of relevane. Therefore, the UUM/HR reflets the hgh-preson retreval goal better than the ReDDE algorthm an thus s more effetve for oument retreval. The UUM/HR algorthm oes not expltly optmze the seleton eson wth respet to the hgh-preson goal as the UUM/HP-FL an UUM/HP-VL algorthms are esgne to o. It an be seen that on ths testbe, the UUM/HP-FL an UUM/HP-VL algorthms are muh more effetve than all the other algorthms. Ths nates that ther power omes from expltly optmzng the hgh-preson goal of oument retreval n Equatons 4 an 6. On the representatve testbe, CORI s muh less effetve than other algorthms for strbute oument retreval (Tables 5 an 6. The oument retreval results of the ReDDE algorthm are better than that of the CORI algorthm but stll worse than the results of the UUM/HR algorthm. On ths testbe the three UUM algorthms are about equally effetve. Detale analyss shows that the overlap of the selete atabases between the UUM/HR, UUM/HP-FL an UUM/HP-VL algorthms s muh larger than the experments on the Tre23-00Col testbe, sne all of them ten to selet the two large atabases. Ths explans why they are about equally effetve for oument retreval. In real operatonal envronments, atabases may return no oument sores an report only rane lsts of results. As the unfe utlty maxmzaton moel only utlzes retreval sores of sample ouments wth a entralze retreval algorthm to alulate the probabltes of relevane, t maes atabase seleton esons wthout referrng to the oument sores from nvual atabases an an be easly generalze to ths ase of ran lsts wthout oument sores. The only ajustment s that the SSL algorthm merges rane lsts wthout oument sores by assgnng the ouments wth pseuo-oument sores normalze for ther rans (In a rane lst of 50 ouments, the frst one has a sore of, the seon has a sore of 0.98 et,whh has been stue n [22]. The experment results on tre23-00col-bysoure testbe wth 3 selete atabases are shown n Table 7. The experment settng was the same as before exept that the oument sores were elmnate ntentonally an the selete atabases only return rane lsts of oument s. It an be seen from the results that the UUM/HP-FL an UUM/HP-VL wor well wth atabases returnng no oument sores an are stll more effetve than other alternatves. Other experments wth atabases that return no oument sores are not reporte but they show smlar results to prove the effetveness of UUM/HP-FL an UUM/HP- VL algorthms. The above experments suggest that t s very mportant to optmze the hgh-preson goal expltly n oument retreval. The new algorthms base on ths prnple aheve better or at least as goo results as the pror state-of-the-art algorthms n several envronments. 7. CONCLUSION Dstrbute nformaton retreval solves the problem of fnng nformaton that s sattere among many text atabases on loal area networs an Internets. Most prevous researh use effetve resoure seleton algorthm of atabase reommenaton system for strbute oument retreval applaton. We argue that the hgh-reall resoure seleton goal of atabase reommenaton an hgh-preson goal of oument retreval are relate but not ental. Ths n of nonssteny has also been observe n prevous wor, but the pror solutons ether use heurst methos or assume ooperaton by nvual atabases (e.g., all the atabases use the same n of searh engnes, whh s frequently not true n the unooperatve envronment. In ths wor we propose a unfe utlty maxmzaton moel to ntegrate the resoure seleton of atabase reommenaton an oument retreval tass nto a sngle unfe framewor. In ths framewor, the seleton esons are obtane by optmzng fferent objetve funtons. As far as we now, ths s the frst wor that tres to vew an theoretally moel the strbute nformaton retreval tas n an ntegrate manner. The new framewor ontnues a reent researh tren stuyng the use of query-base samplng an a entralze sample atabase. A sngle logst moel was trane on the entralze

10 sample atabase to estmate the probabltes of relevane of ouments by ther entralze retreval sores, whle the entralze sample atabase serves as a brge to onnet the nvual atabases wth the entralze logst moel. Therefore, the probabltes of relevane for all the ouments aross the atabases an be estmate wth very small amount of human relevane jugment, whh s muh more effent than prevous methos that bul a separate moel for eah atabase. Ths framewor s not only more theoretally sol but also very effetve. One algorthm for resoure seleton (UUM/HR an two algorthms for oument retreval (UUM/HP-FL an UUM/HP-VL are erve from ths framewor. Empral stues have been onute on testbes to smulate the strbute searh solutons of large organzatons (ompanes or oman-spef hen Web. Furthermore, the UUM/HP-FL an UUM/HP-VL resoure seleton algorthms are extene wth a varant of SSL results mergng algorthm to aress the strbute oument retreval tas when selete atabases o not return oument sores. Experments have shown that these algorthms aheve results that are at least as goo as the pror state-of-the-art, an sometmes onserably better. Detale analyss nates that the avantage of these algorthms omes from expltly optmzng the goals of the spef tass. The unfe utlty maxmzaton framewor s open for fferent extensons. When ost s assoate wth searhng the onlne atabases, the utlty framewor an be ajuste to automatally estmate the best number of atabases to searh so that a large amount of relevant ouments an be retreve wth relatvely small osts. Another extenson of the framewor s to onser the retreval effetveness of the onlne atabases, whh s an mportant ssue n the operatonal envronments. All of these are the retons of future researh. ACKNOWLEGEMENT Ths researh was supporte by NSF grants EIA an IIS Any opnons, fnngs, onlusons, or reommenatons expresse n ths paper are the authors, an o not neessarly reflet those of the sponsor. REFERENCES [] J. Callan. (2000. Dstrbute nformaton retreval. In W.B. Croft, etor, Avanes n Informaton Retreval. Kluwer Aaem Publshers. (pp [2] J. Callan, W.B. Croft, an J. Broglo. (995. TREC an TIPSTER experments wth INQUERY. Informaton Proessng an Management, 3(3. (pp [3] J. G. Conra, X. S. Guo, P. Jason an M. Mezou. (2002. Database seleton usng atual physal an aqure logal olleton resoures n a massve omanspef operatonal envronment. Dstrbute searh over the hen web: Herarhal atabase samplng an seleton. In Proeengs of the 28 th Internatonal Conferene on Very Large Databases (VLDB. [4] N. Craswell. (2000. Methos for strbute nformaton retreval. Ph. D. thess, The Australan Naton Unversty. [5] N. Craswell, D. Hawng, an P. Thstlewate. (999. Mergng results from solate searh engnes. In Proeengs of 0th Australasan Database Conferene. [6] D. D'Souza, J. Thom, an J. Zobel. (2000. A omparson of tehnques for seletng text olletons. In Proeengs of the th Australasan Database Conferene. [7] N. Fuhr. (999. A Deson-Theoret approah to atabase seleton n networe IR. ACM Transatons on Informaton Systems, 7(3. (pp [8] L. Gravano, C. Chang, H. Gara-Molna, an A. Paepe. (997. STARTS: Stanfor proposal for nternet metasearhng. In Proeengs of the 20th ACM-SIGMOD Internatonal Conferene on Management of Data. [9] L. Gravano, P. Iperots an M. Saham. (2003. QProber: A System for Automat Classfaton of Hen-Web Databases. ACM Transatons on Informaton Systems, 2(. [0] P. Iperots an L. Gravano. (2002. Dstrbute searh over the hen web: Herarhal atabase samplng an seleton. In Proeengs of the 28th Internatonal Conferene on Very Large Databases (VLDB. [] InvsbleWeb.om. [2] The lemur toolt. [3] J. Lu an J. Callan. (2003. Content-base nformaton retreval n peer-to-peer networs. In Proeengs of the 2th Internatonal Conferene on Informaton an Knowlege Management. [4] W. Meng, C.T. Yu an K.L. Lu. (2002 Bulng effent an effetve metasearh engnes. ACM Comput. Surv. 34(. [5] H. Nottelmann an N. Fuhr. (2003. Evaluatng fferent metho of estmatng retreval qualty for resoure seleton. In Proeengs of the 25th Annual Internatonal ACM SIGIR Conferene on Researh an Development n Informaton Retreval. [6] H., Nottelmann an N., Fuhr. (2003. The MIND arhteture for heterogeneous multmea feerate gtal lbrares. ACM SIGIR 2003 Worshop on Dstrbute Informaton Retreval. [7] A.L. Powell, J.C. Frenh, J. Callan, M. Connell, an C.L. Vles. (2000. The mpat of atabase seleton on strbute searhng. In Proeengs of the 23r Annual Internatonal ACM SIGIR Conferene on Researh an Development n Informaton Retreval. [8] A.L. Powell an J.C. Frenh. (2003. Comparng the performane of atabase seleton algorthms. ACM Transatons on Informaton Systems, 2(4. (pp [9] C. Sherman (200. Searh for the nvsble web. Guaran Unlmte. [20] L. S an J. Callan. (2002. Usng sample ata an regresson to merge searh engne results. In Proeengs of the 25th Annual Internatonal ACM SIGIR Conferene on Researh an Development n Informaton Retreval. [2] L. S an J. Callan. (2003. Relevant oument strbuton estmaton metho for resoure seleton. In Proeengs of the 26th Annual Internatonal ACM SIGIR Conferene on Researh an Development n Informaton Retreval. [22] L. S an J. Callan. (2003. A Sem-Supervse learnng metho to merge searh engne results. ACM Transatons on Informaton Systems, 2(4. (pp

JSM Survey Research Methods Section. Is it MAR or NMAR? Michail Sverchkov

JSM Survey Research Methods Section. Is it MAR or NMAR? Michail Sverchkov JSM 2013 - Survey Researh Methods Seton Is t MAR or NMAR? Mhal Sverhkov Bureau of Labor Statsts 2 Massahusetts Avenue, NE, Sute 1950, Washngton, DC. 20212, Sverhkov.Mhael@bls.gov Abstrat Most methods that