DP ro: A Probabilistic Approach for Hidden Web Database Selection Using Dynamic Probing

Size: px

Start display at page:

Download "DP ro: A Probabilistic Approach for Hidden Web Database Selection Using Dynamic Probing"

Garey Phelps
5 years ago
Views:

1 DP ro: A Probabilisic Approach for Hidden Web Daabase Selecion sing Dynamic Probing Vicor Z. Liu, Richard C. Luo, Junghoo Cho, Wesley W. Chu CLA Compuer Science Deparmen Los Angeles, CA {vicliu, lc, cho, wwc}@cs.ucla.edu Absrac An ever increasing amoun of valuable informaion is sored in Web daabases, hidden behind search inerfaces. To save he user s effor in manually exploring each daabase, measearchers auomaically selec he mos relevan daabases o a user s query [, 5, 6,, 6]. Exising mehods use a pre-colleced summary of each daabase o esimae is o he query, and reurn he daabases wih he highes esimaion. While his is a grea saring poin, he exising mehods suffer from wo drawbacs. Firs, because he esimaion can be inaccurae, he reurned daabases are ofen wrong. Second, he sysem does no ry o improve he qualiy of is answer by conacing some daabases on-he-fly (o collec more informaion abou he daabases and selec daabases more accruaely), even if he user is willing o wai for some ime o obain a beer answer. In his paper, we inroduce he noion of dynamic probing and sudy is effeciveness under a probabilisic framewor: nder our framewor, a user can specify how correc he seleced daabases should be, and our sysem auomaically conacs a few daabases o saisfy he user-specified correcness. Our experimens on 0 real hidden Web daabases indicae ha our approach significanly improves he correcness of he reurned daabases a a cos of a small number of daabase probing. Inroducion An ever increasing number of informaion on he Web is available hrough search inerfaces. This informaion is ofen called he Hidden Web or Deep Web [] because radiional search engines canno index hem using exising echnologies [0, ]. Since he majoriy of Web users rely on radiional search engines o discover and access informaion on he Web, he Hidden Web is pracically inaccessible o mos users and hidden from hem. Even if users are aware of a cerain par of he Hidden Web, hey need o go hrough he painful process of issuing queries o all poenially relevan Hidden Web daabases and invesigaing he resuls manually. On he oher hand, he informaion in he Hidden We call a collecion of documens accessible hrough a Web search inerface as a Hidden-Web daabase. PubMed (hp:// nlm.nih.gov/enrez/query.fcgi) is one example. documen documen eyword frequency in db frequency in db cancer 0,000 5,000 idney 7,000 0,000 breas,000,500 liver 00 4,000 Figure : (Keyword, documen frequency) able. Documen frequency of a eyword in db is he number of documens in db ha use he eyword Web is esimaed o be significanly larger and of higher qualiy han he Surface Web indexed by search engines []. In order o assis users accessing he informaion in he Hidden Web, recen effors have focused on building a measearcher or a mediaor ha auomaically selecs he mos relevan daabases o a user s query [, 5, 4, 5, 6, 8,, 4, 5, 6]. In his framewor, he measearcher mainains a summary or saisics on each daabase, and consuls he summary o esimae he of each daabase o a query. For example, Gravano e al. [4, 6] mainain (eyword, documen frequency) pairs o esimae he daabases wih he mos number of maching documens. We illusrae he basic idea of he exising approaches using he following example.,000 0,000 fracion Example A measearcher mediaes wo Hidden-Web daabases, db and db. Given a user s query q, he goal of he measearcher is o reurn he daabase wih he mos number of maching documens. The measearcher mainains he (eyword, documen frequency) able shown in Figure. For example, he firs row shows ha 0,000 documens in db conain he word cancer while 5,000 documens in db conain cancer. We assume ha each of db and db has a oal of 0,000 documens. Given a user query breas cancer, he measearcher may selec he daabase wih more maching documens in he following way: From he summary we now ha of he documens in db conain he word breas and 0,000 0,000 of hem conain he word cancer. Then, assuming ha he words breas and cancer are independenly disribued, db,000 will have 0, 000 0,000 =, 000 documens wih boh 0,000 0,000 words breas and cancer. Similarly, db will have 0, 000,500 5,000 = 875 maching documens. Based on his esimaion, he measearcher reurns db o he 0,000 0,000 user. References [8] explain in deail how we may consruc his able from hidden Web daabases.

2 In his paper, we improve upon his exising framewor by inroducing he concep of probabilisic correcness and dynamic probing for Hidden Web daabase selecion. One of he main weanesses of he exising mehod is ha he seleced daabases are ofen no he mos relevan o he user s query, because he of a daabase is esimaed based on a pre-colleced summary. For insance, in he above example, he word breas and cancer may no be independenly disribued, and db may acually conain more maching documens han db. Given a wrong answer, he user ends up wasing a significan amoun of ime on he irrelevan daabases. Recen sudy shows ha invesigaing irrelevan Web pages is a major cause of user s wased ime on he Web []. One way of addressing his weaness is o issue he user s query q o all he daabases ha he measearcher mediaes, and selec he bes ones based on he acual resul reurned by each daabase. For insance, he measearcher may issue he query breas cancer o boh db and db in he above example, obain he number of maching documens repored by hem and selec he one wih more maches. While his approach can improve he correcness of daabase selecion, is huge newor and ime overhead maes i impracical when measearchers mediae a large number of Hidden-Web daabases (ofen housands of hem []). In his paper, we develop a probabilisic approach o use dynamic probing (issuing he user query o he daabases on he fly) in a sysemaic way, so ha he correcness of daabase selecion is significanly improved while he measearcher conacs he minimum number of daabases. In our approach, he user can specify he desired correcness of daabase selecion (e.g., more han 9 ou of he 0 seleced daabases should be he acual op 0 daabases ), and he measearcher decides how many and which daabases o conac based on he user s specificaion. Informally, we may consider he user-specified correcness as a nob: When he user does no care abou he answer s correcness, our approach becomes idenical o he exising ones (no dynamic probing). As he user desires higher correcness, our approach will conac more daabases. Our experimenal resuls reveal ha dynamic probing ofen reurns he bes daabases wih a small number of probing. Dynamic probing of Web daabases inroduces many ineresing challenges. For example, how can we guaranee a cerain level of correcness? How can we maximize he correcness wih he minimal number of dynamic probing? Which daabases should we probe? This paper sudies hese problems using a probabilisic approach. Our soluion is based on he following observaions: Alhough he acual of a Web daabase may deviae from an iniial inaccurae esimaion, he way i deviaes follows a probabilisic disribuion ha can be observed. Such a disribuion usually ceners around he esimaed value. If we roughly now his acual disribuion for each daabase, hen we can guess how liely we have seleced he acual op- daabases using hese disribuions. Furhermore, by probing a few daabases, we can obain heir acual values and we can selec op daabases wih higher confidence. Our as of dynamic probing hus becomes using he minimum number of probing o accomplish he user-specified correcness level. We will formalize hese noions, e.g. he probabilisic disribuion and he correcness of an answer, in Secion. Overall, we believe our paper maes he following conribuions:. A probabilisic model for esimaion: Wih he probabilisic model, we can quanify he correcness of daabase selecion. (Secion ). sing dynamic probing o increase he correcness of daabase selecion: We eep on probing ill he cerainy exceeds a user-specified level. (Secion ). Probing sraegies: Our opimal sraegy uses he minimum number of daabase probing o reach he required level of cerainy. (Secion.) We also presen a greedy sraegy ha can idenify op- daabases a reasonable compuaional complexiy. (Secion.) 4. Experimenal validaion: We validae our algorihms using real Hidden Web daabases, under various experimenal seings. (Secion 5) The resuls reveal ha dynamic probing significanly improves he correcness of daabase selecion wih a reasonably small number of probing. For example, wih a single probing, we can improve he correcness of an answer by 70% in cerain cases. A Probabilisic Approach for Dynamic Probing To selec he mos relevan daabases for a query and mae our selecion as correc as possible, we need o fully undersand he of a daabase o a query, and he correcness of a se of seleced daabases. In his secion, we firs define he relevance meric of a daabase. We hen inroduce he noion of expeced correcness for a op- answer se. Finally, we explain he cos model for dynamic probing.. Daabase and probing Relevancy of a daabase Inuiively, we consider a daabase relevan o a query if he daabase conains enough documens perinen o he query opic. The following are wo possible definiions ha reflec his noion of. Documen-frequency-based A daabase is considered he mos relevan if i conains he highes number of maching documens [4, 6]. This number of maching documens is referred o as he documen frequency of he query in he daabase. Documen-similariy-based A daabase is considered he mos relevan if i conains he mos similar documen(s) o he query [5,, 5]. Query-documen similariy is ofen compued using he sandard Cosine funcion []. Relevancy esimaion A measearcher has o esimae he approximae of a daabase o a query using a pre-colleced summary. Noe ha his esimae may or may no be he same as he acual of he daabase. We refer o he esimaed of a daabase db o a query q as r(db, q). To mae our laer discussion concree, we now briefly illusrae how we may esimae he relavancy of a daabase under he documen-frequency-based meric [4]. Noe, however, ha our framewor is independen of he paricular meric and he esimaor used by a measearcher. Our approach can be used for any meric and esimaor combinaion. In [4, 6], Gravano e al. compue r(db, q) by assuming ha he query erms q = {,..., m} are independenly disribued in db. sing heir independence esimaor, r(db, q) can be compued as follows:

3 r(db, q) = db i q r(db, i) db where db is he size of db and r(db, i) is he number of documens in db ha use i. Noe ha Eq.() assumes ha r(db, i) is available o he measearcher for every erm i and every daabase db. In pracice, however, a hidden web daabase seldom expors such an exhausive conen summary o he measearcher. Reference [8] proposes an approximaion mehod o guess he r(db, i) values for all query erms. Daabase probing We define probing a daabase as he operaion of issuing a paricular query o he daabase and gahering he necessary informaion o evaluae is exac o he query. Depending on he meric, we need o collec differen informaion during probing. For example, under he documen-frequency-based meric, we need o collec he number of maching documens from he probed daabase, while under he documen-similariy-based meric, we need o collec he similariy value of he mos similar documen(s) in he probed daabase. For mos exising Hidden Web daabases, we noe ha i is possible o ge heir exac hrough simple operaions. For insance, many Hidden Web daabases repor he number of maching documens in heir answer page o a query, so we can easily compue heir exac documen-frequency-based. Also, under he documen-similariy-based meric, we may download he firs documen ha a Hidden Web daabase reurns, and hen analyze is conen o compue is cosine similariy. In he remainder of his paper, we refer o he exac of a daabase db o a query q as r(db, q). Thus, afer probing db, is esimaed r(db, q) becomes r(db, q).. Correcness meric for he op- daabases Our goal is o find he daabases ha are mos relevan o a query. We represen his se of correc op- answers as DB op. We refer o he se of daabases seleced by a paricular selecion algorihm as DB. We may define he correcness of DB compared o DB op in one of he following ways. Absolue correcness: We consider DB is correc only when i conains all DB op. Definiion The absolue correcness of DB compared o DB op is { Cor a(db if DB ) = = DB op 0 oherwise Parial correcness: We give parial credi o DB if i conains some of DB op. Definiion The parial correcness of DB compared o DB op is Cor p(db ) = DB DB op In his definiion, he correcness value of a op-5 answer se is 0.4 if i conains of he acual op 5 daabases. We sudy boh of hese merics in his paper. For reader s convenience, we summarize he noaion ha we have inroduced in Figure. Some of he symbols will be discussed laer. () Symbol Meaning DB {db,..., db n}, he oal se of daabases q The user s query The number of op daabases ased by he user r(db, q) The acual of db for q r(db, q) The esimaed of db for q DB A se of daabases seleced by a paricular algorihm, DB DB DB op The se of correc op- daabases Cor a(db ) Absolue correcness meric for DB Cor p(db ) Parial correcness meric for DB DB P The se of daabases ha have already been probed DB The se of daabases ha have no been probed, i.e. DB DB P PRD Probabilisic Relevancy Disribuion P (r(db, q) α The probabiliy of r(db, q) being lower han α, r(db, q) = β) given he esimaion r(db, q) = β. This probabiliy is given by he PRD. α and β are specific values E[Cor(DB )] The expeced correcness of DB, here Cor can be Cor a or Cor p The user-specified hreshold of he answer s expeced correcness c The cos of probing a daabase ECos(DB ) The expeced probing cos on he se of unprobed daabases, DB err(r, r) The error funcion compuing he difference beween r(db, q) and r(db, q) Figure : Noaion used hroughou he paper. Probabilisic disribuion and expeced correcness While we may esimae he of a daabase db o a query q, r(db, q), using exising esimaors, we do no now he exac r(db, q) value unil we acually probe db. Therefore, we may model r(db, q) o follow a probabilisic disribuion ha (hopefully) ceners around he r(db, q) value. We refer o his disribuion as a Probabilisic Relevancy Disribuion, or PRD. In Figure (a), we show example PRDs for four daabases, db,..., db 4. The horizonal axis in he figure represens he acual value of a daabase and he verical axis shows he probabiliy densiy ha he acual is a he given value. For insance, for db, he esimaed is 0.5, and he acual lies beween 0. and (We explain he impulses for db and db shorly.) Formally, a PRD ells us he probabiliy ha r(db, q) is lower han a cerain value α given he esimae r(db, q) equals o β: P (r(db, q) α r(db, q) = β). In Secion 4 we explain how we can obain a PRD by issuing a small number of sample queries o a daabase. For now we assume ha he measearcher nows he PRD of every daabase. Noe ha afer probing db, he r(db, q) value is nown. Thus he PRD for r(db, q) changes from a broad disribuion o an impulse funcion. For example, in Figure (a), we assume db and db have already been probed, so heir PRDs have become impulses a heir correc values, 0.8 and 0.6, respecively. In he middle of a dynamic-probing process, herefore, we have impulse PRDs for he probed daabases, and regular PRDs for he res. We now illusrae how we can use he PRDs o esimae he probabiliy ha a op- answer DB is correc. Example We assume he siuaion shown in Figure (a): db and db have already been probed and heir values

4 probabiliy densiy db db db4 0% 0. r(db4,q) probabiliy densiy r(db,q) r(db,q) r(,q) (a) afer probing db and db db4 0. r(db4,q) db db r(db,q) r(db,q) r(,q) P he resul qualiy. Given he user s expeced correcness specificaion, he measearcher eeps on probing daabases ill i finds a DB ha exceeds he user-specified hreshold. To help our discussion, we refer o he se of daabases ha have been probed during his process as DB P and he se of unprobed daabases as DB. Noe ha he reurned daabases DB may or may no be he same as he probed daabases DB P. In paricular, DB may conain a daabase db ha may have no been probed (db / DB P ). As long as he measearcher is confiden ha r(db, q) is higher han hose of ohers, i is safe o reurn db as par of DB. From he example, i is clear ha we should be able o compue he expeced correcness for DB given PRDs of he daabases. We use he noaion E[Cor a(db )] and E[Cor p(db )] o refer o he expeced correcness of DB under he absolue and parial correcness meric, respecively. When we do no care abou a paricular correcness meric, we use he noaion E[Cor(DB )]. (b) afer probing db, db and db Figure : The Probabilisic Relevancy Disribuion of differen daabases a various sages of probing are 0.8 and 0.6, respecively. We do no now he exac values for db and db 4, bu he PRD of db indicaes ha r(db, q) r(db, q) = 0.6 wih 0% probabiliy. We use he absolue correcness P meric of an answer se. Now suppose he user wans he measearcher o reurn he op- daabases. In his scenario, if we reurn {db, db }, our answer is correc (Cor a({db, db }) = ) wih 80% probabiliy, because r(db, q) < r(db, q) wih 80% probabiliy (in which case {db, db } are he acual op- daabases). Wih he remaining 0% probabiliy, r(db, q) may be larger han r(db, q), so our answer {db, db } is wrong (Cor a({db, db }) = 0) wih 0% probabiliy. Therefore, he expeced correcness of he an- swer {db, db }, is = 0.8. The expeced correcness in Example can be beer undersood on a saisical basis. For example, if he user issues,000 queries o a measearcher, and he measearcher reurn DB such ha is expeced correcness is greaer han 0.8 for every query, hen he user ges correc answers for a leas 800 queries. We now illusrae how a user may usehe expeced correcness o specify he qualiy of he answer and how he measearcher can use dynamic probing o mee he user s specificaion. Example Sill consider he siuaion in Figure (a). Afer probing db and db, he measearcher nows ha he expeced cor- recness of {db, db } is 0.8. If he user only requires 0.7 expeced correcness, he measearcher can sop probing and reurn {db, db }. If he user s hreshold is 0.9, he measearcher has o probe more daabases. Suppose he measearcher pics db for probing. The resuling PRDs are shown in Figure (b). Now he measearcher nows ha db and db 4 are definiely smaller han db, and {db, db } mus be he correc answer. Therefore he expeced correcness of {db, db } is (which exceeds he user s hreshold, 0.9). As a resul, he measearcher can sop probing and reurn {db, db }. According o our Cor a and Cor p definiions, he expeced correcness can be compued as: E[Cor a(db )] = P (DB = DB op ) + 0 P (DB DB op ) = P ( DB DB op = ) () E[Cor p(db )] = i P ( DB DB op = i) () i The following heorems ell us how o compue he expeced absolue correcness, E[Cor a(db )], and he expeced parial correcness, E[Cor p(db )], using he PRD of each daabase. We label he daabases in DB as db, db,..., db, and label he daabases in DB DB as db +,..., db n. Le f j(x j) be he Probabiliy Densiy Funcion derived from db j s PRD ( j n), and x j be one possible value of r(db j, q). Theorem Assuming ha all daabases operae independenly, E[Cor a(db )] = P (r(db, q) < min(x,..., x )) db DB DB f j(x j) dx...dx j where min(x,..., x ) is he minimum value among all he db j DB. The above example shows ha we can consider he expeced correcness as he nob ha he user can urn so as o conrol Noe ha r(db 4, q) is always smaller han r(db, q) and r(db, q). Proof See Appendix 4

5 dbn c sage sage Dynamic Prober DB d d Documen Reriever relevan documens from DB probed daabase unprobed daabase reurn his DB YES Any DB ha has an expeced correcness above he user s hreshold? NO PRDs of he probed and unprobed daabases DB DB P Figure 4: Two sage cos models Theorem Assuming ha all daabases operae independenly, E[Cor p(db )] = i i DB i DB DB db DB i P (r(db, q) > i highes(x,..., x )) probe DB: dbn dbi+ of db, 4 hen he maasearcher may have cached he re- rieved documens, so ha i will no conac db again in dbi move from DB o DB P Figure 5: The dynamic probing process he second sage. In his case, he measearcher only conacs db in he second sage o rerieve is op raning documens. Le he rerieval cos for an unprobed daabase be d and he cos for a probed daabase be d (d d). Then he probing and rerieval cos in he overall measearching process is db (DB DB DB i ) j P (r(db, q) < i highes(x,..., x )) f j(x j) dx...dx where i highes(x,..., x ) is a funcion ha compues he i h highes value among all he db j DB. DB P c + DB DB P d + DB DB P d In his paper, we mainly use he he probing cos model as our cos meric. Noe ha an opimal algorihm for he probing cos model may no be opimal for he PR cos model: Even if an algorihm does fewer probing during he firs sage, he algorihm may incur a significan cos during he second sage if none of he reurned daabases was probed. However, he following heorem shows ha under a cerain condiion, an opimal probing sraegy for he probing cos model is also opimal for he PR cos model. Proof See Appendix.4 Two-sage cos models When a user ineracs wih a measearcher, his evenual goal is o rerieve a se of relevan documens. Therefore, he overall measearching process can be separaed ino wo sages as shown in Figure 4. In he firs sage, he dynamic prober finds an answer se DB by probing a few daabases. In he second sage, he documen reriever conacs each seleced daabase, rerieves he relevan documens and reurns hem o he user. In measuring he cos of our measearching framewor, we may use one of he following merics: Probing cos model: We only consider he cos for he probing sage, ignoring he cos for he documen-rerieval sage. We assume ha he probing cos of a single daabase is c and is idenical for every daabase. (I is sraighforward o exend our model o he case where he probing cos for each daabase is differen.) Since he dynamic prober does DB P number of probing in he firs sage, DB P c is he cos under his model. Probing-and-Rerieval (PR) cos model: We also consider he cos for he documen rerieval sage. The cos for he rerieval sage may depend on wheher a seleced daabase was probed or no in he firs sage. For example, suppose he dynamic prober reurns {db, db } as he op- daabases afer probing db (bu no db ). If he dynamic prober has rerieved he op raning documens of db during is probing Theorem nder he condiion ha DB DB P (i.e. all he reurned daabases have been probed), he opimal probing algorihm under he probing-only cos model is also opimal for he PR cos model. Proof See Appendix In our experimens, we observed ha he condiion in he above heorem is valid for mos cases. Tha is, DB DB P in he majoriy of cases, which means our algorihm is opimal also for he PR cos model. The Dynamic Probing Algorihm Given a query q, n daabases and a hreshold, our goal is o use a minimum number of probing o find a -subse DB whose expeced correcness exceeds. Figure 5 roughly illusraes our dynamic probing process o achieve his goal. A any inermediae sep of he dynamic probing, he enire se of daabases DB is divided ino wo subses: he se of probed daabases DB P and he se of unprobed daabases DB. Based on he impulse and regular PRDs of DB P and DB, we compue he expeced correcness of every -subse DB (using Theorem for he absolue correcness, for example). If here is a DB such ha E[Cor(DB )], he dynamic probing hals and reurns his DB ; oherwise i coninues o probe one more daabase in 4 Which will be necessary if our definiion is documensimilariy-based (Secion.) 5

6 db db db db4 0% db4 db 0. r(db4,q) (a) afer probing Algorihm db and. db DP ro(db, q,, (b) ) afer probing db, db and db Inpu: PRDs of he probed and DB: he enire unprobed se ofdaabases given daabases, {db,...,db n } YES q: a given query : he number of daabases o reurn : he user s hreshold for E[Cor(DB )] reurn his DB Any DB ha has an expeced correcness above he user s hreshold? NO r(db,q) r(db,q) r(,q) Oupu: DB wih E[Cor(DB DB )] P Procedure DB: dbn [] P dbi+, DB dbi DB [] If (E[Cor(DB move from probe )] ) for some DB DB o DB P Reurn DB [] db i SelecDb(DB ) [4] Probe db i [5] Change he PRD of db i from regular o an impulse [6] DB P DB P {db i }, DB DB {db i } [7] Go o [] 0. r(db4,q) r(db,q) r(db,q) r(,q) db afer probing db hen probe db db db (a) P original PRDs of db, db and db db db (b) he oucome of probing db db db db Figure 7: Selecing he op- daabases from {db, db, db } (c) he firs possible oucome of probing db db db (d) he second possible oucome of probing db Figure 6: The dynamic probing algorihm DP ro DB, moves i from DB o DB P, and recompues he expeced correcness for every DB. Figure 6 provides he algorihm of our dynamic probing process. A each ieraion, we ry o find a -subse DB ha has he desired level of expeced correcness and reurn i (Sep []). If no such DB exiss we pic a daabase from he unprobed se (Sep []), probe i (Sep [4]) and recompue he expeced correcness (goo Sep []). Noe ha one ey issue in his algorihm is how SelecDb(DB ) should pic he nex bes daabase o probe in order o minimize probing cos. In he nex subsecion, we derive he answer o his quesion. Algorihm. SelecDb(DB ) Inpu: DB : he se of unprobed daabases Oupu: db i: he nex daabase o probe Procedure [] For every db i DB : [] cos i = c + ECos(DB {db i}) [] Reurn db i wih smalles cos i Figure 8: The opimal SelecDb(DB ) funcion. Selecing he opimal candidae daabase for probing In SelecDB(DB ), we need o selec he nex daabase candidae ha will lead o he earlies erminaion of he probing process and hus minimizing he probing cos. This daabase ofen should no be he one wih he larges expeced. Consider he following example. Example 4 We wan o reurn he op- daabases from {db, db, db }. We have no probed any of hem. Figure 7(a) shows heir PRDs. We assume ha E[Cor(DB )] is smaller han he user hreshold for any DB {db, db, db } ye. We need o pic he nex daabase o probe. Noe ha we do no need o probe db, because is is he highes among all hree, and i will always be reurned as par of DB. Probing db does no increase answer correcness a all. Similarly, noe ha probing db is no very helpful, eiher. Because r(db, q) lies beween he wo peas of r(db, q), even afer we probe db (Figure 7(b)), i is sill uncerain which one (beween db and db ) will have higher. In consras, probing db is very liely o improve he cerainy of our answer. Given he PRD of db, r(db, q) will be eiher on he lef side of r(db, q) (Figure 7(c)) or on he righ side (Figure 7(d)). If i is on he lef side (Figure 7(c)), we can reurn {db, db } as he op- daabases. If i is on he righ side (Fig- ure 7(d)), we can reurn {db, db } as he op- daabases. In eiher case, we can reurn he op- daabases wih high confidence. Therefore, SelecDb(DB ) should pic db as he nex daabase o probe, because we can finish he probing process only afer one probing. Oherwise our algorihm needs a leas wo probing o hal. Noice ha he expeced of db is he lowes among he hree daabases. The nex daabase o probe is no he one wih he highes expeced. From his example, we can see ha he funcion SelecDb(DB ) should pic he db i DB ha yields he smalles number of expeced probing. To formalize his idea, we inroduce he noaion ECos(DB ) o represen he expeced amoun of addiional probing on DB afer we have probed DB DB (=DB P ). Now we analyze he expeced probing cos if we pic db i DB as he nex daabase o probe. The cos for probing db i iself is c. The expeced cos afer probing db i is ECos(DB {db i}) under our noaion. Therefore, by probing db i nex, we are expeced o incur c+ecos(db {db i}) addiional probing cos. Based on his undersanding, we now describe he funcion SelecDb(DB ) in Figure 8. In Seps [] and [], he algorihm firs compues he expeced addiional probing cos for every db i DB. Then Sep [] reurns he one wih he smalles cos. 6

7 Algorihm. ECos(DB ) Inpu: DB : he se of unprobed daabases Oupu: cos: he expeced probing cos for DB Procedure [] If (E[Cor(DB )] ) for some DB DB Reurn 0 [] For every db i DB : [] cos i = c + ECos(DB {db i}) [4] Reurn min(cos i) Figure 9: Algorihm ECos(DB ) We now explain how we can compue ECos(DB ) using recursion. We assume ha we have probed DB P so far, and DB (= DB DB P ) have no been probed ye. There are wo possible scenarios a his poin: Case (Sopping condiion): Wih he daabases DB P probed, we can find a DB DB such ha E[Cor(DB )]. In his case, we can simply reurn DB as he op- daabases. We do no need any furher probing. Thus, ECos(DB ) = 0 (4) Noe ha when all daabases have been probed (DB = ), we now he exac of all daabases, so ECos(DB ) = 0. Case (Recursion): There is no DB DB whose expeced correcness exceeds. Therefore, we need o probe more daabases o improve he expeced correcness. Assume we probe db i DB nex. Then he expeced probing cos is c + ECos(DB {db i}). Remember ha SelecDb(DB ) always pics he db i wih he minimal expeced cos. Therefore, he expeced cos a his poin is ECos(DB ) = min db i DB (c + ECos(DB {db i})) (5) Figure 9 shows he algorihm o compue ECos(DB ). In Sep [], we firs chec wheher we have reached he sopping condiion. If no, we compue he expeced probing cos for every db i DB (Seps [] and []), and reurn he minimum expeced cos (Sep [4]). The following heorem shows he opimaliy of our algorihm SelecDb(DB ). Theorem 4 SelecDb(DB ) reurns he daabase ha leads o he minimum expeced probing cos, ECos(DB ), on he se of unprobed daabases DB. Proof See Appendix Noe ha he compuaion of ECos(DB ) is recursive and can be very expensive. For example, assume ha DB = {db,..., db n} as we show in Figure 0. To compue ECos(DB ), we have o compue ECos(DB {db i}) for every i n (firs-level branching in Figure 0). n case: n- case: ECos(DB -{}) db dbn ECos(DB ) Figure 0: Exploring a search ree o compue ECos(DB ) Algorihm.4 greedyselecdb(db ) Inpu: DB : he se of unprobed daabases Oupu: db i: he nex daabase o probe Procedure [] For every db i DB : [] ECor i = max (E[Cor(DB )] afer probing db i) DB DB [] Reurn db i wih he highes ECor i Figure : The greedy SelecDb(DB ) funcion Then o compue ECos(DB {db i}), we need o compue ECos(DB {db i, db j}) for every j i (second-level branching in Figure 0). Therefore, he cos for compuing ECos(DB ) is O(n!) if DB = n. Clearly his is oo expensive when we mediae a large number of daabases. In he nex subsecion, we propose a greedy algorihm ha reduces he compuaion complexiy of selecing he nex daabase o O(n).. A greedy choice The goal of he DPro algorihm is o find a DB wih E[Cor(DB )] using minimum number of probing. Thus, he opimal DPro compues he expeced probing cos for all possible probing scenarios and pics he one wih he minimum cos. Informally, we may consider ha he opimal DPro loos all seps ahead and pics he bes one. Our new greedy algorihm, insead, loos only one sep ahead and pics he bes one. The basic idea of our greedy algorihm is he following: Since we can finish our probing process when E[Cor(DB )] exceeds for some DB, he nex daabase ha we probe should be he one ha leads o he highes E[Cor(DB )] afer probing (hus mosly liely o exceed early). Noice he suble difference beween he opimal algorihm and he greedy algorihm. The opimal algorihm compues ECos(DB ) for all possible scenarios, while our greedy algorihm compues E[Cor(DB )] afer we probe only one more daabase db i. sing Theorem we can compue E[Cor a(db )] afer we probe db i if we now he PRD of each daabase. 5 In Figure, we show a new SelecDb(DB ) funcion ha implemens his greedy idea. In Seps [] and [], he algorihm compues he expeced correcness value afer we probe db i. Then in Sep [] i reurns he db i ha leads o he highes expeced correcness. 5 Since we do no now he oucome of probing db i, we need o use Theorem o compue and expeced E[Cor(DB )] value, based on db i s PRD. The deailed formula is provided in he appendix. db dbn 7

8 -50 P (r-r 50) = r-r Figure : Disribuion of he absolue-error funcion 4 Probabilisic Relevancy Disribuion In Secion, we assumed ha he PRD for daabase db was already given. We now discuss how we obain he PRD ha gives us P (r(db, q) α r(db, q) = β), where α and β are specific values. For simpliciy, we use r and r o represen r(db, q) and r(db, q). Our basic idea is o use sampling o esimae he PRD. Tha is, we issue a small number of sampling queries, say 000, o db and observe how he acual r values are disribued. From his resul, we can compue he difference of r from r and obain he disribuion. Noe ha he PRD P (r α r = β) is condiional on r. Therefore, he exac shape of he PRD may be very differen for differen r values. Ideally, we have o issue a number of sampling queries for each r value, in order o obain he correc PRD shape for each r. However, issuing a se of queries for each r is oo expensive given ha here are an infinie number of r values. To reduce he cos for PRD esimaion, we assume ha he disribuion we observe is independen of wha r may be. More precisely, we may consider one of he following independence assumpions: Absolue-error independence: We assume ha he absolue error of our esimae, r r (he difference beween our esimae and he acual ), is independen of he r value. Therefore, from our sampling queries, we obain a single disribuion for (r r) values (even if he r values for he queries are differen), and use he disribuion o derive he PRD. Relaive-error independence: We assume ha he relaive error of our esimaion, r r, is independen of he r value. r Therefore, from sampling queries, we obain a single disribuion for r r values (even if heir r values for he queries r are differen) and use he disribuion o derive he PRD. In general, if here is an error funcion err(r, r) (e.g., err(r, r) = r r for he firs case) whose disribuion is independen of r, hen we can use jus one se of queries (regardless of heir r values) o esimae he err(r, r) disribuion. Then using his disribuion, we can obain he correc PRD for every r value. This can be illusraed hrough he following example: Example 5 Suppose from,000 sampling queries, we are able o obain a probabiliy disribuion for he absolue-error funcion: err(r, r) = r r, as shown in Figure. Assume ha r r is independen of r. Le us derive he probabiliy P (r 50 r = 00) using his disribuion. P (r 50 r = 00) = P (r r 50 r = 00) = P (r r 50) (independency of r r and r) This probabiliy P (r r 50), as shown in Figure, is 0.8. More formally, we observe ha he error funcion err(r, r) should saisfy he following properies o derive a PRD: I. Independency: err(r, r) is probabilisically independen of r II. Monooniciy: err(r, r) err(r, r) for any r r The following heorem shows ha he probabiliy of a value can be obained hrough he probabiliy of he error funcion, via a variable ransformaion from r o err(r, r). Theorem 5 If err(r, r) is independen and monoonic, hen P (r α r = β) = P (err(r, r) err(α, β)) (6) Proof See Appendix. In Secion 5, we compare he absolue-error funcion err a(r, r) = r r, and he relaive-error funcion, err r(r, r) = (r r) r experimenally. Our resul shows ha he relaive-error funcion wors well in pracice and roughly saisfies he wo properies in Theorem 5. 5 Experimens This secion repors our experimenal resuls ha esify he effeciveness of he dynamic probing approach. Secion 5. describes he experimenal seup and he daase we use. Secion 5. experimenally compares he error funcions o derive a PRD. Secions 5. and 5.4 show he improvemen of our dynamic probing compared o he exising mehods. 5. Experimenal seup In our experimens, we simulae a real measearching applicaion by mediaing 0 real Hidden-Web daabases and using 4,000 real Web query races. The daabases for our experimens are mainly relaed o healh. Thus, we may consider ha he experimens evaluae he effeciveness of our dynamic probing approach in he conex of a healh-relaed measearcher. In his subsecion, we explain our experimenal seup in deail. Firs, we selec daabases from he healh caegory of InvisibleWeb, 6 which is a manually-mainained direcory of Hidden- Web daabases. While he direcory liss more han healhrelaed daabases, mos of hem are eiher obsolee or oo small. In our experimens, we use only he daabases wih a leas,000 documens. Mos of he small daabases are relaively obscure and of lile ineres. Because is a relaively small number, and in order o inroduce heerogeneiy o our experimens, we append four more daabases on broader opics (e.g., Science and Naure), and hree more news websies (e.g., CNN and NYTimes). We show some sample daabases and heir sizes in Figure. 7 The complee lis of our daabases can be found in [0]. Second, we selec a subse of queries from a real query race from Yahoo (provided by Overure 8 ). We sar by building a sample medical vocabulary using single erms exraced from he healh opic pages in MedLinePlus, 9 an auhoriaive medical informaion websie. We hen randomly pic any -erm and -erm queries from he Yahoo query race ha use a leas wo erms from our vocabulary. Again, his selecion was done o simulae a measearcher ha focuses on healh-relaed opics. 6 hp:// 7 For daabases ha do no expor heir sizes, we roughly esimae he size by issuing a query wih common erms, e.g. medical OR healh OR cancer... 8 hp://invenory.overure.com/ 9 hp:// 8

9 p Daabase RL Size MedWeb 4,000 PubMed Cenral 60,000 NIH 6,799 Science 9,65 Figure : Sample Web daabases used in our experimen sing he above selecion mehod, we prepare a sample query se QS which conains,000 -erm queries and,000 -erm queries. We use QS in Secion 5. o derive he PRD for each daabase. Similarly, we prepare anoher query se QS ha conains, again,,000 -erm queries and,000 -erm queries. QS is used in Secions 5. hrough 5.4 when we evaluae how well our dynamic prober wors. Noe ha a ypical Web query has only a small number of erms, wih. erms on average [9]. Therefore, we believe our experimens using or -erm queries reflec he ypical scenario ha a real measearcher can expec. In all of our experimens, we use documen-frequency-based meric (Secion.) and independence esimaor (Eq.). Furher, we use he independency esimaor o creae a baseline represening radiional esimaion-based selecion mehods. All of our dynamic probing is done using he greedy algorihm (Secion.). Due o is exponenial compuaional cos, i aes a long ime for he opimal algorihm o erminiae, so we could no finish enough experimens o include heir resuls in his draf. 5. Selecing an error funcion o derive he correc PRD To obain he correc PRD from he err(r, r) disribuion, err(r, r) needs o be monoonic and independen of r (Theorem 5). In his subsecion, we experimenally compare he absolue-error funcion err a(r, r) = r r wih he relaive-error funcion, err r(r, r) = (r r), and selec he beer one for our r experimen. From heir analyical forms, i is easy o verify ha boh error funcions are monoonic. Wha we need o verify is he independence propery. This can be done by compuing he saisical correlaion beween err(r, r) and r, where err can be err a or err r. If he correlaion is close o 0, i means err and r are roughly independen; Oherwise hey are no. More specifically, we firs obain he err(r, r) value for every,000 -erm query in QS on a daabase db. We hen compue he correlaion beween err(r, r) and r on hese,000 queries, regarding db. We repea his process for all 0 daabases and compue he average correlaion over all daabases. We similarly compue he correlaion for he,000 -erm sample queries in QS, and summarize he resuls in Figure 4. The max correlaion value among he 0 daabases is also included o show he exreme cases. Figure 4(a) shows ha he absolue error err a(r, r) has a high posiive correlaion wih r for boh -erm and -erm queries. Therefore, err a(r, r) is dependen on r and becomes larger as r ges larger. Figure 4(b) reveals ha he relaive error err r(r, r) is roughly independen of r. Therefore we use err r(r, r) as he error funcion o derive a PRD. From our experimens, we observe ha he shape of err r(r, r) disribuion for -erm queries is slighly differen from ha for -erm queries. Therefore, we mainain wo PRDs for each daabase, one for -erm queries and he oher for -erm queries, and pic he appropriae PRD depending on he number of erms in a query. ( p g) Correlaion beween err a and r -erm -erm average max (a) The average and maximum correlaion beween err a(r, r) and r for 0 daabases Correlaion beween err r and r -erm -erm average max (b) The average and maximum correlaion beween err r(r, r) and r for 0 daabases Figure 4: The correlaion beween err(r, r) and r Avg(Cora) # of probing Figure 5: The effec of dynamic probing on he average correcness ( =, = 0.9) 5. Effeciveness of dynamic probing In he second se of our experimens, we sudy he impac of dynamic probing on he correcness of daabase selecion. Our main goal in his secion is o invesigae how accurae an answer becomes as we probe more daabases, so we resric our experimens only o he queries ha require a leas hree probing for DPro o erminae (i.e. E[Cor(DB )] only afer hree probing). When we se = 0.9 and = 0 as our parameers,,0 ou of he,000 es queries in QS (,000 -erm queries and,000 -erm queries) belong o his caegory. For each query issued, we hen as DP ro o repor he daabase wih he highes expeced correcness afer each probing (even if i has no erminaed ye). By comparing his repored daabase o he mos relevan daabase (i.e., DB = DB op?) we can measure how accurae he answer becomes as we probe more daabases. Noe he correc DB op is inaccessible o DPro during is probing process. Figure 5 summarizes he resul from hese experimens. In he figure, he horizonal axis shows he number of probing ha DPro has performed so far. The verical axis shows he fracion of correc answers ha DP ro repors a he given number of probing. For example, afer one probing, DPro repors he correc daabase for 54 queries ou of,0, so he average correcness is 54/, 0 = 0.5 a one probing. Noe ha a he poin of no probing (# of probing = 0), DPro is idenical o he radiional esimaion-based mehod because i does no use any dynamic probing. A his poin average correcness is only 0.0. Afer wo probing correcness reaches From his resul, i is clear ha dynamic probing significanly improves he answer correcness: We can improve he correcness of he answer by more han wice wih only wo probing. 0 Noe ha Cor a and Cor p are he same when =. Therefore we do no specify our correcness meric in his experimen. ( p g) 9

10 ( p g) ( p g) Avg # of probing probing using Cora and Corp Avg # of probing probing using Cora probing using Corp (a) = (b) = (c) =5 Avg # of probing probing using Cora probing using Corp Figure 6: The average number of probing under differen seings of and 5.4 The average amoun of probing under differen seings In his subsecion, we sudy how many probings DPro does for differen seings of. We experimen on six values: {0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. For a larger, i is expeced ha DP ro probes more daabases o mee he hreshold. In Figure 6, we show how he number of probing increases as becomes larger. The x-axis shows he differen values, and he y-axis is he average number of probing DP ro does for a paricular, over he,000 es queries in QS. For example, when = (Figure 6(a)), DPro erminaes afer probing on average for he hreshold value = 0.9. In Figures 6(b) and (c) we include he resuls for he absolue (Cor a) and parial (Cor p) correcness merics. When =, he wo correcness meric are he same, so we have only one graph in Figure 6(a). Noe ha he graph for Cor a is always above ha of Cor p. Since Cor p is always larger han Cor a, DPro reaches he correcness hreshold faser under Cor p and erminaes earlier. The figure shows ha our algorihm DPro can find correc daabases wih a reasonable number of probing. For example, when = 5 and = 0.9 (Figure 6(c)), DPro finds a DB wih E[Cor p(db )] > 0.9 afer 6.8 probing. In mos cases, all 5 reurned daabases are probed during he selecion process. Tha means ha 5 of 6.8 probing is done on he op- daabases reurned, so he informaion ha we collec during he probing sage can be used o reduce he cos for he documen rerieval sage (Figure 4). So he exra probing in he overall measearching process is only.8. Noe ha even if he user specified hreshold is 0.7, he op- daabases ha DPro reurns may be correc in more han 70% of he ime. ser hreshold is simply a lower bound for he correcness of he reurned answer. To show how accurae answers DPro reurns, Figure 7 and Figure 8 show he average correcness of he answers for differen hreshold values. Figure 7 shows he resul under he Cor a meric, and Figure 8 shows he resul under he Cor p meric. The baseline (he riangle line) is he average correcness of he radiional esimaion-based selecion. Since he radiional mehod does no depend on he value, he average correcness remains consan. The doed lines in he figures represen Avg(Cor) =. The average correcness of he answers from DPro should be higher han he doed line, since is he minimum hreshold value for DPro o erminae. From he graphs, we can see ha his is indeed he case. 6 Relaed wor Daabase selecion is a criical sep in he measearching process. Pas research mainly focused on applying cerain approximae mehod o esimae how relevan a daabase is o he user s query. The daabases wih he highes esimaed are seleced and presened o he user. The qualiy of daabase selecion is highly dependen on he accuracy of he esimaion mehod. In he early wor of bgloss [4] ha mediaes daabases wih boolean search inerfaces, a measearcher esimaes he of each daabase by assuming query erms appear independenly. vgloss [5] exends bgloss o suppor daabases wih vecorbased search inerfaces, and uses a high-correlaion assumpion or a disjoin assumpion on query erms o esimae he of a daabase under he vecor-space-model. [] uses erm covariance informaion o model he dependency beween each pair of erms, and achieve beer esimaion han vgloss. An even beer esimaion is repored in [5] by incorporaing documen linage informaion. There have been parallel research in he disribued informaion rerieval conex. In [, 5, 4] he of a daabase is modelled by he probabiliy of he daabase conaining similar documens o he query. In [4], various esimaion mehods discussed above are compared on a common basis. Our dynamic probing mehod is orhogonal o hese research in ha we are no proposing a new esimaion mehod under cerain definiion. Insead, we use probabilisic disribuion o model he accuracy of a paricular esimaion mehod, and use probing o increase he correcness of daabase selecion. Daabase selecion is relaed o a broader research area called op- query answering. Pas research [, 7, 8, 9] largely focused on relaional daa, and use deerminisic mehods o find he absoluely correc op- answers. While in our conex of Hidden-Web-daabase-selecion, enforcing he deerminisic approach would end up probing almos all he Hidden-Web daabases. In our probabilisic approach, we only probe he daabases ha would maximally increase our cerainy of he op answers. Mediaing heerogenous daabases o provide a single query inerface has been sudied for years [7, ]. While he exising research focused on inegraing daa sources wih relaional search capabiliies, we in his paper invesigae he mediaion of Hidden- Web daabases wih much more primiive query inerfaces over a collecion of unsrucured exual daa. 7 Conclusion We have presened a new approach o he Hidden Wed daabase selecion problem using dynamic probing. In our approach, he accuracy of a paricular esimaor is modelled using Probabilisic Relevancy Disribuion (PRD). The PRD enables us o quanify he correcness of a paricular op- answer se in a probabilisic sense. We propose an opimal probing sraegy ha uses he leas probing o reach he user-specified correcness hreshold. A greedy probing sraegy wih much less compuaion complexiy is also presened. Our experimenal resuls reveal ha dynamic probing significanly improves he answer s correcness wih a reasonably small amoun of probing. 0

11 Avg(Cora) Dynamic probing using Cora Esimaion-based daabase selecion (no probing) Avg(Cora) = Avg(Cora) Dynamic probing using Cora Esimaion-based daabase selecion (no probing) Avg(Cora) = ( p g) ( 0.85 p g) ( p 0.85 g) (a) = (b) = (c) =5 Avg(Cora) Dynamic probing using Cora Esimaion-based daabase selecion (no probing) Avg(Cora) = Figure 7: Avg(Cor a): dynamic probing vs. he esimaion-based daabase selecion Avg(Corp) ( p g) Dynamic probing using Corp Esimaion-based daabase selecion (no probing) Avg(Corp) = p Avg(Corp) p 0.8 ( p g) ( p g) Dynamic probing using Corp Esimaion-based daabase selecion (no probing) Avg(Corp) = Dynamic probing using Corp Esimaion-based daabase selecion (no probing) Avg(Corp) = (a) = (b) = (c) =5 Figure 8: Avg(Cor p): dynamic probing vs. he esimaion-based daabase selecion Avg(Corp) p Our experimenaion on real daases jusifies an effecive new direcion for he measearching research. In he pas, researchers ried o improve he correcness of daabase selecion via consrucing more accurae esimaors ha esimaes he daabase s o a paricular query. A more accurae esimaor demands more comprehensive conen-summary of each daabase. For example, soring he pair-wise erm covariance [, 5] aes O(M ) of space, where M is he size of he vocabulary. However, once he esimaor is consruced, he correcness of daabase selecion is fixed a a cerain level and canno be explicily conrolled by he user. In our dynamic probing approach, he user explicily specifies he desired level of correcness, regardless of wha esimaor we use. Our resuls reveal ha using he esimaor developed in bgloss [4], he answer s correcness is grealy improved via a small amoun of probing. References [] M.K. Bergman. The Deep Web: Surfacing Hidden Value. Accessible a DeepWeb, 000 [] C. Baumgaren. A Probabilisic Soluion o he Selecion and Fusion Problem in Disribued Informaion Rerieval. In Proc. of ACM SIGIR 99, CA, 999 [] J.A. Borges, I. Morales, N.J. Rodrguez. Guidelines for Designing sable World Wide Web Pages. In Proc. of ACM SIGCHI 96, hp://sigchi.org/chi96/, 996 [4] Craswell, P. Bailey, and D. Hawing. Server Selecion on he World Wide Web. In Proc. of ACM Conf. Digial Library 00, TX, 000 [5] J. P. Callan, Z. Lu, and W. Crof. Searching Disribued Collecions wih Inference Newors. In Proc. of ACM SIGIR 95, WA, 995 [6] J. Callan, M. Connell, and A. Du. Auomaic Discovery of Language Models for Tex Daabases. In Proc. of ACM SIGMOD 99, PA, 999 [7] S. Chaudhuri and L. Gravano. Opimizing Queries over Mulimedia Reposiories. In Proc. of ACM SIGMOD 96, Canada, 996 [8] S. Chaudhuri and L. Gravano. Evaluaing Top- Selecion Queries. In Proc. of VLDB 99, Scoland, 999 [9] K. Chang and S.Hwang. Minimal Probing: Supporing Expensive Predicaes for Top- Queries. In Proc. of ACM SIGMOD 0, WI, 00 [0] A. Clyde. The Invisible Web. Teacher Librarian 9(4), hp:// 00 [] R. Fagin. Combining fuzzy informaion from muliple sysems. In Proc. of ACM PODS 96, Canada, 996 [] H. Garcia-Molina, Y. Papaonsaninou, D. Quass, e al. The TSIM- MIS Projec: Inegraion of Heerogeneous Informaion Sources. J. Inelligen Informaion Sysem 8():7-, 997 [] M.R. Garey, D.S. Johnson. Compuers and Inracabiliy: A Guide o he Theory of NP-Compleeness, W.H. Freeman, 990 [4] L. Gravano, H. Garcia-Molina, A. Tomasic. The Effeciveness of GlOSS for he Tex Daabase Discovery Problem. In Proc. of SIG- MOD 94, MN, 994 [5] L. Gravano, and H. Garcia-Molina. Generalizing GlOSS o Vecor- Space daabases and Broer Hierarchies. In Proc. of VLDB 95, Swizerland, 995 [6] L. Gravano, H. Garcia-Molina, A. Tomasic. GlOSS: Tex-Source Discovery over he Inerne. ACM TODS 4():9-64, 999 [7] A.Y. Halevy. Answering Queries sing Views: A Survey. VLDB Journal 0(4):70-94, 00. [8] P.G. Ipeirois, L. Gravano. Disribued Search over he Hidden Web: Hierarchical Daabase Sampling and Selecion. In Proc of VLDB 0, China, 00 [9] S. Kirsch. The Fuure of Inerne Search: Infosee s Experiences Searching he Inernec. ACM SIGIR Forum ():-7, 998 [0] V.Z. Liu, R.C. Luo, J. Cho, W.W. Chu. A Probabilisic Framewor for Hidden Web Daabase Selecion sing Dynamic Probing. Technical repor, CLA, Compuer Science Deparmen, 00

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream