Probabilistic Structured Query Methods

Size: px

Start display at page:

Download "Probabilistic Structured Query Methods"

Estella Simpson
6 years ago
Views:

1 Probablstc Structured Query Methods Kareem Darwsh and Douglas W. Oard 1 Insttute for Advanced Computer Studes Unversty of Maryland, College Park, MD {kareem,oard}@glue.umd.edu ABSTRACT Structured methods for query term replacement rely on separate estmates of term frequency and document frequency to compute the weght for each query term. Ths paper revews pror work on structured query technques and ntroduces three new varants that leverage estmates of replacement probabltes. Statstcally sgnfcant mprovements n retreval effectveness are demonstrated for cross-language retreval and for retreval based on optcal character recognton when replacement probabltes are used to estmate both term frequency and document frequency. KEYWORDS Structured queres, Cross-language nformaton retreval, Document mage retreval 1 INTRODUCTION There are many stuatons n whch t s desrable to match a query term wth dfferent terms n a document. Well known examples nclude stemmng (where any word that shares the same stem should be matched), thesaurus expanson (where terms wth smlar meanngs should be matched), and crosslanguage retreval (where terms wth smlar meanngs n dfferent languages should be matched). When the mappngs among matchng terms are known n advance, the usual approach s to conflate the alternatves durng ndexng. That s the typcal way n whch stemmng s mplemented, for example. Query-tme mplementatons are necessary when approprate matchng decsons depend on the nature of the query, as mght be the case wth systems that provde the searcher wth nteractve control over thesaurus expanson. In ths paper, presently known technques for query-tme replacement are revewed, new technques that leverage estmates of replacement probablty are ntroduced, and experment results that demonstrate mproved retreval effectveness n two applcatons (Cross-Language Informaton Retreval (CLIR) and retreval of scanned documents based on Optcal Character Recognton (OCR)) are presented. CLIR has receved more attenton than any other querytme replacement problem n recent years, and several effectve technques are now known. Query translaton research has developed along two broad drectons, typcally referred to as dctonary-based and corpus-based technques. Broadly speakng, corpus-based technques seek to optmze retreval effectveness through relance on observed translaton 1 College of Informaton Studes and Insttute for Advance Computer Studes. probabltes n algned corpora, whle dctonary-based technques are optmzed for the case where relable estmates of translaton probablty are not avalable. A key dea n the so-called vector-space approach to nformaton retreval s relance on two statstcs: (1) term frequency (TF), the number of occurrences of a term n a document, and (2) document frequency (DF), the number of documents n whch a term appears. TF s a measure of aboutness, whch has benefcal effects on both precson and recall. DF s a measure of specfcty, and ts prncpal effect s on precson. In general, hgh TF and low DF are preferred, wth the optmal combnaton of those factors typcally beng determned through expermentaton (c.f., [14]). Prkola appears to have been the frst to try separately estmatng TF and DF for query terms n a CLIR applcaton [13], usng the InQuery synonym operator to mplement what he called structured queres. InQuery s synonym operator was orgnally desgned to support monolngual thesaurus expanson, so t estmates TF and DF as follows [11]: TF j(q ) = TFj ( Dk ) (1) ( )} DF(Q ) k Q U k Q { d D d} = k (2) ( )} where Q s a query term, D k s a document term, TF j (Q ) s the term frequency of Q n document j, DF(Q ) s the number of documents that contan Q, d s a document, and T j (Q ) s the set of known replacements (n ths case, translatons) for the term D k. Essentally, these equatons treat any occurrence of a replacement as an occurrence of the query term. Ths represents a very cautous strategy n whch a hgh DF for any replacement wll result n a hgh DF (and thus a low weght) for new jont DF of that query term. Retreval results are then domnated by query terms that have no unsafe (very common) replacements. For example, the Arabc query term can ether mean on or the proper name Al. If Al appears n few documents but on appears n many, equaton (2) wll treat as f t were at least as common as on. When there s not a large dsparty n DF, equaton (1) mplements a knd of query expanson effect. For example, the Arabc word can be translated as bread or bake, and equaton (1) would (wth proper stemmng) reward an occurrence of bakng bread. Corpus-based approaches to CLIR have generally developed wthn a framework based on language modelng rather than vector space models, at least n part because modern statstcal translaton frameworks offer a natural way of ntegratng translaton and language models [18]. In general, language modelng approaches to retreval rely on collecton frequency (CF) n place of DF: 2 CF(Q ) = k C ( (3) TF ) k Q where C represents the collecton, and the other terms are as defned above. Whether DF s better than CF depends on how we model the searcher s task when the goal s to fnd entre documents, DF models the concept of selectvty wth hgher fdelty. 2 Hemstra s work s a notable excepton [6].

2 The next secton ntroduces a set of replacement strateges that leverage observed replacement probabltes (from corpora) whle retanng the vector space model s concept of DF. The effectveness and effcency (relatve to present baselnes) of ths strategy s then shown n subsequent sectons for two applcatons: CLIR, and retreval from scanned documents usng OCR. The paper then concludes wth some notes on the lmtatons of the technques presented here and opportuntes for future work on ths problem. 2 BEYOND PIRKOLA S METHOD was the frst to ntroduce a varant to Prkola s method, amng to reduce mplementaton complexty by replacng the unon operator wth a sum [8]: DF(Q ) = DF ( ) (4) D k Dk T ( Q )} An alternatve approach, not prevously explored, would be to use the maxmum document frequency of any replacement (): DF(Q ) = MAX [ DF ( Dk )] (5) Dk T ( Q )} All three varants (Prkola,, and ) lower bound the DF for a query term by the DF of ts most common replacement, and the experments reported n Sectons 3 and 4 below show no statstcally sgnfcant dfference between the three technques. All three technques treat every known replacement as equally lkely. Ths rsks a somewhat counterntutve result: ntroducton of a translaton dctonary wth mproved coverage of rare translatons could actually harm retreval effectveness. To see ths problem, consder a case a query term n whch 99.9% of ts nstances should be translated as some rare term (e.g., superfluous ), but n 0.1% of the cases a translaton that happens to be a common term (e.g., the ) would actually be approprate. In such cases, the common term leads to a hgh jont DF, effectvely dmnshng the value of the orgnal query term. Ths exact stuaton actually arses often wth dctonares bult from algned corpora usng statstcal methods, snce there s always some chance that any term mght observed to be used as a replacement for any other term. One way to resolve the problem s to use a weghted varant of s method: [ DFj ( Dk ) wt( Dk )] DF(Q ) = (6) ( k Q )} In general, any monotone functon of the replacement probablty could be used for wt(d k ). For the experments reported below, the weght s smply set to the best avalable estmate of the replacement probablty. Improbable translatons that are common terms can also cause problems n equaton (1), snce common terms are lkely to have hgher TF s as well. One way to lmt ths effect s to use a weghted sum n the TF computaton: TF j(q ) = [ TFj ( Dk ) wt( Dk )] (7) ( )} k Q Agan, for the experments reported below the replacement probablty estmate s used as the weght. Fnally, ether TF formula could be combned wth any way of computng DF. In the experments reported below, the followng combnatons were tred: Method TF Formula DF Formula Prkola (1) (2) (1) (4) (1) (5) (1) (6) (7) (4) /DF (7) (6) Another way of leveragng nformaton about replacement probabltes s to smply gnore the least lkely replacements. Such an approach potentally offers two potental nsghts. Frst, t can reveal the extent of the adverse effect of lowprobablty replacements on each technque. Second, t offers a prncpled way of tunng the degree of comprehensveness of the dctonary to optmze the retreval effectveness of each technque. Two teams (from the Unversty of Massachusetts [9] and the Unversty of Maryland [2]) tred varants of ths approach for TREC 2002 CLIR track. For the experments reported below, a greedy technque was used n whch replacements were retaned n order of decreasng probablty untl a preset threshold on the cumulatve probablty was frst exceeded. Ths approach guarantees that at least one replacement s retaned. Mean unnterpolated average precson s reported for every threshold value between 0.1 and 1.0, n ncrements of 0.1. The experments were run usng a modfed verson of (**removed for blnd revewng**), whch s a vector space retreval system that was developed locally usng Okap BM-25 weghts. Reported statstcal sgnfcance tests were performed usng a pared two-taled t-test and are reported as sgnfcant for values of p <. 3 CLIR The CLIR experments reported n ths secton were performed usng the TREC 2002 CLIR track collecton, whch contans 383,872 artcles from the Agence France Press (AFP) Arabc newswre, 50 topc descrptons wrtten n Englsh, and assocated relevance judgments [12]. Queres were formed automatcally usng all the words n the ttle feld of the topc descrpton, whch s desgned to be representatve of the style of queres typcally ssued n Web search applcatons. The documents were stemmed usng Al-Stem (a standard resource for the TREC CLIR track), dacrtcs were removed, and normalzaton was performed to convert the letters ya ( ) and alef maqsoura ( ) to ya ( ) and all the varants of alef ( ) and hamza ( ), namely alef ( ), alef hamza ( ), alef maad ( ), hamza ( ), waw hamza ( ), and ya hamza ( ), to alef ( ). The Englsh queres were stemmed before translaton usng the Porter stemmer for compatblty wth the translaton resources descrbed below. 3.1 Estmatng Replacement Probabltes Fve translaton resources of three types were combned for the applcaton. Combnng resources s useful, because (a) the coverage of the combned resources s typcally better than any of the ndvdual resources, and (b) combnng resources can serve to renforce good translatons. The resources were as follows: 1. Two blngual term lsts that were constructed usng two Web-based machne translaton systems (Tarjm and Al-

3 Msbar [16][17]). In each case, sets of solated unque Englsh words found n a 200 MB collecton of Los Angeles Tmes news stores [10] were submtted for translaton from Englsh nto Arabc. Each system returned at most one translaton for each submtted word. Together, the two term lsts covered about 15% of the unque Arabc stems n the TREC collecton (measured by usng Al-Stem on both the term lst and the collecton). 2. The Salmone Arabc-to-Englsh dctonary (from Tufts Unversty), from whch we extracted only the translatons. No translaton preference nformaton s ndcated n ths dctonary. The coverage of the resultng term lst, measured n the same way, was about 7% of the unque Arabc stems n the TREC collecton. 3. Two translaton probablty tables, one for Englsh-to- Arabc and one for Arabc-to-Englsh. These tables were constructed from tables provded by BBN, whch were n turn constructed from a large collecton of algned Englsh and Arabc Unted Natons documents usng the Gza++ mplementaton of IBM s model 1 statstcal machne translaton desgn. The coverage of the Arabcto-Englsh table, measured n the same way, was 29% of the unque Arabc stems n the TREC collecton. These translaton resources were combned n the followng manner: 1. All resources that were orgnally provded as Arabc-to- Englsh were nverted. For the translaton probablty table, the probabltes for each translaton par were retaned and then the nverted tables were renormalzed so that the values of the probabltes for each sourcelanguage term summed to one. Ths process lkely ntroduced some error, snce probabltes for rare events may not have been accurately estmated. 2. A unform dstrbuton was used to assgn probabltes to the translatons obtaned from machne translaton systems and the Salmone dctonary. Tarjm and Al- Msbar each returned at most one translaton for an Englsh word, but two Englsh words mght share a common translaton. When n alternatves were known from a sngle source, each was assgned a probablty of 1/n. 3. The resultng translaton probabltes were then combned by summng the probabltes for a gven Arabc translaton across the sources n whch t appeared and then dvdng by the number of sources n whch the Englsh term had appeared. For example, f Tarjm, Al- Msbar and Salmone contaned the Englsh term, wth Tarjm contanng some specfc translaton wth probablty 1.0, Al-Msbar lackng that translaton (.e., assgnng t a probablty of 0.0), and Salmone assgnng t a probablty of 0.5 (because two translatons were known), then the resultng combned probablty would be 1/3 + 0/ /3 = 0.5. The resultng translaton resource contaned what appeared to be reasonable estmates of translaton probabltes, and covered 36% of the unque Arabc stems n the TREC collecton. 3.2 Results Fgure 1 shows the mean unnterpolated average precson for each of the sx structured query methods for each threshold value and Table 1 shows the same results n tabular form. As a baselne, one-best query translaton (usng only the most lkely translaton) was also run. Ths wdely reported baselne seems approprate n ths case because any cumulatve probablty threshold wll result n use of at least the most probable translaton for each query term. s and Prkola s methods turned out to be essentally ndstngushable, wth method performng nearly as well (statstcally sgnfcantly worse only at threshold values of 0.2 and 0.3). The /DF method produced results that were statstcally sgnfcantly better than the one-best baselne for every threshold value except 0.1 and 1.0. Moreover, /DF was the only one of the probablstc technques that dd not exhbt a dramatc decrease n effectveness as the threshold ncreased. The best /DF result (at a threshold of 0.6) s statstcally ndstngushable from the best result of Prkola,, or (n each case, at a threshold of 0.4), but the reduced dependence on accurate tunng of the threshold makes /DF clearly the preferred method. Table 1: CLIR: Mean average precson, ttle queres. Black (gray) cells represent statstcally better (worse) results, compared to the one-best translaton baselne. Cumulatve Probablty Threshold CLIR.0 Baselne Prkola /DF Mean Average Precson Ttle Queres.0 Threshold Prkola /DF Baselne Fgure 1: CLIR: Dependence of retreval effectveness on cumulatve probablty threshold, ttle queres. 4 OCR-BASED RETRIEVAL Prevous approaches to retreval of OCR-degraded text have focused prmarly on correctng OCR errors [7][15] or on fuzzy matchng technques that are less senstve than exact strng matchng to OCR errors [1][5]. Ths secton demonstrates the generalty of the query-tme replacement technques developed above, usng them to combne TF and DF evdence for a novel technque whch attempts to replace

4 each query term wth possble OCR-dstortons of the term and to estmate probablty of the replacements. The experments were conducted wth the Zad collecton, whch was obtaned from the Unversty of Maryland [3]. The collecton s comprsed of 2,730 documents extracted from Zad Al-Me ad, a prnted book for whch an accurately character coded electronc verson (the clean text ) s also avalable [3]. Three sets of OCR outputs for the same documents were avalable: prnt resoluton (300x300 dots per nch (dp)) as orgnally scanned, and down sampled versons at fne fax resoluton (200x200 dp) and standard fax resoluton (200x100 dp). The test collecton ncludes 25 wrtten topc descrptons and assocated relevance judgments. Characters normalzatons were performed as descrbed above, and character 3-grams (3g) or character 4-grams (4g) were ndexed. Darwsh and Oard found those ndex terms to be among the most effectve of OCR-based retreval of Arabc [3]. 4.1 Estmatng Replacement Probabltes Term replacement probabltes were estmated usng a poston-senstve ungram character dstorton model traned on 5,000 words of algned clean and dstorted texts from the collectons beng searched. The algnment was desgned to smulate manual error correcton of a small porton of the collecton. 3 Snce the appearance of Arabc characters vares by poston, the standard four character postons (begnnng, mddle, end, solated) were modeled. Formally, gven a clean word wth characters C 1..C..C n and the resultng word after OCR degradaton D 1..D j..d m, where D j resulted from C, ε s the null character, L s the poston of the letter n the word (begnnng, mddle, end, or solated), and # s the word boundary, the three edt operatons for the models would be: C D P substtuton (C > D j = P deleton (C > ε = C C C j ε ε D j P nserton (ε > D j = C If the count n the numerator was zero, the computaton would be repeated wthout condtonng on poston. If the count remaned zero, a value of zero was recorded. A separate model was traned for each resoluton. Two factors made automatc algnment of the OCR output to the clean text challengng. Frst, the prnted and clean text versons n the Zad collecton were obtaned from dfferent sources that exhbted mnor dfferences (mostly substtuton or deleton of partcles such as n, from, or, and then). Second, some areas n the scanned mages of the prnted page exhbted mage dstortons that resulted n relatvely long runs of OCR errors. The algnment was performed usng SCLITE from the Natonal Insttute of Standards and Technology (NIST). SCLITE employs a dynamc programmng strng algnment algorthm, whch attempts to mnmze the Levenshten 3 Smaller and larger tranng sets were tred, but no mprovement resulted from more than 5,000 words. dstance (edt dstance) between two strngs. Conceptually, the algorthm uses dentcal matches to anchor algnment, and then uses word poston wth respect to those anchors to estmate an optmal algnment on the remander of the words. SCLITE was orgnally developed for speech recognton applcatons, but n OCR applcatons addtonal characterlevel evdence s avalable. SCLITE algnments were therefore accepted only f the number of character edt operatons were less than or equal to 50% of the length of the shorter of the two matched words. To algn the words that were not algned by SCLITE the followng algorthm was used: 1. Usng the exstng algnments as anchors, gven an unalgned word at poston l from the precedng anchor n a clean document, sequentally compare t to the words, n the correspondng degraded document between the correspondng par of anchors wth poston l from the precedng anchor where l -l < When comparng two words, f the dfference between ther respectve word lengths was less than or equal to 2 characters and the number of edt operatons between the two words (usng Lenvenshten s edt dstance) was less than a certan percentage q of the word length of the shorter one (the percentage q was the number of edt operaton dvded by the length of the shorter word), then the newly algned words were used as anchors. Intally, q was set to 60%. 3. Steps 1 and 2 were terated two more tmes usng the new anchors wth q equal to 40% and 20% to attempt to fnd more algnments. Ths algnment technque works well for prnt resoluton, but t s a sgnfcant source of errors for hghly degraded cases (e.g., standard fax resoluton). Gven a par of algned words, they were algned at the character level by fndng the edt dstance between them usng the Levenshten edt dstance algorthm and then back tracng the algorthm to dentfy nsertons, deletons, and substtutons. The resultng model was then used to assgn a probablty to possble dstortons of each query term as follows: 1. For each character n a clean query term, generate all substtutons or deletons that have non-zero probablty (.e., were observed at least once n the tranng data). The unchanged character s generated at ths step as a substtuton. 2. For each possble nserton pont, generate all possble sngle nsertons. Possble nserton ponts are before the frst character, between any par of characters, and after the last character. A null nserton s generated at each pont to cover the remander of the probablty mass. 3. For each strng that could result from the power set of all possble substtutons or deletons and all possble nsertons, compute the probablty of generatng that strng as the product of the assocated nserton, substtuton, and deleton probabltes. A more effcent mplementaton would be desrable n an operatonal settng, but ths approach suffces for the experments reported below. 4.2 Results Fgure 2 shows the mean unnterpolated average precson at prnt resoluton for each of the sx structured query methods for each threshold value and Table 2 shows the same data n

5 Prnt - 3grams Threshold vs. Mean Avg. Precson Fne Fax - 3grams Threshold vs. Mean Avg. Precson Prkola Prkola /DF baselne /DF baselne Prnt - 4grams Threshold vs. Mean Avg. Precson Fne Fax - 4grams Threshold vs. Mean Avg. Precson Prkola Prkola /DF baselne /DF baselne Fgure 2: Prnt: Dependence of retreval effectveness on cumulatve probablty threshold, ttle queres. tabular form. Fgure 3 and Table 3 present the correspondng results for fne fax resoluton. As a baselne, the same ndex terms (3g or 4g) were run wth the clean (undstorted) queres, snce any cumulatve probablty threshold results n a superset of that baselne case. No statstcally sgnfcant dfferences were observed at any resoluton or threshold value between the Prkola, and methods, whch tends to confrm the observaton made n the CLIR applcaton that the smpler mplementaton of s method results n no sgnfcant adverse effect on retreval effectveness. For prnt resoluton, every structured query technque acheved a statstcally sgnfcant mprovement over the baselne when used wth the better of the two ndexng terms (4g). Among these, /DF both acheved the greatest mprovement (9.7% relatve), and exhbted the greatest range of threshold values over whch the mprovement was statstcally sgnfcant (0.6 to 1.0). Therefore, as wth CLIR, /DF s clearly the preferred technque n ths applcaton. No statstcally sgnfcant mprovements over the baselne were observed for the fne fax resoluton or the standard fax resoluton (not shown). Ths may, however, reflect errors n the algnment of the tranng data rather than lmtatons n the replacement technques that was tred. The same general trends are observable n Fgure 3 as n Fgure 2, so the use of /DF s certanly not counterndcated for the fne fax condton. Fgure 3: Fne fax: Dependence of retreval effectveness on cumulatve probablty threshold, ttle queres. Table 2: Prnt: Mean average precson, ttle queres. Black (gray) cells represent statstcally better (worse) results, compared to the clean query baselne. Cumulatve Probablty Threshold Prnt.0 Baselne Prkola g /DF Baselne Prkola g /DF

6 Table 3: Fne fax: Mean average precson, ttle queres. Black (gray) cells represent statstcally better (worse) results, compared to the clean query baselne. Cumulatve Probablty Threshold Fne Fax.0 Baselne Prkola g /DF Baselne Prkola g /DF CONCLUSION AND FUTURE WORK Ths paper has ntroduced a famly of methods for query term replacement that explot estmates of replacement probabltes whle also ncorporatng the vector space model s concept of document frequency. Both s method and were found to acheve retreval effectveness values smlar to that obtaned wth Prkola s structured query method, so s method seems to be a good bass from whch to buld probablstc structured query methods. Coverage of rare translatons was shown to be problematc for all three methods, however. Use of only the most lkely translatons was found to be an effectve and expedent, but only f an approprate threshold on cumulatve probablty s used. Of the three probablstc structured query methods ntroduced n ths paper, /DF was the clear wnner, showng both the best retreval effectveness and the least senstvty to the cumulatve probablty threshold. Fnally, the novel approach of producng possble replacements for query terms that could have been generated by OCR proved to be a useful technque for mprovng retreval of OCR-degraded text. There are a number of nterestng drectons for future work suggested by these results: 1. Improved weghtng technques. The use of raw probablty estmates as weghts n the /DF method seems ntutvely appealng, but t s possble that usng some functon of the probabltes (e.g., log p) may actually outperform raw probabltes. There are also opportuntes to explore better smoothng methods when estmatng the probabltes. 2. Other applcatons. The /DF method can be used n any applcaton where replacement probabltes can be relably estmated. Examples of potental applcaton areas are thesaurus expanson, speech-based retreval, statstcal approxmatons of morphology, and perhaps gene sequence matchng. 3. Structured document ndexng. Query processng and document processng exhbt a strong dualty, so t may be possble to leverage some of the technques developed here at ndexng tme rather than query tme for applcatons such as stemmng, translaton based ndexng [11], speech retreval and OCR-based retreval. Varants of query term replacement are mportant n several nformaton retreval applcatons, and access to relable estmates of replacement probabltes from corpus statstcs s becomng ncreasngly common. The technques descrbed n ths paper balance effectveness and effcency n ways that are lkely to prove mmedately useful, and they should addtonally serve as a sold bass for future research on ths mportant problem. ACKNOWLEDGMENTS ***Removed for blnd revewng*** REFERENCES [1] Baeza-Yates, R. and G. Navarro, A Faster Algorthm for Approxmate Strng Matchng. Proceedngs of Combnatoral Pattern Matchng (CPM'96), Sprnger-Verlag LNCS, v. 1075, pages 1-13, [2] Darwsh, K. and D. Oard, CLIR Experments at Maryland for TREC 2002: Evdence Combnaton for Arabc-Englsh Retreval, TREC [3] Darwsh, K. and D. Oard, Term Selecton for Searchng Prnted Arabc, SIGIR 2002, , [4] Gey, F. and D. Oard, The TREC-2001 Cross- Language Informaton Retreval Track: Searchng Arabc Usng Englsh, French or Arabc Queres, TREC 2001, [5] Hardng, S., W. Croft, and C. Wer, Probablstc Retreval of OCR Degraded Text Usng N-Grams. European Conference on Dgtal Lbrares, 1997 [6] Hemstra, D. Usng language models for nformaton retreval Ph.D. Thess Unversty of Twente, Enschede, [7] Hong, T., Degraded Text Recognton Usng Vsual and Lngustc Context. Ph.D. thess, Computer Scence Department, SUNY Buffalo, [8], K. L., Personal communcaton. [9] Larkey, L., J. Allen, M. E. Connell, A. Bolvar, and C. Wade, UMass at TREC 2002: Cross Language and Novelty Tracks, TREC [10] NIST, Text Research Collecton Volume 5, Aprl [11] Oard, D. W. and F. Ertunc Translaton-Based Indexng for Cross-Language Retreval, ECIR 2002: , [12] Oard, D. W. and F. Gey, The TREC-2002 Arabc/Englsh CLIR Track, TREC [13] Prkola, A. The Effects of Query Structure and Dctonary setups n DctonaryBased Cross-language Informaton Retreval, Proceedngs of the 21 st Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pages 55-63, [14] Robertson, S. E., S. Walker, M. Hancock-Beauleu, A. Gull, and M. Lau, Okap at TREC-3, In the Fourth Text REtreval Conference (TREC-3), , 1996.

7 [15] Taghva, K., J. Borsack, and A. Condt, An Expert System for Automatcally Correctng OCR Output. Proceedngs of the SPIE - Document Recognton, pages , [16] tarjm.ajeeb.com, Sakhr Technologes, Caro, Egypt [17] ATA Software Technology Lmted, North Brentford Mddlesex, UK. [18] Xu, J., Weschedel, R., and Nguyen, C. Evaluatng a Probablstc Model for Cross-lngual Informaton Retreval. In Proceedngs of SIGIR, 2001, pages , 2001.

Probabilistic Structured Query Methods

Probabilistic Structured Query Methods Probablstc Structured Query Methods Kareem Darwsh Electrcal and Computer Engneerng Department and UMIACS Unversty of Maryland, College Park, MD 20742 {kareem,oard}@glue.umd.edu Douglas W. Oard College