Exploiting association rules and ontology for semantic document indexing

Size: px

Start display at page:

Download "Exploiting association rules and ontology for semantic document indexing"

Gwendolyn Chase
5 years ago
Views:

1 Explotng assocaton rules and ontology for antc document ndexng Fatha Boubekeur IRIT-SIG, Paul Sabater Unversty of Toulouse, CEDEX 9, France Department of Computer Scences, Mouloud Mammer Unversty of Tz-Ouzou, 15000, Algera Mohand Boughanem IRIT-SIG, Paul Sabater Unversty of Toulouse, CEDEX 9, France Lynda Tamne-Lechan IRIT-SIG, Paul Sabater Unversty of Toulouse, CEDEX 9, France Abstract Ths paper descrbes a novel approach for document ndexng based on the dscovery of contextual antc relatons between concepts. The concepts are frst extracted from WordNet ontology. Then we propose to extend and to use the assocaton rules technque n order to dscover condtonal relatons between concepts. Fnally, concepts and related contextual relatons are organzed nto a condtonal graph. Keywords Informaton retreval, Conceptual ndexng, Assocaton rules, WordNet. 1 Introducton Informaton retreval (IR) s concerned wth selectng, from a collecton of documents, those that are lkely to be relevant to a user nformaton need expressed usng a query. Three basc functons are carred out n an nformaton retreval system (IRS): document and nformaton needs representaton and matchng of these representatons [6]. Document representaton s usually called ndexng. The man obectve of ndexng s to assgn to each document a descrptor represented wth a set of features, usually keywords, derved from the document content. Representng the user s nformaton need nvolves a one step or multstep query formulaton by means of pror terms expressed by the user and/or addtve nformaton drven by teratve query mprovements lke relevance feedback [17]. The man goal of document-query matchng, called also query evaluaton, s to estmate a relevance score used as a crteron to rank the lst of documents returned to the user as an answer to hs query. Most of IR models handle durng ths step, an approxmate matchng process usng the frequency dstrbuton of query terms over the documents. Numerous factors affect the IRS performances. Frst, nformaton sources mght contan varous topcs addressed wth a very wde vocabulary that lead to dfferent antc contexts. Ths consequently ncrease the dffculty of dentfyng the accurate antc context addressed by the user s query. Second, users often do not express ther queres n an optmal form consderng ther needs. Thrd, n classcal IR models, documents and queres are bascally represented as bags of weghted terms. A key characterstc of such models s that the degree of document query matchng depends on the number of the shared terms. Ths leads to a lexcal focused relevance estmaton whch s less effectve than a conceptual focused one [14]. Ths paper addresses these lmtatons by proposng a conceptual ndexng approach based on the ont use of ontology and assocaton L. Magdalena, M. Oeda-Acego, J.L. Verdegay (eds): Proceedngs of IPMU 08, pp Torremolnos (Málaga), June 22 27, 2008

2 rules. We thus expect to take advantages from (1) conceptual ndexng of documents and (2) from the contextual dependences between concepts dscovered by means of assocaton rules. The paper s structured as follows: Secton 2 ntroduces the problems we am to tackle, namely the term msmatch and ambguty n IR, then reports some related works and presents our motvatons. The proposed antc ndexng approach s detaled n secton 3. Secton 4 concludes the paper. 2 Related works and motvatons Most of classcal IR models are based on the well known technque of bag of words expressng the fact that both documents and queres are represented usng basc weghted keywords. The performances of such models suffer from the so-called keyword barrer [20] that has serous drawbacks especally at document representaton and query formulaton levels that lead to bad performances. Indeed, n such IRS, a relevant document wll not be retreved n response to a query f the document and query representatons do not share at least one word. Ths mples on one hand, that relevant documents are not retreved f they do not share terms wth the query; on the other hand, rrelevant documents that have common words wth the query are retreved even f these words are not antcally equvalent. Varous approaches and technques attempted to tackle these problems by enhancng the document representaton or query formulaton. Attempts n document representaton mprovements are related to the use of antcs n the ndexng process. In ths context, two man ssues could be dstngushed [3]: antc ndexng and conceptual ndexng. Semantc ndexng s based on technques for contextual word sense dsambguaton (WSD) [10], [13], [18], [21]. Indexng conssts n assocatng the extracted words of document or query, to words of ther own context [21]. Other dsambguaton approaches [20] use herarchcal representatons drven from ontologes to compute the antc dstance or antc smlarty [11], [12], [16] between words to be compared. Conceptual ndexng s based on usng concepts extracted from ontologes and taxonomes n order to ndex documents nstead of usng smple lsts of words [3], [7], [8], [9], [19]. The ndexng process runs manly n two steps: frst dentfyng terms n the document text by matchng document phrases wth concepts n the ontology, then, dsambguatng the dentfed terms. In [15] the FIS-CRM model focusses on the nterrelatons (synonymy and generalty ones) among the document terms n order to buld the related conceptual ndex. A concept s represented by the related document term, the set of ts fuzzy synonymous (extracted from a fuzzy dctonary), and ts more general concepts (extracted from a fuzzy ontology). A weght readustment process allows the added terms to have fuzzy weghts even f they do not occur n the document. Document clusterng s then nvolved usng concept coocurrence measures. We am to tackle the problems due to usng basc keyword based evdence sources n IR, provdng a soluton at the ndexng level. The proposed approach reles on the ont use of concepts and contextual relatons between them n order to ndex the document. Both concepts and related dependences are fnally organzed nto a graphcal representaton resultng n a graphcal representaton of the document ndex. Our approach allows a rcher and more accurate representaton of documents when supportng both contextual and antcal relatons between concepts. Our man propostons presented n ths paper concern a novel conceptual document ndexng approach based on concepts and contextual dependences between them. Ths approach s based on the use of the WordNet general ontology as a source for extractng representatve document concepts, and on assocaton rules based technques n order to dscover the latent concept contextual relatons. More precsely, our conceptual ndexng approach s supported by three man steps: (1) Identfyng the representatve concepts n a document, (2) Dscoverng contextdependent relatons between antcal enttes namely the concepts (rather than between lexcal enttes that are terms) usng a proposed varant of assocaton rules namely antc assocaton Proceedngs of IPMU

3 rules, and (3) combnng both concepts and related antc assocatons nto a compact graphcal representaton. 3 A conceptual document ndexng approach Ths secton detals our conceptual document ndexng approach based on (1) the explotaton of WordNet for dentfyng concepts, (2) assocaton rules for dscoverng contextual relatons between concepts. Both concepts and related relatons are fnally organzed nto a compact condtonal graph. 3.1 Outlne We propose to use WordNet ontology and assocaton rules n order to buld the document conceptual ndex. The document ndexng process s handled through three man steps: (1) Identfyng the representatve concepts of a document: ndex concepts are dentfed n the document usng WordNet. (2) Mnng assocaton rules: contextual relatons between selected concepts are dscovered usng assocaton rules. (3) Buldng a graphcal conceptual ndex: both concepts and related relatons are then organzed nto a graph. Nodes n the graph are concepts and edges traduce relatons between concepts. The above steps are detaled n the followng. 3.2 Index concepts dentfcaton Overvew The whole process of dentfyng the representatve concepts of the document has been well descrbed n [4], we summarze t n the followng. The process reles on the followng steps: (1) Term dentfcaton: the goal of ths step s to dentfy sgnfcatve terms n a document. These terms correspond to entres n the ontology. (2) Term weghtng: n ths step, an alternatve of tf*df weghtng scheme s proposed. The underlyng goal s to elmnate the less frequent (less mportant) terms n a document n order to select only the most representatve ones (ndex terms). (3) Dsambguaton: ndex terms are assocated wth ther correspondng concepts n the ontology. As each extracted term could have many senses (related to dfferent concepts), we dsambguate t usng smlarty measures. A score s thus assgned to each concept based on ts antc dstance to other concepts n the document. The selected concepts are those havng the best scores Basc Notons Terms are represented as lsts of word strngs (a word strng s a character strng that represents a word). The length of a term t, noted t, s then the number of words n t. A mono-word term conssts n a one word lst. A mult-word term conssts n a word lst of more than one word. Let t be a term represented as a lst of words, t [w 1, w 2,,w,., w l ]. Elements n t could be dentcal, representng dfferent occurrences of a same word n t. We note w the th word n t. We recursvely defne the poston of the word w n the lst t by the followng: pos pos t t ( w1 ) 1 ( w ) pos ( w ) t 1 + 1, 1.. l Let t 1 [ w w,..., w m ], t [ y y,..., y l ] 1, 2 gven terms. 2 1, 2 be two Defnton 1 t 2 s a sub-term of t 1 f the whole sequence of words n t 2 occurs n t 1. t 1 s then called sur-term of t Identfcaton of ndex terms Before any document processng, n partcular before prunng stop words, an mportant process for the next steps conssts n extractng monoword and mult-word terms from texts that correspond to entres n WordNet. The technque we propose performs a word by word analyss of the document. It s descrbed n the followng: 466 Proceedngs of IPMU 08

4 Let w be the next word (assumed not to be a stop word), to analyze n the document d. We extract from WordNet, the set S of terms contanng w : let S {C 1, C 2,... C m}, S s composed of mult-word and mono-word terms. S s ranked n decreasng order of term length as follows: S {C (1), C (2), C (m)} where () (1)..(m) s an ndex permutaton such as C (1) C (2) C (m). Terms wth dentcal szes are ndfferently placed one besde another. For each element C () n S, we note Pos C () (w ) the poston of w n the C () lst of words. There are pos C w ( ) ( ) 1 words on the left of w n C (). Let Pos d (w ) be the poston of w n d lst of words. Defnton 2 The relatve context of w occurrence n document d gven the term C (), s the word lst CH defned by: CH sub(d, p,l) where: ( ) ( ) p posd w pos w 1 and l C C( ) (). We so extract the relatve context of w n d, namely CH sub(d, p,l) (c.f. Fg. 1), then we compare word strng lsts CH and C ().. Fgure 1: Identfyng word context n d If CH C (), the C ( +1 ) term of S s analyzed, otherwse, term t k C () s dentfed. The next word to analyze n d s w such that pos w p. ( ) l d + Durng the dentfcaton process, three cases can happen as shown n Fg. 2: Case a) Current dentfed term t k s completely dsont from t k-1. It could be dentcal but we do not treat denttes at ths level. It s thus a new term whch wll be retaned n document descrpton. Case b) Term t k covers partally term t k-1. The two terms are thus dfferent and both dentfed even f they have common words. Case c) Term t k covers completely one or more precedng adacent terms t k-1 t, k-1. In ths case, to allow an effectve dsambguaton, we retan the longest term, namely t k, and elmnate the adacent terms t covers (t k-1... t, k-1). case a) case b) case c) Fgure 2: Term dentfcaton By ths step, we wll have dentfed the set T(d) of all the mult-word or mono-word terms that compose d: T(d) {t 1, t 2,.., t n }. We fnally compute each term frequency n d and elmnate redundant terms, whch leads to: T (d){(t 1,Occ 1 ), (t 2,Occ 2 ),, (t m, Occ m ) / t d, Occ Occ(t ) s the occurrence frequency of t n d, 1 m and m n } Term weghtng Term weghtng assgnes to each term a weght that reflects ts mportance n a document. In the case of mono-word terms, varants of the tf*df formula are used, expressed as follows: W t,d tf(t)*df(t), tf s term frequency, df s the nverse N document frequency such as df () t log df () t, N s the number of documents n the corpus and df(t) the number of documents n the corpus that contan the term t. In the case of mult-word terms, term weghtng approaches use n general statstcal and/or syntactcal analyss. Roughly, they add sngle word frequences or multply the number of term occurrences by the number of sngle words belongng to ths term. We propose a new weghtng formula as varant of tf*df defned as follows: let W t,d be the weght assocated wth the term t n the document d, T (d) the term set of d, Sub (t) T (d) a sub-term of t, Sur (t ) T (d) a sur-term of t, S(t) the set of synsets C assocated wth term t. We defne the probablty that t s a possble sense of Sub (t) as follows: Proceedngs of IPMU

5 ( S( Sub () t ) P t { C S ( Sub () t )/ t C} S Sub () t ( ) We then defne W t, d tf (t) * df(t) as follows: W t, d Occ( t) + Occ( Sur ( t)) [ P( t S( Sub ( t) ) Occ( Sub ( t) )] N * ln df ( t) Where: N s the total number of documents n the corpus, df(t) (document frequency) s the number of documents n the corpus that contan the term t. The underlyng dea s that the global frequency of a term n a document s quantfed on the bass of both ts own frequency and ts frequency wthn each of ts sur-terms as well as ts probable frequency wthn ts sub-terms. Document ndex, Index(d), s then bult by keepng only those terms whose weghts are greater than a fxed threshold Term dsambguaton Each term t n Index(d) may have a number of related senses (.e. WordNet synsets). Let S {C 1,, C,..., C n } be the set of all synsets assocated wth term t. Thus, t has S n senses. We beleve that each ndex term contrbutes to the antc content representaton of d wth only one sense (even f that s somewhat erroneous, snce a term can have dfferent senses n the same document, but we consder here the only domnatng sense). From where, we must choose, for each term belongng to the document ndex (t Index(d)), ts best sense n d. Ths s term dsambguaton. Our dsambguaton approach reles on the computaton of a score (C_Score) for every concept (sense) assocated wth an ndex term. That s, for a term t, the score of ts th sense, noted C, s computed as: Score ( C ) l ( C C ) l WC, d WC, d Dst, k l [ 1,.., m] 1 k nl l k Where m s the number of concepts from Index(d), n l represents the number of WordNet senses whch s proper to each term t l, l Dst C, C s a antc relatedness measure ( ) k between concepts and as defned n [11], [12], [16] and s the weght assocated wth concept C C W C, d l C k, formally defned as follows: C S, W W C, d t, d The concept-sense whch maxmzes the score s then retaned as the best sense of term t. The underlyng dea s that the best sense for a term t n the document d may be the one strongly correlated wth senses assocated wth other mportant terms n d. The set N ( d ) of the selected senses represents the antc core of the document d. 3.3 Dscoverng assocatons between concepts Assocaton rules ntroduced n [1] am to generate all the sgnfcant assocatons between tems n a transactonal database. In our context, assocaton rules are used n order to dscover sgnfcant assocatons between concepts. The formal model s defned n the followng: Let ( d ) { C, C2,..., C,...} N be the antc core 1 of the document d. Each concept C s represented by the term t refers to n the document. The set of ts synonymous defnes ts Dom C C C, C,.... Each value doman ( ) { } value n Dom(C ), s a concept 1, 2 3 that C and C are synonymous. Thus, let η ( d ) {( C, Dom( C ))} ( d ) {( X, Dom( X ))} C N( d ) such, we smply note η the antc core of document d. We propose to use assocaton rules mnng to dscover latent (hdden) contextual relatons between the ndex concepts. Concepts are antc enttes. As assocaton rules allow dscoverng relatons between lexcal enttes, namely terms, we propose to extend them n order to support antc entty assocatons (namely concept assocatons). The proposed extenson s a revsed verson of the one defned n [4]. 468 Proceedngs of IPMU 08

6 Defnton 3 A antc assocaton rule between X and Y η( d ), we note X Y, s defned as follows: X X Dom Y X Y ( X ), Y Dom( Y ) Such that X Y s an assocaton between terms (frequent 1-temsets) X and Y. The ntutve meanng of X Y rule s that f a document s about a concept X, t tends to also be about the concept Y. The aboutness of a document expresses the topc focus of ts content. Ths nterpretaton also apples to the rule X Y. Thus, the rule R : X Y expresses the probablty that the document s about Y knowng that t s about X. The confdence assocated wth R thus rely on the mportance degree of Y n a document d, knowng the mportance degree of X n d. It s formally defned n the followng. Defnton 4 The confdence of the rule R : X Y s formally gven by: Confdence ( R) Support Support mn ( X and Y ) ( X ) ( W W ) X, d, W X, d Y, d Defnton 5 The confdence of a antc assocaton rule R : X Y s defned as: Confdence ( X Y ) max Confdence, X R : X X ( ) ( ) Dom X, Y Dom Y Remark 1. Confdence ( X Y ) always equal 1. In our context, the support of a antc assocaton rule X Y reles on the amount of ndvdual assocaton rules X Y ( X Dom( X ) and Y Dom( Y ), havng confdence greater than the fxed mnmum confdence threshold mnconf1. The support s formally defned n the followng. Defnton 6 The support of the rule R : X Y s formally gven by: / Support ( R) { X Y / Confdence( X Y ) mn conf } { X Y, ( X, Y ) Dom( X ) Dom( Y )} We propose to dscover relatons between concepts n η(d) by means of antc assocaton rules. Semantc assocaton rules are based, n our context, on the followng prncples: (1) a transacton s a document, (2) tems are values from concept domans, (3) an temset s a concept, (4) a antc assocaton rule X Y defnes an mplcaton (.e. a condtonal relaton) between concepts X and Y. By usng assocaton rules, we am to buld a condtonal herarchcal structure of the topc focus of the document content. That s to say, we am at structurng concepts descrbng the document, n a condtonal herarchy whch s supported by the antcs of extracted assocaton rules. The problem of dscoverng assocaton rules between concepts s devded nto two steps, followng the prncple by A-pror algorthm.. Frst, dentfyng all the frequent 1-temsets, correspondng to ndvdual concepts. A frequent concept s n our context, a concept that have a weght greater than a fxed mnmum threshold. Second, mnng assocaton rules between the frequent temsets. The obectve of mnng antc assocaton rules between concepts s to retan the only ones that have a support and a confdence greater than a fxed mnmum threshold mnsup and mnconf respectvely. Some problems can occur when dscoverng assocaton rules, such as redundances and cycles. Redundant rules come generally from transtve propertes: X Y, Y Z and X Z. In order to elmnate redundancy we propose to buld the mnmal covers of the set of extracted rules (that s the mnmal set of non transtve rules). The exstence of cycles n the graph would be due to the smultaneous dscovery of assocaton rules X Y and Y X, or of assocaton rules such as X Y, Y Z and Z X. To solve ths problem, we elmnate the weakest rule Proceedngs of IPMU

7 havng the lowest support, among that leadng to the cycle. If all the rule supports are equal, we randomly elmnate one rule n the cycle. Fnally, concepts and related assocaton rules are structured nto a network leadng to a graphcal ndex. The process of buldng the ndex graph s based on the followng prncples: Nodes n the graph are are concepts from the antc core of the document d, η ( d ) {( C, Dom( C ))}. Thus, nodes n the graph are represented by varables attached to the concepts C, nstantated n the value doman Dom C C C, C,.... ( ) { } 1, 2 3 Node relatons are condtonal dependences dscovered between concepts, by means of antc assocaton rules. Each node X n the graph s annotated by a uncondtonal table named T(X) such that: ( X ), T ( X ) WX d X Dom, (1) In a prevous work [4], we used partcularly CP- Nets [5] as the graphcal model supportng ths conceptual ndex. 3.4 Illustraton The document ndexng process presented above s llustrated through the followng example. Let d((u.c 1.,0.5),(Metropols,0.9), (Impovershment, 0.1),(People,0.4),(Poorness, 0.7), ) a document descrbed by the gven weghted concepts. Metropols, and U.C are synonymes of Cty, thus U.C and Metropols pertan to Cty concept node doman. Smlarly, both of Impovershment and Poorness pertan to Poverty concept node doman, whereas People s assocated wth populaton concept node. That s to say η(d) {(Cty, Dom (Cty)), (Poverty, Dom(Poverty)), (Populaton, Dom(Populaton)) Dom(Cty) {Metropols, U.C}, Dom(Poverty) {Poorness,Impovershment}, Dom(Populaton) {People}. 1 U.C. Urban Center We am to dscover assocatons between Cty, Populaton and Poverty nodes. Applyng Apror algorthm [2] leads: (1) to extract frequent temsets, then (2) to generate assocaton rules between frequent 1- temsets. Relatons of nterest are between ndvdual concepts n the document (rather than between sets of concepts), thus we only have to compute the k-temsets, k1, 2. Assumng a mnmum support threshold mnsup 0.1, the extracted frequent k-temsets (k 1, 2) are gven n Table 1. Support(Impovershment) < msup : the 1- temset Impovershment s not frequent, t s then pruned. The extracted assocaton rules are gven n Table 2. Table 1: Generatng frequent k-temsets 1-Itemsets Itemset Support Metropols 0.9 U.C 0.5 Impovershment 0.1 Poorness 0.7 Frequent 2- temsets Table 2: R 1 : Metropols People R 3 : Metropols Poorness R 5 : U.C People R 7 : U.C Poorness R 9 : People Poorness People 0.4 Metropols, People 0.4 Metropols, Poorness 0.7 U.C, People 0.4 U.C, Poorness 0.5 People, Poorness 0.4 Generated assocaton rules R 2 : People Metropols R 4 : Poorness Metropols R 6 : People U.C R 8 : Poorness U.C R 10 : Poorness People By applyng the formula gven n defnton 4, generated rule confdences are computed leadng to the results gven n Table 3. If we suppose a confdence mnmum threshold mnconf 1, we retan only rules whose confdences are equal or greater than mnconf. The selected rules are shown n Table Proceedngs of IPMU 08

8 Table 3: Confdence rules R R 1 R 3 R 5 R 7 R 9 Confdence(R ) assocated wth concept nodes Populaton, Cty and Poverty respectvely, usng formula 1, whch leads to the graphcal document ndex gven n Fg. 3. R 2 R 4 R 6 R 8 R These rules are frst used to buld antcal assocaton rules that n fact correspond to the ndex graph concept- node relatons. Thus, we deduce: (1) From R 2 : People Metropols and R 6 : People U.C: Populaton Cty (2) From R 4 : Poorness Metropols : Poverty Cty (3) From R 7 : U.C Poorness : Ct y Poverty (4) From R 9 : People Poorness : Populaton Poverty Table 4: Selected assocaton rules R 2 : People Metropols R 4 : Poorness Metropols R 6 : People U.C R 7 : U.C Poorness R 9 : People Poorness We then compute the support of each antc assocaton rule. The results are gven n Table 5. Table 5: Semantc assocaton rules supports Populaton Cty 1 Poverty Cty 0.5 Cty Poverty 0.5 Populaton Poverty 1 Fgure 3: A graphcal document ndex 4 Concluson We descrbed n ths paper a novel approach for conceptual document ndexng. The approach s founded on the ont use of both ontology for dentfyng, weghtng and dsambguatng terms, and assocaton rules to derve context dependent relatons (the context here refers to the document content) between terms leadng to a more expressve document representaton. The approach foundaton s not new but we proposed new technques for dentfyng, weghtng and dsambguatng terms and for dscoverng relatons between related concepts by means of the proposed antc assocaton rules. Semantc assocaton rules allow to derve context dependent relatons between concepts leadng to a more expressve document representaton. The work s stll n progress and results from a comparatve study wth other exstng conceptual ndexng approaches wll be soon avalable. We obvously retan these rules that have a support equal to 1. Two assocatons exst between concepts Cty and Poverty, wth the same support. We thus randomly keep one of them. Suppose Poverty Cty s retaned. The set of selected antc assocaton rules s hghlghted n Table 5. Clearly, retanng the three rules wll lead to a cycle n the ndex graph. To avod ths, we prune the weakest rule (that s the one wth lowest support) Poverty Cty. Fnally, the only selected antc rules are the followng: Populaton Cty and Populaton Poverty. CPT s are then References [1] R.Agrawal, T. Imelnsk, and A. Swam (1993). Mnng assocaton rules between sets of tems n large databases. In Proceedngs of the ACMSIGMOD Conference on Management of data, pages , Washngton, USA, [2] R. Agrawal and R. Srkant (1994). Fast algorthms for mnng assocaton rules. In Proceedngs of the 20th VLdB Conference, pages Proceedngs of IPMU

9 [3] M. Bazz, M.Boughanem, N.Aussenac- Glles N. and C. Chrsment (2005). Semantc Cores for Representng document n IR. In SAC' th ACM Symposum on Appled Computng, pages Santa Fe, New Mexco, USA, [4] F. Boubekeur, M. Boughanem and L. Tamne-Lechan (2007). Semantc Informaton Retreval Based on CP-Nets, n IEEE Internatonal Conference on Fuzzy Systems, London, U.K, uly [5] C. Boutler, R. Brafman, H. Hoos and D. Poole (1999). Reasonng wth Condtonal Ceters Parbus Preference Statements. In Proceedngs of UAI-1999, pages [6] W.B. Croft (1993). Knowledge-based and statstcal approaches to text retreval. In IEEE Expert, 8(2), pages [7] N. Guarno, C. Masolo and G. Vetere (1999). OntoSeek: Usng Large Lngustc Ontologes for Accessng On-Lne Yellow Pages and Product Catalogs. Natonal Research Councl, LADSEBCNR, Padavo, Italy, [8] J. Gonzalo, F. Verdeo, I. Chugur, J. Cgarran. (1998). Indexng wth WordNet synsets can mprove retreval. In proceedngs of COLING/ACL Work, [9] L.R Khan (2000). Ontology-based Informaton Selecton. Phd Thess, Faculty of the Graduate School, Unversty of Southern Calforna. [10] R. Krovetz and W.B. Croft (1992). Lexcal Ambguty and Informaton Retreval. In ACM Transactons on Informaton Systems, 10(1). [11] C. Leacock, G.A. Mller and M. Chodorow (1998). Usng corpus statstcs and WordNet relatons for sense dentfcaton. Computatonal Lngustcs 24(1,) pages [12] D. Ln (1998). An nformaton-theoretc defnton of smlarty. In Proceedngs of 15th Internatonal Conference On Machne Learnng. [13] R. Mhalcea, and D. Moldovan (2000). Semantc ndexng usng WordNet senses. In Proceedngs of ACL Workshop on IR and NLP, HongKong. [14] M. Mauldn, J. Carbonell and R. Thomason (1987). Beyond the keyword barer: knowledge-based nformaton retreval. Informaton servces and use 7(4-5), pages [15] J.A. Olvas, P.J. Garcés and F.P. Romero (2003). An applcaton of the FIS-CRM model to the FISS metasearcher: Usng fuzzy synonymy and fuzzy generalty for representng concepts n documents. Internatonal Journal of Approxmate Reasonng, Volume 34, Issues 2-3, pages , November [16] P. Resnk (1999). Semantc Smlarty n a Taxonomy: An Informaton-Based Measure and ts Applcaton to Problems of Ambguty n Natural Language. Journal of Artfcal Intellgence Research (JAIR), 11. pages [17] J.J. Roccho (1971). Relevance feedback n nformaton retreval. In The SMART Retreval System-experments n Automatc Document Processng, G.Salton edtor, Prentce-Hall, Englewood Clffs, NJ, pages [18] M. Sanderson (1994). Word sense dsambguaton and nformaton retreval. In Proceedngs of the 17th Annual Internatonal ACM-SIGIR Conference on Research and development n Informaton Retreval, pages [19] M.A. Starmand and J.B. Wllam (1996). Conceptual and Contextual Indexng of documents usng WordNet-derved Lexcal Chans. In Proceedngs of 18th BCS-IRSG Annual Colloquum on Informaton Retreval Research. [20] E.M. Voorhees (1993). Usng WordNet to dsambguate Word Senses for Text Retreval. In Proceedngs of the 16thAnnual Conference on Research and development n Informaton Retreval, SIGIR'93, Pttsburgh, PA. [21] D. Yarowsky (1995). Unsupervsed word sense dsambguaton rvalng supervsed methods. In 33rd Annual Meetng, Assocaton for Computatonal Lngustcs, pages , Cambrdge, Massachusetts, USA. 472 Proceedngs of IPMU 08

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes