1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval Models A rerieval model specifies he deails of: Documen represenaion Query represenaion Rerieval funcion Deermines a noion of relevance. Noion of relevance can be binary or coninuous (i.e. ranked rerieval). 1 Common Preprocessing Seps Srip unwaned characers/markup (e.g. HTML ags, puncuaion, numbers, ec.). Break ino okens (keywords) on whiespace. Sem okens o roo words compuaional compu Remove common sopwords (e.g. a, he, i, ec.). Deec common phrases (possibly using a domain specific dicionary). Build invered index (keyword lis of docs conaining i). Boolean Model A documen is represened as a se of keywords. Queries are Boolean expressions of keywords, conneced by AND, OR, and NOT, including he use of brackes o indicae scope. [[Rio & Brazil] [Hilo & Hawaii]] & hoel &!Hilon] Oupu: Documen is relevan or no. No parial maches or ranking. 3 4 Boolean Rerieval Model Popular rerieval model because: Easy o undersand for simple queries. Clean formalism. Boolean models can be exended o include ranking. Reasonably efficien implemenaions possible for normal queries. Boolean Models Problems Very rigid: AND means all; OR means any. Difficul o express complex user requess. Difficul o conrol he number of documens rerieved. All mached documens will be reurned. Difficul o rank oupu. All mached documens logically saisfy he query. Difficul o perform relevance feedback. If a documen is idenified by he user as relevan or irrelevan, how should he query be modified? 5 6
Saisical Models A documen is ypically represened by a bag of words (unordered words wih frequencies). Bag = se ha allows muliple occurrences of he same elemen. User specifies a se of desired erms wih opional weighs: Weighed query erms: Q = < daabase 0.5; ex 0.8; informaion 0. > Unweighed query erms: Q = < daabase; ex; informaion > No Boolean condiions specified in he query. Saisical Rerieval Rerieval based on similariy beween query and documens. Oupu documens are ranked according o similariy o query. Similariy based on occurrence frequencies of keywords in query and documen. Auomaic relevance feedback can be suppored: Relevan documens added o query. Irrelevan documens subraced from query. 7 8 Issues for Vecor Space Model How o deermine imporan words in a documen? Word sense? Word n-grams (and phrases, idioms, ) erms How o deermine he degree of imporance of a erm wihin a documen and wihin he enire collecion? How o deermine he degree of similariy beween a documen and he query? In he case of he web, wha is a collecion and wha are he effecs of links, formaing informaion, ec.? The Vecor-Space Model Assume disinc erms remain afer preprocessing; call hem index erms or he vocabulary. These orhogonal erms form a vecor space. Dimension = = vocabulary Each erm, i, in a documen or query, j, is given a real-valued weigh, w ij. Boh documens and queries are expressed as -dimensional vecors: d j = (w 1j, w j,, w j ) 9 10 Graphic Represenaion Documen Collecion Example: D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + T 3 D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + T 3 T 7 5 T 3 3 T 1 Is D 1 or D more similar o Q? How o measure he degree of similariy? Disance? Angle? Projecion? A collecion of n documens can be represened in he vecor space model by a erm-documen marix. An enry in he marix corresponds o he weigh of a erm in he documen; zero means he erm has no significance in he documen or i simply doesn exis in he documen. T 1 T. T D 1 w 11 w 1 w 1 D w 1 w w : : : : : : : : D n w 1n w n w n 11 1
3 Term Weighs: Term Frequency More frequen erms in a documen are more imporan, i.e. more indicaive of he opic. f ij = frequency of erm i in documen j May wan o normalize erm frequency (f) across he enire corpus: f ij = f ij / max{f ij } Term Weighs: Inverse Documen Frequency Terms ha appear in many differen documens are less indicaive of overall opic. df i = documen frequency of erm i = number of documens conaining erm i idf i = inverse documen frequency of erm i, = log (N/ df i ) (N: oal number of documens) An indicaion of a erm s discriminaion power. Log used o dampen he effec relaive o f. 13 14 TF-IDF Weighing A ypical combined erm imporance indicaor is f-idf weighing: w ij = f ij idf i = f ij log (N/ df i ) A erm occurring frequenly in he documen bu rarely in he res of he collecion is given high weigh. Many oher ways of deermining erm weighs have been proposed. Experimenally, f-idf has been found o work well. Compuing TF-IDF -- An Example Given a documen conaining erms wih given frequencies: A(3), B(), C(1) Assume collecion conains 10,000 documens and documen frequencies of hese erms are: A(50), B(1300), C(50) Then: A: f = 3/3; idf = log(10000/50) = 5.3; f-idf = 5.3 B: f = /3; idf = log(10000/1300) =.0; f-idf = 1.3 C: f = 1/3; idf = log(10000/50) = 3.7; f-idf = 1. 15 16 Query Vecor Query vecor is ypically reaed as a documen and also f-idf weighed. Alernaive is for he user o supply weighs for he given query erms. Similariy Measure A similariy measure is a funcion ha compues he degree of similariy beween wo vecors. Using a similariy measure beween he query and each documen: I is possible o rank he rerieved documens in he order of presumed relevance. I is possible o enforce a cerain hreshold so ha he size of he rerieved se can be conrolled. 17 18
4 Similariy Measure - Inner Produc Properies of Inner Produc Similariy beween vecors for he documen d i and query q can be compued as he vecor inner produc: sim(d j,q) = d j q = w ij w iq i= 1 where w ij is he weigh of erm i in documen j and w iq is he weigh of erm i in he query For binary vecors, he inner produc is he number of mached query erms in he documen (size of inersecion). For weighed erm vecors, i is he sum of he producs of he weighs of he mached erms. The inner produc is unbounded. Favors long documens wih a large number of unique erms. Measures how many erms mached bu no how many erms are no mached. 19 0 Binary: Inner Produc -- Examples D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 rerieval daabase archiecure compuer ex managemen informaion Weighed: D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + 1T 3 sim(d 1, Q) = *0 + 3*0 + 5* = 10 sim(d, Q) = 3*0 + 7*0 + 1* = Size of vecor = size of vocabulary = 7 0 means corresponding erm no found in documen or query Cosine Similariy Measure Cosine similariy measures he cosine of he angle beween wo vecors. Inner produc normalized by he vecor lenghs. r r ( d q w ij w iq ) j i = 1 CosSim(d = j, q) = r r d j q w w i ij = 1 i = 1 iq D 1 θ D D 1 = T 1 + 3T + 5T 3 CosSim(D 1, Q) = 10 / (4+9+5)(0+0+4) = 0.81 D = 3T 1 + 7T + 1T 3 CosSim(D, Q) = / (9+49+1)(0+0+4) = 0.13 D 1 is 6 imes beer han D using cosine similariy bu only 5 imes beer using inner produc. θ 1 3 Q 1 1 Naïve Implemenaion Conver all documens in collecion D o f-idf weighed vecors, d j, for keyword vocabulary V. Conver query o a f-idf-weighed vecor q. For each d j in D do Compue score s j = cossim(d j, q) Sor documens by decreasing score. Presen op ranked documens o he user. Time complexiy: O( V D ) Bad for large V & D! V = 10,000; D = 100,000; V D = 1,000,000,000 Commens on Vecor Space Models Simple, mahemaically based approach. Considers boh local (f) and global (idf) word occurrence frequencies. Provides parial maching and ranked resuls. Tends o work quie well in pracice despie obvious weaknesses. Allows efficien implemenaion for large documen collecions. 3 4
5 Problems wih Vecor Space Model Missing semanic informaion (e.g. word sense). Missing synacic informaion (e.g. phrase srucure, word order, proximiy informaion). Assumpion of erm independence (e.g. ignores synonomy). Lacks he conrol of a Boolean model (e.g., requiring a erm o appear in a documen). Given a wo-erm query A B, may prefer a documen conaining A frequenly bu no B, over a documen ha conains boh A and B, bu boh less frequenly. 5