Compung Relevance, Smlary: The Vecor Space Model Based on Larson and Hears s sldes a UC-Bereley hp://.sms.bereley.edu/courses/s0/f00/ aabase Managemen Sysems, R. Ramarshnan ocumen Vecors v ocumens are represened as bags of ords v Represened as vecors hen used compuaonally A vecor s le an array of floang pon Has drecon and magnude Each vecor holds a place for every erm n he collecon Therefore, mos vecors are sparse aabase Managemen Sysems, R. Ramarshnan
ocumen Vecors: One locaon for each ord. A B C E F G H I nova galaxy hea h ood flm role de fur 0 5 3 5 0 0 8 7 Nova occurs 9 0 mes 0 n ex 5 A Galaxy occurs 5 mes n ex A 0 0 Hea occurs 3 mes n ex A 9 0 5 7 (Blan means 0 occurrences. 9 6 0 8 7 5 3 aabase Managemen Sysems, R. Ramarshnan 3 ocumen ds A B C E F G H I ocumen Vecors nova galaxy hea h ood flm role de fur 0 5 3 5 0 0 8 7 9 0 5 0 0 9 0 5 7 9 6 0 8 7 5 3 aabase Managemen Sysems, R. Ramarshnan 4
We Can Plo he Vecors Sar oc abou asronomy oc abou move sars oc abou mammal behavor e Assumpon: ocumens ha are close n space are smlar. aabase Managemen Sysems, R. Ramarshnan 5 Vecor Space Model v ocumens are represened as vecors n erm space Terms are usually sems ocumens represened by bnary vecors of erms v Queres represened he same as documens v A vecor dsance measure beeen he query and documens s used o ran rereved documens Query and ocumen smlary s based on lengh and drecon of her vecors Vecor operaons o capure boolean query condons Terms n a vecor can be eghed n many ays aabase Managemen Sysems, R. Ramarshnan 6
Vecor Space ocumens and Queres docs 3 RSVQ. 0 4 0 0 3 0 5 4 0 0 5 6 6 0 3 7 0 0 8 0 0 9 0 0 3 0 0 5 0 3 Q 3 q q q3 Q s a query also represened as a vecor 3 3 0 5 6 7 8 aabase Managemen Sysems, R. Ramarshnan 7 9 4 Boolean erm combnaons Assgnng Weghs o Terms åbnary Weghs Ra erm frequency êf x df Recall he Zpf dsrbuon Wan o egh erms hghly f hey are frequen n relevan documens BUT nfrequen n he collecon as a hole aabase Managemen Sysems, R. Ramarshnan 8
Bnary Weghs v Only he presence ( or absence (0 of a erm s ncluded n he vecor docs 3 0 0 0 3 0 4 0 0 5 6 0 7 0 0 8 0 0 9 0 0 0 0 0 aabase Managemen Sysems, R. Ramarshnan 9 Ra Term Weghs v The frequency of occurrence for he erm n each documen s ncluded n he vecor docs 3 0 3 0 0 3 0 4 7 4 3 0 0 5 6 3 6 3 5 0 7 0 8 0 8 0 0 0 9 0 0 0 0 3 5 4 0 aabase Managemen Sysems, R. Ramarshnan 0
TF x IF Weghs v f x df measure: Term Frequency (f Inverse ocumen Frequency (df -- a ay o deal h he problems of he Zpf dsrbuon v Goal: Assgn a f * df egh o each erm n each documen aabase Managemen Sysems, R. Ramarshnan TF x IF Calculaon f * log( N / n T f n df df erm n documen frequency of erm T n documen nverse documen frequency of erm T n C N oal number of documens n he collecon C he number of documens n C ha conan T log N n aabase Managemen Sysems, R. Ramarshnan
Inverse ocumen Frequency v IF provdes hgh values for rare ords and lo values for common ords For a collecon of 0000 documens 0000 log 0 0000 0000 log 0.30 5000 0000 log.698 0 0000 log 4 aabase Managemen Sysems, R. Ramarshnan 3 TF x IF Normalzaon v Normalze he erm eghs (so longer documens are no unfarly gven more egh Usually means forcng all values o fall hn a ceran range, ypcally beeen 0 and, nclusve. f ( f log( N / n [log( N / n ] aabase Managemen Sysems, R. Ramarshnan 4
Par-se ocumen Smlary A B C nova galaxy hea h ood flm role de fur 3 5 5 4 Ho o compue documen smlary? aabase Managemen Sysems, R. Ramarshnan 5 Par-se ocumen Smlary,, sm(,,,...,..., sm( A, B ( 5 + ( 3 sm( A, C 0 sm( A, 0 sm( B, C 0 sm( B, 0 sm( C, ( 4 + ( 9 A B C nova galaxy hea h ood flm role de fur 3 5 5 4 aabase Managemen Sysems, R. Ramarshnan 6
aabase Managemen Sysems, R. Ramarshnan 7 Par-se ocumen Smlary (cosne normalzaon cosne normalzed ( (, ( unnormalzed, (...,,...,,,, sm sm aabase Managemen Sysems, R. Ramarshnan 8 Vecor Space Relevance Measure ( (, ( oherse normalze n he smlary comparson :, ( normalzed: erm eghs f erm s absen a 0 f...,,,...,,, j d j qj j d qj j d qj q q q d d d j j j sm Q sm Q Q
Compung Relevance Scores Say e have query vecor Q (0.4,0.8 Also, documen Wha does her smlary comparson yeld? sm( Q, [(0.4 0.64 0.4 (0.,0.7 (0.4*0. + (0.8*0.7 + (0.8 0.98 ]*[(0. + (0.7 ] aabase Managemen Sysems, R. Ramarshnan 9 Term B.0 0.8 0.6 0.4 0. Vecor Space h Term Weghs and Cosne Machng α α Q Q (0.4,0.8 (0.8,0.3 (0.,0.7 0 0. 0.4 0.6 0.8.0 Term A (d, d ;d, d ; ;d, d Q (q, q ;q, q ; ;q, q sm( Q, sm( Q, ( j q j dj j q j j 0.64 0.98 0.4.56 sm( Q, 0.74 0.58 ( dj (0.4 0. + (0.8 0.7 [(0.4 + (0.8 ] [(0. + (0.7 ] aabase Managemen Sysems, R. Ramarshnan 0
Smlary Measures Q Q Q + Q Q Q Q Q mn( Q, Smple machng (coordnaon level mach ce s Coeffcen Jaccard s Coeffcen Cosne Coeffcen Overlap Coeffcen aabase Managemen Sysems, R. Ramarshnan Tex Cluserng v Fnds overall smlares among groups of documens v Fnds overall smlares among groups of oens v Pcs ou some hemes, gnores ohers aabase Managemen Sysems, R. Ramarshnan
Tex Cluserng Cluserng s The ar of fndng groups n daa. -- Kaufmann and Rousseeu Term Term aabase Managemen Sysems, R. Ramarshnan 3 Problems h Vecor Space v There s no real heorecal bass for he assumpon of a erm space I s more for vsualzaon han havng any real bass Mos smlary measures or abou he same v Terms are no really orhogonal dmensons Terms are no ndependen of all oher erms; remember our dscusson of correlaed erms n ex aabase Managemen Sysems, R. Ramarshnan 4
Probablsc Models v Rgorous formal model aemps o predc he probably ha a gven documen ll be relevan o a gven query v Rans rereved documens accordng o hs probably of relevance (Probably Ranng Prncple v Reles on accurae esmaes of probables aabase Managemen Sysems, R. Ramarshnan 5 Probably Ranng Prncple v If a reference rereval sysem s response o each reques s a ranng of he documens n he collecons n he order of decreasng probably of usefulness o he user ho submed he reques, here he probables are esmaed as accuraely as possble on he bass of haever daa has been made avalable o he sysem for hs purpose, hen he overall effecveness of he sysem o s users ll be he bes ha s obanable on he bass of ha daa. Sephen E. Roberson, J. ocumenaon 977 aabase Managemen Sysems, R. Ramarshnan 6
Ierave Query Refnemen aabase Managemen Sysems, R. Ramarshnan 7 Query Modfcaon v Problem: Ho can e reformulae he query o help a user ho s ryng several searches o ge a he same nformaon? Thesaurus expanson: Sugges erms smlar o query erms Relevance feedbac: Sugges erms (and documens smlar o rereved documens ha have been judged o be relevan aabase Managemen Sysems, R. Ramarshnan 8
Relevance Feedbac v Man Idea: Modfy exsng query based on relevance judgemens Exrac erms from relevan documens and add hem o he query AN/OR re-egh he erms already n he query v There are many varaons: Usually posve eghs for erms from relevan docs Somemes negave eghs for erms from non-relevan docs v Users, or he sysem, gude hs process by selecng erms from an auomacally-generaed ls. aabase Managemen Sysems, R. Ramarshnan 9 Roccho Mehod v Roccho auomacally Re-eghs erms Adds n ne erms (from relevan docs have o be careful hen usng negave erms Roccho s no a machne learnng algorhm aabase Managemen Sysems, R. Ramarshnan 30
Q Q S n n 0 Roccho Mehod β n n α Q0 + R n n here he vecor for he nal query R he vecor for he relevan documen γ he vecor for he non - relevan documen he number of relevan documens chosen he number of non - relevan documens chosen α, β and γ une he mporance of relevan and nonrelevan erms (n some sudes bes o se β o 0.75 and γ o 0.5 S aabase Managemen Sysems, R. Ramarshnan 3 Roccho/Vecor Illusraon Informaon.0 Q 0 rereval of nformaon (0.7,0.3 nformaon scence (0.,0.8 rereval sysems (0.9,0. 0.5 Q Q ½*Q 0 + ½ * (0.45,0.55 Q ½*Q 0 + ½ * (0.80,0.0 Q 0 Q 0 0.5.0 Rereval aabase Managemen Sysems, R. Ramarshnan 3