5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16
5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48
O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds 1.0 assists per game o'neal averag 15.2 point 9.2 rebound 1.0 assist per game
Original Porter2 Krovetz organization organ organization organ organ organ heading head heading head head head Original Porter2 Krovetz european european europe europe europ europe urgency urgenc urgent urgent urgent urgent
D2 2 1 0
cos x, y x x x x 1 2... k y y y y 1 2... k x y x y i1 k x y i k k 2 2 x i yi i1 i1 i
x x x x 1 2... k y y y y 1 2... k cos x, y k x y x y xi yi x y x y x y i1
cos q, d q q q q d d d d 1 2... k 1 2... k q d q d q d k i i i i i1 i1 k k k 2 2 2 qi di di i1 i1 i1 k q d
w TF t, D IDF t i i i
TF binary t, D i 1 c ti, D 0 0 c ti, D 0
TFraw ti, D c ti, D
TF raw t, D i c t D ct D 1 log b c ti, D i, 0 0 i, 0
y x log 1 x y log 2 x y log x y log 10 x
log2 0.69 log31.10 q 1.69 1 0
IDF t uniform 1 i
IDF KSJ t log N n t
IDF BM 25 t log N nt 0.5 N nt n 0.5 2 t 0 n t N 2
Multivariate Bernoulli Distribution Results of n independent Bernoulli trails {X 1,, X n } n different coins; toss each coin once Each coin may have a different probability of heads/tails X 1 : tails X 2 : heads X 3 : heads X 4 : heads Different from a binomial distribution with size n Toss the same coin n times (each time independent of others)
A Multivariate Bernoulli Model of Document Consider a document D as the outcome of the model Let be a document D s binary term occurrence vector What is the probability of by this MB model? X i P(X i = 1) P(X i = 0) index 1 0.4 0.6 retrieval 1 0.3 0.7 search 0 0.5 0.5 information 1 0.9 0.1 data 0 0.8 0.2 computer 1 0.9 0.1 science 0 0.4 0.6 index=1 retrieval=1 search=0 information=1 data=0 computer=1 science=0 P D P P P P P P P 0.40.30.50.90.20.90.6
Naïve Bayes Classification using MB Models X i P(X i = 1 IR) P(X i = 1 DB) index 1 0.7 0.8 search 1 0.9 0.9 information 0 0.8 0.6 data 1 0.5 0.9 computer 0 0.4 0.6 relevance 1 0.9 0.1 SQL 0 0.1 0.8 P(IR) = 0.3 P(DB) = 0.7 P IR D P IR P X i Di IR P DB D P DB P X D DB i X V i i P IR D 0.3 0.7 0.9 10.8 0.5 10.4 0.9 10.1 6.33 P DB D 0.7 0.8 0.9 10.6 0.9 10.6 0.1 10.8 D is 6.33 times more likely to be an IR article than a DB one.
log P R D, Q pi 1qi log P NR D, Q 1 p q i t QD i i p P X 1 R, Q i i q P X 1 NR, Q i i p 1 1 0.5 i qi q N n i t i log log log 1 p q q n 0.5 t QD i i t QD i t QD t i i i i
,, scorebm25 d q weightbm25 d q q q i i weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl tf : the raw frequency of q in d q, d i N : the total number of documents in the corpus dl : the length of the document d avdl : the average length of documents in the corpus 1 i n : the document frequency of q k and b are two parameters i i
weight k 1 BM25 determines the upperbound of TF: k 1 tf dl k 1 1 b b tf avdl 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl 1 lim k1 1 tf For an average-length document ( dl avdl), TF k1 1 tf k1 1 tf dl k1 tf k 1 1 b b tf avdl
weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl For a longer-than-average document, raw tf will be penalized, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl For a shorter-than-average document, raw tf will be boosted, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl
tf dl k 1 1 b b tf avdl tf
tf dl k 1 1 b b tf avdl tf
tf dl k 1 1 b b tf avdl tf
θ θ θ θ tv P t 1 t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08
θ Pt P D n i1 0.110.32 0.180.21 0.32 i t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08
t c(t,d) P(t IR) P(t DB) index 1 0.21 0.17 retrieval 2 0.32 0.05 search 1 0.18 0.22 information 1 0.11 0.12 data 0 0.06 0.33 computer 0 0.04 0.08 science 0 0.08 0.03 Prior Prob. P(IR)=0.3 P(DB)=0.7 c t, D c t, D P IR D P IR P t IR P DB D P DB td P t DB P IR D 0.3 0.21 0.32 0.18 0.11 16.26 P DB D 0.7 0.17 0.05 0.22 0.12 2
θ θ θ Pt P q D tq Pt log P q D log tq D D ± ±
Recap: The Query likelihood model (QL) Each document is generated from a document LM θ D Estimate a language model θ D for the document D Rank documents by P(q θ D ) Example: QL ranks D 1 higher than D 2 D 1 s model t P(t θ D1 ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08? query information retrieval Pt P q 0.110.32 D1 D1 tq? Pt D2 D2 tq D 2 s model t P(t θ D2 ) index 0.17 retrieval 0.05 search 0.22 information 0.12 data 0.33 computer 0.08 science 0.03 P q 0.120.05
ˆ c t, D term frequency PMLE t D D document length ˆ information D 12 2 1 D Pˆ MLE retrieval D 12 12 ˆ 1 1 PMLE an D Pˆ MLE for D 12 12 1 1 Pˆ MLE technique D 12 12 P MLE ˆ is 1 12 PMLE D ˆ important P MLE D
ˆ c t, D term frequency PMLE t D D document length c t, D ˆ D corpus term frequency PMLE t corpus corpus length D D t DF P X e N ˆ i IDF i MLE i 1 corpus t
MLE: Recall Several Issues Unseen words get zero probability As long as one query term does not appear in D, the document gets zero probability. Ranking by P MLE (q D) is similar to Boolean AND But no occurrence does not mean impossible. Limited sample size MLE is reasonably good for a large sample size. But we are estimating a document model, usually just a few hundred/thousand words long. In some cases (will cover next lecture), we also need to estimate a query model. The sample size is even shorter. Solution: smoothing (will discuss a few slides later )
Jelinek-Mercer Smoothing Start simple, but reasonably good Using P(t Corpus) as the background model Set λ to be constant for all documents, independent of any document or query characteristics Tune to optimize retrieval performance e.g., maximize mean values of P@10 or AP over a set of different queries in a dataset. optimal value of λ varies with different databases, query sets, etc. Correctly setting λ is very important 1 P t P t P t Corpus D MLE D
Jelinek-Mercer Smoothing Example: λ = 0.5, D = 1281 106 Psmoothed the D 0.5 0.50.063904 1281 word freq P MLE (* D) P(* Corpus) Smoothed the 106 0.082748 0.063904 0.073326 soviet 18 0.014052 0.000208 0.007130 chernobyl 10 0.007806 0.000012 0.003909 disclosure 1 0.000781 0.000053 0.000417 divert 1 0.000781 0.000014 0.000397 downplaye 1 0.000781 0.000001 0.000391 each 1 0.000781 0.000489 0.000635 early 1 0.000781 0.000486 0.000633
Dirichlet Smoothing Problem with Jelinek-Mercer All documents have the same λ Longer documents provide better estimates (because it provides a larger sample), and thus its own MLE is more reliable Make smoothing depend on sample size (adaptive) Here D is the length of the sample and is a constant P t D, Pt Corpus c t D D D MLE weight: 1 Corpus weight: D D
Dirichlet Smoothing Example: = 500, D = 1281 Psmoothed D 1065000.063904 the 0.077458 1281500 word freq P MLE (* D) P(* Corpus) Smoothed the 106 0.082748 0.063904 0.077458 soviet 18 0.014052 0.000208 0.010165 chernobyl 10 0.007806 0.000012 0.005618 disclosure 1 0.000781 0.000053 0.000576 divert 1 0.000781 0.000014 0.000565 downplaye 1 0.000781 0.000001 0.000562 each 1 0.000781 0.000489 0.000699 early 1 0.000781 0.000486 0.000698
Smoothing and IDF Similar to many retrieval models, we can write QL as:,, where, log score q D w t D w t D P t D tq Dirichlet smoothing: P t D tf P t Corpus D tf JM smoothing: P t D 1 D Pt Corpus w t, D 1 P t D tf P t D tf Dirichlet: JM: w t, D 1 1 tf P t D D w t, D 1 1 tf P t D D tf is discrete, but let s just assume the functions are all continuous here.
Smoothing and IDF No matter in which smoothing method is employed, common words get much higher P smoothed (t D) The weight (score) of the common words by QL increases much slower than that for less common and rare words while raw tf increases. wt D Dirichlet: wt, D 1 Pt D tf Pt D tf, 1 1 JM: w t D tf Pt D D, which is the same for common and less common terms., 1 1 tf P t D D tf log P t, D 1 For MLE: Pt, D,, only depends on tf D tf tf tf is discrete, but let s just assume the functions are all continuous here.
Dirichlet smoothing, = 1500. Long document, D = 5000; short document, D = 100. Common word, P(t corpus) = 0.01; less common word, P(t corpus) = 0.0001. log P smoothed For common words, log P smoothed increases much slower. tf
KLDP Q Pilog i P i Q i
θ θ θ θ KLD P t log P t q D Pt q t t q D q log D q log q P t P t P t P t t
q D q log D q log q KLD P t P t P t P t ˆ c t, q when PMLE t q, q t t q log Pt D P t, log Pt c t q D ct, q t 1 log Pt D QL q q q t t
How to improve retrieval? Clustering search results Group top-ranked results into different topics Showing only a few results for each topic To avoid that the top-ranked results are biased towards only one particular topic Using clusters to improve document representation Because a document is (relatively) short Document representation is boosted by taken into account the clusters/topics it belongs to
Cluster-based Document Model Liu & Croft, SIGIR 04 Using k-means for clustering; unigram as features Represent a cluster as a language model, c t cluster Pt cluster 1 c t, cluster Smooth a document D s MLE model using The corpus model (the same as QL) The model of the cluster D belongs to t i i Pt corpus 1 P t D P t D P t cluster P t corpus 1 MLE 2 1 2
LDA-based Document Model Wei & Croft, SIGIR 05 Similar to the cluster-based document model Smooth a document MLE model with The corpus model A mixture model of the document s topics
Pseudo-relevance Feedback Pseudo-relevance feedback (PRF); blind feedback; Do an initial search using a regular approach, such as QL Assume the top k ranked results as relevant Perform relevance feedback based on the top k results Normally by query expansion Re-run the query A few practical issues The assumption Efficiency concerns: expand a short query (2-3 words) into a long one (e.g., ~50 words) Practically effective for improving overall search effectiveness (in terms of the mean values of effectiveness metrics) Our focus today
RM1 Pt, q R D D D D D D R R D D R R,, P D R P t q D R,, P D R P t D R P q D R,, P D R P t D R P q D R P t D P q D q q i i q q i i A1, A2 A3 A4 Assumptions A1: is uniform A2: and A3: A4:
RM1 RM1: P t q, R P t, q R P t D P q D Computation Iterate over each feedback document (source) D Assign a weight to D In terms of PRF, we just retrieve top k results by QL and weight each document by QL probability Higher-ranked results get more weights Expand a term t from D by the weight P(t D)P(q D) Sum up terms weights in each feedback document D Normalize the terms weights to probability P t q, R DD R t DD j R D D R i q q Pt D Pqi D qi q Pt j D Pqi D q q i i
RM2 RM2: P t q, R P t D P q D j i j qi q R Computation Iterate over each query term q i Iterate over each feedback document D Assign a weight P(q i D) to D Expand a term t from D by P(t D)P(q i D): if both t and q i occur frequently in D, t gets a greater weight Sum up the weight in each document Multiply the expansion weight for each q i Normalize the terms weights to probability P t D j P t D D D D j D
Comparing different approaches Lv & Zhai, CIKM 09
Pseudo-relevance Feedback Usually believed to be a useful technique But somewhat controversial Recall oriented; limited improvements in precision at the top Making good queries bad; making bad queries worse Overall improvements: average values of metrics? But improving bad/difficult queries may be more important Search efficiency concern Difficult to control; unpredictable for the user Difficult to improve in noisy corpus (such as web corpus) Using some clean corpus for query expansion, e.g., Wikipedia
Average Precision (AP): example = the relevant documents Ranking #1 Rank = 1, precision = 1 Rank = 3, precision = 2/3 Rank = 6, precision = 3/6 Rank = 9, precision = 4/9 Rank = 10, precision = 5/10 AP = 1/5 x (1+2/3+3/6+4/9+5/10)
Average Precision (AP): example = the relevant documents Ranking #2 AP=?
Average Precision (AP): example = the relevant documents Ranking #2 Rank = 2, precision = 1/2 Rank = 5, precision = 2/5 Rank = 6, precision = 3/6 Rank = 7, precision = 4/7 A relevant result is not retrieved we can consider the retrieved rank=, precision = 0 AP = 1/5 x (1/2+2/5+3/6+4/7+0)
DCG@ k k r 2 i 1 log i 1 i1 2
DCG@ k k r 2 i 1 log i 1 i1 2
ndcg@ k DCG@ k IDCG@ k
1 3 0 0 3 DCG @5 log 2 log 3 log 4 log 5 log 6 2 2 2 2 2 3 3 1 1 0 IDCG @5 log 2 log 3 log 4 log 5 log 6 2 2 2 2 2 DCG @5 ndcg @5 IDCG @5
User Sequence User Sequence User 1 A -> B -> C User 4 A -> B -> C User 2 B - > C -> A User 5 B - > C -> A User 3 C - > A -> B User 6 C - > A -> B
R 1 n 2 i1 n i1 y i y i yˆ i y 2 2
2 2 2 1 2 1 ˆ 1 1 Ajusted 1 1 1 1 1 n i i i n i i y y n n R R n p n p y y
Lucene Galago Indri Task 1 3.40 3.72 3.65 Task 2 3.56 3.48 3.82
Lucene Galago Indri Mean 3.40 3.82 3.60
Source DF Adj SS Adj MS F-Value P-Value Task 2 8.222 4.1111 1.71 0.2094 System 2 20.222 10.1111 4.20 0.0318 Task * System 4 46.222 11.5556 4.80 0.0082 Error 18 43.333 2.4074 Total 26 118.000
Source DF Adj SS Adj MS F-Value P-Value Task 2 8.222 4.1111 1.71 0.2094 System 2 20.222 10.1111 4.20 0.0318 Task * System 4 46.222 11.5556 4.80 0.0082 Error 18 43.333 2.4074 Total 26 118.000