- PDF Free Download

Size: px

Start display at page:

Download ""

Jocelyn Rose Morton
5 years ago
Views:

6 docid score

8 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds and 1.0 assists per game o'neal averaged 15.2 points 9.2 rebounds 1.0 assists per game o'neal averag 15.2 point 9.2 rebound 1.0 assist per game

13 Original Porter2 Krovetz organization organ organization organ organ organ heading head heading head head head Original Porter2 Krovetz european european europe europe europ europe urgency urgenc urgent urgent urgent urgent

14 D

15 cos x, y x x x x k y y y y k x y x y i1 k x y i k k 2 2 x i yi i1 i1 i

16 x x x x k y y y y k cos x, y k x y x y xi yi x y x y x y i1

17 cos q, d q q q q d d d d k k q d q d q d k i i i i i1 i1 k k k qi di di i1 i1 i1 k q d

18 w TF t, D IDF t i i i

19 TF binary t, D i 1 c ti, D 0 0 c ti, D 0

20 TFraw ti, D c ti, D

21 TF raw t, D i c t D ct D 1 log b c ti, D i, 0 0 i, 0

22 y x log 1 x y log 2 x y log x y log 10 x

23 log log31.10 q

25 IDF t uniform 1 i

26 IDF KSJ t log N n t

27 IDF BM 25 t log N nt 0.5 N nt n t 0 n t N 2

Multivariate Bernoulli Distribution Results

n } n different coins; toss each coin once

of heads/tails X 1 : tails X 2 : heads X 3

binomial distribution with size n Toss the

28 Multivariate Bernoulli Distribution Results of n independent Bernoulli trails {X 1,, X n } n different coins; toss each coin once Each coin may have a different probability of heads/tails X 1 : tails X 2 : heads X 3 : heads X 4 : heads Different from a binomial distribution with size n Toss the same coin n times (each time independent of others)

29 A Multivariate Bernoulli Model of Document Consider a document D as the outcome of the model Let be a document D s binary term occurrence vector What is the probability of by this MB model? X i P(X i = 1) P(X i = 0) index retrieval search information data computer science index=1 retrieval=1 search=0 information=1 data=0 computer=1 science=0 P D P P P P P P P

30 Naïve Bayes Classification using MB Models X i P(X i = 1 IR) P(X i = 1 DB) index search information data computer relevance SQL P(IR) = 0.3 P(DB) = 0.7 P IR D P IR P X i Di IR P DB D P DB P X D DB i X V i i P IR D P DB D D is 6.33 times more likely to be an IR article than a DB one.

31 log P R D, Q pi 1qi log P NR D, Q 1 p q i t QD i i p P X 1 R, Q i i q P X 1 NR, Q i i p i qi q N n i t i log log log 1 p q q n 0.5 t QD i i t QD i t QD t i i i i

32 ,, scorebm25 d q weightbm25 d q q q i i weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl tf : the raw frequency of q in d q, d i N : the total number of documents in the corpus dl : the length of the document d avdl : the average length of documents in the corpus 1 i n : the document frequency of q k and b are two parameters i i

33 weight k 1 BM25 determines the upperbound of TF: k 1 tf dl k 1 1 b b tf avdl 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl 1 lim k1 1 tf For an average-length document ( dl avdl), TF k1 1 tf k1 1 tf dl k1 tf k 1 1 b b tf avdl

34 weight BM25 1 q, d k 1 tf N n 0.5 i i d, qi log dl ni 0.5 k 1 1 b b tfq i, d avdl For a longer-than-average document, raw tf will be penalized, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl For a shorter-than-average document, raw tf will be boosted, k1 1 tf k1 1 tf k1 1 tf TF dl dl avdl k1 tf k 1 1 b b tf k1 tf k1b avdl avdl

35 tf dl k 1 1 b b tf avdl tf

36 tf dl k 1 1 b b tf avdl tf

37 tf dl k 1 1 b b tf avdl tf

38 θ θ θ θ tv P t 1 t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08

39 θ Pt P D n i i t P(t θ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08

40 t c(t,d) P(t IR) P(t DB) index retrieval search information data computer science Prior Prob. P(IR)=0.3 P(DB)=0.7 c t, D c t, D P IR D P IR P t IR P DB D P DB td P t DB P IR D P DB D

41 θ θ θ Pt P q D tq Pt log P q D log tq D D ± ±

42 Recap: The Query likelihood model (QL) Each document is generated from a document LM θ D Estimate a language model θ D for the document D Rank documents by P(q θ D ) Example: QL ranks D 1 higher than D 2 D 1 s model t P(t θ D1 ) index 0.21 retrieval 0.32 search 0.18 information 0.11 data 0.06 computer 0.04 science 0.08? query information retrieval Pt P q D1 D1 tq? Pt D2 D2 tq D 2 s model t P(t θ D2 ) index 0.17 retrieval 0.05 search 0.22 information 0.12 data 0.33 computer 0.08 science 0.03 P q

43 ˆ c t, D term frequency PMLE t D D document length ˆ information D D Pˆ MLE retrieval D ˆ 1 1 PMLE an D Pˆ MLE for D Pˆ MLE technique D P MLE ˆ is 1 12 PMLE D ˆ important P MLE D

44 ˆ c t, D term frequency PMLE t D D document length c t, D ˆ D corpus term frequency PMLE t corpus corpus length D D t DF P X e N ˆ i IDF i MLE i 1 corpus t

45 MLE: Recall Several Issues Unseen words get zero probability As long as one query term does not appear in D, the document gets zero probability. Ranking by P MLE (q D) is similar to Boolean AND But no occurrence does not mean impossible. Limited sample size MLE is reasonably good for a large sample size. But we are estimating a document model, usually just a few hundred/thousand words long. In some cases (will cover next lecture), we also need to estimate a query model. The sample size is even shorter. Solution: smoothing (will discuss a few slides later )

46 Jelinek-Mercer Smoothing Start simple, but reasonably good Using P(t Corpus) as the background model Set λ to be constant for all documents, independent of any document or query characteristics Tune to optimize retrieval performance e.g., maximize mean values of or AP over a set of different queries in a dataset. optimal value of λ varies with different databases, query sets, etc. Correctly setting λ is very important 1 P t P t P t Corpus D MLE D

47 Jelinek-Mercer Smoothing Example: λ = 0.5, D = Psmoothed the D word freq P MLE (* D) P(* Corpus) Smoothed the soviet chernobyl disclosure divert downplaye each early

48 Dirichlet Smoothing Problem with Jelinek-Mercer All documents have the same λ Longer documents provide better estimates (because it provides a larger sample), and thus its own MLE is more reliable Make smoothing depend on sample size (adaptive) Here D is the length of the sample and is a constant P t D, Pt Corpus c t D D D MLE weight: 1 Corpus weight: D D

49 Dirichlet Smoothing Example: = 500, D = 1281 Psmoothed D the word freq P MLE (* D) P(* Corpus) Smoothed the soviet chernobyl disclosure divert downplaye each early

50 Smoothing and IDF Similar to many retrieval models, we can write QL as:,, where, log score q D w t D w t D P t D tq Dirichlet smoothing: P t D tf P t Corpus D tf JM smoothing: P t D 1 D Pt Corpus w t, D 1 P t D tf P t D tf Dirichlet: JM: w t, D 1 1 tf P t D D w t, D 1 1 tf P t D D tf is discrete, but let s just assume the functions are all continuous here.

51 Smoothing and IDF No matter in which smoothing method is employed, common words get much higher P smoothed (t D) The weight (score) of the common words by QL increases much slower than that for less common and rare words while raw tf increases. wt D Dirichlet: wt, D 1 Pt D tf Pt D tf, 1 1 JM: w t D tf Pt D D, which is the same for common and less common terms., 1 1 tf P t D D tf log P t, D 1 For MLE: Pt, D,, only depends on tf D tf tf tf is discrete, but let s just assume the functions are all continuous here.

52 Dirichlet smoothing, = Long document, D = 5000; short document, D = 100. Common word, P(t corpus) = 0.01; less common word, P(t corpus) = log P smoothed For common words, log P smoothed increases much slower. tf

53 KLDP Q Pilog i P i Q i

54 θ θ θ θ KLD P t log P t q D Pt q t t q D q log D q log q P t P t P t P t t

55 q D q log D q log q KLD P t P t P t P t ˆ c t, q when PMLE t q, q t t q log Pt D P t, log Pt c t q D ct, q t 1 log Pt D QL q q q t t

56 How to improve retrieval? Clustering search results Group top-ranked results into different topics Showing only a few results for each topic To avoid that the top-ranked results are biased towards only one particular topic Using clusters to improve document representation Because a document is (relatively) short Document representation is boosted by taken into account the clusters/topics it belongs to

57 Cluster-based Document Model Liu & Croft, SIGIR 04 Using k-means for clustering; unigram as features Represent a cluster as a language model, c t cluster Pt cluster 1 c t, cluster Smooth a document D s MLE model using The corpus model (the same as QL) The model of the cluster D belongs to t i i Pt corpus 1 P t D P t D P t cluster P t corpus 1 MLE 2 1 2

58 LDA-based Document Model Wei & Croft, SIGIR 05 Similar to the cluster-based document model Smooth a document MLE model with The corpus model A mixture model of the document s topics

59 Pseudo-relevance Feedback Pseudo-relevance feedback (PRF); blind feedback; Do an initial search using a regular approach, such as QL Assume the top k ranked results as relevant Perform relevance feedback based on the top k results Normally by query expansion Re-run the query A few practical issues The assumption Efficiency concerns: expand a short query (2-3 words) into a long one (e.g., ~50 words) Practically effective for improving overall search effectiveness (in terms of the mean values of effectiveness metrics) Our focus today

60 RM1 Pt, q R D D D D D D R R D D R R,, P D R P t q D R,, P D R P t D R P q D R,, P D R P t D R P q D R P t D P q D q q i i q q i i A1, A2 A3 A4 Assumptions A1: is uniform A2: and A3: A4:

61 RM1 RM1: P t q, R P t, q R P t D P q D Computation Iterate over each feedback document (source) D Assign a weight to D In terms of PRF, we just retrieve top k results by QL and weight each document by QL probability Higher-ranked results get more weights Expand a term t from D by the weight P(t D)P(q D) Sum up terms weights in each feedback document D Normalize the terms weights to probability P t q, R DD R t DD j R D D R i q q Pt D Pqi D qi q Pt j D Pqi D q q i i

62 RM2 RM2: P t q, R P t D P q D j i j qi q R Computation Iterate over each query term q i Iterate over each feedback document D Assign a weight P(q i D) to D Expand a term t from D by P(t D)P(q i D): if both t and q i occur frequently in D, t gets a greater weight Sum up the weight in each document Multiply the expansion weight for each q i Normalize the terms weights to probability P t D j P t D D D D j D

63 Comparing different approaches Lv & Zhai, CIKM 09

64 Pseudo-relevance Feedback Usually believed to be a useful technique But somewhat controversial Recall oriented; limited improvements in precision at the top Making good queries bad; making bad queries worse Overall improvements: average values of metrics? But improving bad/difficult queries may be more important Search efficiency concern Difficult to control; unpredictable for the user Difficult to improve in noisy corpus (such as web corpus) Using some clean corpus for query expansion, e.g., Wikipedia

71 Average Precision (AP): example = the relevant documents Ranking #1 Rank = 1, precision = 1 Rank = 3, precision = 2/3 Rank = 6, precision = 3/6 Rank = 9, precision = 4/9 Rank = 10, precision = 5/10 AP = 1/5 x (1+2/3+3/6+4/9+5/10)

72 Average Precision (AP): example = the relevant documents Ranking #2 AP=?

73 Average Precision (AP): example = the relevant documents Ranking #2 Rank = 2, precision = 1/2 Rank = 5, precision = 2/5 Rank = 6, precision = 3/6 Rank = 7, precision = 4/7 A relevant result is not retrieved we can consider the retrieved rank=, precision = 0 AP = 1/5 x (1/2+2/5+3/6+4/7+0)

78 k k r 2 i 1 log i 1 i1 2

79 k k r 2 i 1 log i 1 i1 2

80 k DCG@ k IDCG@ k

81 log 2 log 3 log 4 log 5 log log 2 log 3 log 4 log 5 log

92 User Sequence User Sequence User 1 A -> B -> C User 4 A -> B -> C User 2 B - > C -> A User 5 B - > C -> A User 3 C - > A -> B User 6 C - > A -> B

99 R 1 n 2 i1 n i1 y i y i yˆ i y 2 2

100 ˆ 1 1 Ajusted n i i i n i i y y n n R R n p n p y y

101

102 Lucene Galago Indri Task Task

103 Lucene Galago Indri Mean

104

105

106 Source DF Adj SS Adj MS F-Value P-Value Task System Task * System Error Total

107 Source DF Adj SS Adj MS F-Value P-Value Task System Task * System Error Total

108

109

CS 646 (Fall 2016) Homework 3

CS 646 (Fall 2016) Homework 3 Deadline: 11:59pm, Oct 31st, 2016 (EST) Access the following resources before you start working on HW3: Download and uncompress the index file and other data from Moodle.