A Latent Variable Graphical Model Derivation of Diversity for Set-based Retrieval

Size: px

Start display at page:

Download "A Latent Variable Graphical Model Derivation of Diversity for Set-based Retrieval"

Oswald Norman
5 years ago
Views:

1 A Latent Variable Graphical Model Derivation of Diversity for Set-based Retrieval Keywords: Graphical Models::Directed Models, Foundations::Other Foundations Abstract Diversity has been heavily motivated as an objective criterion for result sets in the information retrieval literature and various ad-hoc heuristics have been proposed to explicitly optimize for it. In this paper, we start from first principles and show that optimizing a simple criterion of set-based relevance in a latent variable graphical model a framework we refer to as probabilistic latent accumulated relevance (PLAR) leads to diversity as a naturally emergent property of the solution. PLAR derives variants of latent semantic indexing (LSI) kernels for relevance and diversity and does not require ad-hoc tuning parameters to balance them. PLAR also directly motivates the general form of many other ad-hoc diversity heuristics in the literature, albeit with important modifications that we show can lead to improved performance on a diversity testbed from the TREC 6-8 Interactive Track. 1 Introduction One of the basic tenets of set-based information retrieval is to minimize redundancy (hence maximize diversity) in the result set to increase the chance that the results will contain items relevant to the user s query [9]. However, among the numerous proposals for result set diversification in the literature, most typically employ ad-hoc heuristics for balancing relevance and diversity. In this paper, we argue that diversification should emerge naturally from the optimization of a simple set-based relevance criterion and hence should form an integral part of relevance rather than a separate ad-hoc objective. By formalizing set-based retrieval as a sequence of query optimizations in a latent variable graphical model a framework we refer to as probabilistic latent accumulated relevance (PLAR) we show that indeed we can derive diversity as an emergent phenomena. One of PLAR s key contributions is to show that we can derive (and formally motivate) variants of many ad-hoc relevance and diversity heuristics commonly used in the information retrieval literature. Here we briefly discuss these areas along with the formal contributions of PLAR: Latent semantic indexing (LSI) [8] is a fundamental technique for determining document similarity in a latent space. PLAR directly derives an LSI-like latent similarity kernel as well as a novel variant LSI-like kernel for latent diversity. Maximal marginal relevance [5] is one of the staples of result set reranking and diversification in the information retrieval literature. PLAR derives a variant of MMR that uses a softer expected marginal relevance in place of the original maximal marginal relevance. Set-covering heuristics that ensure a result set optimally covers words or topics are a common heuristic for diversity and can be directly optimized in a supervised setting [20]. While PLAR is unsupervised, its derivation of expected marginal relevance has a direct interpretation as a latent topic set-covering approach. Portfolio theory is used in the information retrieval literature to motivate result set diversification based on the choice of uncorrelated items [18] to minimize investment risk (e.g., if one item is suboptimal, the others will not be related). PLAR can be directly intepreted as minimizing an expected risk metric derived from latent topic correlations of items in the result set. The rest of the paper proceeds as follows: we begin with the formal problem definition for set-based information retrieval and then proceed to introduce the PLAR framework and derive its solution; next we discuss related work on diversity metrics and in particular, the relation of PLAR to the prototypical, but ad-hoc methods discussed above for balancing relevance and diversity; finally, we empirically demonstrate the improved performance of PLAR over all of these ad-hoc diversification objectives on a diversity testbed from the TREC 6-8 Interactive Track.

2 2 Probabilistic Latent Accumulated Relevance Our objective in this section is to answer the following questions from first principles: suppose we have a set of items D (e.g., in information retrieval (IR), D could be a set of documents) and we want to recommend a small subset S k D (where S k = k and k D ) relevant to a given query q (itself an item, e.g, in IR, q would be a document-like list of terms); then (Q1) how do we define an appropriate set-based relevance criterion for S k and (Q2) how do we efficiently find the set S k that optimizes it? We start with question (Q1) and seek to provide an optimization criterion for set-based retrieval under relevance uncertainty; while each document is either relevant or not, we do not directly observe document relevance and instead treat it as a random variable. Thus we formalize document relevance in a latent variable graphical model as follows: ws 1 ws 21 wt 1 wt 21 ws 31 wt from Latent Dirichlet Allocation (LDA) [4]. We note that unsupervised topic models such as LDA can be trained offline for Figure 1; that is, there is no need to modify LDA inference to take into account dependencies introduced in Figure 1 since the relevance variables r i in PLAR are unobserved and hence D-separate all topic variables. With the CPTs for P (t i s i ) and P (t q) learned offline by LDA, we now define the CPTs for the relevance r i of item s i, which has a very natural definition: { P (r 1 t 1 if t 1 = t, t 1 ) = 0 if t 1 t { j > i : P (r j t 1 if (t i t j ) (t j = t ), r i = 0, t i, t j ) = 0 if (t i = t j ) (t j t ) Simply, s 1 is relevant if its topic t 1 matches the query topic t. For j > i, s j is relevant under the same conditions with the addition that if s i was not relevant (r i = 0), topic t j for s j should also not match t i. We omit a definition of P (r j t, r i = 1, t i, t j ) since it will not be needed in the following derivation. Finally, for j > i, we factorize the CPT for the relevance r j given the irrelevance of all r i (to be explained) as follows: wr 1 wr 21 wr 31 P (r j t, t 1,.., t j, r 1 =0,.., r j 1 =0) = P (r j t, r i =0, t i, t j ) q t' Figure 1: Graphical model used in PLAR. In Figure 1, shaded nodes represent observed variables while unshaded nodes are latent. The observed variables are the vector of query terms q and selected items s i (where for 1 i k, s i D). For the latent variables, let T be a discrete topic set. Then variables t i T represent topics for respective s i and t T represents a topic for query q. The r i are binary variables that indicate whether the respective selected items s i are relevant (1) or not (0). In what follows, we do not differentiate between random variables and specific realizations, assuming the latter are evident from context, e.g., r 1 = 0. The conditional probability tables (CPTs) in this discrete directed graphical model are defined as follows. P (t i s i ) represents the topic model of item s i and P (t q) represents a topic model of the query. There are a variety of ways to learn these topic CPTs based on the nature of the items and query; for an item set D consisting of text documents and a query that can be treated as a text document, one natural probabilistic model for P (t i s i ) and P (t q) can be derived By inspection, we note that this factorized perspective is equivalent to a joint CPT where r j is relevant only if t j = t and t i, t j t i. However, we use the equivalent factorized view since it will facilitate variable elimination used to simplify probabilistic queries in the graphical model. We define the accumulated relevance AccRel(S k, q) of set S k with query q as follows, where following the click chain model [11] of result set browsing, we assume item i is only examined if all previous items were irrelevant (r <i = 0): AccRel(S k, q) = k P (r i r 1 =0,..., r i 1 =0, s 1,..., s i, q) Hence, the formal set-based relevance optimization is to find the set S k D that maximizes AccRel(S k, q): S k = arg max AccRel(S k, q) (1) S k ={s 1,...,s k } Because it is intractable to directly optimize (1) for all possible subsets S k D for large D, we instead opt for a greedy optimization approach to address (Q2) where we choose each s i in order, conditioned on the previously chosen s <i. This approach is consistent with all of the result set diversification approaches we cite in Related Work. To demonstrate our greedy optimization of accumulated

3 relevance, we begin with the selection of the first item s 1: s 1 AccRel({s 1 }, q) P (r 1 s 1, q) s 1 s 1 P (r 1, s 1, q) s 1 r 1 P (r 1, s 1, q) =Z (indep. of s 1, c.f. Appendix A) s 1 s 1 P (r 1 t, t 1 )P (t q)p (t 1 s 1 ) t 1,t P (t q)p (t 1 = t s 1 ) (2) t This intuitive result follows by exploiting the structure of P (r 1 t, t 1 ) and draws parallels to Latent Semantic Indexing (LSI) [8] because it chooses s 1 so as to maximize the similarity between its topic model and that of the query q. Continuing, we next want to select s 2 given s 1 in order to maximize AccRel({s 1, s 2}, q). Here, we need only maximize over the second summation term (i = 2) in (1) since the first term (i = 1) in (1) is a constant given s 1: s 2 s 2 D\{s 1 } AccRel({s 1, s 2 }, q) s 2 D\{s 1 } P (r 2 = 1 s 1, s 2, q, r 1 = 0) P (r 2 = 1, s 1, s 2, q, r 1 = 0) s 2 D\{s 1 } P (r 2, s 1, s 2, q, r 1 = 0) r 2 }{{} =Z, (independent of s 2, c.f. Appendix A) P (r 2 = 1, s 1, s 2, q, r 1 = 0) s 2 D\{s 1 } P (r 2 = 1 r 1 = 0, t 1, t 2, t )P (t 1 s 1) s 2 D\{s 1 } t 1,t 2,t P (r 1 = 0 t 1, t )P (t 2 s 2 )P (t q) (3) Here we simply rewrote the conditional query P (r 2 = 1 s 1, s 2, q, r 1 = 0) in terms of the joint distribution and noted that the denominator was a constant Z independent of the maximization argument and hence irrelevant as derived in Appendix A. Then we expanded the query into the appropriate marginalization over the joint factored distribution given by the graphical model. P (t q)p (t 1 = t s 1)P (t 2 = t s 2 ) Given (3), we provide the remainder of the fascinating but tedious derivation in Appendix B (actually for the more general case of k > 1) and directly provide the final solution for the optimal selection of s 2: [ ] s 2 s 2 D\{s 1 } } t P (t q)p (t 2 = t s 2 ) {{} relevance [ ] (4) } t {{} diversity We pause at this point to note that one of our primary objectives in the paper has been achieved. By directly optimizing a set-based relevance criterion, diversity has automatically emerged in the solution. That is, the relevance term measures the similarity between the topic distribution for the query and s 2, while the diversity term penalizes s 2 for having too similar a topic distribution to the already chosen s 1, where the diversity term re-weights topics according to their probability given q. As discussed later, this querybased topic reweighting in the latent diversity term is an important modification of the ad-hoc diversity metrics we compare to. Continuing onto the most general case, we show the derivation for selecting s k (k > 1) once S = {s 1,..., s } has been selected: s k = arg max AccRel({s 1,..., s, s k }, q) s k D\S s k D\S P (r k = 1 s 1,..., s, s k, r 1 = 0,..., r = 0, q) P (r k = 1, s 1,..., s, s k, r 1 = 0,..., r = 0, q) s k D\S P (s 1,..., s, s k, r 1 = 0,..., r = 0, q) r k {z } =Z, (independent of s k, c.f. Appendix A) P (r k = 1, s 1,..., s, s k, r 1 = 0,..., r = 0, q) s k D\S P (t k s k )P (t q) s k D\S t 1,...,t k,t i 1 P (r k = 1 r i = 0, t, t i, t k )P (t i s i ) P (r i = 0 r j = 0, t, t i, t j) j=1 Again, we simply rewrote the conditional query P (r k = 1 s 1,..., s, s k, r 1 = 0,..., r = 0, q) in terms of joint distribution, noted independence of the denominator w.r.t the maximization argument as derived in Appendix A, then expanded the query into the appropriate marginalization over the joint factored distribution given by the graphical model. Given (5), we again refer to the reminder of the derivation in Appendix B, and directly present the final solution for the optimal selection of s k that generally defines probabilistic latent accumulated relevance (PLAR): s k s k D\S t P (t k = t s k )P (t q) (5) [1 P (t i = t s i )] (6) We note there are many nice properties of this general solution. First, given a learned topic model, it is very efficient and easy to compute. Second, as we will later show, there is a natural set-covering interpretation to the optimization in (6) that demonstrates that the best s k covers the most latent topic space not already covered by {s 1,..., s }.

4 3 Comparison of PLAR to Related Work As early as 1964, Goffman [9] notes that the relevance of documents in a list has to depend on the documents preceding it. More recently, work on maximal marginal relevance (MMR) [5] was one of the first to formalize diversification as a mathematical optimization criterion; MMR has proved one of the most popular diversity approaches. Aside from this work, two of the other notable works are [20], which formalizes a structured SVM loss function based on a set covering objective, and [18], which borrows concepts from portfolio theory in economics to treat result set diversification as optimization of a risk minimization objective. We note that PLAR as derived in the last section formally motivates these last three somewhat ad-hoc diversification approaches and we discuss these connections more deeply in the following sections. Before we discuss specifics though, we provide a general summary of the result set diversification literature in Table 1. Here we breakdown the different proposals according to whether they are probabilistic, use latent models for determining similarity, use adaptive learning techniques, and finally whether they are unsupervised, i.e., they do not require labeled data or feedback. We note that PLAR is the first proposal (that we are aware of) to combine all four traits in the affirmative and respectively, we believe these four traits provide robustness, generalization with sparse data, data adaptivity, and generality for situations without the availability of manual labeling or user feedback. To strengthen these claims, we note that each trait plays some role in the improved performance of PLAR over various diversification approaches in the empirical results section. 3.1 Latent Semantic Indexing We noted previously that PLAR directly re-derives latent similarity metrics introduced by Latent Semantic Indexing (LSI) [8]. To clarify, we introduce the relevance metric Sim PLAR 1 given by PLAR in both (2) and (4) where we let T and T i be respective topic probability vectors for query q and item s i with vector elements T j = P (t = j q) and T ij = P (t i = j s 2 ) and use, for the inner product: Sim PLAR 1 (q, s i ) = t P (t q)p (t i = t s i ) = T, T i We note that Sim PLAR 1 is precisely the unnormalized LSI similarity kernel metric and directly derived in (2) and (4). More interestingly, using, w to denote the reweighted inner product where the ith term of the dot product is multiplied by w i, we note that a similar analysis applies to the diversity metric Sim PLAR 2 (s i, s j ) given by the second diver- Table 1: Dimensions of diversified set-based retrieval system: Prob.: uses probabilistic models?; Latent: use latent topic models?; Learn.: uses some form of learning?; Unsup.: does not require labeled topic data or feedback? Diversity Paper Prob. Latent Learn. Unsup. Carbonell [5] Anagnostop. [2] Radlinski [12] Radlinski [13] Clarke [6] ue [20] Bai [3] Sanderson [16] u [19] Agrawal [1] Gollapudi [10] Clough [7] Song [17] Wang [18] PLAR (this paper) sity term of PLAR given in (4): Sim PLAR 2 (q, s i, s j ) = t P (t q)p (t i = t s i )P (t j = t s j ) = T i, T j T This analysis shows that PLAR derives a diversity metrics based on the LSI similarity metric, but reweighted by the query topic probability P (t q). This is a fascinating and intuitive result that reflects the intuitions that latent diversity between items should be emphasized according to the latent topic probability of the query for which the setbased relevance criterion is being optimized. We note that no other diversity metrics that we are aware of use such a query reweighted approach at retrieval time. 3.2 Maximal Marginal Relevance Maximal marginal relevance (MMR) [5] is perhaps one of the most popular methods for balancing relevance and diversity in set-based information retrieval systems and has been cited over 530 times since its publication in 1998, according to Google Scholar. Like PLAR, MMR proposes to build Sk in a greedy manner by selecting s k given S = {s 1,..., s } (where as usual Sk = S {s k }) according to the following criteria s k [λ(sim 1 (s k, q)) (1 λ) max Sim 2 (s k, s i )] s k D\S s i S (7) where Sim 1 (, ) measures the relevance between an item and a query, Sim 2 (, ) measures the similarity between two items, and the manually tuned λ [0, 1] trades off rel-

5 evance and diversity. In the case of s 1, the second term disappears. While MMR is a popular algorithm, it was specified in a rather ad-hoc manner and good performance typically relies on careful tuning of the λ parameter. Furthermore, MMR is agnostic to the specific similarity metrics used, which indeed allows for flexibility, but makes no indication as to the choice of similarity metrics for Sim 1 and Sim 2 that are compatible with each other and also appropriate for good performance. We note that a very natural mapping to the MMR algorithm in (7) has emerged automatically from the derivation that maximized accumulated relevance. This is especially apparent for the optimal S2 in the form of (4), where we note that MMR and PLAR exactly match when λ = Hence, and when Sim 1 = Sim PLAR 1 and Sim 2 = Sim PLAR making these three substitutions into MMR, we might arrive at a general algorithm for PLAR, derived from first principles. However, we note that MMR and PLAR are divergent when considering Sk for k > 2. Here, MMR chooses to penalize each newly selected item s k by its minimal diversity with respect to items in S, whereas PLAR (as we will demonstrate next) penalizes each s k according to its expected, query-weighted overlap with the topic coverage of all items in S. We formalize this more clearly next, but note that this difference leads to improvement of PLAR over MMR, even when MMR uses the same similarity and diversity metrics (Sim PLAR 1 and Sim PLAR 2 ) as used by PLAR. 3.3 Set Covering Heuristics As an exemplar of set-covering heuristics for result set diversity optimization, we describe the general form of the loss function used by [20] for training SVMs with structured output. The weighted subtopic loss function WSL(S k, q) in this work is the weighted percentage of distinct subtopics in the topic set of query q not covered by the retrieved set S k. Formally, we define the set of topics for query q as T and denote the set of topics in T covered by S k as T k. Then we penalize a result set S k by the topics T \ T k according to the sum of their weights, where topic i is weighted according to the count n i of documents in D covered by topic i (normalized by i n i) as follows: WSL(S k, q) = i T \T k n i i T n i (8) While WSL provides a hard set-covering view of diversity, we now demonstrate that an expansion of (6) provides a soft latent set-covering interpretation that provides the fascinating insight that s k is chosen so as to best cover the latent topic space not already covered by {s 1,..., s }, again with topic importance reweighted by the topic probability given the query q. Formally expanding the prods * s s * 2 1 Figure 2: Inclusion-exclusion principle. The sets represent candidate items s for a query, and the area covered by each set is the information covered by that item. Numbers on different areas indicates the number of sets that share these areas. uct in in (6), (1 P (t i = t s i )), collecting terms and writing it as a series, we arrive at a form that reflects the inclusion-exclusion principle applied to the problem of tabulating the joint topic probability mass for topic t covered by {s 1,..., s }: `1 P (ti = t s i ) = 1 " P (t i = t s i ) P (t i = t s i )P (t j = t s j ) j=1 + ( 1) P (t i = t s i ) The inclusion-exclusion principle calculation provided by the term in [ ] is illustrated in Figure 2. In words, this expression in [ ] is calculating the total topic probability coverage of t by all selected items {s 1,..., s } by properly applying the inclusion-exclusion principle to ensure that overlapping probability coverage is not double counted. Then referring back to (6), we note that s k is chosen by maximizing a weighted sum over topics, where each topic weight is determined by its relevance to the query q, the item s k, and penalized (i.e., due to the 1 ) by the topic coverage of t by the set {s 1,..., s } to naturally encourage diversity. We note that this is a soft probabilistic version of the in or out topic coverage approach of WSL. 3.4 Portfolio Theory As a final connection to the literature, we note recent work [18] motivates diversification in set-based information retrieval by a risk minimizing portfolio selection approach. Rewriting the diversification objective of [18] to match the notation of our paper, they note the following (9)

6 greedy approach to selecting item s k worked best among the alternatives they tried: [ ] s k s k Sim BM25 (s k, q) λ ω i Sim TFIDF (s i, s k ) (10) Here, λ is a manually tuned weight parameter, ω 1,..., ω k are the weights of each position in the result set (for lack of more informed choice, we chose a uniform weighting), Sim BM25 (s k, q) is the BM25 [14] probabilistic relevance score of document s k w.r.t. query q and Sim TFIDF (s i, s k ) is a TFIDF [15] similarity metric between two documents s i and s k represented as term frequency vectors. Like MMR, this Portfolio objective directly trades off a relevance and diversity metric, but it uses a softer diversity metric that penalizes s k according to the sum of pairwise similarities with all previous documents {s 1,..., s }. This is motivated by portfolio theory, where the risk of investing in a new item s k is determined by its correlation with all previous items {s 1,..., s } (the theory being that if one of the previous items was suboptimal and very similar to s k then there is higher risk if choosing s k ). We note that like this Portfolio approach, PLAR also uses a soft diversity metric. And returning to (4), we note that we can rewrite it in a similar form to show that PLAR can also be viewed as formally motivating a Portfolio-like latent expected risk minimization approach to diversification: s 2 s 2 D\{s 1 } t P (t q) P (t 2 = t s 2 ) }{{} relevance P (t 1 = t s 1)P (t 2 = t s 2 ) }{{} correlation risk We note that an ad-hoc generalization of this Portfolio form for s 2 to the general case of s k might sum over all pairwise correlation risks. However, the previous set-covering interpretation of PLAR indicates that this would actually double count the risk because of overlap in the latent topic space that is not adjusted for using the principle of inclusionexclusion. In this sense, the Portfolio objective runs the risk of being overly conservative by overestimating risk and indeed we will see such results in the experiments. 4 Experimental Comparison Since we derived the diversification approach in PLAR from first principles in a latent variable graphical model, one interesting question to ask is how PLAR compares to more commonly used ad-hoc result set diversification approaches as discussed previously. For this purpose, we note that the TREC 6-8 diversity testbed described by [20] is the only publicly available diversity testbed that we found. Table 2: Weighted subtopic loss (WSL) of four methods using all words and first 10 words. Standard error estimates are shown for PLAR-LDA methods since their topic models are sampled, which may lead to small variances in performance. Method WSL (first 10 words) WSL (all words) Portfolio Theory MMR-TF MMR-TFIDF PLAR-LDA-Max ± ± PLAR-LDA ± ± We report experiments on a subset of TREC 6-8 data where we follow the same experimental setup as [20] who measure the weighted subtopic loss (WSL) (8) of item sets where in brief, WSL gives higher penalty for not covering popular subtopics. We do not compare directly to [20] as their method was supervised while the MMR, Portfolio approaches, and PLAR are inherently unsupervised and do not require manually labeled data. Standard query and item similarity metrics used in MMR applied to text data include the cosine of the term frequency (TF) and TF inverse document frequency (TFIDF) vector space models [15]. We denote these variants of MMR as MMR-TF and MMR-TFIDF and found λ =.5 to work best for both. We also implemented the BM25 Portfolio theorybased diversification objective described in [18] with uniform ω and best λ = 1. PLAR specifically suggests the use of LSI-based similarity metrics defined in the last section and we used LDA to train the topic models P (t i s i ) and P (t q) for a resulting algorithm we refer to as PLAR-LDA. LDA was trained with α = 2.0, β = 0.5, T = 15 and we note the results were not highly sensitive to these parameter choices. We also define PLAR-LDA-Max as MMR using the PLAR similarity and diversity metrics, respectively Sim PLAR 1 and Sim PLAR 2. Average WSL scores are shown in Table 2 on the 17 queries examined by [20]. We use both full documents and also just the first 10 words of each document. Due to the power of the latent topic model and derived similarity metrics, PLAR-LDA is able to perform better than MMR with standard TF and TFIDF metrics and without a λ parameter to be tuned. Furthermore, we note that PLAR with the MMR motivated maximum (PLAR-LDA-Max) performs worse on one evaluation indicating that PLAR-LDA s soft latent set covering approach may serve as a better diversity metric than a hard MMR-like diversity metric. The Portfolio approached showed somewhat unstable performance; we hypothesize it did worse with more data due to increased similarity and thus double counting in its risk minimization metric. We note that PLAR-LDA works comparatively very well with short documents (i.e., the results for documents restricted to the first 10 words) since intrinsic document and query similarities are automatically derived from the latent PLAR relevance and diversity metrics.

7 These results show that PLAR is robust (it does as well or better than other diversification algorithms), its latent topic modeling helps in situations with small context, the topic models automatically adapt to the data at hand, and it is fully unsupervised, requiring no manual tuning of diversity. 5 Concluding Remarks We formalized a relevance criterion for set-based information retrieval in a latent variable graphical model called probabilistic latent accumulated relevance (PLAR) and derived from this a novel solution yielding diversity as an emergent property. We showed that PLAR derives an LSI [8] similarity metric and a novel LSI variant for diversity. We also related the PLAR derivation to three important definitions of diversity in the literature: (a) maximal marginal relevance (MMR) [5], (b) set-covering heuristics [20], and (c) portfolio theory [18], and we showed how PLAR directly motivates all of them, but makes important modifications that led to improved empirical results on the TREC 6-8 diversity testbed. This provides evidence that formally optimizing the correct set-based relevance criterion in information retrieval can lead to natural and effective diversification strategies, demonstrating a path to move information retrieval away from ad-hoc diversity metrics towards formally motivated and better performing solutions. References [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pages 5 14, New ork, N, USA, ACM. [2] A. Anagnostopoulos, A. Z. Broder, and D. Carmel. Sampling search-engine results. In Proceedings of the 14th International Conference on World Wide Web, pages , Chiba, Japan, May ACM. [3] J. Bai and J.-. Nie. Adapting information retrieval to query contexts. Information Processing and Management, 44(6): , November [4] D. M. Blei, A.. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: , [5] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reording documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, August [6] C. L. A. Clarke, M. Kolla, G. V. Cormack, and O. Vechtomova. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages ACM, [7] P. Clough, M. Sanderson, M. Abouammoh, S. Navarro, and M. Paramita. Multiple approaches to analysing query diversity. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages , New ork, N, USA, ACM. [8] S. Deerwester, S. T. Dumaisand, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41: , [9] W. Goffman. A searching procedure for information retrieval. Information Storage and Retrieval, 2:73 78, [10] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web, pages ACM, April [11] F. Guo, C. Liu, A. Kannan, T. Minka, M. Taylor,.- M. Wang, and C. Faloutsos. Click chain model in web search. In Proceedings of the 2009 International Conference on the World Wide Web, pages 11 20, Madrid, Spain, ACM. [12] F. Radlinski and S. Dumais. Improving personalized web search using result diversification. In Proceedings of the 29th Annual International ACM SI- GIR Conference on Research and Development in Information Retrieval, pages , New ork, N, USA, ACM. [13] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages ACM, [14] S. Robertson, S. Walker, S. Jones, M. Hancock- Beaulieu, and M. Gatford. Okapi at trec-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994), pages , Gaithersburg, USA, [15] G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, [16] M. Sanderson. Ambiguous queries: Test collections need more sense. pages ACM, July [17] R. Song, Z. Luo, J.-. Nie,. u, and H.-W. Hon. Identification of ambiguous queries in web search. Information Processing and Management, 45(2): , March [18] J. Wang and J. Zhu. Portfolio theory of information retrieval. In Proceedings of SIGIR 09, pages ACM, July [19] C. u, L. Lakshmanan, and S. Amer-ahia. It takes variety to make a world: diversification in recommender systems. In EDBT 09: Proceedings of the 12th International Conference on Extending

8 Database Technology, volume 360, pages ACM, [20]. ue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML 08: Proceedings of the 25th International Conference on Machine Learning, pages , New ork, N, USA, ACM. A Independence of Normalization Constant Here we derive the normalization constant Z in (5): Z = P (r k = 1, s 1,..., s, s k, r 1 = 0,..., r = 0, q) j=1 " P (t k s k )P (t q) P (r k = 1 r i = 0, t, t i, t k ) t 1,...,t k,t i 1 = P (t i s i ) P (r i = 0 r j = 0, t, t i, t j)] As described in B, the term i 1 j=1 P (r i = 0 r j = 0, t, t i, t j ) is implied. Thus we can omit it for further derivation, and add the constraint i < k, t i t. Hence, we have Z = P (t k s k )P (t q) r k t 1 t,...,t k t,t (P (r k r i = 0, t, t i, t k )P (t i s i )) = = P (t k s k )P (t q) t 1 t,...,t k t,t " (P (r k = 1 r i = 0, t, t i, t k )P (t i s i ))+ We note that (P (r k = 0 r i = 0, t, t i, t k )P (t i s i )) t 1 t,...,t k t,t P (t k s k )P (t q) " P (t i s i ) P (r k = 1 r i = 0, t, t i, t k )+ P (r k = 0 r i = 0, t, t i, t k ) Q P (r k = 0 r i = 0, t, t i, t k ) = 1 implies that (t k t ) ( i < k, t k = t i) is true. Q P (r k = 1 r i = 0, t, t i, t k ) = 1 implies that ((t k t ) ( i < k, t k = t i)) is true. Hence, we have P (r k = 0 r i = 0, t, t i, t k ) =1 P (r k = 1 r i = 0, t, t i, t k ) Therefore, we obtain Z = = = t 1 t,...,t k t,t P (t k s k )P (t q) t 1 t,...,t t,t P (t q) t 1 t,...,t t,t P (t q) = t P (t q) = t P (t q) P (t i s i ) P (t i s i ) P (t k s k ) t k P (t i s i ) P (t i t s i ) (1 P (t i = t s i )) The crucial observation here is that Z is independent of the chosen s k and thus when choosing the maximizing s k, Z is irrelevant and can be ignored. B PLAR Derivation for s k, k > 2 Following is the derivation of PLAR from (5) through to the final result in (6). s k s k D\S P (r k = 1, s 1,..., s, s k, r 1 = 0,..., r = 0, q) P (t k s k )P (t q) s k D\S t 1,...,t k,t " P (r k = 1 r i = 0, t, t i, t k )P (t i s i ) i 1 P (r i = 0 r j = 0, t, t i, t j) j=1 i < k, P (r k = 1 r i = 0, t, t i, t k ) is 1 when t k = t and t i t holds. Since t i t implies that P (r i = 0 r j = 0, t, t i, t j ) is 1, the term P (r i = 0 r j = 0, t, t i, t j ) is redundant, which implies that i 1 j=1 P (r i = 0 r j = 0, t, t i, t j ) is redundant. Hence we can omit it, and obtain s k P (t k s k )P (t q) s k D\S t 1 t,...,t k t,t P (r k = 1 r i = 0, t, t i, t k )P (t i s i ) P (t k s k )P (t q) s k D\S t k,t 0 P (r k = 1 r i = 0, t, t i, t k )P (t i s i ) A t i t s k D\S t P (t k = t s k )P (t q) k (1 P (t i = t s i ))

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

On the Foundations of Diverse Information Retrieval Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi 1 Outline Need for diversity The answer: MMR But what was the