On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Similar documents
A Latent Variable Graphical Model Derivation of Diversity for Set-based Retrieval

Probabilistic Information Retrieval

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Ranked Retrieval (2)

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Advanced Topics in Information Retrieval 5. Diversity & Novelty

PV211: Introduction to Information Retrieval

PMR Learning as Inference

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Information Retrieval

Information Retrieval and Web Search Engines

Information Retrieval

Machine learning for pervasive systems Classification in high-dimensional spaces

Information Retrieval

An Axiomatic Framework for Result Diversification

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )


Lecture 12: Link Analysis for Web Retrieval

Chap 2: Classical models for information retrieval

18.9 SUPPORT VECTOR MACHINES

Decision Tree Learning Lecture 2

Midterm Exam, Spring 2005

Information Retrieval

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Latent Semantic Analysis. Hongning Wang

Lecture 3: Probabilistic Retrieval Models

Multiclass Classification

A Study of the Dirichlet Priors for Term Frequency Normalisation

Variable Latent Semantic Indexing

IR Models: The Probabilistic Model. Lecture 8

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS-E4830 Kernel Methods in Machine Learning

Kernel Methods and Support Vector Machines

Max-Sum Diversification, Monotone Submodular Functions and Semi-metric Spaces

Predicting Neighbor Goodness in Collaborative Filtering

Support Vector and Kernel Methods

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

CS47300: Web Information Search and Management

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CPSC 540: Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Learning Maximal Marginal Relevance Model via Directly Optimizing Diversity Evaluation Measures

Introduction to Logistic Regression

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Modern Information Retrieval

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Modern Information Retrieval

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

Information Retrieval Basic IR models. Luca Bondi

CS281 Section 4: Factor Analysis and PCA

CS 188: Artificial Intelligence Spring Announcements

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Midterm sample questions

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Linear Classifiers and the Perceptron

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Online Learning, Mistake Bounds, Perceptron Algorithm

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

SUPPORT VECTOR MACHINE

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Information Retrieval

The Perceptron Algorithm, Margins

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Boolean and Vector Space Retrieval Models

DISTRIBUTIONAL SEMANTICS

Learning Socially Optimal Information Systems from Egoistic Users

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Latent Dirichlet Allocation

Modeling Controversy Within Populations

Discovering Geographical Topics in Twitter

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Multi-theme Sentiment Analysis using Quantified Contextual

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Introduction to Information Retrieval

Lecture 21: Spectral Learning for Graphical Models

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Statistical Ranking Problem

Support Vector Machine (continued)

Nonlinear Dimensionality Reduction

CS 572: Information Retrieval

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

CS246 Final Exam, Winter 2011

CPSC 340: Machine Learning and Data Mining

Information Retrieval. Lecture 6

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

M303. Copyright c 2017 The Open University WEB Faculty of Science, Technology, Engineering and Mathematics M303 Further pure mathematics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Transcription:

On the Foundations of Diverse Information Retrieval Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi 1

Outline Need for diversity The answer: MMR But what was the question? Expected n-call@k 2

Search Result Ranking We query the daily news for technology we get this Is this desirable? Note that de-duplication does not solve this problem 3

Recommendation Book search for cowboys * *These are actual results I got from an e-book search engine. Why are they mostly romance books? Will this appeal to all demographics? 4

Diversity Beyond IR: Machine Learning Classifying Computer Science web pages Select top features by some feature scoring metric computer computers computing computation computational Certainly all are appropriate But do these cover all relevant web pages well? A better approach? MRMR? 5

Diversity in IR In this talk, focus on diversity from an IR perspective: De-duplication (all search engines handle locality sensitive hashing) Same page, different URL Different page versions (copied Wiki articles) Source diversity (easy) Web pages vs. news vs. image search vs. Youtube Sense ambiguity (easily addressed through user reformulation) Java, Jaguar, Apple Arguably not the main motivation Intrinsic diversity (faceted information needs) Heathrow (checkin, food services, ground transport) Extrinsic diversity (diverse user population) Teens vs. parents, men vs. women, location How do these relate to previous examples? Radlinski and Joachims diverse information needs (SIGIR Forum 2009) 6

Diversification in IR Maximum marginal relevance (MMR) Carbonell & Goldstein, SIGIR 1998 Standard diversification approach in IR MMR Algorithm: S k is subset of k selected documents from D Greedily build S k from S k-1 where S 0 as follows: 7

What was the Question? MMR is an algorithm, we don t know what underlying objective it is optimizing. Previous formalization attempts but full question unanswered for 14 years Chen and Karger, SIGIR 2006 came closest This talk: a complete derivation of MMR Many assumptions Arguably the assumptions you are making when using MMR! 8

Where do we start? Let s try to relate set/ranking objective Precision@k to diversity* *Note: non-standard IR! IR evaluates these objectives empirically but never derives algorithms to directly optimize them! (Largely because long tail queries & no labels.) 9

Relating Precision@k Objectives to Diversity Chen and Karger, SIGIR 2006: 1-call@k At least one document in S k should be relevant (P@k=1) Very Diverse: encourages you to cover your bases with S k Sanner et al, CIKM 2011: 1-call@k derives MMR with λ = ½ van Rijsbergen, 1979: Probability Ranking Principle (PRP) Rank items by probability of relevance (e.g., modeled via term freq) PRP relates to k-call@k (P@k=k) which relates to MMR with λ = 1 Not diverse: Encourages k th item to be very similar to first k-1 items So either λ= ½ (1-call@k very diverse) or λ= 1 (k-call@k not diverse)? Should really tune λ for MMR based on query ambiguity Santos, MacDonald, Ounis, CIKM 2011: Learn best λ given query features So what derives λ [½,1]? Any guesses? Small fraction of queries have diverse information needs need good experimental design 10

Estimate of Results Diversity Empirical Study of n-call@k How does diversity of n-call@k change with n? Clearly, diversity (pairwise document correlation) decreases with n in n-call@k J. Wang and J. Zhu. Portfolio theory of information retrieval, SIGIR 2009 11

Hypothesis Let s try optimizing 2-call@k Derivation builds on Sanner et al, CIKM 2011 Optimizing this leads to MMR with λ = 2 3 There seems to be a trend relating λ and n: n=1: λ = ½ n=2: λ = 2 3 n=k: 1 Hypothesis Optimizing n-call@k leads to MMR with lim k λ(k,n) = n n+1 12

Recap We wanted to know what objective leads to MMR diversification Evidence supports that optimizing n-call@k leads to diverse MMR-like behavior where Can we derive MMR from n-call@k? 13

One Detail is Missing We want to optimize n-call@k i.e., at least n of k documents should be relevant Great, but given a query and corpus, how do we do this? Key question: how to define relevance? Need a model for this probabilistic given PRP connections If diversity needed to cover latent information needs relevance model must include latent query/doc topics 14

Latent Binary Relevance Retrieval Model Formalize as optimization s 1 S k in the graphical model: s i : doc selection i=1..k t 1 t k t i : topic for i r i : i relevant? q: query r 1 r k t : topic for q q t

How to determine latent topics? observed latent S 1 S k Need CPTs for P(t i s i ) P(t q) t 1 t k Can Set arbitrarily Topics are words L1-norm TF or TF-IDF! Topic modeling (not quite LDA) q r 1 t r k

Defining Relevance Adapt 0-1 loss model of PRP: S 1 S k P (r i jt 0 ; t i ) = ( 0 if t i 6= t 0 1 if t i = t 0 t 1 r 1 t k r k q t

S = i=1 Optimizing Expected 1-call@k Exp-1-call@k(S; ~q) " k # _ = E r i = 1 s1; : : : ; s k ; ~q = P ( k_ i=1 r i = 1js 1 ; : : : ; s k ; ~q) = P (r 1 = 1 _ [r 1 = 0 ^ r 2 = 1] _ [r 1 = 0 ^ r 2 = 0 ^ r 3 = 1] _ : : : js 1 ; : : : ; s k ; ~q) kx = P (r i = 1; r 1 = 0; : : : ; r i 1 = 0js 1 ; : : : ; s k ; ~q) = i=1 kx P (r i = 1jr 1 = 0; : : : ; r i 1 = 0; s 1 ; : : : ; s k ; ~q)p (r 1 = 0; : : : ; r i 1 = 0jS; ~q) i=1 argmax Exp-1-call@k(S; ~q) S=fs 1 ;:::;s k g All disjuncts mutually exclusive s k D-separated from r 1 r k-1 ; so can ignore when greedy! Greedy: s i = argmax P(r i = 1jr 1 = 0; : : : ; r i 1 = 0; s 1; : : : ; s i 1; s i ; ~q) s i

Objective to Optimize: s 1 * Take a greedy approach (like MMR) Choose s 1 via AccRel first s 1 = arg max s 1 = arg max s 1 = arg max s 1 P (r 1 js 1 ; ~q) X I[t 0 = t 1 ]P (t 0 j~q)p (t 1 js 1 ) t 1 ;t X 0 P (t 0 j~q)p (t 1 = t 0 js 1 ) t 0 {z } 2 4. 3 5 2 4. 3 5 q s 1 t 1 r 1 t Can derive numerous kernels including TF, TD-IDF, LSI

Objective to Optimize: s 2 * s 1 s 2 Choose s 2 via AccRel next Condition on chosen s 1 * and r 1 =0 t 1 t 2 r 1 r 2 s 2 = arg max P (r 2 = 1jr 1 = 0; s s 2 1; s 2 ; ~q) q t X = arg max I[t 2 = t 0 ]P (t 1 js s 2 1)I[t 1 6= t 0 ]P (t 2 js 2 )P (t 0 j~q) t 1 ;t 2 ;t X 0 = arg max p(t 0 j~q)p (t 2 = t 0 js 2 )(1 P (t 1 = t 0 js Query-topic s 2 1)) weighted diversity! t " 0 # " # X X = arg max p(t 0 j~q)p (t 2 = t 0 js 2 ) P (t 0 j~q)p(t 1 = t 0 js s 2 1)P (t 2 = t 0 js 2 ) t 0 t {z } 0 {z } relevance non-diversity penalty What is? ½.

k 1 Q i=1 Objective to Optimize: s k *, k>2 X s k = arg max P (t k = t 0 js k )P (t 0 j~q) s k 2DnS k 1 t 0 2 k 1 [1 P (t i = t 0 js i )] = 1 X k 1 X k 1 X 4 P (t i = t 0 js i ) i=1 i=1 j=1;j6=i k 1 Y i=1 [1 P (t i = t 0 js i )] P (t i = t 0 js i )P (t j = t 0 js j ) + : : : 5 3 t ¼ P(t 0 js 1; : : : ; s k 1) Derives topic t coverage by Principle of Inclusion, Exclusion! Provides set-covering view of diversity.

So far We ve seen hints of MMR from E[1-call@k] Need a few more assumptions to get to MMR Let s also generalize to E[n-call@k] for general : where 22

Optimization Objective Continue with greedy approach for E[n-call@k] Select the next document s k * given all previously chosen documents S k-1 :

24 Derivation Nontrivial Only an overview of key tricks here For full details, see Sanner et al, CIKM 2011: 1-call@k (gentler introduction) http://users.cecs.anu.edu.au/~ssanner/papers/cikm11.pdf Lim et al, SIGIR 2012: n-call@k http://users.cecs.anu.edu.au/~ssanner/papers/sigir12.pdf and online SIGIR 2012 appendix http://users.cecs.anu.edu.au/~ssanner/papers/sigir12_app.pdf

Derivation 25

Derivation Marginalise out all subtopics (using conditional probability) 26

Derivation We write r k as conditioned on R k-1, where it decomposes into two independent events, hence the + 27

Derivation Start to push latent topic marginalizations as far in as possible. 28

Derivation First term in + is independent of s k so can remove from max! 29

Derivation We arrive at the simplified This is still a complicated expression, but it can be expressed recursively 30

Recursion Very similar conditional decomposition as done in first part of derivation. 31

Unrolling the Recursion We can unroll the previous recursion, express it in closed-form, and substitute: Where s the max? MMR has a max. 32

Deterministic Topic Probabilities We assume that the topics of each document are known (deterministic), hence: Likewise for P(t q) This means that a document refers to exactly one topic and likewise for queries, e.g., If you search for Apple you meant the fruit OR the company, but not both If a document refers to Apple the fruit, it does not discuss the company Apple Computer 33

Deterministic Topic Probabilities Generally: Deterministic: 34

Convert a to a max Assuming deterministic topic probabilities, we can convert a to a max and vice versa For x i {0 (false), 1 (true)} max i = i x i = i ( x i ) = 1 - i (1 x i ) = 1 - i (1 x i ) 35

Convert a to a max From the optimizing objective when, we can write 36

Objective After max 37

Combinatorial Simplification Deterministic topics also permit combinatorial simplification of some of the Assuming that m documents out of the chosen (k-1) are relevant, then d (the top term) are non-zero times. (bottom term) are non-zero times. 38

After Final form assuming a deterministic topic distribution, converting to a max, and combinatorial simplification Topic marginalization leads to probability product kernel Sim 1 (, ): this is any kernel that L 1 normalizes inputs, so can use with TF, TF-IDF! MMR drops q dependence in Sim 2 (, ). argmax invariant to constant multiplier, use Pascal s rule to normalize coefficients to [0,1]: 39

Comparison to MMR The optimising objective used in MMR is We note that the optimizing objective for expected n-call@k has the same form as MMR, with. but m is unknown 40

Expectation of m m is expected number of relevant documents (m n), we can lower bound m as m n. With the assumption m=n, we obtain Our hypothesis! λ = n n+1 also roughly follows empirical behavior observed earlier, variation is likely due to m for each corpus 41 If instead m constant, still yields MMR-like algorithm

Summary of Contributions We derived MMR from n-call@k! After 14 years, we have insight as to what MMR is optimizing! Don t like the assumptions? Write down the objective you want Derive the solution! 42

Bigger Picture: Prob ML for IR Search engines are complex beasts Manually optimized (which has grown out of empirical IR philosophy) But there are probabilistic derivations for popular algorithms in IR TF-IDF, BM25, Language Model Opportunity for more modeling, learning, optimization Probabilistic models of (latent) information needs And solutions which autonomously learn and optimize these needs! 43