Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data

Similar documents
Probabilistic Field Mapping for Product Search


RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Predicting Neighbor Goodness in Collaborative Filtering

arxiv: v1 [cs.ir] 1 May 2018

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Convolutional Neural Network-based

A Neural Passage Model for Ad-hoc Document Retrieval

Using the Past To Score the Present: Extending Term Weighting Models Through Revision History Analysis

Ranked Retrieval (2)

Lecture 13: More uses of Language Models

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Chapter 3: Basics of Language Modelling

Comparing Relevance Feedback Techniques on German News Articles

Improving Diversity in Ranking using Absorbing Random Walks

Business Process Technology Master Seminar

Language Models, Smoothing, and IDF Weighting

BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS

Prediction of Citations for Academic Papers

The Axiometrics Project

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Embedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

The Use of Dependency Relation Graph to Enhance the Term Weighting in Question Retrieval

Information Retrieval

Collaborative topic models: motivations cont

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Information Retrieval

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Statistical Ranking Problem

Concept Tracking for Microblog Search

Restricted Boltzmann Machines for Collaborative Filtering

Information Retrieval and Organisation

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Understanding Interlinked Data

Lower-Bounding Term Frequency Normalization

Passage language models in ad hoc document retrieval

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

A Three-Way Model for Collective Learning on Multi-Relational Data

Midterm Examination Practice

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Language Models. Hongning Wang

CS 646 (Fall 2016) Homework 3

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Modeling the Score Distributions of Relevant and Non-relevant Documents

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Collaborative Recommendation with Multiclass Preference Context

PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES. Dongwoo Kim with Lexing Xie and Cheng Soon Ong CIKM 2016

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Deep Learning for NLP

Chapter 3: Basics of Language Modeling

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Learning to translate with neural networks. Michael Auli

Positional Language Models for Information Retrieval

A Bayesian Model of Diachronic Meaning Change

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Advanced Topics in Information Retrieval 5. Diversity & Novelty

Collaborative Filtering with Entity Similarity Regularization in Heterogeneous Information Networks

Variable Latent Semantic Indexing

1998: enter Link Analysis

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS

Using Joint Tensor Decomposition on RDF Graphs

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

CIDOC-CRM Method: A Standardisation View. Haridimos Kondylakis, Martin Doerr, Dimitris Plexousakis,

CIKM 18, October 22-26, 2018, Torino, Italy

Utilizing Passage-Based Language Models for Document Retrieval

PV211: Introduction to Information Retrieval

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

PV211: Introduction to Information Retrieval

Semantic Web SPARQL. Gerd Gröner, Matthias Thimm. July 16,

ECIR Tutorial 30 March Advanced Language Modeling Approaches (case study: expert search) 1/100.

Score Distribution Models

CS246 Final Exam, Winter 2011

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Fusing Data with Correlations

Information Retrieval

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Decoupled Collaborative Ranking

Citation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.

A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources

Modeling Controversy Within Populations

Comparison between ontology distances (preliminary results)

Transcription:

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data Nikita Zhiltsov 1,2 Alexander Kotov 3 Fedor Nikolaev 3 1 Kazan Federal University 2 Textocat 3 Textual Data Analytics Lab, Department of Computer Science, Wayne State University

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 2/34

Knowledge Graphs 3/34

Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion Linked Open Data (LOD) Cloud 4/34

Entities Material objects or concepts in the real world or fiction (e.g. people, movies, conferences etc.) Are connected with other entities by relations (e.g. hasgenre, actedin, ispcmemberof etc.) Subject-Predicate-Object (SPO) triple: subject=entity; object=entity (or primitive data value); predicate=relationship between subject and object Many SPO triples knowledge graph 5/34

DBPedia entity page example 6/34

Entity Retrieval from Knowledge Graph(s) Graph KBs are perfectly suited for addressing the information needs that aim at finding specific objects (entities) rather than documents Given the user s information need expressed as a keyword query, retrieve a relevant set of objects from the knowledge graph(s) 7/34

Typical ERWD tasks Entity Search Queries refer to a particular entity. Ben Franklin England football player highest paid Einstein Relativity theory List Search Complex queries with several relevant entities. US presidents since 1960 animals lay eggs mammals Question Answering Queries are questions in natural language. Who is the mayor of Santiago? For which label did Elvis record his first album? 8/34

Fundamental problems in ERWD Designing effective and concise entity representations Pound, Mika et al. Ad-hoc Object Retrieval in the Web of Data, WWW 10 Blanco, Mika et al. Effective and Efficient Entity Search in RDF Data, ISWC 11 Neumayer, Balog et al. On the Modeling of Entities for Ad-hoc Entity Search in the Web of Data, ECIR 12 Developing accurate retrieval models Mostly adaptations of standard unigram bag-of-words retrieval models, such as BM25F, MLM 9/34

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 10/34

Entity document An entity is represented as a structured (multi-fielded) document: names attributes categories Conventional names of the entities, such as the name of a person or the name of an organization All entity properties, other than names Classes or groups, to which the entity has been assigned similar entity names Names of the entities that are very similar or identical to a given entity related entity names Names of the entities that are part of the same RDF triple 11/34

Entity document example Multi-fielded entity document for the entity Barack Obama. Field names attributes categories similar entity names related entity names Content barack obama barack hussein obama ii 44th current president united states birth place honolulu hawaii democratic party united states senator nobel peace prize laureate christian barack obama jr barak hussein obama barack h obama ii spouse michelle obama illinois state predecessor george walker bush 12/34

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 13/34

Motivation Previous research in ad-hoc IR has focused on two major directions: unigram bag-of-words retrieval models for multi-fielded documents Ogilvie and Callan. Combining Document Representations for Known-item Search, SIGIR 03 Robertson et al. Simple BM25 Extension to Multiple Weighted Fields, CIKM 04 retrieval models incorporating term dependencies Metzler and Croft. A Markov Random Field Model for Term Dependencies, SIGIR 05 Huston and Croft. A Comparison of Retrieval Models using Term Dependencies, CIKM 14 Goal: to develop a retrieval model that captures both document structure and term dependencies 14/34

MLM where P(Q D) rank = q i Q P(q i θ D ) = j P(q i θ D ) tf (qi), w j P(q i θ j D ) 15/34

SDM Ranks w.r.t. P Λ (D Q) = i {T,U,O} λ i f i (Q, D) Potential function for unigrams is QL: f T (q i, D) = log P(q i θ D ) = log tf q i,d + µ cf qi C D + µ 16/34

FSDM ranking function FSDM incorporates document structure and term dependencies with the following ranking function: P Λ (D Q) rank = λ T ft (q i, D) + q Q λ O fo (q i, q i+1, D) + q Q λ U fu (q i, q i+1, D) q Q Separate MLMs for bigrams and unigrams give FSDM the flexibility to adjust the document scoring depending on the query type MLM is a special case of FSDM, when λ T = 1, λ O = 0, λ U = 0 17/34

FSDM ranking function FSDM incorporates document structure and term dependencies with the following ranking function: P Λ (D Q) rank = λ T ft (q i, D) + q Q λ O fo (q i, q i+1, D) + q Q λ U fu (q i, q i+1, D) q Q Separate MLMs for bigrams and unigrams give FSDM the flexibility to adjust the document scoring depending on the query type MLM is a special case of FSDM, when λ T = 1, λ O = 0, λ U = 0 17/34

FSDM ranking function FSDM incorporates document structure and term dependencies with the following ranking function: P Λ (D Q) rank = λ T ft (q i, D) + q Q λ O fo (q i, q i+1, D) + q Q λ U fu (q i, q i+1, D) q Q Separate MLMs for bigrams and unigrams give FSDM the flexibility to adjust the document scoring depending on the query type MLM is a special case of FSDM, when λ T = 1, λ O = 0, λ U = 0 17/34

FSDM ranking function FSDM incorporates document structure and term dependencies with the following ranking function: P Λ (D Q) rank = λ T ft (q i, D) + q Q λ O fo (q i, q i+1, D) + q Q λ U fu (q i, q i+1, D) q Q Separate MLMs for bigrams and unigrams give FSDM the flexibility to adjust the document scoring depending on the query type MLM is a special case of FSDM, when λ T = 1, λ O = 0, λ U = 0 17/34

FSDM ranking function Potential function for unigrams in case of FSDM: ft (q i, D) = log j w T j P(q i θ j D ) = log j w T j tf qi,d j + µ j D j + µ j cf j q i C j Example apollo astronauts who walked on the moon 18/34

FSDM ranking function Potential function for unigrams in case of FSDM: ft (q i, D) = log j w T j P(q i θ j D ) = log j w T j tf qi,d j + µ j D j + µ j cf j q i C j Example apollo astronauts who walked on the moon category 18/34

FSDM ranking function Potential function for unigrams in case of FSDM: ft (q i, D) = log j w T j P(q i θ j D ) = log j w T j tf qi,d j + µ j D j + µ j cf j q i C j Example apollo astronauts who walked on the moon category attribute 18/34

Parameters of FSDM Overall, FSDM has 3 F + 3 free parameters: w T, w O, w U, λ. Properties of ranking function 1. Linearity with respect to λ. We can apply any linear learning-to-rank algorithm to optimize the ranking function with respect to λ. 2. Linearity with respect to w of the arguments of monotonic f ( ) functions. Optimization of the arguments as linear functions with respect to w, leads to optimization of each function f ( ). 19/34

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 20/34

Optimization algorithm 1: Q Training queries 2: for s {T, O, U} do // Optimize field weights of LMs independently 3: λ = e s 4: ŵ s CoordAsc(Q, λ) 5: end for 6: ˆλ CoordAsc(Q, ŵt, ŵ O, ŵ U ) // Optimize λ The unit vectors e T = (1, 0, 0), e O = (0, 1, 0), e U = (0, 0, 1) are the corresponding settings of the parameters λ in the formula of FSDM ranking function. direct optimization w.r.t. target metric, e.g. MAP 21/34

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 22/34

Collection DBPedia 3.7 was used as a collection in all experiments Structured version of on-line encyclopedia Wikipedia Provides the descriptions of over 3.5 million entities belonging to 320 classes 23/34

Query Sets Balog and Neumayer. A Test Collection for Entity Search in DBpedia, SIGIR 13. Query set Amount Query types [Pound et al., 2010] SemSearch ES 130 Entity ListSearch 115 Type INEX-LD 100 Entity, Type, Attribute, Relation QALD-2 140 Entity, Type, Attribute, Relation 24/34

Tuning field weights Attributes field is consistently considered to be a very valuable for both unigrams and bigrams. The names field as well as the similar entity names field are highly important for queries aiming at finding named entities. Distinguishing categories from related entity names is particularly important for type queries. 25/34

Tuning λ 0.8 0.8 0.6 0.6 λ T, λo, λu 0.4 λ T λ O λ U λ T, λo, λu 0.4 λ T λ O λ U 0.2 0.2 0.0 0.0 SemSearch_ES ListSearch INEX_LD QALD2 SemSearch_ES ListSearch INEX_LD QALD2 (a) SDM (b) FSDM Bigram matches are important for named entity queries. Transformation of SDM into FSDM increases the importance of bigram matches, which ultimately improves the retrieval performance, as we will demonstrate next. 26/34

Experimental results Query set Method MAP P@10 P@20 b-pref SemSearch ES ListSearch INEX-LD QALD-2 All queries MLM-CA 0.320 0.250 0.179 0.674 SDM-CA 0.254 0.202 0.149 0.671 FSDM 0.386 0.286 0.204 0.750 MLM-CA 0.190 0.252 0.192 0.428 SDM-CA 0.197 0.252 0.202 0.471 FSDM 0.203 0.256 0.203 0.466 MLM-CA 0.102 0.238 0.190 0.318 SDM-CA 0.117 0.258 0.199 0.335 FSDM 0.111 0.263 0.215 0.341 MLM-CA 0.152 0.103 0.084 0.373 SDM-CA 0.184 0.106 0.090 0.465 FSDM 0.195 0.136 0.111 0.466 MLM-CA 0.196 0.206 0.157 0.455 SDM-CA 0.192 0.198 0.155 0.495 FSDM 0.231 0.231 0.179 0.517 27/34

Topic-Level differences between SDM and FSDM 1.0 0.5 0.0-0.5-1.0 0 100 200 300 400 500 (a) All queries 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0-0.5-0.5-0.5-0.5-1.0-1.0-1.0-1.0 0 50 100 0 30 60 90 120 0 25 50 75 100 0 50 100 (b) SemSearch ES (c) ListSearch (d) INEX-LD (e) QALD-2 Topic-level differences in average precision between FSDM and SDM. Positive values indicate FSDM is better. 28/34

Overview Entities Entity Representation Fielded Sequential Dependence Model Parameter Estimation Results Conclusion 29/34

Conclusion We proposed Fielded Sequential Dependence Model, a novel retrieval model, which incorporates term dependencies into structured document retrieval We proposed a two-stage algorithm to directly optimize the parameters of FSDM with respect to the target retrieval metric We experimentally demonstrated that having different field weighting schemes for unigrams and bigrams is effective for different types of ERWD queries Experimental evaluation of FSDM on a standard publicly available benchmark showed that it consistently and, in most cases, statistically significantly outperforms MLM and SDM for the task of ERWD 30/34

Code and runs are available at github.com/teanalab/fieldedsdm Questions? 31/34

Robustness 200 150 100 SDM FSDM 50 0 <= -[75, -100 100) -[50, -[25, -(0, 75) 50) 25) [0, [25, [50, [75, >= 25) 50) 75) 100) 100 FSDM is more robust compared to SDM FSDM improves the performance of 50% of the queries with respect to MLM-CA, compared to 45% of the queries improved by SDM FSDM decreases the performance of only 26% of the queries, while SDM degrades the performance of 40% of the queries 32/34

Various Levels of Difficulty Level Model MAP P@10 P@20 b-pref Difficult queries Medium queries Easy queries SDM 0.213 0.067 0.042 0.599 FSDM 0.239 0.065 0.043 0.621 SDM 0.209 0.224 0.165 0.532 FSDM 0.264 0.272 0.191 0.559 SDM 0.139 0.298 0.262 0.316 FSDM 0.166 0.345 0.309 0.330 Creating sophisticated entity descriptions is not sufficient for answering difficult queries in entity retrieval scenario and better capturing the semantics of query terms is required to further improve the precision of FSDM for difficult queries. 33/34

Failure Analysis SDM errors Overestimation of importance of matches in the fields other than names city of charlotte give me all soccer clubs in the premier league us presidents since 1960 FSDM errors Neglecting the important query terms members of the beaux arts trio who created goofy where is the residence of the prime minister of spain? Lack of semantic knowledge. did nicole kidman have any siblings 34/34