PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES. Dongwoo Kim with Lexing Xie and Cheng Soon Ong CIKM 2016

Similar documents
A Three-Way Model for Collective Learning on Multi-Relational Data

Medical Question Answering for Clinical Decision Support

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Embedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS

A Convolutional Neural Network-based

Learning to Disentangle Factors of Variation with Manifold Learning

Using Joint Tensor Decomposition on RDF Graphs

A Review of Relational Machine Learning for Knowledge Graphs

A Tutorial on Learning with Bayesian Networks

A Translation-Based Knowledge Graph Embedding Preserving Logical Property of Relations

Topic Models and Applications to Short Documents

RaRE: Social Rank Regulated Large-scale Network Embedding

Andriy Mnih and Ruslan Salakhutdinov

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder

Machine Learning for Biomedical Engineering. Enrico Grisan

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Bayesian Contextual Multi-armed Bandits

Semestrial Project - Expedia Hotel Ranking

A Probabilistic Approach for Integrating Heterogeneous Knowledge Sources

A fast and simple algorithm for training neural probabilistic language models

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Homework 6: Image Completion using Mixture of Bernoullis

Latent Dirichlet Allocation Introduction/Overview

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Item Recommendation for Emerging Online Businesses

Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Collaborative Filtering with Entity Similarity Regularization in Heterogeneous Information Networks

Analogical Inference for Multi-Relational Embeddings

Text mining and natural language analysis. Jefrey Lijffijt

Decoupled Collaborative Ranking

Deep Convolutional Neural Networks for Pairwise Causality

CORE: Context-Aware Open Relation Extraction with Factorization Machines. Fabio Petroni

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

COURSE CONTENT for Computer Science & Engineering [CSE]

Principal Component Analysis

CS246 Final Exam, Winter 2011

Naïve Bayes classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MAP Examples. Sargur Srihari

Graph-Based Anomaly Detection with Soft Harmonic Functions

Learning to Query, Reason, and Answer Questions On Ambiguous Texts

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Adaptive Crowdsourcing via EM with Prior

SYSTEMATIC CONSTRUCTION OF ANOMALY DETECTION BENCHMARKS FROM REAL DATA. Outlier Detection And Description Workshop 2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multiclass Multilabel Classification with More Classes than Examples

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Learning Terminological Naïve Bayesian Classifiers Under Different Assumptions on Missing Knowledge

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Collaborative Filtering. Radek Pelánek

Lab 12: Structured Prediction

Information Retrieval and Web Search Engines

Learning to Recommend Point-of-Interest with the Weighted Bayesian Personalized Ranking Method in LBSNs

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Sum-Product Networks: A New Deep Architecture

Introduction to Probabilistic Machine Learning

R2 R2 2R R2 R1 3R This is the reduced echelon form of our matrix. It corresponds to the linear system of equations

Web Structure Mining Nodes, Links and Influence

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

arxiv: v2 [cs.cl] 28 Sep 2015

Active network alignment

Bayesian Methods: Naïve Bayes

Observing Dark Worlds (Final Report)

Large-scale Collaborative Ranking in Near-Linear Time

Lecture 21: Spectral Learning for Graphical Models

Active and Semi-supervised Kernel Classification

Thompson sampling for web optimisation. 29 Jan 2016 David S. Leslie

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

epochs epochs

CS 188: Artificial Intelligence. Outline

Interactive Feature Selection with

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

Classification and Pattern Recognition

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

Midterm, Fall 2003

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Performance evaluation of binary classifiers

Algorithms for Collaborative Filtering

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

arxiv: v1 [cs.cv] 27 Apr 2016

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Knowledge Graph Completion with Adaptive Sparse Transfer Matrix

Transcription:

PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES Dongwoo Kim with Lexing Xie and Cheng Soon Ong CIKM 2016 1

WHAT IS KNOWLEDGE GRAPH (KG)? KG widely used in various tasks such as search and Q&A Triple (a basic unit of KG): - Consists of source and target entities and a relation - e.g: <DiCaprio, starred in, The Revenant> 2

INCOMPLETE KNOWLEDGE GRAPH KG is constructed from external sources - Wikipedia, Web pages Many existing KGs are incomplete - 93.8% of persons from Freebase have no place of birth (Min et al, 2013) - Hard to obtain from external sources Need alternative way to complete KG? 3

IN THIS TALK, I WILL DISCUSS Knowledge graph completion task - where our goal is to identify which triple is positive or negative Specifically, discuss two approaches for knowledge base completion: 1. Predict unknown triples based on observed triples 2. Manually label unknown triples with human experts These two approaches are complementary to each other 4

1. UNKNOWN TRIPLE PREDICTION Unknown triples may be inferred from existing triples - <DiCaprio, starred in, The Revenant> <DiCaprio, job, actor> Statistical relational model - Goal: predict unknown triples from known triples - Find latent semantics of entities and relations 5

TENSOR REPRESENTATION OF KG KG can be represented as 3-dimensional binary tensor or a collection of matrices each of which represent one relation Example: <DiCaprio, starred in, The Revenant> <DiCaprio, not starred in, Star Trek> x ikj =1 { i = DiCaprio j = The Revenant k = starred in X x ikj 0 =0 { i = DiCaprio j = Star Trek k = starred in 6

BILINEAR TENSOR FACTORISATION (RESCAL) (NICKEL, ET. AL, 2011) Latent space model - Represent an entity as D-dimensional vector - Represent a relation as DxD matrix Better performance on knowledge graph to compare with general purpose tensor factorisation methods (Dong, et. al, 2014) 7

LATENT REPRESENTATION OF ENTITIES AND RELATIONS Leonardo DiCaprio e D = e B = apple 0.8 0.1 Yoshua Bengio apple 0.1 0.2 R RA = Receive Award apple 0.1 0.9 0.1 0.1 e O = Oscar apple 0.2 0.8 <Bengio, Receive, Oscar> x B RA O e > BR RA e O <DiCaprio, Receive, Oscar> x D RA O e > DR RA e O 8

LATENT REPRESENTATION OF ENTITIES AND RELATIONS Leonardo DiCaprio e D = e B = apple 0.8 0.1 Yoshua Bengio apple 0.1 0.2 R RA = Receive Award apple 0.1 0.9 0.1 0.1 e O = Oscar apple 0.2 0.8 <Bengio, Receive, Oscar> e > BR RA e O =0.16 < <DiCaprio, Receive, Oscar> e > DR RA e O =0.60 DiCaprio is more likely to get the Oscar award! 9

GOAL OF STATISTICAL RELATIONAL MODELS Leonardo DiCaprio e D = Yoshua Bengio e B = apple?? apple?? Receive Award R RA = apple???? e O = Oscar apple?? The goal of statistical relational model is to infer latent vectors & matrices from known triples Predict new positive triples from inferred latent space 10

PROBABILISTIC BILINEAR (COMPOSITIONAL) MODEL We propose probabilistic RESCAL (PRESCAL) Place a probability distribution over entity and relation matrices e i N D (0, 2 ei) R k N D 2(0, 2 RI) Place a probability distribution over triples x ijk N (e > i R k e j, 2 x) Measure how likely a triple is positive or not 11

KNOWLEDGE COMPOSITION Another Limitation of RESCAL: The model ignores the path information on KG during factorisation Compositional Triple starred in {z } Triple { } z But, the factorisation model ignores the compositional relation between Leonardo DiCaprio and Alejandro G. Iñárritu 12

z } { z } { KG AUGMENTED WITH COMPOSITIONAL TRIPLES Compositional Triple starred in {z } Compositional Triples Original Tensor Triple { } z We propose a probabilistic compositional RESCAL Additional data with compositional relations Explicitly include path information into factorisation 13

TENSOR FACTORISATION ON REAL DATASETS Split each dataset into training / testing sets, measure predictive performance of test set given training set Examples: # Relations # Entities # Positvie Triples Sparsity KINSHIP 26 104 10790 0.038 UMLS 49 135 6752 0.008 NATION 56 14 2024 0.184 Datasets - KINSHIP: <person A, father of, person B> - UMLS: <disease E, affects, function F> - NATION: <nation C, supports, nation D> 14

PREDICTION RESULT AUC score 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 1.0 0.9 0.8 0.7 0.6 0.5 RESCAL PRESCAL COMPOSITIONAL 0.01 109 0.03 329 NATION 0.05 548 0.07 768 0.09 987 Training set proportion/size UMLS 0.11 1207 0.13 1426 High AUC indicates better predictive performance Probabilistic models outperforms nonprobabilistic model 0.4 1.0 0.9 0.8 0.7 0.6 0.5 0.01 8930 0.03 26790 0.05 44651 0.07 62511 0.09 80372 Training set proportion/size KINSHIP 0.11 98232 0.13 116093 Compositional models outperform for the UMLS and NATION datasets 0.4 0.01 2812 0.03 8436 0.05 14060 0.07 19685 0.09 25309 Training set proportion/size 0.11 30933 0.13 36558 15

2. COMPLETION WITH HUMAN EXPERTS Some domains may not have enough information to construct a KG (e.g. Bio-informatics, medical data) - Without enough known triples, RESCAL models do not work well Each unknown entry of KG may be uncovered by domain experts - Requires some amount of budget to uncover - Too many triples to be evaluated, e.g. UMLS = 900,000 triples Goal: how to choose a triple to be evaluated next - to maximise the number of positive triples given a limited budget 16

GREEDY INCREMENTAL COMPLETION WITH STATISTICAL MODEL Using prediction of bilinear model Greedy approach for incremental completion 1. Infer latent semantics of KG given observed triples 2. Query an unknown triple that has the highest expected value with the current approximation arg max ijk E[e> i R k e j ] 3. Update late semantics based on the observation from queried result 4. Repeat 2-3 until reach the budget limitation 17

EXPLOITATION-EXPLORATION Greedy approach always query the maximum triple: may not efficiently explorer the latent space arg max ijk E[e> i R k e j ] We employ randomised probability matching (RPM) algorithm (Thompson, 1933; a.k.a Thompson sampling) - Optimal balance between exploitation and exploration - Instead of finding maximum triple, consider the uncertainty of unknown triples 18

RANDOMISED PROBABILITY MATCHING T1 = <DiCaprio, starred in, The Revenant> T2 = <Benjio, receive, Oscar> T3 = <DiCaprio, profession, Actor> E[T 1 ]=0.9 E[T 2 ]=1.5 E[T 3 ]=0.2 Although T2 has the highest expectation, the variance of T2 is also high If we consider the uncertainty of probability, it might be better to choose T1 - Chose a triple to be labelled based on a random sample from its distribution 19

EXPERIMENT: INCREMENTAL COMPLETION Comparison with two greedy algorithms (Kajino et al, 2015) - AMDC-POP: Greedy population model, always queries the highest expected triple - AMDC-PRED: Greedy prediction model, always queries the most ambiguous triple (expected value around 0.5) Performance metrics - Cumulative gain: total number of positive triples up to time t - AUC: predictive performance of model (used in the first task) 20

COMPARISON WITH GREEDY APPROACHES 1000 800 600 Cumulative gain 2500 2000 1500 PNORMAL-TS AMDC-POP AMDC-PRED Cumulative gain 3000 2000 Cumulative gain 400 1000 1000 200 500 0 0 500 1000 1500 2000 1.0 ROC-AUC score 0 0 2000 4000 6000 8000 10000 1.0 ROC-AUC score 0 0 2000 4000 6000 8000 10000 1.0 ROC-AUC score 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0 500 1000 1500 2000 0.4 0 2500 5000 7500 10000 0.4 0 2500 5000 7500 10000 NATION KINSHIP UMLS 21

SUMMARY Knowledge graph completion task can be divided into unknown triple prediction and incremental completion We develop probabilistic RESCAL, and incorporate compositional structure into factorisation In the prediction task, compositional triples help to predict unknown triples In the construction task, randomised probability matching efficiently uncovers the latent semantic of KG 22

23

INTUITION BEHIND BILINEAR MODEL Leonardo DiCaprio e D = e B = apple 0.8 0.1 Yoshua Bengio apple 0.1 0.2 Indicator of good actor R RA = Receive Award apple 0.1 0.9 0.1 0.1 relation between good actor and prestigious award in this relation e O = Oscar apple 0.2 0.8 Indicator of prestigious award Bilinear model infers latent semantics of entities and relations 24

COMPARISON WITH COMPOSITIONAL MODELS 1000 800 Cumulative gain 2500 2000 Cumulative gain PRESCAL-TS COMPOSITIONAL-TS 3000 Cumulative gain 600 1500 2000 400 1000 1000 200 500 0 0 500 1000 1500 2000 1.0 ROC-AUC score 0 0 2000 4000 6000 8000 10000 1.0 ROC-AUC score 0 0 2000 4000 6000 8000 10000 1.0 ROC-AUC score 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0 500 1000 1500 2000 0.4 0 2500 5000 7500 10000 0.4 0 2500 5000 7500 10000 NATION KINSHIP UMLS 25

POSTERIOR VARIANCE ANALYSIS KINSHIP UMLS Variance of compositional model shrinks faster than noncompositional model 26