Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Similar documents
MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

HAR: Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

MultiRank: Co-Ranking for Objects and Relations in Multi-Relational Data

ON THE LIMITING PROBABILITY DISTRIBUTION OF A TRANSITION PROBABILITY TENSOR

Data Mining and Matrices

Link Analysis. Leonid E. Zhukov

Inf 2B: Ranking Queries on the WWW

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Node Centrality and Ranking on Networks

1 Searching the World Wide Web

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Node and Link Analysis

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

How does Google rank webpages?

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis Ranking

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Applications to network analysis: Eigenvector centrality indices Lecture notes

Lecture 7 Mathematics behind Internet Search

Faloutsos, Tong ICDE, 2009

The Static Absorbing Model for the Web a

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PROBABILISTIC LATENT SEMANTIC ANALYSIS

The Dynamic Absorbing Model for the Web

Google Page Rank Project Linear Algebra Summer 2012

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Online Social Networks and Media. Link Analysis and Web Search

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Data Mining Recitation Notes Week 3

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

1998: enter Link Analysis

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Models and Algorithms for Complex Networks. Link Analysis Ranking

STA141C: Big Data & High Performance Statistical Computing

Online Social Networks and Media. Link Analysis and Web Search

IR: Information Retrieval

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Introduction to Data Mining

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Uncertainty and Randomization

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

A Note on Google s PageRank

Lecture 12: Link Analysis for Web Retrieval

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Eigenvalue Problems Computation and Applications

Math 304 Handout: Linear algebra, graphs, and networks.

Perturbation of the hyper-linked environment

CS47300: Web Information Search and Management

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Mining PageRank. From Stanford C246

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

NewPR-Combining TFIDF with Pagerank

Complex Social System, Elections. Introduction to Network Analysis 1

Talk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU

Calculating Web Page Authority Using the PageRank Algorithm

Krylov Subspace Methods to Calculate PageRank

Mathematical Properties & Analysis of Google s PageRank

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Computing PageRank using Power Extrapolation

The Second Eigenvalue of the Google Matrix

Higher-Order Web Link Analysis Using Multilinear Algebra

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Blog Distillation via Sentiment-Sensitive Link Analysis

Eigenvalues of Exponentiated Adjacency Matrices

Predicting Neighbor Goodness in Collaborative Filtering

Slides based on those in:

Web Structure Mining Nodes, Links and Influence

Graph Models The PageRank Algorithm

CS6220: DATA MINING TECHNIQUES

Analysis of Google s PageRank

Algebraic Representation of Networks

Updating PageRank. Amy Langville Carl Meyer

Degree Distribution: The case of Citation Networks

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

c 2005 Society for Industrial and Applied Mathematics

Link Analysis. Stony Brook University CSE545, Fall 2016

HRank: A Path Based Ranking Method in Heterogeneous Information Network

Variable Latent Semantic Indexing

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Finding Authorities and Hubs From Link Structures on the World Wide Web

Finding central nodes in large networks

eigenvalues, markov matrices, and the power method

Analysis (SALSA) and the TKC Eect. R. Lempel S. Moran. Department of Computer Science. The Technion, Haifa 32000, Israel

Applications. Nonnegative Matrices: Ranking

Transcription:

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Xutao Li 1 Michael Ng 2 Yunming Ye 1 1 Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China 2 Department of Mathematics, Hong Kong Baptist Univerisity, Hong Kong SIAM International Conference on Data Mining, 2012

Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks

Motivation Link analysis algorithm is critical to information retrieval tasks, especially to Web related retrieval applications. much noise, low quality information link(hyperlink)structure is helpful e.g., Google There are many applications where the links/hyperlinks can be characterized into different types.

Motivation - Examples of multi-relational data (a) multi-relational citation network (b) multi-semantic hyperlink network (c) multi-channel communication network (d) multi-conditional gene interaction network How to exploit such multi-relational link structures to facilitate query search task is an important and open research problem.

Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks

Related Work The hyperlink structure is exploited by three of the most frequently cited Web IR methods: HITS (Hypertext Induced Topic Search), PageRank and SALSA. HITS was developed in 1997 by Jon Kleinberg. Soon after Sergey Brin and Larry Page developed their now famous PageRank method. SALSA was developed in 2000 in reaction to the pros and cons of HITS and PageRank. [The survey given by A. Langville and C. Meyer, A Survey of Eigenvector Methods for Web Information Retrieval, SIAM Review, 2005.] In 2006, Tamara Kolda and Brett Bader proposed TOPHITS method to analyze multi-relational link structures by using tensor decomposition.

New Challenge PageRank: L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1998. HITS: J. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46: 604-632, 1999. SALSA: R. Lempel and S. Moran. The Stochastic Approach for Link-structure Analysis (SALSA) and the TKC effect. The Ninth International WWW Conference, 2000. single-type relation(hyperlink) TOPHITS: T. Kolda and B. Bader. The TOPHITS Model for Higher-Order Web Link Analysis. Workshop on Link Analysis, Counterterrorism and Security, 2006. The decomposition may not be unique. Negative hub and authority scores can be produced.

Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks

The Idea In order to differentiate relations, we introduce a relevance score for each relation besides the hub and authority scores for objects. The hub, authority and relevance scores have a mutually-reinforcing relationship. Represent the data with a tensor construct transition probability tensors w.r.t. hubs, authorities and relations setup tensor equations based on random walk solve the tensor equations for obtaining the hub, authority and relevance scores

The Representation Example: five objects and three relations (R1: green, R2: blue, R3: red) among them. R 1 R 2 R 3 1 1 1 1 1 1 5 4 (a) 2 3 1 1 1 21 3 1 1 4 51 1 2 3 4 5 R 1 (b) 1 1 R 2 R 3 In the following, we assume that there are m objects and n relations in the multi-relational data. It is represented as a tensor T = (t i1,i 2,j 1 ). Here (i 1, i 2 ) to be the indices for objects and j 1 to be the indices for relations.

Transition Probability Tensors H = (h i1,i 2,j 1 ), A = (a i1,i 2,j 1 ) and R = (r i1,i 2,j 1 ) with respect to hubs, authorities and relations by normalizing the entry of T as follows: h i1,i 2,j 1 = a i1,i 2,j 1 = r i1,i 2,j 1 = t i1,i 2,j 1, i m 1 = 1, 2,, m, t i1,i 2,j 1 i 1 =1 t i1,i 2,j 1, i m 2 = 1, 2,, m, t i1,i 2,j 1 i 2 =1 t i1,i 2,j 1, j n 1 = 1, 2,, n. t i1,i 2,j 1 j 1 =1

Transition Probability Tensors These numbers give the estimates of the following conditional probabilities: h i1,i 2,j 1 = Prob[X t = i 1 Y t = i 2, Z t = j 1 ] a i1,i 2,j 1 = Prob[Y t = i 2 X t = i 1, Z t = j 1 ] r i1,i 2,j 1 = Prob[Z t = j 1 Y t = i 2, X t = i 1 ] where X t, Y t and Z t are random variables referring to visit at any particular object as a hub and as an authority, and to use at any particular relation respectively at the time t respectively. Here the time t refers to the time step in the random walk.

HAR - Tensor Equations hub score: x authority score: ȳ relevance score: z Hȳ z = x, A x z = ȳ, R xȳ = z, with m x i1 = 1, i 1 =1 m ȳ i2 = 1, i 2 =1 n z j1 = 1. j 1 =1

HAR - Tensor Equations hub score: x authority score: ȳ relevance score: z with m n h i1,i 2,j 1 y i2 z j1 = x i1, i 2 =1 j 1 =1 m i 1 =1 j 1 =1 m i 1 =1 i 2 =1 n a i1,i 2,j 1 x i1 z j1 = y i2, m h i1,i 2,j 1 x i1 y i2 = z j1, m m x i1 = 1, ȳ i2 = 1, i 1 =1 i 2 =1 1 i 1 m 1 i 2 m 1 j 2 n n z j1 = 1. j 1 =1

Generalization When we consider a single relation type, we can set z to be a vector l/n of all ones, and thus we obtain two matrix equations Hȳl/n = x A xl/n = ȳ. We remark that A can be viewed as the transpose of H. This is exactly the same as that we solve for the singular vectors to get the hub and authority scoring vectors in SALSA. As a summary, the proposed framework HAR is a generalization of SALSA to deal with multi-relational data.

HAR - Query Search To deal with query processing, we need to compute hub and authority scores of objects and relevance scores of relations with respect to a query input (like topic-sensitive PageRank): (1 α)hȳ z + αo = x, (1 β)a x z + βo = ȳ, (1 γ)r xȳ + γr = z, where o and r are two assigned probability distributions that are constructed from a query input, and 0 α, β, γ < 1, are three parameters.

HAR - Theory Ω m = {u = (u 1, u 2,, u m ) R m u i 0, 1 i m, and Ω n = {w = (w 1, w 2,, w n ) R n w j 0, 1 j n, m u i = 1} i=1 n w j = 1} Clearly, the solution of HAR is in a convex set. Then we derived the following two theorems based on the Brouwer Fixed Point Theorem. j=1

HAR - Theory Theorem 1 Suppose H, A and R are constructed, 0 α, β, γ < 1, and o Ω m and r Ω n are given. If T is irreducible, then there exist x > 0, ȳ > 0 and z > 0 such that (1 α)hȳ z + αo = x, (1 β)a x z + βo = ȳ, and (1 γ)r xȳ + γr = z, with x, ȳ Ω m and z Ω n. Theorem 2 Suppose T is irreducible, H, A and R constructed, 0 α, β, γ < 1 and o Ω m and r Ω n are given. If 1 is not the eigenvalue of the Jacobian matrix of the mapping from the tensor, then the solution vectors x, ȳ and z are unique.

The HAR Algorithm Input: Three tensors H, A and R, two initial probability distributions y 0 and z 0 with ( m i=1 [y 0] i = 1 and n j=1 [z 0] j = 1), the assigned probability distributions of objects and/or relations o and r ( m i=1 [o] i = 1 and n j=1 [r] j = 1), three weighting parameters 0 α, β, γ < 1, and the tolerance ɛ Output: Three stationary probability distributions x (authority scores), ȳ (hub scores) and z (relevance values) Procedure: 1: Set t = 1; 2: Compute x t = (1 α)hy t 1 z t 1 + αo; 3: Compute y t = (1 β)ax t z t 1 + βo; 4: Compute z t = (1 γ)rx t y t + γr; 5: If x t x t 1 + y t y t 1 + z t z t 1 < ɛ, then stop, otherwise set t = t + 1 and goto Step 2.

Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks

Evaluation metrics P@k: Given a particular query q, we compute the precision at position k as follows: P @k = #{relevant documents in top k results} k NDCG@k: NDCG@k is a normalized version of DCG@k metric. MAP: Given a query, the average precision is calculated by averaging the precision scores at each position in the search results where a relevant document is found. R-prec: Given a query, R-prec is the precision score after R documents are retrieved, i.e., R-prec=P@R, where R is the total number of relevant documents for such query.

Experiment 1 100,000 webpages from.gov Web collection in 2002 TREC and 50 topic distillation topics in TREC 2003 Web track as queries links among webpages via different anchor texts 39,255 anchor terms (multiple relations), and 479,122 links with these anchor terms among the 100,000 webpages If the i 1 th webpage links to the i 2 th webpage via the j 1 th anchor term, we set the entry t i1,i 2,j 1 of T to be one. The size of T is 100, 000 100, 000 39, 255.

P@10 P@20 NDCG@10 NDCG@20 MAP R-prec HITS 0.0000 0.0000 0.0000 0.0000 0.0041 0.0000 SALSA 0.0160 0.0140 0.0157 0.0203 0.0114 0.0084 TOPHITS 0.0020 0.0010 0.0044 0.0028 0.0008 0.0002 (500-rank) TOPHITS 0.0040 0.0020 0.0088 0.0057 0.0016 0.0010 (1000-rank) TOPHITS 0.0040 0.0030 0.0063 0.0049 0.0011 0.0018 (1500-rank) BM25+ 0.0280 0.0180 0.0419 0.0479 0.0370 0.0370 DepInOut HAR 0.0560 0.0410 0.0659 0.0747 0.0330 0.0552 (rel. query) HAR 0.1100 0.0800 0.1545 0.1765 0.1035 0.1051 (rel. and obj. query) The results of all comparison algorithms on TREC data set.

Parameters 0.07 0.06 0.05 P@10,α=β=0 NDCG@10,α=β=0 MAP,α=β=0 R prec,α=β=0 performance 0.04 0.03 0.02 0.01 0 0 0.2 0.4 0.6 0.8 1 γ The parameter tuning test: tuning γ with α = β = 0.

Parameters 0.16 performance 0.14 0.12 0.1 P@10,γ=0.9 NDCG@10,γ=0.9 MAP,γ=0.9 R prec,γ=0.9 0.08 0.06 0 0.2 0.4 0.6 0.8 1 α=β The parameter tuning test: tuning α and β with γ = 0.9.

Experiment 2 five conferences (SIGKDD, WWW, SIGIR, SIGMOD, CIKM) Publication information includes title, authors, reference list, and classification categories associated with publication 6848 publications and 617 different categories 100 category concepts as query inputs to retrieve the relevant publications Tensor: 6848 6848 617, If the i 1 th publication cites the i 2 th publication and the i 2 th publication has the j 1 th category concept, then we set the entry t i1,i 2,j 1 of T to be one, otherwise we set the entry t i1,i 2,j 1 to be zero.

P@10 P@20 NDCG@10 NDCG@20 MAP R-prec HITS 0.2260 0.1815 0.3789 0.3792 0.2522 0.2751 SALSA 0.4100 0.3105 0.5606 0.5352 0.3462 0.3929 TOPHITS 0.1360 0.1145 0.1684 0.1557 0.0566 0.0617 (50-rank) TOPHITS 0.1640 0.1340 0.2012 0.1857 0.0646 0.0732 (100-rank) TOPHITS 0.1920 0.1410 0.2315 0.1998 0.0732 0.0765 (150-rank) BM25+ 0.0170 0.0145 0.0147 0.0138 0.0162 0.0109 DepInOut HAR 0.5880 0.4155 0.7472 0.6760 0.4731 0.4683 (rel. query) The results of all comparison algorithms on DBLP data set.

Outline Motivation Related Work HAR (Theory + Algorithm) Experimental Results Concluding Remarks

Concluding Remarks Our framework is a general paradigm and it can be further extended to consider data with higher order tensors for potential applications in semantic web, image retrieval and community discovery. For example, we can consider the query search problem in semantic web using a (1, 1, 1, 1)th order rectangular tensor to represent subject, object, predicate and context relationship. After constructing four transition probability tensors S, O, P and R for subject, object, predicate and context relationship respectively, based on the proposed framework, we expect to solve the following set of tensor equations: Sopr = s, Ospr = o, Psor = p, Rsop = r.

Thank you!