Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Xutao Li 1 Michael Ng 2 Yunming Ye 1 1 Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China 2 Department of Mathematics, Hong Kong Baptist Univerisity, Hong Kong SIAM International Conference on Data Mining, 2012
Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks
Motivation Link analysis algorithm is critical to information retrieval tasks, especially to Web related retrieval applications. much noise, low quality information link(hyperlink)structure is helpful e.g., Google There are many applications where the links/hyperlinks can be characterized into different types.
Motivation - Examples of multi-relational data (a) multi-relational citation network (b) multi-semantic hyperlink network (c) multi-channel communication network (d) multi-conditional gene interaction network How to exploit such multi-relational link structures to facilitate query search task is an important and open research problem.
Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks
Related Work The hyperlink structure is exploited by three of the most frequently cited Web IR methods: HITS (Hypertext Induced Topic Search), PageRank and SALSA. HITS was developed in 1997 by Jon Kleinberg. Soon after Sergey Brin and Larry Page developed their now famous PageRank method. SALSA was developed in 2000 in reaction to the pros and cons of HITS and PageRank. [The survey given by A. Langville and C. Meyer, A Survey of Eigenvector Methods for Web Information Retrieval, SIAM Review, 2005.] In 2006, Tamara Kolda and Brett Bader proposed TOPHITS method to analyze multi-relational link structures by using tensor decomposition.
New Challenge PageRank: L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1998. HITS: J. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46: 604-632, 1999. SALSA: R. Lempel and S. Moran. The Stochastic Approach for Link-structure Analysis (SALSA) and the TKC effect. The Ninth International WWW Conference, 2000. single-type relation(hyperlink) TOPHITS: T. Kolda and B. Bader. The TOPHITS Model for Higher-Order Web Link Analysis. Workshop on Link Analysis, Counterterrorism and Security, 2006. The decomposition may not be unique. Negative hub and authority scores can be produced.
Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks
The Idea In order to differentiate relations, we introduce a relevance score for each relation besides the hub and authority scores for objects. The hub, authority and relevance scores have a mutually-reinforcing relationship. Represent the data with a tensor construct transition probability tensors w.r.t. hubs, authorities and relations setup tensor equations based on random walk solve the tensor equations for obtaining the hub, authority and relevance scores
The Representation Example: five objects and three relations (R1: green, R2: blue, R3: red) among them. R 1 R 2 R 3 1 1 1 1 1 1 5 4 (a) 2 3 1 1 1 21 3 1 1 4 51 1 2 3 4 5 R 1 (b) 1 1 R 2 R 3 In the following, we assume that there are m objects and n relations in the multi-relational data. It is represented as a tensor T = (t i1,i 2,j 1 ). Here (i 1, i 2 ) to be the indices for objects and j 1 to be the indices for relations.
Transition Probability Tensors H = (h i1,i 2,j 1 ), A = (a i1,i 2,j 1 ) and R = (r i1,i 2,j 1 ) with respect to hubs, authorities and relations by normalizing the entry of T as follows: h i1,i 2,j 1 = a i1,i 2,j 1 = r i1,i 2,j 1 = t i1,i 2,j 1, i m 1 = 1, 2,, m, t i1,i 2,j 1 i 1 =1 t i1,i 2,j 1, i m 2 = 1, 2,, m, t i1,i 2,j 1 i 2 =1 t i1,i 2,j 1, j n 1 = 1, 2,, n. t i1,i 2,j 1 j 1 =1
Transition Probability Tensors These numbers give the estimates of the following conditional probabilities: h i1,i 2,j 1 = Prob[X t = i 1 Y t = i 2, Z t = j 1 ] a i1,i 2,j 1 = Prob[Y t = i 2 X t = i 1, Z t = j 1 ] r i1,i 2,j 1 = Prob[Z t = j 1 Y t = i 2, X t = i 1 ] where X t, Y t and Z t are random variables referring to visit at any particular object as a hub and as an authority, and to use at any particular relation respectively at the time t respectively. Here the time t refers to the time step in the random walk.
HAR - Tensor Equations hub score: x authority score: ȳ relevance score: z Hȳ z = x, A x z = ȳ, R xȳ = z, with m x i1 = 1, i 1 =1 m ȳ i2 = 1, i 2 =1 n z j1 = 1. j 1 =1
HAR - Tensor Equations hub score: x authority score: ȳ relevance score: z with m n h i1,i 2,j 1 y i2 z j1 = x i1, i 2 =1 j 1 =1 m i 1 =1 j 1 =1 m i 1 =1 i 2 =1 n a i1,i 2,j 1 x i1 z j1 = y i2, m h i1,i 2,j 1 x i1 y i2 = z j1, m m x i1 = 1, ȳ i2 = 1, i 1 =1 i 2 =1 1 i 1 m 1 i 2 m 1 j 2 n n z j1 = 1. j 1 =1
Generalization When we consider a single relation type, we can set z to be a vector l/n of all ones, and thus we obtain two matrix equations Hȳl/n = x A xl/n = ȳ. We remark that A can be viewed as the transpose of H. This is exactly the same as that we solve for the singular vectors to get the hub and authority scoring vectors in SALSA. As a summary, the proposed framework HAR is a generalization of SALSA to deal with multi-relational data.
HAR - Query Search To deal with query processing, we need to compute hub and authority scores of objects and relevance scores of relations with respect to a query input (like topic-sensitive PageRank): (1 α)hȳ z + αo = x, (1 β)a x z + βo = ȳ, (1 γ)r xȳ + γr = z, where o and r are two assigned probability distributions that are constructed from a query input, and 0 α, β, γ < 1, are three parameters.
HAR - Theory Ω m = {u = (u 1, u 2,, u m ) R m u i 0, 1 i m, and Ω n = {w = (w 1, w 2,, w n ) R n w j 0, 1 j n, m u i = 1} i=1 n w j = 1} Clearly, the solution of HAR is in a convex set. Then we derived the following two theorems based on the Brouwer Fixed Point Theorem. j=1
HAR - Theory Theorem 1 Suppose H, A and R are constructed, 0 α, β, γ < 1, and o Ω m and r Ω n are given. If T is irreducible, then there exist x > 0, ȳ > 0 and z > 0 such that (1 α)hȳ z + αo = x, (1 β)a x z + βo = ȳ, and (1 γ)r xȳ + γr = z, with x, ȳ Ω m and z Ω n. Theorem 2 Suppose T is irreducible, H, A and R constructed, 0 α, β, γ < 1 and o Ω m and r Ω n are given. If 1 is not the eigenvalue of the Jacobian matrix of the mapping from the tensor, then the solution vectors x, ȳ and z are unique.
The HAR Algorithm Input: Three tensors H, A and R, two initial probability distributions y 0 and z 0 with ( m i=1 [y 0] i = 1 and n j=1 [z 0] j = 1), the assigned probability distributions of objects and/or relations o and r ( m i=1 [o] i = 1 and n j=1 [r] j = 1), three weighting parameters 0 α, β, γ < 1, and the tolerance ɛ Output: Three stationary probability distributions x (authority scores), ȳ (hub scores) and z (relevance values) Procedure: 1: Set t = 1; 2: Compute x t = (1 α)hy t 1 z t 1 + αo; 3: Compute y t = (1 β)ax t z t 1 + βo; 4: Compute z t = (1 γ)rx t y t + γr; 5: If x t x t 1 + y t y t 1 + z t z t 1 < ɛ, then stop, otherwise set t = t + 1 and goto Step 2.
Outline Motivation Related Work HAR (Idea + Theory + Algorithm) Experimental Results Concluding Remarks
Evaluation metrics P@k: Given a particular query q, we compute the precision at position k as follows: P @k = #{relevant documents in top k results} k NDCG@k: NDCG@k is a normalized version of DCG@k metric. MAP: Given a query, the average precision is calculated by averaging the precision scores at each position in the search results where a relevant document is found. R-prec: Given a query, R-prec is the precision score after R documents are retrieved, i.e., R-prec=P@R, where R is the total number of relevant documents for such query.
Experiment 1 100,000 webpages from.gov Web collection in 2002 TREC and 50 topic distillation topics in TREC 2003 Web track as queries links among webpages via different anchor texts 39,255 anchor terms (multiple relations), and 479,122 links with these anchor terms among the 100,000 webpages If the i 1 th webpage links to the i 2 th webpage via the j 1 th anchor term, we set the entry t i1,i 2,j 1 of T to be one. The size of T is 100, 000 100, 000 39, 255.
P@10 P@20 NDCG@10 NDCG@20 MAP R-prec HITS 0.0000 0.0000 0.0000 0.0000 0.0041 0.0000 SALSA 0.0160 0.0140 0.0157 0.0203 0.0114 0.0084 TOPHITS 0.0020 0.0010 0.0044 0.0028 0.0008 0.0002 (500-rank) TOPHITS 0.0040 0.0020 0.0088 0.0057 0.0016 0.0010 (1000-rank) TOPHITS 0.0040 0.0030 0.0063 0.0049 0.0011 0.0018 (1500-rank) BM25+ 0.0280 0.0180 0.0419 0.0479 0.0370 0.0370 DepInOut HAR 0.0560 0.0410 0.0659 0.0747 0.0330 0.0552 (rel. query) HAR 0.1100 0.0800 0.1545 0.1765 0.1035 0.1051 (rel. and obj. query) The results of all comparison algorithms on TREC data set.
Parameters 0.07 0.06 0.05 P@10,α=β=0 NDCG@10,α=β=0 MAP,α=β=0 R prec,α=β=0 performance 0.04 0.03 0.02 0.01 0 0 0.2 0.4 0.6 0.8 1 γ The parameter tuning test: tuning γ with α = β = 0.
Parameters 0.16 performance 0.14 0.12 0.1 P@10,γ=0.9 NDCG@10,γ=0.9 MAP,γ=0.9 R prec,γ=0.9 0.08 0.06 0 0.2 0.4 0.6 0.8 1 α=β The parameter tuning test: tuning α and β with γ = 0.9.
Experiment 2 five conferences (SIGKDD, WWW, SIGIR, SIGMOD, CIKM) Publication information includes title, authors, reference list, and classification categories associated with publication 6848 publications and 617 different categories 100 category concepts as query inputs to retrieve the relevant publications Tensor: 6848 6848 617, If the i 1 th publication cites the i 2 th publication and the i 2 th publication has the j 1 th category concept, then we set the entry t i1,i 2,j 1 of T to be one, otherwise we set the entry t i1,i 2,j 1 to be zero.
P@10 P@20 NDCG@10 NDCG@20 MAP R-prec HITS 0.2260 0.1815 0.3789 0.3792 0.2522 0.2751 SALSA 0.4100 0.3105 0.5606 0.5352 0.3462 0.3929 TOPHITS 0.1360 0.1145 0.1684 0.1557 0.0566 0.0617 (50-rank) TOPHITS 0.1640 0.1340 0.2012 0.1857 0.0646 0.0732 (100-rank) TOPHITS 0.1920 0.1410 0.2315 0.1998 0.0732 0.0765 (150-rank) BM25+ 0.0170 0.0145 0.0147 0.0138 0.0162 0.0109 DepInOut HAR 0.5880 0.4155 0.7472 0.6760 0.4731 0.4683 (rel. query) The results of all comparison algorithms on DBLP data set.
Outline Motivation Related Work HAR (Theory + Algorithm) Experimental Results Concluding Remarks
Concluding Remarks Our framework is a general paradigm and it can be further extended to consider data with higher order tensors for potential applications in semantic web, image retrieval and community discovery. For example, we can consider the query search problem in semantic web using a (1, 1, 1, 1)th order rectangular tensor to represent subject, object, predicate and context relationship. After constructing four transition probability tensors S, O, P and R for subject, object, predicate and context relationship respectively, based on the proposed framework, we expect to solve the following set of tensor equations: Sopr = s, Ospr = o, Psor = p, Rsop = r.
Thank you!