Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Similar documents
LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Data Mining and Matrices

Link Analysis Ranking

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

STA141C: Big Data & High Performance Statistical Computing

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

CS47300: Web Information Search and Management

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

1 Searching the World Wide Web

Online Social Networks and Media. Link Analysis and Web Search

Models and Algorithms for Complex Networks. Link Analysis Ranking

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis. Leonid E. Zhukov

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Authoritative Sources in a Hyperlinked Enviroment

IR: Information Retrieval

Lecture 12: Link Analysis for Web Retrieval

Perturbation of the hyper-linked environment

The Dynamic Absorbing Model for the Web

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Online Social Networks and Media. Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Faloutsos, Tong ICDE, 2009

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Inf 2B: Ranking Queries on the WWW

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

How does Google rank webpages?

Link Mining PageRank. From Stanford C246

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Node Centrality and Ranking on Networks

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Data science with multilayer networks: Mathematical foundations and applications

1998: enter Link Analysis

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Link Analysis: Hubs and Authorities on the World Wide Web

Node and Link Analysis

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Eigenvalues of Exponentiated Adjacency Matrices

Web Structure Mining Nodes, Links and Influence

0.1 Naive formulation of PageRank

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Computing PageRank using Power Extrapolation


Jeffrey D. Ullman Stanford University

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

Applications to network analysis: Eigenvector centrality indices Lecture notes

Introduction to Data Mining

Data and Algorithms of the Web

The Static Absorbing Model for the Web a

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis and Web Search

Cross-lingual and temporal Wikipedia analysis

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Google Page Rank Project Linear Algebra Summer 2012

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

A Note on Google s PageRank

Finding Authorities and Hubs From Link Structures on the World Wide Web

Slides based on those in:

CS47300: Web Information Search and Management

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Data Mining and Analysis: Fundamental Concepts and Algorithms

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

The Forward-Backward Embedding of Directed Graphs

K-method of calculating the mutual influence of nodes in a directed weight complex networks

Name: Matriculation Number: Tutorial Group: A B C D E

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

Multimedia Databases - 68A6 Final Term - exercises

Introduction to Information Retrieval

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Introduction to Information Retrieval

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Math 304 Handout: Linear algebra, graphs, and networks.

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Information Retrieval

Latent Semantic Analysis. Hongning Wang

Analysis (SALSA) and the TKC Eect. R. Lempel S. Moran. Department of Computer Science. The Technion, Haifa 32000, Israel

The Second Eigenvalue of the Google Matrix

AXIOMS FOR CENTRALITY

Complex Social System, Elections. Introduction to Network Analysis 1

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Transcription:

IV.3 HITS Hyperlinked-Induced Topic Search (HITS) identifies authorities as good content sources (~high indegree) hubs as good link sources (~high outdegree) HITS [Kleinberg 99] considers a web page a good authority if many good hubs link to it a good hub if it links to many good authorities ~ mutual reinforcement between hubs & authorities Jon Kleinberg H A H A H A H A 30

HITS Given (partial) Web graph G(V, E), let a(v) and h(v) denote the authority score and hub score of the web page v a(v) / X h(u) (u,v)2e h(v) / X a(w) (v,w)2e Authority and hub scores in matrix notation a = A T h h = Aa with adjacency matrix A, hub & authority score vectors a & h, and constants α and β 31

HITS as Eigenvector Computation Plugging authority and hub equations into each other produces a = A T h = a = A T Aa= A T Aa h = Aa= A A T h = AA T h with a and h as eigenvectors of A T A and AA T, respectively Intuitive Interpretation: A T A is the cocitation matrix, i.e., A T Aij is the number of web pages that link to both i and j AA T is the coreference matrix, i.e., AA T ij is the number of web pages to which both i and j link 32

Cocitation and Coreference Matrix Adjacency matrix A 1 2 3 4 A = 20 0 1 1 3 0 0 1 1 0 0 0 0 6 7 40 0 0 05 Cocitation matrix A T A A T A = 20 0 0 0 3 0 0 0 0 0 0 2 2 6 7 40 0 2 25 Coreference matrix AA T AA T = 22 2 0 0 3 2 2 0 0 0 0 0 0 6 7 40 0 0 05 33

HITS Algorithm a (0) = (1,, 1) T, h (0) = (1,, 1) T Repeat until convergence of a and h: h (i+1) = A a (i) h (i+1) = h (i+1) / h (i+1) // re-normalize h a (i+1) = A T h (i) a (i+1) = a (i+1) / a (i+1) // re-normalize a Convergence is guaranteed under fairly general conditions: For a symmetric n-by-n matrix M and a vector v that is not orthogonal to the principal eigenvector w(m), the unit vector in the direction of M k v converges to w(m) for k 34

Root Set & Expansion Set HITS operates on a query-dependent subgraph of the Web 1. Determine sufficient number of root pages (e.g., 50-100 pages) based on relevance ranking for query (e.g., using TF*IDF) 2. For each root page, add all of its successors 3. For each root page, add up to d predecessors 4. Compute authority and hub scores on the query-dependent subgraph of the Web induced by this expansion set (typically: 1000-5000 pages) 5. Return top-k authorities and top-k hubs 35

Root Set & Expansion Set (Example) Root Set Shortcoming: Relevance scores within root set not considered 36

Root Set & Expansion Set (Example) Root Set Shortcoming: Relevance scores within root set not considered 36

Root Set & Expansion Set (Example) Root Set Shortcoming: Relevance scores within root set not considered 36

Root Set & Expansion Set (Example) Root Set Expansion Set Shortcoming: Relevance scores within root set not considered 36

Improved HITS Potential weaknesses of the HITS algorithm: irritating links (e.g., automatically generated links, spam, etc.) topic drift (e.g., from jaguar car to car) [Bharat and Henzinger 98] introduce edge weights 0 for links within the same host 1/k with k links from k URLs of the same host to 1 URL (aweight) 1/m with m links from 1 URL to m URLs on the same host (hweight) Consider relevance weights rel(v) w.r.t. query (e.g., TF*IDF) a(v) / h(v) / X (u, v)2e X (v, w)2e h(u) rel(v) aweight(u,v) a(w) rel(v) hweight(v,w) 37

Dominant Subtopics in HITS A = 2 3 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 60 0 0 0 0 0 1 0 1 1 7 40 0 0 0 0 0 0 1 0 15 0 0 0 0 0 0 0 1 0 0 1 2 3 4 5 6 7 8 HITS returns the authority and hub vectors 9 10 a = 0.15 0.08 0.26 0.18 0.21 0.12 0.00 0.00 0.00 0.00 T h = 0.10 0.28 0.04 0.15 0.08 0.35 0.00 0.00 0.00 0.00 T Observation: Only the nodes {1,, 6} in the dominant subtopic have a non-zero authority and hub score 38

HITS & SVD The authority vector a and hub vector h determined by HITS are eigenvectors of the matrices AA T and A T A, respectively For A = UΣV T as the SVD of the adjacency matrix A U contains the eigenvectors of AA T as its columns (with U1 corresponding to the hub vector h) V contains the eigenvectors of A T A as its columns (with V1 corresponding to the authority vector a) 39

HITS & SVD (Example) A = U = 6 4 2 3 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 60 0 0 0 0 0 1 0 1 1 7 40 0 0 0 0 0 0 1 0 15 0 0 0 0 0 0 0 1 0 0 2 3 0.20 0.00 0.14 0.00 0.39 0.70 0.00 0.29 0.00 0.43 0.56 0.00 0.66 0.00 0.24 0.16 0.00 0.32 0.00 0.22 0.08 0.00 0.25 0.00 0.49 0.31 0.00 0.53 0.00 0.54 0.31 0.00 0.53 0.00 0.54 0.08 0.00 0.25 0.00 0.49 0.16 0.00 0.32 0.00 0.22 0.56 0.00 0.66 0.00 0.24 0.70 0.00 0.29 0.00 0.43 0.20 0.00 0.14 0.00 0.39 0.00 0.27 0.00 0.33 0.00 0.00 0.80 0.00 0.40 0.00 0.00 0.80 0.00 0.40 0.00 0.00 0.27 0.00 0.33 0.00 7 0.00 0.49 0.00 0.65 0.00 0.00 0.16 0.00 0.54 0.00 5 0.00 0.16 0.00 0.54 0.00 0.00 0.49 0.00 0.65 0.00 1 2 3 4 5 6 7 8 9 10 2 3 2.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.48 0.00 0.00 0.00 0.00 0.00 0.00 = 0.00 0.00 0.00 0.00 1.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.00 60.00 0.00 0.00 0.00 0.00 0.00 0.00 0.71 0.00 0.00 7 40.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.005 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 2 V = 6 4 3 0.34 0.00 0.56 0.00 0.31 0.48 0.00 0.47 0.00 0.07 0.19 0.00 0.45 0.00 0.71 0.26 0.00 0.37 0.00 0.16 0.60 0.00 0.21 0.00 0.13 0.42 0.00 0.25 0.00 0.57 0.42 0.00 0.25 0.00 0.57 0.60 0.00 0.21 0.00 0.13 0.48 0.00 0.47 0.00 0.07 0.34 0.00 0.56 0.00 0.31 0.26 0.00 0.37 0.00 0.16 0.19 0.00 0.45 0.00 0.71 0.00 0.40 0.00 0.27 0.00 0.00 0.33 0.00 0.80 0.00 0.00 0.33 0.00 0.80 0.00 0.00 0.40 0.00 0.27 0.00 7 0.00 0.54 0.00 0.49 0.00 0.00 0.65 0.00 0.16 0.00 5 0.00 0.65 0.00 0.16 0.00 0.00 0.54 0.00 0.49 0.00 40

HITS for Community Detection Problem: Root set may contain multiple subtopics or communities (e.g., for ambiguous queries like jaguar or java) and HITS may favor only the dominant subtopic Approach: Consider the k eigenvectors of A T A associated with the k largest eigenvalues (e.g., using SVD on A) For each of these k eigenvectors, the largest authority scores indicate a densely connected community SVD useful as a general tool to detect communities in graphs 41

HITS vs. PageRank PageRank HITS Matrix construction static query time Matrix size huge moderate Stochastic matrix yes no Dampening by random jumps yes no Outdegree normalization yes no Score stability to perturbations yes no Resilience to topic drift n/a no Resilience to spam no no But: PageRank features (e.g., random jump) could be incorporated into HITS; HITS could be applied to the entire Web; PageRank could also be applied to a query-dependent subgraph 42

HITS vs. PageRank [Najork et al. 07] compare HITS, PageRank, etc. in terms of their retrieval effectiveness when combined with Okapi BM25F Dataset: Web crawl consisting of 463 M web pages containing 17.6 M hyperlinks and referencing 2.9 B distinct URLs; 28 K queries sampled from a query log Methods: PageRank 0.36 0.34 0.32.341.340.339.337.336.336.334.311.311.310.310.310.310 HITS (auth / hub) NDCG@10 0.30 0.28 Degree (in / out) 0.26 0.24.231 all (all links considered) id (only inter-domain links) 0.22 degree-in-id degree-in-ih degree-in-all hits-aut-ih-100 hits-aut-all-100 pagerank hits-aut-id-10 degree-out-all hits-hub-all-100 degree-out-ih hits-hub-ih-100 degree-out-id hits-hub-id-10 bm25f in (only inter-host links) 43

Summary of IV.3 Hubs as web pages that link to good authorities Authorities as web pages that are linked to by good hubs HITS operates on a query-dependent subgraph of the Web determines eigenvectors of the matrices AA T and A T A SVD helps to circumvent the dominant subtopic problem in HITS can be used as a general tool to identify communities in graphs 44

Additional Literature for IV.3 K. Bharat and M. Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment, SIGIR 1998 A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas: Link analysis ranking: algorithms, theory, and experiments. ACM TOIT 5(1), 2005 J. Dean and M. Henzinger: Finding Related Pages in the World Wide Web, Computer Networks 31:1467-1479, 1999 J. Kleinberg: Authoritative sources in a hyperlinked environment, Journal of the ACM 46:604-632, 1999 M. Najork, H. Zaragoza, and M. Taylor: HITS on the Web: How does it Compare?, SIGIR 2007 45