CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Similar documents
CS47300: Web Information Search and Management

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis. Leonid E. Zhukov

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

A Note on Google s PageRank

Link Analysis Ranking

Node Centrality and Ranking on Networks

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Data Mining and Matrices

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

0.1 Naive formulation of PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Complex Social System, Elections. Introduction to Network Analysis 1

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

How does Google rank webpages?

Data and Algorithms of the Web

Online Social Networks and Media. Link Analysis and Web Search

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Applications to network analysis: Eigenvector centrality indices Lecture notes

1998: enter Link Analysis

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

IR: Information Retrieval

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Inf 2B: Ranking Queries on the WWW

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Google Page Rank Project Linear Algebra Summer 2012

Cross-lingual and temporal Wikipedia analysis

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Slides based on those in:

Link Mining PageRank. From Stanford C246

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

Lecture 12: Link Analysis for Web Retrieval

Web Structure Mining Nodes, Links and Influence

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Introduction to Data Mining

Ranking on Large-Scale Graphs with Rich Metadata

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Link Analysis. Stony Brook University CSE545, Fall 2016

The Second Eigenvalue of the Google Matrix

1 Searching the World Wide Web

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Data Mining Recitation Notes Week 3

Uncertainty and Randomization

eigenvalues, markov matrices, and the power method

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Computing PageRank using Power Extrapolation

The Static Absorbing Model for the Web a

CS47300: Web Information Search and Management

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

STA141C: Big Data & High Performance Statistical Computing

DOMINANT VECTORS APPLICATION TO INFORMATION EXTRACTION IN LARGE GRAPHS. Laure NINOVE OF NONNEGATIVE MATRICES

Graph Models The PageRank Algorithm

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Faloutsos, Tong ICDE, 2009

Analysis of Google s PageRank

Data Mining and Analysis: Fundamental Concepts and Algorithms

Node and Link Analysis

CS6220: DATA MINING TECHNIQUES

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Link Analysis and Web Search

Models and Algorithms for Complex Networks. Link Analysis Ranking

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Mathematical Properties & Analysis of Google s PageRank

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

The Dynamic Absorbing Model for the Web

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

CS6220: DATA MINING TECHNIQUES

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Data Mining Techniques

Lecture 7 Mathematics behind Internet Search

CS249: ADVANCED DATA MINING

Page rank computation HPC course project a.y

Calculating Web Page Authority Using the PageRank Algorithm

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Jeffrey D. Ullman Stanford University

Finding central nodes in large networks

Degree Distribution: The case of Citation Networks

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

c 2016 Society for Industrial and Applied Mathematics

Perturbation of the hyper-linked environment

Transcription:

CS54701 Information Retrieval Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU)

Citation Analysis

Web Structure Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge What does it look likes? Web is small world The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19

Bowtie Structure Strongly Connected Component Broder et al., 2001

Bowtie Structure Sites that link towards the center of the web Broder et al., 2001

Bowtie Structure Sites that link from the center of the web Broder et al., 2001

Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law Broder et al., 2001

Early Approaches Basic Assumptions Hyperlinks contain information about the human judgment of a site The more incoming links to a site, the more it is judged important Bray 1996 The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites

HITS - Kleinberg s Algorithm HITS Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Authority and Hubness: Version 1 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) w c h [ v ] HubsAuthorities(G) 1 1 [1,,1] Є R 2 a h 1 0 0 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h (v) Σ a (w) t w Є pa[v] 8 t t + 1 t -1 9 until a a + h h < ε t t -1 t t -1 10 return (a, h ) t t V w Є pa[v] t -1

Authority and Hubness: Version 1 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) w c h [ v ] HubsAuthorities(G) 1 1 [1,,1] Є R 2 a h 1 0 0 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h (v) Σ a (w) t w Є pa[v] 8 t t + 1 t -1 9 until a a + h h < ε t t -1 t t -1 10 return (a, h ) t t V w Є pa[v] t -1 Problems?

Authority and Hubness: Version 2 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) a( v) h( v) w c h [ v ] + Normalization a( v) w a( w) h( v) w h( w) HubsAuthorities(G) 1 1 [1,,1] Є R 2 a 0 h 0 1 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h t (v) Σ w Є pa[v] a (w) t -1 8 a t a t / a 9 h t h t / h 10 t t + 1 11 until a t a t -1 + h t h t -1 < ε 12 return (a, h ) t t V w Є pa[v] t -1

HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

Authority and Hubness Authority score Not only depends on the number of incoming links But also the quality (e.g., hubness) of the incoming links Hubness score Not only depends on the number of outgoing links But also the quality (e.g., hubness) of the outgoing links

Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise M =

Authority and Hub Vector a: a i is the authority score for the i-th site Vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v]

Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] a M T h h M a

Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site M i, j Recursive dependency: Matrix M: 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Normalization Procedure a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] T t t t a M h h M a t t t

Authority and Hub T t t t Apply SVD to matrix M T t t t t a M h a M M a h M a h M M h T t t t t t t t T i T i i i M UΣ V u v a u 1, h v 1

PageRank Introduced by Page et al (1998) The weight is assigned by the rank of parents Difference with HITS HITS takes Hubness & Authority weights The page rank is proportional to its parents rank, but inversely proportional to its parents outdegree

Matrix Notation M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise B i, j j 1 M i, j i, j 0 0 o th e rw is e j M M = B =

Matrix Notation : rep resen ts th e ran k sco re fo r th e i-th w eb p ag e r r i Finding Pagerank to find principle eigenvector of B r = α B T r α : eigenvalue r : eigenvector of B

Matrix Notation

Random Walk Model Consider a random walk through the Web graph?? B =???

Random Walk Model Consider a random walk through the Web graph B =

Random Walk Model Consider a random walk through the Web graph B =

Random Walk Model Consider a random walk through the Web graph B = T, what is portion of time that the surfer will spend time on each site?

Random Walk Model Consider a random walk through the Web graph B = p( k) : p e rc e n ta g e o f tim e th a t th e su rfe r w ill sta y a t th e i-th site p k ( ) p ( i ) B i, k i T p B p

Adding Self Loop Allow surfer to decide to stay on the same place B = B ' B (1 ) I

Problem Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph B = 0 0 0 0 0 0 0 0 0 0 0 1 0 0 r(new page) = 0

Problem Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph B = 0 0 0 0 0 1 0 0 0 0 0 0 0 0 r(new page) = 1

Distribution of the Mixture Model H i, j 1/ n B ' H (1 ) B r B r ' T Prevent the page ranks to be either zeros or one.

Stability Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don t change significantly? The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms?

Stability of HITS Ng et al (2001) A bound on the number of hyperlinks k that can be added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ1 λ2 d: maximum outdegree of G

Stability of PageRank r r 2 r ( j ) jv Ng et al (2001) V: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small rank, the overall change will also be small