Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Similar documents
Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

A Note on Google s PageRank

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis Ranking

1998: enter Link Analysis

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Lecture 12: Link Analysis for Web Retrieval

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Leonid E. Zhukov

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Math 304 Handout: Linear algebra, graphs, and networks.

IR: Information Retrieval

Page rank computation HPC course project a.y

Data Mining and Matrices

Finding central nodes in large networks

Online Social Networks and Media. Link Analysis and Web Search

eigenvalues, markov matrices, and the power method

How does Google rank webpages?

Graph Models The PageRank Algorithm

Google Page Rank Project Linear Algebra Summer 2012

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 7 Mathematics behind Internet Search

Data Mining Recitation Notes Week 3

0.1 Naive formulation of PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Applications to network analysis: Eigenvector centrality indices Lecture notes

Online Social Networks and Media. Link Analysis and Web Search

Node Centrality and Ranking on Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Complex Social System, Elections. Introduction to Network Analysis 1

Calculating Web Page Authority Using the PageRank Algorithm

Uncertainty and Randomization

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Web Structure Mining Nodes, Links and Influence

Inf 2B: Ranking Queries on the WWW

Link Mining PageRank. From Stanford C246

Slides based on those in:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Monte Carlo methods in PageRank computation: When one iteration is sufficient

AXIOMS FOR CENTRALITY

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

CS47300: Web Information Search and Management

Degree Distribution: The case of Citation Networks

1 Searching the World Wide Web

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Practice Problems - Linear Algebra

Introduction to Data Mining

COMPSCI 514: Algorithms for Data Science

CS249: ADVANCED DATA MINING

Lecture: Local Spectral Methods (1 of 4)

arxiv:cond-mat/ v1 3 Sep 2004

Node and Link Analysis

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Link Analysis. Stony Brook University CSE545, Fall 2016

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

The Dynamic Absorbing Model for the Web

Application. Stochastic Matrices and PageRank

Algebraic Representation of Networks

The Google Markov Chain: convergence speed and eigenvalues

Data and Algorithms of the Web

Computing PageRank using Power Extrapolation

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Announcements: Warm-up Exercise:

Randomization and Gossiping in Techno-Social Networks

STA141C: Big Data & High Performance Statistical Computing

The Static Absorbing Model for the Web a

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

What is this Page Known for? Computing Web Page Reputations. Outline

Class President: A Network Approach to Popularity. Due July 18, 2014

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

SUPPLEMENTARY MATERIALS TO THE PAPER: ON THE LIMITING BEHAVIOR OF PARAMETER-DEPENDENT NETWORK CENTRALITY MEASURES

Krylov Subspace Methods to Calculate PageRank

PAGERANK PARAMETERS. Amy N. Langville. American Institute of Mathematics Workshop on Ranking Palo Alto, CA August 17th, 2010

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Applications. Nonnegative Matrices: Ranking

DS504/CS586: Big Data Analytics Graph Mining II

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Faloutsos, Tong ICDE, 2009

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

CS6220: DATA MINING TECHNIQUES

Updating PageRank. Amy Langville Carl Meyer

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Google Matrix, dynamical attractors and Ulam networks Dima Shepelyansky (CNRS, Toulouse)

Transcription:

Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci

Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor text indexing The hyperlink from A to B connotes a conferral of authority on page B, by the creator of page A Link based ranking

Anchor Text Indexing (1) 3 Issues of conventional inverted index search: Sometims page B does not provide a description of itself http://www.ibm.com page not contain computer Gap between how a page presents itself and how web users would describe it (e.g., Big Blue -> IBM) Many pages embed text in graphics and images, making HTML parsing inefficient Solution: Include anchor text terms in inverted indexing Weight anchor text terms based on frequency (to penalize words such as click or here )

Anchor Text Indexing (2) 4 Armonk, NY-based computer giant IBM announced today www.ibm.com Joe s computer hardware links Dell HP IBM Big Blue today announced record profits for the quarter

Link Based Ranking 5 Second generation search engine identify relevance topic-relatedness (boolean, vector, probabilistic, ) authoritativeness (link base ranking) The Web can be seen as a network of recommendations Link analysis / centrality indices used for 60+ years Sociology Psychology Bibliometrics Information Retrieval

Why Link Analysis? 6 The WWW can been seen as a (directed) graph: Vertices: web pages Edges: hyperlinks Goal: use of hyperlinks to index/rank Web search results The same can be done for other interlinked environments: Dictionaries Encyclopedias Scientific publications Social networks

Centrality Indeces 7 Several centrality indices exists: Spectral indices, based on linear algebra construction Path-based indices, based on the number of paths or shortest paths (geodesics) passing through a vertex Geometric indices, based on distances from a vertex to other vertices The center of a star is more important than the other nodes It has largest degree It is on the shortest paths It is maximally close to everybody

(In-)Degree Centrality (1) 8 Basics assumptions Any Web page is assigned a non-negative score A page score is derived from links made to that page from other web pages Links to a given page are called backlinks Web democracy: pages vote for the importance of other pages by linking to them (In-)degree centrality Count the number of incoming links (Or the nodes at distance 1) å x k = 1 i k

(In-)Degree Centrality (2) 9 1 3 2 4 x x1 2 x 2 1 x 3 3 x4 2 An important issue: Ignores the fact that a link to page j from an important page should boost page j s importance score more than a link from an unimportant one Indeed a link to your homepage from www.yahoo.com should boost your page s score more than a link from www.foobar.com

Betweenness Centrality (1971) 10 Count the fraction of shortest paths from i to j passing through node k s x k = ij (k) å i, j¹k s ij Often scaled dividing by (n-1)/(n-2) in directed graphs Let Consider Node 1 1 3 2 3 2 4 3 2 3 4 4 2 4 3 é ê ê x = ê ê ê ëê x 1 x 2 x 3 x 4 ù ú é ú ê ú = ê ú ê ú ê ûú ë 3/ 6 0 0 1/ 6 ù ú ú ú ú û 2 4

Kats Centrality (1953) 11 Count the number of paths of length t ending at node k Discount this number by α t + x k = åa t P k (t) = a t ( E t ) jk t=0 + å t=0 n å j=1 1 3 Lets Consider Node 1 n n x 1 = a å (E) j1 +a 2 å (E 2 ) j1 +¼ j=1 = 2a +5a 2 +¼ Computed efficiently by j=1 x = ((I -ae T ) -1 - I )1 é ê E = ê ê ê ë Number of paths of length 1 from node j to node 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 Column vector of n ones ù ú ú ú ú û é ê E 2 = ê ê ê ë 2 4 Number of paths of length 2 from node j to node 1 2 0 2 1 2 0 1 0 0 1 1 1 1 1 1 1 ù ú ú ú ú û

Closeness Centrality (1950) 12 Geometric-based: inverse of the sum of the distances from each node 1 x k = å d( j,k) The summation is over all nodes such that Lets consider Node 1 1 3 2 4 j¹k Harmonic Centrality (2000) includes infinity as well... 1 x 1 = 2 +1+1 = 1 4 1 x 2 = 1+ 2 + 2 = 1 5 1 x 3 = 1+1+1 = 1 3 1 x 4 = 1+1+ 2 = 1 4 d( j,k) <

Wei Index (1953) 13 Spectral-based: if a matrix represents whether a team defeated another team, we can define a general score by iteratively computing the sum of the scores of the teams that have been defeated In our context: If page j links to page k, we boost page k s score by x j x k = å jîl k x j L k is the set of page k s backlinks Issue: a single page gains influence just by linking lots of other pages

Seeley Index (1949) 14 Spectral-based: in a group of children, a child is as popular as the sum of the popularities of the children who like him, but popularities are divided evenly among friends In our context: If page j contains n j links, one of which to page k, we will boost page k s score by x j /n j instead than by x j x k jl k x n j j L k is the set of page k s backlinks n j is the number of page j s outgoing links Each page gets one vote, which is divided up among its outgoing links

Seeley Index (1949) 15 x x x x 1 2 3 4 x3 x4 1 2 x1 3 x1 x2 x4 3 2 2 x1 x2 3 2 Ax x 1 3 2 4 A A ij ij 0 1 n j 0 0 1 1/ 2 1/ 3 0 0 0 A 1/ 3 1/ 2 0 1/ 2 1/ 3 1/ 2 0 0 if there is no link between page j and i otherwise, with n j the number of outgoing links of page j Solve the eigensystem for the eigenvector corresponding to eigenvalue 1 The matrix A is the transpose of the (weighted) adjacency matrix of the Web graph

Seeley Index (1949) 16 Can be obtained by the power iteration method lim where x 0 is some initial column vector with non-zero entries Convergence is guaranteed for strongly connected graphs (i.e. if you can get from any page to any other page in a finite number of steps ) A surfer moves from one page to next randomly choosing one of the outgoing links The component x j of the normalized score vector x is the time the surfer spends, in the long run, on page j of the web More important pages tend to be linked to by many other pages and so the surfer hits those most often. x k k A x0 Steady state in Ergothic Markov Chain!!!

Seeley Example 17 Power iteration k=0 0.250 0.250 0.250 0.250

Seeley Example 18 Power iteration k=1 0.375 0.333 0.083 0.208

Seeley Example 19 Power iteration k=2 0.437 0.270 0.125 0.167

Seeley Example 20 Power iteration k=3 0.354 0.291 0.145 0.208

Seeley Example 21 Power iteration k=4 0.395 0.295 0.118 0.191

Seeley Example 22 Power iteration k 0.387 0.290 0.129 0.194

PageRank (1999) 23 Spectral-based: start from Seeley s equation and add an adjustment to make it have a unique solution x k = (1- m) å jîl k x j n j + m 1 n Enhanced Random surfer moves from one page to next If the surfer is currently at a page with r outgoing links, with probability 1-m he randomly chooses one of these links with probability m he jumps to any randomly selected page Equivalent to adding implicit links from each page to any other page; the resulting graph is strongly connected

Uniqueness of Ranking (1) 24 Power iterations may not always converge to a unique ranking when the graph is not strongly connected Example: 1 3 2 4 5 A 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1/ 2 0 0 1 0 1/ 2 0 0 0 0 0 Equation x = Ax is satisfied by any linear combination of: x x (1) (2) [1/ 2 1/ 2 0 0 0] T [0 0 1/ 2 1/ 2 0] T

Uniqueness of Ranking (2) 25 To fix this, at each step the random surfer can randomly jump to any page of the Web with some probability m being (1 m) a damping factor We can generate unique importance scores by modifying the matrix A as follows M (1 m) A ms where S R nxn, S i,j = 1/n 0m 1 The PageRank score is computed as x lim k k M x 0

Uniqueness of Ranking (3) 26 For example, setting m = 0.15 A 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1/ 2 0 0 1 0 1/ 2 0 0 0 0 0 M 0.03 0.88 0.03 0.03 0.03 0.88 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.88 0.455 0.03 0.03 0.88 0.03 0.455 0.03 0.03 0.03 0.03 0.03 The unique PageRank score is given by x 0.2 0.2 0.285 0.285 0.03 T Instead of x x (1) (2) [1/ 2 1/ 2 0 0 0] T [0 0 1/ 2 1/ 2 0] T

Dealing with Dangling Nodes (1) 31 Nodes without outgoing links, e. g., webpages with no links to other pages: image & music files pdf files pages whose links haven t been crawled protected pages Some estimates say more than 50%, others say between 60% and 80% A web with dangling nodes produces a matrix A which contains one or more columns of all zeros. Matrix A does not have an eigenvalue equal to 1

Dealing with Dangling Nodes (2) 32 Solution 1: Remove dangling nodes build matrix M, with m > 0 Compute the importance scores for the remaining nodes Reinsert dangling nodes, computing their importance scores based on incoming links only Solution 2 replace the all-zero columns of matrix A corresponding to dangling nodes with 1/n term equivalent to browsing to a new page at random when the user reaches a page with no outgoing links build matrix M, with m > 0 Compute the importance scores

PageRank Example (Solution 1) 33 Remove dangling nodes 3 6 1 2 5 0 1 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0 0.5 0 A 0 0 0 0 0.5 0 0.5 0 1 0 0 1 0 0 0 0 0 0 4

PageRank Example (Solution 1) 34 Power iteration k = 0 0.20 0.20 0.20 0.20 0.20 0 1 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0 0.5 0 A 0 0 0 0 0.5 0 0.5 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 A 0.5 0 0 1 0 0.5 0 1 0 1 0 0 0 0 0

PageRank Example (Solution 1) 35 Power iteration k = 1 (with dumping m = 0.15) 0.29 0.03 0.20 0.03 0.46

PageRank Example (Solution 1) 36 Power iteration k = 2 (with dumping m=0.15) 0.50 0.03 0.06 0.03 0.38

PageRank Example (Solution 1) 37 Power iteration k = 3 (with dumping m=0.15) 0.39 0.03 0.06 0.03 0.51

PageRank Example (Solution 1) 38 Power iteration k = 4 (with dumping m=0.15) 0.48 0.03 0.06 0.03 0.40

PageRank Example (Solution 1) 39 Power iteration k (with dumping m=0.15) 0.44 0.03 0.06 0.03 0.45

PageRank Example (Solution 1) 40 Add dangling node (x4 = x5 / 2) 0.44 0.03 0.06 0.03 0.45 0.23

References (some) 65 S. Chakrabarti, Mining the Web, Morgan Kaufman, 2003 J. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM, Vol. 46, Issue 5, pp. 604-632, 1999 S. Brin, L. Page The anatomy of a large-scale hypertextual Web search engine, Proceedings of the seventh international conference on World Wide Web 7: pp. 107-117, 1998 K. Bryan, T. Leise, The $25,000,000,000 eigenvector. The linear algebra behind Google, SIAM Review, Vol. 48, Issue 3, pp. 569-81. 2006. S. Vigna, A critical, historical and mathematical review of centrality scores, SSMS Summer School, Santorini, 2012