Link Analysis Ranking

Similar documents
DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search

Models and Algorithms for Complex Networks. Link Analysis Ranking

Data Mining and Matrices

Link Analysis. Leonid E. Zhukov

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

STA141C: Big Data & High Performance Statistical Computing

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

1 Searching the World Wide Web

Link Analysis. Stony Brook University CSE545, Fall 2016

Data Mining Recitation Notes Week 3

ECEN 689 Special Topics in Data Science for Communications Networks

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Lecture 12: Link Analysis for Web Retrieval

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Link Mining PageRank. From Stanford C246

Node Centrality and Ranking on Networks

0.1 Naive formulation of PageRank

1998: enter Link Analysis

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Node and Link Analysis

Page rank computation HPC course project a.y

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Introduction to Data Mining

IR: Information Retrieval

Complex Social System, Elections. Introduction to Network Analysis 1

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Slides based on those in:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Math 304 Handout: Linear algebra, graphs, and networks.

Web Structure Mining Nodes, Links and Influence

The Static Absorbing Model for the Web a

Uncertainty and Randomization

Probability & Computing

How does Google rank webpages?

Computing PageRank using Power Extrapolation

Inf 2B: Ranking Queries on the WWW

A Note on Google s PageRank

Google Page Rank Project Linear Algebra Summer 2012

Authoritative Sources in a Hyperlinked Enviroment

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Finding Authorities and Hubs From Link Structures on the World Wide Web

CS6220: DATA MINING TECHNIQUES

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

eigenvalues, markov matrices, and the power method

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Graph Models The PageRank Algorithm

Faloutsos, Tong ICDE, 2009

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Data and Algorithms of the Web

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Calculating Web Page Authority Using the PageRank Algorithm

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Updating PageRank. Amy Langville Carl Meyer

CS249: ADVANCED DATA MINING

Class President: A Network Approach to Popularity. Due July 18, 2014

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Degree Distribution: The case of Citation Networks

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

CS6220: DATA MINING TECHNIQUES

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

CS47300: Web Information Search and Management

Lecture 5: Random Walks and Markov Chain

25.1 Markov Chain Monte Carlo (MCMC)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

COMPSCI 514: Algorithms for Data Science

Lecture: Local Spectral Methods (1 of 4)

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

The Google Markov Chain: convergence speed and eigenvalues

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

Mathematical Properties & Analysis of Google s PageRank

The Dynamic Absorbing Model for the Web

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+

Finding central nodes in large networks

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Introduction to Information Retrieval

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

The Second Eigenvalue of the Google Matrix

Algebraic Representation of Networks

Applications to network analysis: Eigenvector centrality indices Lecture notes

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Transcription:

Link Analysis Ranking

How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it?

Naïve ranking of query results Given query q Rank the web pages p in the index based on sim(p,q) Scenarios where this is not such a good idea?

Why Link Analysis? First generation search engines view documents as flat text files could not cope with size, spamming, user needs Example: Honda website, keywords: automobile manufacturer Second generation search engines Ranking becomes critical use of Web specific data: Link Analysis shift from relevance to authoritativeness a success story for the network analysis

Link Analysis: Intuition A link from page p to page q denotes endorsement page p considers page q an authority on a subject mine the web graph of recommendations assign an authority value to every page

Link Analysis Ranking Algorithms Start with a collection of web pages Extract the underlying hyperlink graph Run the LAR algorithm on the graph Output: an authority weight for each node w w w w w

Algorithm input Query dependent: rank a small subset of pages related to a specific query HITS (Kleinberg 98) was proposed as query dependent Query independent: rank the whole Web PageRank (Brin and Page 98) was proposed as query independent

Query-dependent LAR Given a query q, find a subset of web pages S that are related to S Rank the pages in S based on some ranking criterion

Query-dependent input Root Set

Query-dependent input IN Root Set OUT

Query dependent input IN Root Set OUT

Query dependent input Base Set IN Root Set OUT

Properties of a good seed set S S is relatively small. S is rich in relevant pages. S contains most (or many) of the strongest authorities.

How to construct a good seed set S For query q first collect the t highest-ranked pages for q from a text-based search engine to form set Γ S = Γ Add to S all the pages pointing to Γ Add to S all the pages that pages from Γ point to

Link Filtering Navigational links: serve the purpose of moving within a site (or to related sites) www.espn.com www.espn.com/nba www.yahoo.com www.yahoo.it www.espn.com www.msn.com Filter out navigational links same domain name same IP address

How do we rank the pages in seed set In degree? S? Intuition Problems

Hubs and Authorities [K98] Authority is not necessarily transferred directly between authorities Pages have double identity hub identity authority identity Good hubs point to good authorities Good authorities are pointed by good hubs hubs authorities

HITS Algorithm Initialize all weights to. Repeat until convergence O operation : hubs collect the weight of the authorities h i a j j: i j I operation: authorities collect the weight of the hubs a i h j j: ji Normalize weights under some norm

HITS and eigenvectors The HITS algorithm is a power-method eigenvector computation in vector terms a t = A T h t- and h t = Aa t- so a t = A T Aa t- and h t = AA T h t- The authority weight vector a is the eigenvector of A T A and the hub weight vector h is the eigenvector of AA T Why do we need normalization? The vectors a and h are singular vectors of the matrix A

Singular Value Decomposition r : rank of matrix A σ σ σ r : singular values (square roots of eig-vals AA T, A T A) : left singular vectors (eig-vectors of AA T ) : right singular vectors (eig-vectors of A T A) r r r T v v v σ σ σ u u u V Σ U A [n r] [r r] [r n] r u,,,u u r v,,,v v T r r r T T v σ u v u σ v σ u A

Singular Value Decomposition Linear trend v in matrix A: the tendency of the row vectors of A to align with vector v strength of the linear trend: Av SVD discovers the linear trends in the data u i, v i : the i-th strongest linear trends σ i : the strength of the i-th strongest linear trend σ v v σ HITS discovers the strongest linear trend in the authority space

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect 3 3 3 3 3

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect 3 3 3 3 3 3

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect 3 3 3 3 3 3 3 3

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect 3 4 3 4 3 4 3 3 3

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect 3 n weight of node p is proportional to the number of (BF) n paths that leave node p 3 n 3 n 3 n n 3 n n 3 n n after n iterations

HITS and the TKC effect The HITS algorithm favors the most dense community of hubs and authorities Tightly Knit Community (TKC) effect after normalization with the max element as n

Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the pages in Q ordered according to order π What are the advantages of such an approach?

InDegree algorithm Rank pages according to in-degree w i = B(i) w= w=3 w=. Red Page. Yellow Page 3. Blue Page 4. Purple Page. Green Page w= w=

PageRank algorithm [BP98] Good authorities should be pointed by good authorities Random walk on the web graph pick a page at random with probability - α jump to a random page with probability α follow a random outgoing link Rank according to the stationary distribution PR( q) PR( p) F ( q ) n qp. Red Page. Purple Page 3. Yellow Page 4. Blue Page. Green Page

Markov chains A Markov chain describes a discrete time stochastic process over a set of states S = {s, s, s n } according to a transition probability matrix P = {P ij } P ij = probability of moving to state j when at state i j P ij = (stochastic matrix) Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process (first order MC) higher order MCs are also possible

Random walks Random walks on graphs correspond to Markov Chains The set of states S is the set of nodes of the graph G The transition probability matrix is the probability that we follow an edge from one node to another

An example v v v 3 v 4 v 3 3 3 P A

State probability vector The vector q t = (q t,q t,,q t n) that stores the probability of being at state i at time t q i = the probability of starting from state i q t = q t- P

An example 3 3 3 P v v v 3 v 4 v q t+ = /3 q t 4 + / q t q t+ = / q t + q t 3 + /3 q t 4 q t+ 3 = / q t + /3 q t 4 q t+ 4 = / q t q t+ = q t

Stationary distribution A stationary distribution for a MC with transition matrix P, is a probability distribution π, such that π = πp A MC has a unique stationary distribution if it is irreducible the underlying graph is strongly connected it is aperiodic for random walks, the underlying graph is not bipartite The probability π i is the fraction of times that we visited state i as t The stationary distribution is an eigenvector of matrix P the principal left eigenvector of P stochastic matrices have maximum eigenvalue

Computing the stationary distribution The Power Method Initialize to some distribution q Iteratively compute q t = q t- P After enough iterations q t π Power method because it computes q t = q P t Why does it converge? follows from the fact that any vector can be written as a linear combination of the eigenvectors q = v + c v + c n v n Rate of convergence determined by λ t

The PageRank random walk Vanilla random walk make the adjacency matrix stochastic and run a random walk 3 3 3 P

The PageRank random walk What about sink nodes? what happens when the random walk moves to a node without any outgoing inks? 3 3 3 P

3 3 3 P' The PageRank random walk Replace these row vectors with a vector v typically, the uniform vector P = P + dv T otherwise sink is i if d

3 3 3 P'' ) ( The PageRank random walk How do we guarantee irreducibility? add a random jump to vector v with prob α typically, to a uniform vector P = αp + (-α)uv T, where u is the vector of all s

Effects of random jump Guarantees irreducibility Motivated by the concept of random surfer Offers additional flexibility personalization anti-spam Controls the rate of convergence the second eigenvalue of matrix P is α

A PageRank algorithm Performing vanilla power method is now too expensive the matrix is not sparse q = v t = repeat t q δ t = t + until δ < ε T t P'' q q t q t Efficient computation of y = (P ) T x y αp β x T x y y y βv

Random walks on undirected graphs In the stationary distribution of a random walk on an undirected graph, the probability of being at node i is proportional to the (weighted) degree of the vertex Random walks on undirected graphs are not interesting

Research on PageRank Specialized PageRank personalization [BP98] instead of picking a node uniformly at random favor specific nodes that are related to the user topic sensitive PageRank [H] compute many PageRank vectors, one for each topic estimate relevance of query with each topic produce final PageRank as a weighted combination Updating PageRank [Chien et al ] Fast computation of PageRank numerical analysis tricks node aggregation techniques dealing with the Web frontier

Previous work The problem of identifying the most important nodes in a network has been studied before in social networks and bibliometrics The idea is similar A link from node p to node q denotes endorsement mine the network at hand assign an centrality/importance/standing value to every node

Social network analysis Evaluate the centrality of individuals in social networks degree centrality the (weighted) degree of a node distance centrality the average (weighted) distance of a node to the rest in the graph betweenness centrality D c v u v the average number of (weighted) shortest paths that use node v d(v,u) B c v svt σst(v) σ st

Counting paths Katz 3 The importance of a node is measured by the weighted sum of paths that lead to this node A m [i,j] = number of paths of length m from i to j Compute P ba b A b converges when b < λ (A) I ba I Rank nodes according to the column sums of the matrix P m A m

Bibliometrics Impact factor (E. Garfield 7) counts the number of citations received for papers of the journal in the previous two years Pinsky-Narin 76 perform a random walk on the set of journals P ij = the fraction of citations from journal i that are directed to journal j