Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Similar documents
MONICA BIANCHINI, MARCO GORI, and FRANCO SCARSELLI University of Siena

IR: Information Retrieval

Link Analysis. Leonid E. Zhukov

Link Mining PageRank. From Stanford C246

Page rank computation HPC course project a.y

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

A Note on Google s PageRank

Graph Models The PageRank Algorithm

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Uncertainty and Randomization

1998: enter Link Analysis

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis Ranking

Data Mining and Matrices

Online Social Networks and Media. Link Analysis and Web Search

CS47300: Web Information Search and Management

Data Mining Recitation Notes Week 3

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

0.1 Naive formulation of PageRank

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Slides based on those in:

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Google Page Rank Project Linear Algebra Summer 2012

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Online Social Networks and Media. Link Analysis and Web Search

Introduction to Data Mining

Web Structure Mining Nodes, Links and Influence

Lecture 12: Link Analysis for Web Retrieval

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Calculating Web Page Authority Using the PageRank Algorithm

Node Centrality and Ranking on Networks

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis. Stony Brook University CSE545, Fall 2016

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

The Static Absorbing Model for the Web a

The Second Eigenvalue of the Google Matrix

Mathematical Properties & Analysis of Google s PageRank

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Inf 2B: Ranking Queries on the WWW

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

STA141C: Big Data & High Performance Statistical Computing

The Dynamic Absorbing Model for the Web

The Google Markov Chain: convergence speed and eigenvalues

Analysis and Computation of Google s PageRank

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

Link Analysis and Web Search

Analysis of Google s PageRank

arxiv:cond-mat/ v1 3 Sep 2004

eigenvalues, markov matrices, and the power method

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

ECEN 689 Special Topics in Data Science for Communications Networks

CS6220: DATA MINING TECHNIQUES

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Degree Distribution: The case of Citation Networks

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

On the mathematical background of Google PageRank algorithm

Computing PageRank using Power Extrapolation

1 Searching the World Wide Web

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Math 304 Handout: Linear algebra, graphs, and networks.

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Algebraic Representation of Networks

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Randomization and Gossiping in Techno-Social Networks

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

How does Google rank webpages?

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Applications to network analysis: Eigenvector centrality indices Lecture notes

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Application. Stochastic Matrices and PageRank

Data and Algorithms of the Web

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Introduction to Information Retrieval

CS6220: DATA MINING TECHNIQUES

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Applications. Nonnegative Matrices: Ranking

Updating PageRank. Amy Langville Carl Meyer

c 2005 Society for Industrial and Applied Mathematics

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

Computational Economics and Finance

Combating Web Spam with TrustRank

NewPR-Combining TFIDF with Pagerank

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Faloutsos, Tong ICDE, 2009

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

MATH36001 Perron Frobenius Theory 2015

Lecture 7 Mathematics behind Internet Search

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

PV211: Introduction to Information Retrieval

Transcription:

Intelligent Data Analysis PageRank Peter Tiňo School of Computer Science University of Birmingham

Information Retrieval on the Web Most scoring methods on the Web have been derived in the context of Information Retrieval: Use numerical vector representations d j R T of documents d j D. Here, T is the number of terms. Apply some form of similarity measure in the vector space R T of document representations. But the Web is huge! For a given query, there may be unrealistically many well-matching documents. Cannot rely just on term-based similarity between documents! Peter Tiňo 1

Need to rank pages/documents In large hypertextual systems there is usually a strong topological interconnection structure. Ideas that have already been around for some time Citation indexes in scientific literature Self-evaluating groups - each member of a group evaluates all the other members of the group Idea: Rely on the democratic nature of the web! Rank a page based on how it is embedded in the interconnection structure of the Web, not based on its content. Peter Tiňo 2

PageRank Authority of a page p takes into an account: the number of incoming links (number of citations) authority of pages q that cite p with forward links selective citations from q are more valuable than flat (uniform) citations of a large number of pages. Note the self-referencing nature of page authority! Peter Tiňo 3

Formalising PageRank For a page p {1, 2,..., N} we define: pa(p) - set of pages pointing to p h p - number of hyperlinks from p (outdegree of p) dumping factor 0 < d < 1 - d is the proportion of authority coming from other pages, (1 d) is the authority given to p by default ( for free ) PageRank equation (Brin, Page 1998): x p = (1 d) + d q pa(p) x q h q Peter Tiňo 4

PageRank in Matrix form Stack all N page authorities x p into a column vector x = (x 1, x 2,..., x N ) T R N. Construct N N transition matrix W = (w i,j ): w i,j = 1/h j, if the is a hyperlink from j to i; w i,j = 0, otherwise. 1 N - column N-dimensional vector of 1 s. x = d Wx + (1 d) 1 N Peter Tiňo 5

PageRank - finding the solution System of N equations with N unknowns x p x = d Wx + (1 d) 1 N The system is contractive and hence its dynamical form x(t) = d Wx(t 1) + (1 d) 1 N converges (for any initial condition x(0)) to a unique fixed point x (the solution): x = d Wx + (1 d) 1 N Peter Tiňo 6

Problem with dangling pages Dangling pages - pages without hyperlinks. If p is a dangling page, then p-th column of transition matrix W is null. Each column corresponding to a non-dangling page sums to 1 (as in stochastic matrix). Because of dangling pages, the transition matrix W is not stochastic. We cannot apply all the nice results available for (browsing) processes governed by stochastic matrices :-( Peter Tiňo 7

Getting around the problem of dangling pages Idea: Introduce a dummy page r r has a link to itself every dangling page is made to point to r Can you think what this does to the transition matrix W? Hint: We will get an extended transition matrix W - Introduce a dangling page indicator vector r = (r 1, r 2,..., r N ), where r p = 1, if p is a dangling page, else r p = 0. - Stack r at the bottom of W. - To such row-extended matrix add an additional column of N 0 s followed by a single 1. Peter Tiňo 8

Another way of getting around the problem of dangling pages Idea: Make dangling pages point to all pages of the Web. Construct matrix V with N equal rows N 1 r: V = 1 N 1 N r Modified PageRank equation (Ng et al. 2001) x = d (W + V)x + 1 d N 1 N Peter Tiňo 9

Solution x of the above system is related to the solution x of the original PageRank equation as follows: x = x x 1 Recall that the L 1 norm x 1 of x is x 1 = x 1 + x 2 +... + x N Hence x can be thought of as representing probabilities of visiting nodes when surfing the web. Peter Tiňo 10

Think about stochastic interpretation of various forms of PageRank equations Hints: Each page is a node in a graph. Nodes are connected as described by the hyperlink structure. Connections from a node p are weighted by probabilities of their usage (given that we are currently in p). The surfer never stops navigating. At each time step the surfer may become bored with probability (1 d). When bored, the surfer jumps to any web page with uniform probability N 1. Peter Tiňo 11

Web communities Community - any subset of pages together with their hyperlink structure. Formally, it is a subgraph G of the Web graph. Energy E G of community G - sum of PageRank of all its pages. It is a measure of the community s authority. E G = p G x p Given a community G, we define 3 related communities: out(g) - (sub)community of pages in G that point outside G dp(g) - (sub)community of dangling pages in G into(g) - external pages (outside G) that point to G Peter Tiňo 12

Community energy decomposition G - number of pages in G. Larger communities tend to have higher authority. E into G - energy coming from outside G into G E out G E dp G - energy released from G to external pages - energy lost in dangling pages of G The energy of G decomposes as (Bianchini et al., 2005): E G = G + E into G Eout G Edp G Peter Tiňo 13

Community energy decomposition - cont d For any page p (inside or outside G), ρ p hyperlinks from p that point to G. is the fraction of all E into G = d 1 d p into(g) ρ p x p E out G = d 1 d p out(g) (1 ρ p )x p E dp G = d 1 d p dp(g) x p Peter Tiňo 14

Lessons learnt Community energy decomposition provides an insight into how the score migrates in the Web. In order to maximise energy of G, we need to care about references received from other communities pay attention to links pointing outside G minimise the number of dangling pages inside G Peter Tiňo 15