Graph Models The PageRank Algorithm

Similar documents
eigenvalues, markov matrices, and the power method

Application. Stochastic Matrices and PageRank

Uncertainty and Randomization

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Math 304 Handout: Linear algebra, graphs, and networks.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

A Note on Google s PageRank

ECEN 689 Special Topics in Data Science for Communications Networks

1998: enter Link Analysis

Data Mining Recitation Notes Week 3

Calculating Web Page Authority Using the PageRank Algorithm

0.1 Naive formulation of PageRank

Link Analysis. Leonid E. Zhukov

Google Page Rank Project Linear Algebra Summer 2012

Link Mining PageRank. From Stanford C246

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Inf 2B: Ranking Queries on the WWW

Link Analysis. Stony Brook University CSE545, Fall 2016

Lecture 7 Mathematics behind Internet Search

Online Social Networks and Media. Link Analysis and Web Search

Node Centrality and Ranking on Networks

MATH36001 Perron Frobenius Theory 2015

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Applications of The Perron-Frobenius Theorem

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Section 1.7: Properties of the Leslie Matrix

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

Lecture: Local Spectral Methods (1 of 4)

IR: Information Retrieval

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Algebraic Representation of Networks

CS6220: DATA MINING TECHNIQUES

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Link Analysis Ranking

Justification and Application of Eigenvector Centrality

Data and Algorithms of the Web

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Robust PageRank: Stationary Distribution on a Growing Network Structure

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Online Social Networks and Media. Link Analysis and Web Search

Google and Biosequence searches with Markov Chains

Krylov Subspace Methods to Calculate PageRank

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Page rank computation HPC course project a.y

The PageRank Computation in Google: Randomization and Ergodicity

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

On the mathematical background of Google PageRank algorithm

CS6220: DATA MINING TECHNIQUES

Applications to network analysis: Eigenvector centrality indices Lecture notes

Markov Chains for Biosequences and Google searches

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Data Mining and Matrices

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Convex Optimization CMU-10725

Updating PageRank. Amy Langville Carl Meyer

Lecture 12: Link Analysis for Web Retrieval

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Applications. Nonnegative Matrices: Ranking

Web Structure Mining Nodes, Links and Influence

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

arxiv:cond-mat/ v1 3 Sep 2004

Analysis and Computation of Google s PageRank

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Node and Link Analysis

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

CS249: ADVANCED DATA MINING

COMPSCI 514: Algorithms for Data Science

Announcements: Warm-up Exercise:

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Where and how can linear algebra be useful in practice.

Affine iterations on nonnegative vectors

Perron Frobenius Theory

Google Matrix, dynamical attractors and Ulam networks Dima Shepelyansky (CNRS, Toulouse)

A hybrid reordered Arnoldi method to accelerate PageRank computations

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Randomization and Gossiping in Techno-Social Networks

The Google Markov Chain: convergence speed and eigenvalues

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

How does Google rank webpages?

Computational Economics and Finance

Eigenvalues and Eigenvectors

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

The Giving Game: Google Page Rank

STA141C: Big Data & High Performance Statistical Computing

CPSC 540: Machine Learning

CPSC 540: Machine Learning

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Lecture 15 Perron-Frobenius Theory

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Slides based on those in:

1 Searching the World Wide Web

Transcription:

Graph Models The PageRank Algorithm Anna-Karin Tornberg Mathematical Models, Analysis and Simulation Fall semester, 2013 The PageRank Algorithm I Invented by Larry Page and Sergey Brin around 1998 and used in the prototype of Google s search engine. I Estimating the popularity/importance of a webpage based on the interconnection of the web. I Two basic assumptions: i) Apagewithmoreincominglinksismoreimporantthanapagewith less incoming links. ii) A page with a link from a page of high importance is also important. I We have used incidence matrices to define the structure of a graph. In our earlier examples, two nodes only connected by a link in one direction. Now, webpage 1 can link to webpage 2, while 2 also links to 1. I From now on, node and webpage or simply page used interchangeably.

The first model - the bored surfer I Imagine a bored surfer that klick links in a random manner. I If a page has a number of links, the bored surfer is as likely to click on any of the links. I If there are no links from the current webpage, the surfer goes to another webpage at random. I If the vector x 2 lr N contains the probabilities that a surfer is at website 1, 2,...,N at a certain instant, then we want to create a matrix A s.t. Ax is the probability that the surfer is at website 1, 2,...,N after one more step. Defining the matrix I Denote by L(j), the number of links from a page j. I First, define the matrix entries, a ij, i, j =1,...,N: a ij = 1 L(j) if there is a link from j to i, 0 otherwise. I This is the assumption that all links from page will be clicked with equal probability. I If there are no links from a page, this will however render a column with zeros. Then set all values in the column to 1/N, usingthe assumption that the surfer will pick a new page at random (i.e. all pages have equal probability). I This yields 8 < a ij = : I NOTES: EXAMPLE. 1 L(j) if there is a link from j to i, 1 N if there are no links from j, 0 otherwise.

The page rank I If the vector x 2 lr N contains the probabilities that a surfer is at website 1, 2,...,N at a certain instant, then Ax is the probability that the surfer is at website 1, 2,...,N after one more step. I The page rank is given by the vector x such that a multiplication of A no longer changes the probabilities, i.e. x = Ax. I This has a solution if the matrix A has an eigenvalue 1 with the corresponding eigenvector x. I Our matrix A is a so-called column stochastic matrix: all entries are non-negative, and the entries in each column sum to 1. I The Perron-Frobenius theorem ensures that every stochastic matrix has an eigenvalue = 1, and that all other eigenvalues are smaller in magnitude. I Without more assumptions, it does however not guarantee that = 1 is a single eigenvalue, and hence that x is unique. The Power Method I The dominant eigenvalue of a matrix A is the eigenvalue with the largest magnitude, and a dominant eigenvector is an eigenvector corresponding to this eigenvalue. I Introduce the power iteration (with x 0 aunitvector), x 0, x 1 = Ax 0 kax 0 k, x 2 = Ax 1 kax 1 k,..., x k = Ax k 1 kax k 1 k,... Then this sequence converges to a unit dominant eigenvector, under the assumptions i) A has an eigenvalue that is strictly greater in magnitude than its other eigenvalues. ii) The starting vector x 0 has a non-zero component in the direction of an eigenvector associated to the dominant eigenvalue. and the sequence x T 1 Ax 1, x T 2 Ax 2,..., x T k Ax k,... converges to the dominant eigenvalue. I WORKSHEET

What can fail? I The PageRank algorithm is simply the power iteration x k = A k x 0. where x k converges to the PageRank vector x as k!1.allentries in x are non-negative, the node with the corresponding entry with the largest value is ranked the most important, etc. (x 0 must have non/negative entries). I In the homework, you are asked to show that for a column-stochastic matrix A, nx (A k x) i = i=1 nx (x) i, i=1 i.e. if the entries of x 0 are scaled to sum to 1, so will the entries of x k. I Can the PageRank algorithm fail the way we have constructed A so far? [EXAMPLE]. Reducible graphs I A graph is called irreducible if we can reach all nodes, independent of which node we start from. I A graph that is not irreducible is called reducible. I Example of a reducible graphs: Imagine two sets of nodes. I Both set C and set D contain many nodes, and they all link to other nodes. I Now assume that nodes from set C link to set D, but no nodes from set D links to nodes in set C. That means, that once we are at a node in set D, thereisno possibility to go to a node in set C, following the link structure. I In the algorithm, the random restart, with 1/N as each column entry in A will never occur since each node has outgoing links. This means, that if we are at a node in set D, we have zero probability of returning to set C.

The Google matrix I In order to caclulate PageRanks for a reducible web graph, Page and Brin proposed to define the following matrix: 2 3 G = A + 1 6 N 4...... 7. 5, where A is the matrix we have already defined, and the damping factor has a default value of 0.85. This gives the surfer a 1 probability to jump randomly to any page. I The matrix G is still a column stochastic matrix, but now the entries are not only non-negative, but strictly positive. I For such a matrix, the Perron-Frobenius theorem tells us that the eigenvalue = 1 is a simple eigenvalue (multiplicity 1), and that all other eigenvalues are of smaller magnitude. I Hence, x k = G k x 0 will converge to a non-negative eigenvector x as k!1, which is unique up to normalization. (Given that x 0 is non-negative).