PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Similar documents
Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

1998: enter Link Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Mining PageRank. From Stanford C246

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Data Mining Recitation Notes Week 3

Slides based on those in:

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

0.1 Naive formulation of PageRank

Data Mining and Matrices

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

A Note on Google s PageRank

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Link Analysis. Leonid E. Zhukov

Application. Stochastic Matrices and PageRank

Inf 2B: Ranking Queries on the WWW

Uncertainty and Randomization

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Introduction to Data Mining

Math 304 Handout: Linear algebra, graphs, and networks.

Online Social Networks and Media. Link Analysis and Web Search

Computing PageRank using Power Extrapolation

Link Analysis Ranking

Google Page Rank Project Linear Algebra Summer 2012

Data and Algorithms of the Web

Lecture: Local Spectral Methods (1 of 4)

Calculating Web Page Authority Using the PageRank Algorithm

CS6220: DATA MINING TECHNIQUES

Online Social Networks and Media. Link Analysis and Web Search

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Page rank computation HPC course project a.y

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Lecture 12: Link Analysis for Web Retrieval

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

ECEN 689 Special Topics in Data Science for Communications Networks

IR: Information Retrieval

Graph Models The PageRank Algorithm

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

STA141C: Big Data & High Performance Statistical Computing

CS6220: DATA MINING TECHNIQUES

Applications to network analysis: Eigenvector centrality indices Lecture notes

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

How does Google rank webpages?

Node Centrality and Ranking on Networks

Mathematical Properties & Analysis of Google s PageRank

The Google Markov Chain: convergence speed and eigenvalues

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Node and Link Analysis

Link Analysis. Stony Brook University CSE545, Fall 2016

Lecture 7 Mathematics behind Internet Search

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

1 Searching the World Wide Web

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Complex Social System, Elections. Introduction to Network Analysis 1

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Applications. Nonnegative Matrices: Ranking

CS249: ADVANCED DATA MINING

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Class President: A Network Approach to Popularity. Due July 18, 2014

CPSC 540: Machine Learning

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

eigenvalues, markov matrices, and the power method

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Approximate Inference

MATH36001 Perron Frobenius Theory 2015

CS47300: Web Information Search and Management

CPSC 540: Machine Learning

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

COMPSCI 514: Algorithms for Data Science

Analysis of Google s PageRank

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

On the mathematical background of Google PageRank algorithm

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Computational Economics and Finance

Introduction to Information Retrieval

Web Structure Mining Nodes, Links and Influence

Markov Chains Handout for Stat 110

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

MPageRank: The Stability of Web Graph

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Degree Distribution: The case of Citation Networks

Modelling data networks stochastic processes and Markov chains

CS 188: Artificial Intelligence

Finding central nodes in large networks

Transcription:

PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1

Information retrieval with the web Last time we learned about information retrieval. We learned how to compute similarity scores (distances) of documents to a given query string. But what if documents are webpages, and our collection is the whole web (or a big chunk of it)? Now we have two problems: The techniques from last lectures (normalization, IDF weighting) are computationally infeasible at this scale. There are about 30 billion webpages! Some webpages should be assigned more priority than others, for being more important. Fortunately, there is an underlying structure that we can use to solve both of these problems: the links between webpages. 2

Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) 3

PageRank algorithm The PageRank algorithm was famously invented by Larry Page and Sergei Brin, founders of Google. The algorithm assigns a PageRank (this is a score, or a measure of importance) to each webpage. Suppose we have webpages numbered 1,... n. The PageRank of webpage i based on its linking webpages (webpages j that link to i). But we don t just count the number of linking webpages, i.e., not all linking webpages are treated equally. Instead, we weight the links from different webpages. This follows two ideas: Webpages that link to i, and have high PageRank scores themselves, should be given more weight. Webpages that link to i, but link to a lot of other webpages in general, should be given less weight. Note that the first idea is circular! (But that s OK.) 4

BrokenRank (almost PageRank) definition Let L ij = 1 if webpage j links to webpage i (written j i), and L i j = 0 otherwise. Also let m j = n k=1 L kj, the total number of webpages that j links to. We re going to define something that s almost PageRank, but not quite, because it s broken. The BrokenRank p i of webpage i p i = j i p j m j = n j=1 L ij m j p j. Does this match our ideas from the last slide? Yes: for j i, the weight is p j /m j. This increases with p j, but decreases with m j. 5

BrokenRank in matrix notation Written in matrix notation, p 1 p 2 p =., L = p n M = L 11 L 12... L 1n L 21 L 22... L 2n. L n1 L n2... L nn m 1 0... 0 0 m 2... 0. 0 0... m n Dimensions: p is n 1, L and M are n n.., We can now re-express the definition on the previous page. The BrokenRank vector p is defined as p = LM 1 p. 6

Eigenvalues and eigenvectors Let A = LM 1. Then p = Ap. This means that p is an eigenvector of the matrix A with eigenvalue 1. Great! Well we know how to compute the eigenvalues and eigenvectors of A, and there are even methods for doing this quickly when A is large and sparse. (Why is our A sparse?) But wait... how do we know that A has an eigenvalue of 1, so that such a vector p exists? And even if it does exist, will it be unique (well-defined)? To investigate this question, it helps to interpret BrokenRank in terms of a Markov chain. 7

BrokenRank as a Markov chain You can think of a Markov Chain as a random process that moves between states numbered 1,... n (each step of the process is one move). Recall that for a Markov chain to have an n n transition matrix P, this means that P(go from i to j) = P ij. Suppose p (0) is an n-dimensional vector giving us the probability of being in each state to begin with. After one step, the probability of being in each state is given by p (1) = P T p (0). (Why?) Now consider a Markov chain, with the states as webpages, and with transition matrix A T. Note that (A T ) ij = A ji = L ji /m i, so we can describe the chain as { 1/m i if i j P(go from i to j) = 0 otherwise. (Check: does this make sense?) This is like a random surfer, i.e., a person surfing the web by clicking on links uniformly at random. 8

Stationary distribution A stationary distribution of our Markov chain is a probability vector p (i.e., its entries are 0 and sum to 1) with p = Ap. This means that the distribution after one step of the Markov chain is unchanged. Note that this is exactly what we re looking for: an eigenvector of A corresponding to eigenvalue 1! If the Markov chain is strongly connected, meaning that any state can be reached from any other state, then the stationary distribution p exists and is unique. Furthermore, we can think of the stationary distribution as the of proportions of visits the chain pays to each state after a very long time (the ergodic theorem): # of visits to state i in t steps p i = lim. t t Our interpretation: the BrokeRank of p i is the proportion of time our random surfer spends on webpage i if we let him go forever. 9

10 Why is BrokenRank broken? There s a problem here. Our Markov chain a random surfer on the web graph is not strongly connected, in three cases (at least): Disconnected components Dangling links Loops Actually, even for Markov chains that are not strongly connected, a stationary distribution always exists, but may nonunique. In other words, the BrokenRank vector p exists but is ambiguously defined.

11 BrokenRank example Here A = LM 1 = 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 (Check: matches both definitions?). Here there are two eigenvectors of A with eigenvalue 1: 1 3 0 1 3 p = 1 3 0 and p = 0 0 1. 2 1 0 2 These are totally opposite rankings!

12 PageRank definition PageRank is given by a small modification of BrokenRank: p i = 1 d n + d j i p j = 1 d n m j n + d L ij p j, m j=1 j where 0 < d < 1 is a constant (apparently Google uses d = 0.85). In matrix notation, this is p = ( 1 d n E + dlm 1) p, where E is the n n matrix of 1s, subject to the constraint n i=1 p i = 1. (Check: are these definitions the same? Show that the second definition gives the first. Hint: if e is the n-vector of all 1s, then E = ee T, and e T p = 1.)

13 PageRank as a Markov chain Let A = 1 d n E + dlm 1, and consider as before a Markov chain with transition matrix A T. Well (A T ) ij = A ji = (1 d)/n + dl ji /m i, so the chain can be described as { (1 d)/n + d/m i if i j P(go from i to j) = (1 d)/n otherwise. (Check: does this make sense?) The chain moves through a link with probability (1 d)/n + d/m i, and with probability (1 d)/n it jumps to an unlinked webpage. Hence this is like a random surfer with random jumps. Fortunately, the random jumps get rid of our problems: our Markov chain is now strongly connected. Therefore the stationary distribution (i.e., PageRank vector) p is unique.

PageRank example With d = 0.85, A = 1 d n E + dlm 1 = 0.15 5 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + 0.85 0.03 0.03 0.88 0.03 0.03 0.88 0.03 0.03 0.03 0.03 0.03 0.88 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.88 0.03 0.03 0.03 0.88 0.03. 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 Now only one eigenvector of A with eigenvalue 1: p = 0.2 0.2 0.2 0.2 0.2. 14

15 Computing the PageRank vector We could compute the PageRank vector p via traditional methods, i.e., an eigendecomposition, but this takes on the order of n 3 operations. When n = 3 10 10, n 3 = 2.7 10 31 Yikes! (But a bigger concern would be memory...) Fortunately, there is a much faster way to compute the eigenvector of A with eigenvalue 1. If we begin with any initial distribution p (0), and compute p (1) = Ap (0) p (2) = Ap (1). p (t) = Ap (t 1), then p (t) p as t. In practice, we just repeatedly multiply by A until there isn t much change between iterations.

16 Computation, continued There are still important questions remaining about computing the PageRank vector p (with the algorithm presented on last slide): 1. How can we perform each iteration quickly (multiply by A quickly)? 2. How many iterations does it take (generally) to get a reasonable answer? At a very broad level, the answers are: 1. Use the sparsity of web graph. (How?) 2. Not very many if A large spectral gap (difference between its first and second largest absolute eigenvalues); the largest is 1, the second largest is d. (PageRank in R: see the function page.rank in package igraph.)

A basic web search For a basic web search, given a query, we could do the following: 1. Compute the PageRank vector p once. (Google recomputes this from time to time, to stay current.) 2. Find the documents containing all words in the query. 3. Sort these documents by PageRank, and return the top k (e.g. k = 50). This is a little too simple... but we can use the similarity scores learned last time, changing the above to: 3. Sort these documents by PageRank, and keep only the top K (e.g. K = 5000). 4. Sort by similarity to the query (e.g. normalized, IDF weighted distance), and return the top k (e.g. k = 50). Google uses a combination of PageRank, similarity scores, and other techniques (it s proprietary!). 17

18 Variants/extensions of PageRank A precursor to PageRank: Hubs and authorities: using link structure to determine hubs and authorities ; a similar algorithm was used by Ask.com. (Kleinberg (1997), Authoritative Sources in a Hyperlinked Environment.) Following its discovery, there has been a huge amount of work to improve/extend PageRank and not only at Google! There are many, many academic papers too, here are a few: Intelligent surfing: pointing the surfer towards webpages that are textually relevant. (Richardson and Domingos (2002), The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank.) TrustRank: pointing the surfer away from spam. (Gyongyi et al. (2004), Combating Web Spam with TrustRank.) PigeonRank: pigeons, the real reason for Google s success. (http://www.google.com/onceuponatime/technology/pigeonrank.html.)

19 Encouragement (I don t mean: Google found the 25 billion dollar eigenvector. ) Beautiful math tends to be useful, and useful things tend to have beautiful math.... Statistics is often where it comes together.

20 Recap: PageRank PageRank is a ranking for webpages based on their importance. For a given webpage, its PageRank is based on the webpages that link to it. It helps if these linking webpages have high PageRank themselves. It hurts if these linking webpages also link to a lot of other webpages. We defined it by modifying a simpler ranking system (BrokenRank) that didn t quite work. The PageRank vector p corresponds to the eigenvector of a particular matrix A corresponding to eigenvalue 1. It can also be explained in terms of a Markov chain, interpreted a random surfer with random jumps. These jumps were crucial, because they made the chain strongly connected, and guaranteed that the PageRank vector (stationary distribution) p is unique. We can compute p by repeatedly multiplying by A. PageRank can be combined with similarity scores for a basic web search.

21 Next time: clustering E.g., apples with apples and oranges with oranges! (If only it were that easy...)