IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Similar documents
Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lecture 12: Link Analysis for Web Retrieval

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

0.1 Naive formulation of PageRank

Introduction to Information Retrieval

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Data Mining and Matrices

Link Mining PageRank. From Stanford C246

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1998: enter Link Analysis

CSE 473: Ar+ficial Intelligence

PV211: Introduction to Information Retrieval

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Link Analysis. Stony Brook University CSE545, Fall 2016

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Online Social Networks and Media. Link Analysis and Web Search

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis Ranking

Slides based on those in:

Data and Algorithms of the Web

Uncertainty and Randomization

CS6220: DATA MINING TECHNIQUES

Math 304 Handout: Linear algebra, graphs, and networks.

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Application. Stochastic Matrices and PageRank

IR: Information Retrieval

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Data Mining Recitation Notes Week 3

ECEN 689 Special Topics in Data Science for Communications Networks

Graph Models The PageRank Algorithm

CS6220: DATA MINING TECHNIQUES

Page rank computation HPC course project a.y

Jeffrey D. Ullman Stanford University

Computing PageRank using Power Extrapolation

Jeffrey D. Ullman Stanford University

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

COMPSCI 514: Algorithms for Data Science

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

eigenvalues, markov matrices, and the power method

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Google Page Rank Project Linear Algebra Summer 2012

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

A Note on Google s PageRank

The Theory behind PageRank

Updating PageRank. Amy Langville Carl Meyer

1 Searching the World Wide Web

Informa(on Retrieval

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Link Analysis. Leonid E. Zhukov

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

How does Google rank webpages?

The Second Eigenvalue of the Google Matrix

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Markov Chains Handout for Stat 110

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

MATH36001 Perron Frobenius Theory 2015

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Entropy Rate of Stochastic Processes

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

The Google Markov Chain: convergence speed and eigenvalues

On the mathematical background of Google PageRank algorithm

Seman&cs with Dense Vectors. Dorota Glowacka

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Google and Biosequence searches with Markov Chains

STA141C: Big Data & High Performance Statistical Computing

Algebraic Representation of Networks

Markov chains. Randomness and Computation. Markov chains. Markov processes

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

Inf 2B: Ranking Queries on the WWW

CS249: ADVANCED DATA MINING

Markov Chains for Biosequences and Google searches

Convex Optimization CMU-10725

CS 188: Artificial Intelligence Spring Announcements

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Calculating Web Page Authority Using the PageRank Algorithm

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Informa(on Retrieval

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Machine learning for Dynamic Social Network Analysis

Link Analysis and Web Search

Transcription:

IS4200/CS6200 Informa0on Retrieval PageRank Con+nued with slides from Hinrich Schütze and Chris6na Lioma

Exercise: Assump0ons underlying PageRank Assump0on 1: A link on the web is a quality signal the author of the link thinks that the linked- to page is high- quality. Assump0on 2: The anchor text describes the content of the linked- to page. Is assump0on 1 true in general? Is assump0on 2 true in general? 2

Google bombs 3

Google bombs A Google bomb is a search with bad results due to maliciously manipulated anchor text. 4

Google bombs A Google bomb is a search with bad results due to maliciously manipulated anchor text. Google introduced a new weigh0ng func0on in January 2007 that fixed many Google bombs. 5

Google bombs A Google bomb is a search with bad results due to maliciously manipulated anchor text. Google introduced a new weigh0ng func0on in January 2007 that fixed many Google bombs. S0ll some remnants: [dangerous cult] on Google, Bing, Yahoo 6

Google bombs A Google bomb is a search with bad results due to maliciously manipulated anchor text. Google introduced a new weigh0ng func0on in January 2007 that fixed many Google bombs. S0ll some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link crea0on by those who dislike the Church of Scientology 7

Google bombs A Google bomb is a search with bad results due to maliciously manipulated anchor text. Google introduced a new weigh0ng func0on in January 2007 that fixed many Google bombs. S0ll some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link crea0on by those who dislike the Church of Scientology Defused Google bombs: [dumb motherf ], [who is a failure?], [evil empire] 8

Origins of PageRank: Cita0on analysis (1) Cita0on analysis: analysis of cita0ons in the scien0fic literature. Example cita0on: Miller (2001) has shown that physical ac0vity alters the metabolism of estrogens. We can view Miller (2001) as a hyperlink linking two scien0fic ar0cles. One applica0on of these hyperlinks in the scien0fic literature: Measure the similarity of two ar0cles by the overlap of other ar0cles ci0ng them. This is called cocita0on similarity. Cocita0on similarity on the web: Google s find pages like this or Similar feature. 9

Origins of PageRank: Cita0on analysis (2) Another applica0on: Cita0on frequency can be used to measure the impact of an ar0cle. Simplest measure: Each ar0cle gets one vote not very accurate. On the web: cita0on frequency = inlink count A high inlink count does not necessarily mean high quality...... mainly because of link spam. Befer measure: weighted cita0on frequency or cita0on rank An ar0cle s vote is weighted according to its cita0on impact. Circular? No: can be formalized in a well- defined way. 10

Origins of PageRank: Cita0on analysis (3) Befer measure: weighted cita0on frequency or cita0on rank. This is basically PageRank. PageRank was invented in the context of cita0on analysis by Pinsker and Narin in the 1960s. Cita0on analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publica0ons! 11

Origins of PageRank: Summary We can use the same formal representa0on for cita0ons in the scien0fic literature hyperlinks on the web Appropriately weighted cita0on frequency is an excellent measure of quality...... both for web pages and for scien0fic publica0ons. Next: PageRank algorithm for compu0ng weighted cita0on frequency on the web. 12

Model behind PageRank: Random walk Imagine a web surfer doing a random walk on the web Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state, each page has a long- term visit rate. This long- term visit rate is the page s PageRank. PageRank = long- term visit rate = steady state probability. 13

Formaliza0on of random walk: Markov chains 14

Formaliza0on of random walk: Markov chains A Markov chain consists of N states, plus an N N transi0on probability matrix P. 15

Formaliza0on of random walk: Markov chains A Markov chain consists of N states, plus an N N transi0on probability matrix P. state = page 16

Formaliza0on of random walk: Markov chains A Markov chain consists of N states, plus an N N transi0on probability matrix P. state = page At each step, we are on exactly one of the pages. 17

Formaliza0on of random walk: Markov chains A Markov chain consists of N states, plus an N N transi0on probability matrix P. state = page At each step, we are on exactly one of the pages. For 1 i, j N, the matrix entry P ij tells us the probability of j being the next page, given we are currently on page i. N = Clearly, for all i, P = 1 j 1 ij 18

Example web graph

Link matrix for example 20

Link matrix for example d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0 0 1 0 0 0 0 d 1 0 1 1 0 0 0 0 d 2 1 0 1 1 0 0 0 d 3 0 0 0 1 1 0 0 d 4 0 0 0 0 0 0 1 d 5 0 0 0 0 0 1 1 d 6 0 0 0 1 1 0 1 21

Transi0on probability matrix P for example 22

Transi0on probability matrix P for example d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d 1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d 2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d 3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d 4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d 5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d 6 0.00 0.00 0.00 0.33 0.33 0.00 0.33 23

Long- term visit rate 24

Long- term visit rate Recall: PageRank = long- term visit rate. 25

Long- term visit rate Recall: PageRank = long- term visit rate. Long- term visit rate of page d is the probability that a web surfer is at page d at a given point in 0me. 26

Long- term visit rate Recall: PageRank = long- term visit rate. Long- term visit rate of page d is the probability that a web surfer is at page d at a given point in 0me. Next: what proper0es must hold of the web graph for the long- term visit rate to be well defined? 27

Long- term visit rate Recall: PageRank = long- term visit rate. Long- term visit rate of page d is the probability that a web surfer is at page d at a given point in 0me. Next: what proper0es must hold of the web graph for the long- term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. 28

Long- term visit rate Recall: PageRank = long- term visit rate. Long- term visit rate of page d is the probability that a web surfer is at page d at a given point in 0me. Next: what proper0es must hold of the web graph for the long- term visit rate to be well defined? The web graph must correspond to an ergodic Markov chain. First a special case: The web graph must not contain dead ends. 29

Dead ends 30

Dead ends 31

Dead ends The web is full of dead ends. 32

Dead ends The web is full of dead ends. Random walk can get stuck in dead ends. 33

Dead ends The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long- term visit rates are not well- defined (or non- sensical). 34

Telepor0ng to get us of dead ends 35

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. 36

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non- dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). 37

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non- dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. 38

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non- dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 39

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non- dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 10% is a parameter, the teleporta0on rate. 40

Telepor0ng to get us of dead ends At a dead end, jump to a random web page with prob. 1/ N. At a non- dead end, with probability 10%, jump to a random web page (to each with a probability of 0.1/N ). With remaining probability (90%), go out on a random hyperlink. For example, if the page has 4 outgoing links: randomly choose one with probability (1-0.10)/4=0.225 10% is a parameter, the teleporta0on rate. Note: jumping from dead end is independent of teleporta0on rate. 41

Result of telepor0ng 42

Result of telepor0ng With telepor0ng, we cannot get stuck in a dead end. 43

Result of telepor0ng With telepor0ng, we cannot get stuck in a dead end. But even without dead ends, a graph may not have well- defined long- term visit rates. 44

Result of telepor0ng With telepor0ng, we cannot get stuck in a dead end. But even without dead ends, a graph may not have well- defined long- term visit rates. More generally, we require that the Markov chain be ergodic. 45

Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. 46

Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page. 47

Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page. Aperiodicity. Roughly: The pages cannot be par00oned such that the random walker visits the par00ons sequen0ally. 48

Ergodic Markov chains A Markov chain is ergodic if it is irreducible and aperiodic. Irreducibility. Roughly: there is a path from any other page. Aperiodicity. Roughly: The pages cannot be par00oned such that the random walker visits the par00ons sequen0ally. A non- ergodic Markov chain: 49

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. 50

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. 51

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. Over a long 0me period, we visit each state in propor0on to this rate. 52

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. Over a long 0me period, we visit each state in propor0on to this rate. It doesn t mafer where we start. 53

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. Over a long 0me period, we visit each state in propor0on to this rate. It doesn t mafer where we start. Telepor0ng makes the web graph ergodic. 54

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. Over a long 0me period, we visit each state in propor0on to this rate. It doesn t mafer where we start. Telepor0ng makes the web graph ergodic. Web- graph+telepor0ng has a steady- state probability distribu0on. 55

Ergodic Markov chains Theorem: For any ergodic Markov chain, there is a unique long- term visit rate for each state. This is the steady- state probability distribu0on. Over a long 0me period, we visit each state in propor0on to this rate. It doesn t mafer where we start. Telepor0ng makes the web graph ergodic. Web- graph+telepor0ng has a steady- state probability distribu0on. Each page in the web- graph+telepor0ng has a PageRank. 56

Formaliza0on of visit : Probability vector 57

Formaliza0on of visit : Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. 58

Formaliza0on of visit : Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: ( 0 0 0 1 0 0 0 ) 1 2 3 i N- 2 N- 1 N 59

Formaliza0on of visit : Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: ( 0 0 0 1 0 0 0 ) 1 2 3 i N- 2 N- 1 N More generally: the random walk is on the page i with probability x i. 60

Formaliza0on of visit : Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: More generally: the random walk is on the page i with probability x i. Example: ( 0 0 0 1 0 0 0 ) 1 2 3 i N- 2 N- 1 N ( 0.05 0.01 0.0 0.2 0.01 0.05 0.03 ) 1 2 3 i N- 2 N- 1 N 61

Formaliza0on of visit : Probability vector A probability (row) vector x = (x 1,..., x N ) tells us where the random walk is at any point. Example: More generally: the random walk is on the page i with probability x i. Example: Σ x i = 1 ( 0 0 0 1 0 0 0 ) 1 2 3 i N- 2 N- 1 N ( 0.05 0.01 0.0 0.2 0.01 0.05 0.03 ) 1 2 3 i N- 2 N- 1 N 62

Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step? 63

Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step? Recall that row i of the transi0on probability matrix P tells us where we go next from state i. 64

Change in probability vector If the probability vector is x = (x 1,..., x N ), at this step, what is it at the next step? Recall that row i of the transi0on probability matrix P tells us where we go next from state i. So from x, our next state is distributed as xp. 65

Steady state in vector nota0on 66

Steady state in vector nota0on The steady state in vector nota0on is simply a vector π = (π 1, π 2,, π Ν ) of probabili0es. 67

Steady state in vector nota0on The steady state in vector nota0on is simply a vector π = (π 1, π 2,, π Ν ) of probabili0es. (We use π to dis0nguish it from the nota0on for the probability vector x.) 68

Steady state in vector nota0on The steady state in vector nota0on is simply a vector π = (π 1, π 2,, π Ν ) of probabili0es. (We use π to dis0nguish it from the nota0on for the probability vector x.) π is the long- term visit rate (or PageRank) of page i. 69

Steady state in vector nota0on The steady state in vector nota0on is simply a vector π = (π 1, π 2,, π Ν ) of probabili0es. (We use π to dis0nguish it from the nota0on for the probability vector x.) π is the long- term visit rate (or PageRank) of page i. So we can think of PageRank as a very long vector one entry per page. 70

Steady- state distribu0on: Example 71

Steady- state distribu0on: Example What is the PageRank / steady state in this example? 72

Steady- state distribu0on: Example 73

Steady- state distribu0on: Example x 1 P t (d 1 ) x 2 P t (d 2 ) t 0 0.25 0.75 t 1 P 11 = 0.25 P 21 = 0.25 P 12 = 0.75 P 22 = 0.75 PageRank vector = π = (π 1, π 2 ) = (0.25, 0.75) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 74

Steady- state distribu0on: Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.25 P 21 = 0.25 t 0 0.25 0.75 0.25 0.75 t 1 PageRank vector = π = (π 1, π 2 ) = (0.25, 0.75) P 12 = 0.75 P 22 = 0.75 P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 75

Steady- state distribu0on: Example x 1 P t (d 1 ) t 0 0.25 t 1 0.25 x 2 P t (d 2 ) 0.75 0.75 P 11 = 0.25 P 21 = 0.25 0.25 0.75 P 12 = 0.75 P 22 = 0.75 PageRank vector = π = (π 1, π 2 ) = (0.25, 0.75) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 76

Steady- state distribu0on: Example x 1 P t (d 1 ) t 0 0.25 t 1 0.25 x 2 P t (d 2 ) 0.75 0.75 P 11 = 0.25 P 21 = 0.25 0.25 0.75 P 12 = 0.75 P 22 = 0.75 (convergence) PageRank vector = π = (π 1, π 2 ) = (0.25, 0.75) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 77

How do we compute the steady state vector? 78

How do we compute the steady state vector? In other words: how do we compute PageRank? 79

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... 80

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. 81

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! 82

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! So: π = π P 83

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! So: π = π P Solving this matrix equa0on gives us π. 84

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! So: π = π P Solving this matrix equa0on gives us π. π is the principal let eigenvector for P 85

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! So: π = π P Solving this matrix equa0on gives us π. π is the principal let eigenvector for P that is, π is the let eigenvector with the largest eigenvalue. 86

How do we compute the steady state vector? In other words: how do we compute PageRank? Recall: π = (π 1, π 2,, π N ) is the PageRank vector, the vector of steady- state probabili0es... and if the distribu0on in this step is x, then the distribu0on in the next step is xp. But π is the steady state! So: π = π P Solving this matrix equa0on gives us π. π is the principal let eigenvector for P that is, π is the let eigenvector with the largest eigenvalue. All transi0on probability matrices have largest eigenvalue 1. 87

One way of compu0ng the PageRank π 88

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on 89

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. 90

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. 91

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. Ater k steps, we re at xp k. 92

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. Ater k steps, we re at xp k. Algorithm: mul0ply x by increasing powers of P un0l convergence. 93

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. Ater k steps, we re at xp k. Algorithm: mul0ply x by increasing powers of P un0l convergence. This is called the power method. 94

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. Ater k steps, we re at xp k. Algorithm: mul0ply x by increasing powers of P un0l convergence. This is called the power method. Recall: regardless of where we start, we eventually reach the steady state π. 95

One way of compu0ng the PageRank π Start with any distribu0on x, e.g., uniform distribu0on Ater one step, we re at xp. Ater two steps, we re at xp 2. Ater k steps, we re at xp k. Algorithm: mul0ply x by increasing powers of P un0l convergence. This is called the power method. Recall: regardless of where we start, we eventually reach the steady state π. Thus: we will eventually (in asympto0a) reach the steady state. 96

Power method: Example 97

Power method: Example What is the PageRank / steady state in this example? 98

Compu0ng PageRank: Power Example 99

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 = xp t 1 = xp 2 t 2 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 100

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 = xp 2 t 2 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 101

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 = xp 2 t 2 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 102

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 103

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 104

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 105

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 106

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 0.2496 0.7504 = xp 4 t... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 107

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 0.2496 0.7504 = xp 4 t...... = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 108

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 0.2496 0.7504 = xp 4...... t 0.25 0.75 = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 109

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 0.2496 0.7504 = xp 4...... t 0.25 0.75 0.25 0.75 = xp P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 110

Compu0ng PageRank: Power Example x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.1 P 21 = 0.3 P 12 = 0.9 P 22 = 0.7 t 0 0 1 0.3 0.7 = xp t 1 0.3 0.7 0.24 0.76 = xp 2 t 2 0.24 0.76 0.252 0.748 = xp 3 t 3 0.252 0.748 0.2496 0.7504 = xp 4...... t 0.25 0.75 0.25 0.75 = xp PageRank vector = π = (π 1, π 2 ) = (0.25, 0.75) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 111

Power method: Example What is the PageRank / steady state in this example? The steady state distribu0on (= the PageRanks) in this example are 0.25 for d 1 and 0.75 for d 2. 112

Exercise: Compute PageRank using power method 113

Exercise: Compute PageRank using power method 114

Solu0on 115

Solu0on x 1 P t (d 1 ) t 0 0 1 t 1 t 2 t 3 x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 P 12 = 0.3 P 22 = 0.8 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 116

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 t 2 t 3 P 12 = 0.3 P 22 = 0.8 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 117

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 t 2 t 3 P 12 = 0.3 P 22 = 0.8 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 118

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 t 2 t 3 P 12 = 0.3 P 22 = 0.8 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 119

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 t 2 0.3 0.7 t 3 P 12 = 0.3 P 22 = 0.8 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 120

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 121

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 0.35 0.65 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 122

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 0.35 0.65 0.375 0.625 t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 123

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 0.35 0.65 0.375 0.625... t PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 124

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 0.35 0.65 0.375 0.625 t 0.4 0.6... PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 125

Solu0on x 1 P t (d 1 ) x 2 P t (d 2 ) P 11 = 0.7 P 21 = 0.2 t 0 0 1 0.2 0.8 t 1 0.2 0.8 0.3 0.7 P 12 = 0.3 P 22 = 0.8 t 2 0.3 0.7 0.35 0.65 t 3 0.35 0.65 0.375 0.625... t 0.4 0.6 0.4 0.6 PageRank vector = π = (π 1, π 2 ) = (0.4, 0.6) P t (d 1 ) = P t- 1 (d 1 ) * P 11 + P t- 1 (d 2 ) * P 21 P t (d 2 ) = P t- 1 (d 1 ) * P 12 + P t- 1 (d 2 ) * P 22 126

PageRank summary 127

PageRank summary Preprocessing 128

PageRank summary Preprocessing Given graph of links, build matrix P 129

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on 130

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π 131

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π π i is the PageRank of page i. 132

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π π i is the PageRank of page i. Query processing 133

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π π i is the PageRank of page i. Query processing Retrieve pages sa0sfying the query 134

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π π i is the PageRank of page i. Query processing Retrieve pages sa0sfying the query Rank them by their PageRank 135

PageRank summary Preprocessing Given graph of links, build matrix P Apply teleporta0on From modified matrix, compute π π i is the PageRank of page i. Query processing Retrieve pages sa0sfying the query Rank them by their PageRank Return reranked list to the user 136

PageRank issues Real surfers are not random surfers. Examples of nonrandom surfing: back bufon, short vs. long paths, bookmarks, directories and search! Markov model is not a good model of surfing. But it s good enough as a model for our purposes. Simple PageRank ranking (as described on previous slide) produces bad results for many pages. Consider the query [video service]. The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. If we rank all pages containing the query terms according to PageRank, then the Yahoo home page would be top- ranked. Clearly not desirable. 137

How important is PageRank? Frequent claim: PageRank is the most important component of web ranking. The reality: There are several components that are at least as important: e.g., anchor text, phrases, proximity, 0ered indexes... Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking! However, variants of a page s PageRank are s0ll an essen0al part of ranking. Addressing link spam is difficult and crucial. 138