INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Similar documents
INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Probability & Computing

7.1 Coupling from the Past

Link Analysis. Leonid E. Zhukov

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Midterm 2 Review. CS70 Summer Lecture 6D. David Dinh 28 July UC Berkeley

MARKOV CHAIN MONTE CARLO

Disjointness and Additivity

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 2: September 8

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods

Markov chains. Randomness and Computation. Markov chains. Markov processes

Lecture 5: Random Walks and Markov Chain

6.842 Randomness and Computation February 24, Lecture 6

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Data Mining and Matrices

25.1 Markov Chain Monte Carlo (MCMC)

Some Definition and Example of Markov Chain

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Analysis. Stony Brook University CSE545, Fall 2016

MARKOV CHAINS AND HIDDEN MARKOV MODELS

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Markov Chains. Andreas Klappenecker by Andreas Klappenecker. All rights reserved. Texas A&M University

Lecture 12: Link Analysis for Web Retrieval

0.1 Naive formulation of PageRank

Uncertainty and Randomization

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 28: April 26

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Approximate Counting and Markov Chain Monte Carlo

The Theory behind PageRank

Introduction to Data Mining

IR: Information Retrieval

ISE/OR 760 Applied Stochastic Modeling

The cutoff phenomenon for random walk on random directed graphs

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

1998: enter Link Analysis

Inf 2B: Ranking Queries on the WWW

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 3: September 10

Stat-491-Fall2014-Assignment-III

MATH 56A: STOCHASTIC PROCESSES CHAPTER 7

Convergence Rate of Markov Chains

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS. Contents

Markov Chains and MCMC

The coupling method - Simons Counting Complexity Bootcamp, 2016

The Markov Chain Monte Carlo Method

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

1 Random Walks and Electrical Networks

CPSC 540: Machine Learning

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Positive and null recurrent-branching Process

Computing PageRank using Power Extrapolation

CS47300: Web Information Search and Management

Updating PageRank. Amy Langville Carl Meyer

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

Markov Processes Hamid R. Rabiee

A note on adiabatic theorem for Markov chains and adiabatic quantum computation. Yevgeniy Kovchegov Oregon State University

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+

Page rank computation HPC course project a.y

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

4452 Mathematical Modeling Lecture 16: Markov Processes

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Markov Chain Monte Carlo

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

COMPSCI 514: Algorithms for Data Science

Slides based on those in:

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

A Note on Google s PageRank

Markov Chains and Stochastic Sampling

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

Lecture 7. µ(x)f(x). When µ is a probability measure, we say µ is a stationary distribution.

Time Reversibility and Burke s Theorem

Modern Discrete Probability Spectral Techniques

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

DS504/CS586: Big Data Analytics Graph Mining II

Discrete solid-on-solid models

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Markov Chains, Random Walks on Graphs, and the Laplacian

Data Mining Recitation Notes Week 3

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Chapter 7. Markov chain background. 7.1 Finite state space

Coupling. 2/3/2010 and 2/5/2010

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

The Static Absorbing Model for the Web a

6.842 Randomness and Computation March 3, Lecture 8

Transcription:

INTRODUCTION TO MCMC AND PAGERANK Eric Vigoda Georgia Tech Lecture for CS 6505

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

What is a Markov chain? Example: Life in CS 6210, discrete time t = 0, 1, 2,... :.5 Listen to Kishore.5.2 Check Email.7.3.5.3 Sleep StarCraft.3.7

What is a Markov chain? Example: Life in CS 6210, discrete time t = 0, 1, 2,... :.5 Listen to Kishore.5.2 Check Email.7.3.5.3 Sleep StarCraft Each vertex is a state of the Markov chain. Directed graph, possibly with self-loops..3 Edge weights represent probability of a transition, so: non-negative and sum of weights of outgoing edges = 1..7

Transition matrix In general: N states Ω = {1, 2,..., N}. N N transition matrix P where: P(i, j) = weight of edge i j = Pr (going from i to j) For earlier example:.5 Listen to Kishore Sleep.5.2 Check Email.3.5.7.3 P = StarCraft.5.5 0 0.2 0.5.3 0.3.7 0.7 0 0.3.3.7 P is a stochastic matrix = rows sum to 1.

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable.

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j).

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t?

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t? Pr (X t = j X 0 = k 0, X 1 = k 1,..., X t 1 = k t 1 ) = Pr (X t = j X t 1 = k t 1 ) = P(k t 1, j).

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t? Pr (X t = j X 0 = k 0, X 1 = k 1,..., X t 1 = k t 1 ) = Pr (X t = j X t 1 = k t 1 ) = P(k t 1, j). Process is memoryless only current state matters, previous states do not matter. Known as Markov property, hence the term Markov chain.

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1.

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1. Pr (X 2 = Listen X 0 = Email) = Pr (X 2 = Listen X 1 = Listen) Pr (X 1 = Listen X 0 = Email) +Pr (X 2 = Listen X 1 = Email) Pr (X 1 = Email X 0 = Email) +Pr (X 2 = Listen X 1 = StarCraft) Pr (X 1 = StarCraft X 0 = Email) +Pr (X 2 = Listen X 1 = Sleep) Pr (X 1 = Sleep X 0 = Email)

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1. Pr (X 2 = Listen X 0 = Email) P = = Pr (X 2 = Listen X 1 = Listen) Pr (X 1 = Listen X 0 = Email) +Pr (X 2 = Listen X 1 = Email) Pr (X 1 = Email X 0 = Email) +Pr (X 2 = Listen X 1 = StarCraft) Pr (X 1 = StarCraft X 0 = Email) +Pr (X 2 = Listen X 1 = Sleep) Pr (X 1 = Sleep X 0 = Email) = (.5)(.2) + 0 + 0 + (.7)(.3) =.31.5.5 0 0.2 0.5.3 0.3.7 0.7 0 0.3 P 2 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09 States: 1=Listen, 2=Email, 3=StarCraft, 4=Sleep.

k-step transitions 2-step transition probabilities: use P 2. In general, for states i and j: Pr (X t+2 = j X t = i) N = Pr (X t+2 = j X t+1 = k) Pr (X t+1 = k X t = i) k=1 = k P(k, j)p(i, k) = k P(i, k)p(k, j) = P 2 (i, j)

k-step transitions 2-step transition probabilities: use P 2. In general, for states i and j: Pr (X t+2 = j X t = i) N = Pr (X t+2 = j X t+1 = k) Pr (X t+1 = k X t = i) k=1 = k P(k, j)p(i, k) = k P(i, k)p(k, j) = P 2 (i, j) l-step transition probabilities: use P l. For states i and j and integer l 1, Pr (X t+l = j X t = i) = P l (i, j),

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1?

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1? For state j, Pr (X 1 = j) = N Pr (X 0 = i) Pr (X 1 = j X 0 = i) i=1 = i µ 0 (i)p(i, j) = (µ 0 P)(j) So X 1 µ 1 where µ 1 = µ 0 P.

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1? For state j, Pr (X 1 = j) = N Pr (X 0 = i) Pr (X 1 = j X 0 = i) i=1 = i µ 0 (i)p(i, j) = (µ 0 P)(j) So X 1 µ 1 where µ 1 = µ 0 P. And X t µ t where µ t = µ 0 P t.

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0.7 0 0.3

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 = P 20 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754.244190.244187.406971.104652.244187.244186.406975.104651.244181.244185.406984.104650.244195.244188.406966.104652

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 = P 20 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754.244190.244187.406971.104652.244187.244186.406975.104651.244181.244185.406984.104650.244195.244188.406966.104652 Columns are converging to π = [.244186,.244186,.406977,.104651].

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651 Regardless of where it starts X 0, for big t: Pr (X t = 1) =.244186 Pr (X t = 2) =.244186 Pr (X t = 3) =.406977 Pr (X t = 4) =.104651

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651 Regardless of where it starts X 0, for big t: Pr (X t = 1) =.244186 Pr (X t = 2) =.244186 Pr (X t = 3) =.406977 Pr (X t = 4) =.104651 Let π = [.244186,.244186,.406977,.104651]. In other words, for big t, X t π. π is called a stationary distribution.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution. Once we reach π we stay in π: if X t π then X t+1 π, in other words, πp = π.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution. Once we reach π we stay in π: if X t π then X t+1 π, in other words, πp = π. Any distribution π where πp = π is called a stationary distribution of the Markov chain.

Stationary Distributions Key questions: When is there a stationary distribution? If there is at least one, is it unique or more than one? Assuming there s a unique stationary distribution: Do we always reach it? What is it? Mixing time = Time to reach unique stationary distribution Algorithmic Goal: If we have a distribution π that we want to sample from, can we design a Markov chain that has: Unique stationary distribution π, From every X 0 we always reach π, Fast mixing time.

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0.

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0. But if multiple strongly connected components (SCCs) then can t go from one to the other: 6 1 1 2 5 3 4 Starting at 1 gets to different distribution than starting at 5.

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0. But if multiple strongly connected components (SCCs) then can t go from one to the other: 6 1 1 2 5 3 4 Starting at 1 gets to different distribution than starting at 5. State i communicates with state j if starting at i can reach j: there exists t, P t (i, j) > 0. Markov chain is irreducible if all pairs of states communicate..

Periodicity Example of bipartite Markov chain: 1 1.4.3 3.4.3 2.6 4.5.5.7.3 4 Starting at 1 gets to different distribution than starting at 3.

Periodicity Example of bipartite Markov chain: 1 1.4.3 3.4.3 2.6 4.5.5.7.3 4 Starting at 1 gets to different distribution than starting at 3. Need that no periodicity.

Aperiodic 2 1 1 1.7.7 3.3.3 4 Return times for state i are times R i = {t : P t (i, i) > 0}. Above example: R 1 = {3, 5, 6, 8, 9,... }. Let r = gcd(r i ) be the period for state i.

Aperiodic 2 1 1 1.7.7 3.3.3 4 Return times for state i are times R i = {t : P t (i, i) > 0}. Above example: R 1 = {3, 5, 6, 8, 9,... }. Let r = gcd(r i ) be the period for state i. If P is irreducible then all states have the same period. If r = 2 then the Markov chain is bipartite. A Markov chain is aperiodic if r = 1.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic. Fundamental Theorem for Markov Chains: Ergodic Markov chain has a unique stationary distribution π. And for all initial X 0 µ 0 then: lim µ t = π. t In other words, for big enough t, all rows of P t are π.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic. Fundamental Theorem for Markov Chains: Ergodic Markov chain has a unique stationary distribution π. And for all initial X 0 µ 0 then: lim µ t = π. t In other words, for big enough t, all rows of P t are π. How big does t need to be? What is π?

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j.

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j).

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N N P(i, j) i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1 = 1 N since rows of P always sum to 1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1 = 1 since rows of P always sum to 1 N = π(j)

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution.

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j).

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1 = π(j) N P(j, i) i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1 = π(j) = π(j) N P(j, i) i=1

Some Examples Random walk on a d-regular, connected undirected graph G: What is π?

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n.

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π?

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). d(i) = 1 Z

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). What if G is a directed graph? d(i) = 1 Z

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). What if G is a directed graph? d(i) = 1 Z Then it may not be reversible, and if it s not reversible: then usually we can t figure out the stationary distribution since typically N is HUGE.

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages.

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages. Webgraph: V = webpages E = directed edges for hyperlinks Let π(x) = rank of page x. We are trying to define π(x) in a sensible way.

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages. Webgraph: V = webpages E = directed edges for hyperlinks Notation: For page x V, let: Out(x) = {y : x y E} = outgoing edges from x In(x) = {w : w x E} = incoming edges to x Let π(x) = rank of page x. We are trying to define π(x) in a sensible way.

First Ranking Idea First idea for ranking pages: like academic papers use citation counts Here, citation = link to a page. So set π(x) = In(x) = number of links to x.

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable?

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable? Academic papers: If a paper cites 50 other papers, then each reference gets 1/50 of a citation.

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable? Academic papers: If a paper cites 50 other papers, then each reference gets 1/50 of a citation. Webpages: If a page y has Out(y) outgoing links, then: each linked page gets 1/ Out(y). New solution: π(x) = 1 Out(y). y In(x)

Further Refining the Ranking Idea Previous: π(x) = y In(x) 1 Out(y). But if Eric s children s webpage has a link to a Eric s page and CNN has a link to Santosh s page, which is more important?

Further Refining the Ranking Idea Previous: π(x) = y In(x) 1 Out(y). But if Eric s children s webpage has a link to a Eric s page and CNN has a link to Santosh s page, which is more important? Solution: define π(x) recursively. Page y has importance π(y). A link from y gets π(y)/ Out(y) of a citation. π(x) = y In(x) π(y) Out(y).

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y).

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y). Recursive definition of π, how do we find it?

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y). Recursive definition of π, how do we find it? Look at the random walk on the webgraph G = (V, E). From a page y V, choose a random link and follow it. This is a Markov chain. For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain?

Random Walk Random walk on the webgraph G = (V, E). For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain?

Random Walk Random walk on the webgraph G = (V, E). For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain? Need to find π where π = πp. Thus, π(x) = π(y)p(y, x) = π(y) Out(y). y V y In(x) This is identical to the definition of the importance vector π. Summary: the stationary distribution of the random walk on the webgraph gives the importance π(x) of a page x.

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic?

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic? Need that G is strongly connected it probably is not. And some pages have no outgoing links... then hit the random button!

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic? Need that G is strongly connected it probably is not. And some pages have no outgoing links... then hit the random button! Solution to make it ergodic: Introduce damping factor α where 0 < α 1. (in practice apparently use α.85) From page y, with prob. α follow a random outgoing link from page y. with prob. 1 α go to a completely random page (uniformly chosen from all pages V).

Random Surfer Let N = V denote number of webpages. Transition matrix of new Random Surfer chain: P(y, x) = { 1 α N 1 α N + α Out(y) if y x E if y x E This new Random Surfer Markov chain is ergodic. Thus, unique stationary distribution is the desired π.

Random Surfer Let N = V denote number of webpages. Transition matrix of new Random Surfer chain: P(y, x) = { 1 α N 1 α N + α Out(y) if y x E if y x E This new Random Surfer Markov chain is ergodic. Thus, unique stationary distribution is the desired π. How to find π? Take last week s π, and compute πp t for big t. What s a big enough t?

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Mixing Time How fast does an ergodic MC reach its unique stationary π?

Mixing Time How fast does an ergodic MC reach its unique stationary π? Need to measure distance from π, use total variation distance. For distributions µ and ν on set Ω: d TV (µ, ν) = 1 µ(x) ν(x). 2 x Ω

Mixing Time How fast does an ergodic MC reach its unique stationary π? Need to measure distance from π, use total variation distance. For distributions µ and ν on set Ω: d TV (µ, ν) = 1 µ(x) ν(x). 2 x Ω Example: Ω = {1, 2, 3, 4}. µ is uniform: µ(1) = µ(2) = µ(3) = µ(4) =.25. And ν has: ν(1) =.5, ν(2) =.1, ν(3) =.15, ν(4) =.25. d TV (µ, ν) = 1 (.25 +.15 +.1 + 0) =.25 2

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4.

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4. Then, mixing time T mix = max x T(x). Summarizing in words: mixing time is time to get within distance 1/4 of π from the worst initial state X 0.

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4. Then, mixing time T mix = max x T(x). Summarizing in words: mixing time is time to get within distance 1/4 of π from the worst initial state X 0. Choice of constant 1/4 is somewhat arbitrary. Can get within distance ɛ in time O(T mix log(1/ɛ)).

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary.

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary. If X t 1 = Y t 1 then they choose the same transition at time t. If X t 1 Y t 1 then with prob. 1 α choose the same random page z for both chains. Therefore, Pr (X t Y t ) α t.

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary. If X t 1 = Y t 1 then they choose the same transition at time t. If X t 1 Y t 1 then with prob. 1 α choose the same random page z for both chains. Therefore, Pr (X t Y t ) α t. Setting: t 2/ log(α) we have Pr (X t Y t ) 1/4. Therefore, mixing time: T mix 2 8.5 for α =.85. log α