INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Similar documents
INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Probability & Computing

7.1 Coupling from the Past

Link Analysis. Leonid E. Zhukov

MARKOV CHAIN MONTE CARLO

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Lecture 5: Random Walks and Markov Chain

Midterm 2 Review. CS70 Summer Lecture 6D. David Dinh 28 July UC Berkeley

Disjointness and Additivity

6.842 Randomness and Computation February 24, Lecture 6

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 2: September 8

25.1 Markov Chain Monte Carlo (MCMC)

Markov chains. Randomness and Computation. Markov chains. Markov processes

Data Mining and Matrices

The Theory behind PageRank

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

MARKOV CHAINS AND HIDDEN MARKOV MODELS

Approximate Counting and Markov Chain Monte Carlo

0.1 Naive formulation of PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Some Definition and Example of Markov Chain

Markov Chains. Andreas Klappenecker by Andreas Klappenecker. All rights reserved. Texas A&M University

Lecture 12: Link Analysis for Web Retrieval

Uncertainty and Randomization

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS. Contents

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Lecture 28: April 26

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Link Analysis. Stony Brook University CSE545, Fall 2016

Markov Chains and MCMC

ISE/OR 760 Applied Stochastic Modeling

IR: Information Retrieval

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

1998: enter Link Analysis

Introduction to Data Mining

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

The cutoff phenomenon for random walk on random directed graphs

1 Random Walks and Electrical Networks

Lecture 3: September 10

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

An Algorithmist s Toolkit September 24, Lecture 5

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

Page rank computation HPC course project a.y

Inf 2B: Ranking Queries on the WWW

Markov Processes Hamid R. Rabiee

Computing PageRank using Power Extrapolation

Lecture 7. µ(x)f(x). When µ is a probability measure, we say µ is a stationary distribution.

Stat-491-Fall2014-Assignment-III

Essentials on the Analysis of Randomized Algorithms

Convergence Rate of Markov Chains

Markov Chains, Stochastic Processes, and Matrix Decompositions

MATH 56A: STOCHASTIC PROCESSES CHAPTER 7

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+

The Markov Chain Monte Carlo Method

The coupling method - Simons Counting Complexity Bootcamp, 2016

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

CPSC 540: Machine Learning

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

A Note on the Glauber Dynamics for Sampling Independent Sets

Approximate Counting

Positive and null recurrent-branching Process

COMPSCI 514: Algorithms for Data Science

CS47300: Web Information Search and Management

6 Markov Chain Monte Carlo (MCMC)

Updating PageRank. Amy Langville Carl Meyer

Coupling. 2/3/2010 and 2/5/2010

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Chapter 7. Markov chain background. 7.1 Finite state space

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

A Note on Google s PageRank

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

A note on adiabatic theorem for Markov chains and adiabatic quantum computation. Yevgeniy Kovchegov Oregon State University

µ n 1 (v )z n P (v, )

Markov Chains and Stochastic Sampling

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

The Google Markov Chain: convergence speed and eigenvalues

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

Data Mining Recitation Notes Week 3

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods. Lecture 7: November 11, 2003 Estimating the permanent Eric Vigoda

4452 Mathematical Modeling Lecture 16: Markov Processes

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

PAGERANK OPTIMIZATION BY EDGE SELECTION

ORIE 6334 Spectral Graph Theory September 22, Lecture 11

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Transcription:

INTRODUCTION TO MCMC AND PAGERANK Eric Vigoda Georgia Tech Lecture for CS 6505

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

What is a Markov chain? Example: Life in CS 6210, discrete time t = 0, 1, 2,... :.5 Listen to Kishore.5.2 Check Email.7.3.5.3 Sleep StarCraft.3.7

What is a Markov chain? Example: Life in CS 6210, discrete time t = 0, 1, 2,... :.5 Listen to Kishore.5.2 Check Email.7.3.5.3 Sleep StarCraft Each vertex is a state of the Markov chain. Directed graph, possibly with self-loops..3 Edge weights represent probability of a transition, so: non-negative and sum of weights of outgoing edges = 1..7

Transition matrix In general: N states Ω = {1, 2,..., N}. N N transition matrix P where: P(i, j) = weight of edge i j = Pr (going from i to j) For earlier example:.5 Listen to Kishore Sleep.5.2 Check Email.3.5.7.3 P = StarCraft.5.5 0 0.2 0.5.3 0.3.7 0.7 0 0.3.3.7 P is a stochastic matrix = rows sum to 1.

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable.

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j).

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t?

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t? Pr (X t = j X 0 = k 0, X 1 = k 1,..., X t 1 = k t 1 ) = Pr (X t = j X t 1 = k t 1 ) = P(k t 1, j).

One-step transitions Time: t = 0, 1, 2,.... Let X t denote the state at time t. X t is a random variable. For states k and j, Pr (X 1 = j X 0 = k) = P(k, j). In general, for t 1, given: in state k 0 at time 0, in k 1 at time 1,..., in k t 1 at time t 1, what s the probability of being in state j at time t? Pr (X t = j X 0 = k 0, X 1 = k 1,..., X t 1 = k t 1 ) = Pr (X t = j X t 1 = k t 1 ) = P(k t 1, j). Process is memoryless only current state matters, previous states do not matter. Known as Markov property, hence the term Markov chain.

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1.

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1. Pr (X 2 = Listen X 0 = Email) = Pr (X 2 = Listen X 1 = Listen) Pr (X 1 = Listen X 0 = Email) +Pr (X 2 = Listen X 1 = Email) Pr (X 1 = Email X 0 = Email) +Pr (X 2 = Listen X 1 = StarCraft) Pr (X 1 = StarCraft X 0 = Email) +Pr (X 2 = Listen X 1 = Sleep) Pr (X 1 = Sleep X 0 = Email)

2-step transitions What s probability Listen at time 2 given Email at time 0? Try all possibilities for state at time 1. Pr (X 2 = Listen X 0 = Email) P = = Pr (X 2 = Listen X 1 = Listen) Pr (X 1 = Listen X 0 = Email) +Pr (X 2 = Listen X 1 = Email) Pr (X 1 = Email X 0 = Email) +Pr (X 2 = Listen X 1 = StarCraft) Pr (X 1 = StarCraft X 0 = Email) +Pr (X 2 = Listen X 1 = Sleep) Pr (X 1 = Sleep X 0 = Email) = (.5)(.2) + 0 + 0 + (.7)(.3) =.31.5.5 0 0.2 0.5.3 0.3.7 0.7 0 0.3 P 2 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09 States: 1=Listen, 2=Email, 3=StarCraft, 4=Sleep.

k-step transitions 2-step transition probabilities: use P 2. In general, for states i and j: Pr (X t+2 = j X t = i) N = Pr (X t+2 = j X t+1 = k) Pr (X t+1 = k X t = i) k=1 = k P(k, j)p(i, k) = k P(i, k)p(k, j) = P 2 (i, j)

k-step transitions 2-step transition probabilities: use P 2. In general, for states i and j: Pr (X t+2 = j X t = i) N = Pr (X t+2 = j X t+1 = k) Pr (X t+1 = k X t = i) k=1 = k P(k, j)p(i, k) = k P(i, k)p(k, j) = P 2 (i, j) l-step transition probabilities: use P l. For states i and j and integer l 1, Pr (X t+l = j X t = i) = P l (i, j),

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1?

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1? For state j, Pr (X 1 = j) = N Pr (X 0 = i) Pr (X 1 = j X 0 = i) i=1 = i µ 0 (i)p(i, j) = (µ 0 P)(j) So X 1 µ 1 where µ 1 = µ 0 P.

Random Initial State Suppose the state at time 0 is not fixed but is chosen from a probability distribution µ 0. Notation: X 0 µ 0. What is the distribution for X 1? For state j, Pr (X 1 = j) = N Pr (X 0 = i) Pr (X 1 = j X 0 = i) i=1 = i µ 0 (i)p(i, j) = (µ 0 P)(j) So X 1 µ 1 where µ 1 = µ 0 P. And X t µ t where µ t = µ 0 P t.

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0.7 0 0.3

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 = P 20 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754.244190.244187.406971.104652.244187.244186.406975.104651.244181.244185.406984.104650.244195.244188.406966.104652

Back to CS 6210 example: big t? Let s look again at our CS 6210 example:.5.5 0 0 P =.2 0.5.3 0.3.7 0 P 2 =.7 0 0.3 P 10 = P 20 =.35.25.25.15.31.25.35.09.06.21.64.09.56.35 0.09.247770.244781.402267.105181.245167.244349.405688.104796.239532.243413.413093.103963.251635.245423.397189.105754.244190.244187.406971.104652.244187.244186.406975.104651.244181.244185.406984.104650.244195.244188.406966.104652 Columns are converging to π = [.244186,.244186,.406977,.104651].

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651 Regardless of where it starts X 0, for big t: Pr (X t = 1) =.244186 Pr (X t = 2) =.244186 Pr (X t = 3) =.406977 Pr (X t = 4) =.104651

Limiting Distribution For big t, P t.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651.244186.244186.406977.104651 Regardless of where it starts X 0, for big t: Pr (X t = 1) =.244186 Pr (X t = 2) =.244186 Pr (X t = 3) =.406977 Pr (X t = 4) =.104651 Let π = [.244186,.244186,.406977,.104651]. In other words, for big t, X t π. π is called a stationary distribution.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution. Once we reach π we stay in π: if X t π then X t+1 π, in other words, πp = π.

Limiting Distribution Let π = [.244186,.244186,.406977,.104651]. π is called a stationary distribution. Once we reach π we stay in π: if X t π then X t+1 π, in other words, πp = π. Any distribution π where πp = π is called a stationary distribution of the Markov chain.

Stationary Distributions Key questions: When is there a stationary distribution? If there is at least one, is it unique or more than one? Assuming there s a unique stationary distribution: Do we always reach it? What is it? Mixing time = Time to reach unique stationary distribution Algorithmic Goal: If we have a distribution π that we want to sample from, can we design a Markov chain that has: Unique stationary distribution π, From every X 0 we always reach π, Fast mixing time.

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0.

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0. But if multiple strongly connected components (SCCs) then can t go from one to the other: 6 1 1 2 5 3 4 Starting at 1 gets to different distribution than starting at 5.

Irreducibility Want a unique stationary distribution π and that get to it from every starting state X 0. But if multiple strongly connected components (SCCs) then can t go from one to the other: 6 1 1 2 5 3 4 Starting at 1 gets to different distribution than starting at 5. State i communicates with state j if starting at i can reach j: there exists t, P t (i, j) > 0. Markov chain is irreducible if all pairs of states communicate..

Periodicity Example of bipartite Markov chain: 1 1.4.3 3.4.3 2.6 4.5.5.7.3 4 Starting at 1 gets to different distribution than starting at 3.

Periodicity Example of bipartite Markov chain: 1 1.4.3 3.4.3 2.6 4.5.5.7.3 4 Starting at 1 gets to different distribution than starting at 3. Need that no periodicity.

Aperiodic 2 1 1 1.7.7 3.3.3 4 Return times for state i are times R i = {t : P t (i, i) > 0}. Above example: R 1 = {3, 5, 6, 8, 9,... }. Let r = gcd(r i ) be the period for state i.

Aperiodic 2 1 1 1.7.7 3.3.3 4 Return times for state i are times R i = {t : P t (i, i) > 0}. Above example: R 1 = {3, 5, 6, 8, 9,... }. Let r = gcd(r i ) be the period for state i. If P is irreducible then all states have the same period. If r = 2 then the Markov chain is bipartite. A Markov chain is aperiodic if r = 1.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic. Fundamental Theorem for Markov Chains: Ergodic Markov chain has a unique stationary distribution π. And for all initial X 0 µ 0 then: lim µ t = π. t In other words, for big enough t, all rows of P t are π.

Ergodic: Unique Stationary Distribution Ergodic = Irreducible and aperiodic. Fundamental Theorem for Markov Chains: Ergodic Markov chain has a unique stationary distribution π. And for all initial X 0 µ 0 then: lim µ t = π. t In other words, for big enough t, all rows of P t are π. How big does t need to be? What is π?

Proof idea: Ergodic MC has Unique Stationary Distribution What is a π?

Proof idea: Ergodic MC has Unique Stationary Distribution What is a π? Fix a state i and set X 0 = i. Let T be the first time we visit state i again. T is a random variable. For every state j, let ρ(j) = expected number of visits to j up to time T. (Note, ρ(i) = 1.) Let π(j) = ρ(j)/z where Z = k ρ(k). Can verify that this π is a stationary distribution.

Proof idea: Ergodic MC has Unique Stationary Distribution What is a π? Fix a state i and set X 0 = i. Let T be the first time we visit state i again. T is a random variable. For every state j, let ρ(j) = expected number of visits to j up to time T. (Note, ρ(i) = 1.) Let π(j) = ρ(j)/z where Z = k ρ(k). Can verify that this π is a stationary distribution. Why is it unique and we always reach it?

Proof idea: Ergodic MC has Unique Stationary Distribution What is a π? Fix a state i and set X 0 = i. Let T be the first time we visit state i again. T is a random variable. For every state j, let ρ(j) = expected number of visits to j up to time T. (Note, ρ(i) = 1.) Let π(j) = ρ(j)/z where Z = k ρ(k). Can verify that this π is a stationary distribution. Why is it unique and we always reach it? Make 2 chains (X t ) and (Y t ): X 0 is arbitrary, and Y 0 is chosen from π so that Y t π for all t. Using irreducibility, can couple the transitions of these chains: for big t we have X t = Y t and thus X t π.

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j.

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j).

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N N P(i, j) i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1 = 1 N since rows of P always sum to 1

Determining π: Symmetric Markov Chain Symmetric if for all pairs i, j: P(i, j) = P(j, i). Then π is uniformly distributed over all of the states {1,..., N}: π(j) = 1 N for all states j. Proof: We ll verify that πp = π for this π. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1 = 1 N = 1 N N P(i, j) i=1 N P(j, i) since P is symmetric i=1 = 1 since rows of P always sum to 1 N = π(j)

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution.

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j).

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = N π(i)p(i, j) i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1 = π(j) N P(j, i) i=1

Determining π: Reversible Markov Chain Reversible with respect to π if for all pairs i, j: π(i)p(i, j) = π(j)p(j, i). If can find such a π then it is the stationary distribution. Proof: Similar to the symmetric case. Need to check that for all states j: (πp)(j) = π(j). (πp)(j) = = N π(i)p(i, j) i=1 N π(j)p(j, i) since P is reversible i=1 = π(j) = π(j) N P(j, i) i=1

Some Examples Random walk on a d-regular, connected undirected graph G: What is π?

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n.

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π?

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). d(i) = 1 Z

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). What if G is a directed graph? d(i) = 1 Z

Some Examples Random walk on a d-regular, connected undirected graph G: What is π? Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d. So π is uniform: π(i) = 1/n. Random walk on a general connected undirected graph G: What is π? Consider π(i) = d(i)/z where d(i) = degree of vertex i and Z = j V d(j). (Note, Z = 2m = 2 E.) Check it s reversible: π(i)p(i, j) = d(i) 1 Z = π(j)p(j, i). What if G is a directed graph? d(i) = 1 Z Then it may not be reversible, and if it s not reversible: then usually we can t figure out the stationary distribution since typically N is HUGE.

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages.

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages. Webgraph: V = webpages E = directed edges for hyperlinks Let π(x) = rank of page x. We are trying to define π(x) in a sensible way.

PageRank PageRank is an algorithm devised by Brin and Page 1998: determine the importance of webpages. Webgraph: V = webpages E = directed edges for hyperlinks Notation: For page x V, let: Out(x) = {y : x y E} = outgoing edges from x In(x) = {w : w x E} = incoming edges to x Let π(x) = rank of page x. We are trying to define π(x) in a sensible way.

First Ranking Idea First idea for ranking pages: like academic papers use citation counts Here, citation = link to a page. So set π(x) = In(x) = number of links to x.

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable?

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable? Academic papers: If a paper cites 50 other papers, then each reference gets 1/50 of a citation.

Refining the Ranking Idea What if: a webpage has 500 links and one is to Eric s page. another webpage has only 5 links and one is to Santosh s page. Which link is more valuable? Academic papers: If a paper cites 50 other papers, then each reference gets 1/50 of a citation. Webpages: If a page y has Out(y) outgoing links, then: each linked page gets 1/ Out(y). New solution: π(x) = 1 Out(y). y In(x)

Further Refining the Ranking Idea Previous: π(x) = y In(x) 1 Out(y). But if Eric s children s webpage has a link to a Eric s page and CNN has a link to Santosh s page, which is more important?

Further Refining the Ranking Idea Previous: π(x) = y In(x) 1 Out(y). But if Eric s children s webpage has a link to a Eric s page and CNN has a link to Santosh s page, which is more important? Solution: define π(x) recursively. Page y has importance π(y). A link from y gets π(y)/ Out(y) of a citation. π(x) = y In(x) π(y) Out(y).

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y).

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y). Recursive definition of π, how do we find it?

Random Walk Importance of page x: π(x) = y In(x) π(y) Out(y). Recursive definition of π, how do we find it? Look at the random walk on the webgraph G = (V, E). From a page y V, choose a random link and follow it. This is a Markov chain. For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain?

Random Walk Random walk on the webgraph G = (V, E). For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain?

Random Walk Random walk on the webgraph G = (V, E). For y x E then: P(y, x) = 1 Out(y) What is the stationary distribution of this Markov chain? Need to find π where π = πp. Thus, π(x) = π(y)p(y, x) = π(y) Out(y). y V y In(x) This is identical to the definition of the importance vector π. Summary: the stationary distribution of the random walk on the webgraph gives the importance π(x) of a page x.

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic?

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic? Need that G is strongly connected it probably is not. And some pages have no outgoing links... then hit the random button!

Random Walk on the Webgraph Random walk on the webgraph G = (V, E). Is π the only stationary distribution? In other words, is the Markov chain ergodic? Need that G is strongly connected it probably is not. And some pages have no outgoing links... then hit the random button! Solution to make it ergodic: Introduce damping factor α where 0 < α 1. (in practice apparently use α.85) From page y, with prob. α follow a random outgoing link from page y. with prob. 1 α go to a completely random page (uniformly chosen from all pages V).

Random Surfer Let N = V denote number of webpages. Transition matrix of new Random Surfer chain: P(y, x) = { 1 α N 1 α N + α Out(y) if y x E if y x E This new Random Surfer Markov chain is ergodic. Thus, unique stationary distribution is the desired π.

Random Surfer Let N = V denote number of webpages. Transition matrix of new Random Surfer chain: P(y, x) = { 1 α N 1 α N + α Out(y) if y x E if y x E This new Random Surfer Markov chain is ergodic. Thus, unique stationary distribution is the desired π. How to find π? Take last week s π, and compute πp t for big t. What s a big enough t?

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Mixing Time How fast does an ergodic MC reach its unique stationary π?

Mixing Time How fast does an ergodic MC reach its unique stationary π? Need to measure distance from π, use total variation distance. For distributions µ and ν on set Ω: d TV (µ, ν) = 1 µ(x) ν(x). 2 x Ω

Mixing Time How fast does an ergodic MC reach its unique stationary π? Need to measure distance from π, use total variation distance. For distributions µ and ν on set Ω: d TV (µ, ν) = 1 µ(x) ν(x). 2 x Ω Example: Ω = {1, 2, 3, 4}. µ is uniform: µ(1) = µ(2) = µ(3) = µ(4) =.25. And ν has: ν(1) =.5, ν(2) =.1, ν(3) =.15, ν(4) =.25. d TV (µ, ν) = 1 (.25 +.15 +.1 + 0) =.25 2

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4.

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4. Then, mixing time T mix = max x T(x). Summarizing in words: mixing time is time to get within distance 1/4 of π from the worst initial state X 0.

Mixing Time Consider ergodic MC with states Ω, transition matrix P, and unique stationary distribution π. For state x Ω, time to mix from x: T(x) = min{t : d TV (P t (x, ), π) 1/4. Then, mixing time T mix = max x T(x). Summarizing in words: mixing time is time to get within distance 1/4 of π from the worst initial state X 0. Choice of constant 1/4 is somewhat arbitrary. Can get within distance ɛ in time O(T mix log(1/ɛ)).

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary.

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary. If X t 1 = Y t 1 then they choose the same transition at time t. If X t 1 Y t 1 then with prob. 1 α choose the same random page z for both chains. Therefore, Pr (X t Y t ) α t.

Mixing Time of Random Surfer Coupling proof: Consider 2 copies of the Random Surfer chain (X t ) and (Y t ). Choose Y 0 from π. Thus, Y t π for all t. And X 0 is arbitrary. If X t 1 = Y t 1 then they choose the same transition at time t. If X t 1 Y t 1 then with prob. 1 α choose the same random page z for both chains. Therefore, Pr (X t Y t ) α t. Setting: t 2/ log(α) we have Pr (X t Y t ) 1/4. Therefore, mixing time: T mix 2 8.5 for α =.85. log α

1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

Example Chain: Random Matching Undirected graph: Matching = subset of vertex disjoint edges. Let Ω = collection of all matchings of G (of all sizes).

Example Chain: Random Matching Undirected graph: Matching = subset of vertex disjoint edges. Let Ω = collection of all matchings of G (of all sizes). Can we generate a matching uniformly at random from Ω? in time polynomial in n = V?

Markov Chain for Matchings Consider an undirected graph G = (V, E). From a matching X t the transition X t X t+1 is defined as follows: 1 Choose an edge e = (v, w) uniformly at random from E. 2 If e X t then set X t+1 = X t \ {e}. 3 If v and w are unmatched in X t then set X t+1 = X t {e}. 4 Otherwise, set X t+1 = X t. Symmetric and ergodic and thus π is uniform over Ω.

Markov Chain for Matchings Consider an undirected graph G = (V, E). From a matching X t the transition X t X t+1 is defined as follows: 1 Choose an edge e = (v, w) uniformly at random from E. 2 If e X t then set X t+1 = X t \ {e}. 3 If v and w are unmatched in X t then set X t+1 = X t {e}. 4 Otherwise, set X t+1 = X t. Symmetric and ergodic and thus π is uniform over Ω. How fast does it reach π? Further topic (in MCMC class): we ll see that it s close to π after poly(n) steps and this holds for every G. Thus, we can generate a random matching of a graph in polynomial-time.