Introduction to Markov Chains and Riffle Shuffling

Introduction to Markov Chains and Riffle Shuffling Nina Kuklisova Math REU 202 University of Chicago September 27, 202 Abstract In this paper, we introduce Markov Chains and their basic properties, and we look at a simple application in shuffling cards. We derive the rate of convergence to a stationary distribution for the most common shuffling method, dovetail shuffle. Contents Introduction 2 Markov Chains 2 2. Basic Definitions and concepts.......................... 2 2.2 Definition of Markov Chains........................... 4 2.3 Coupling...................................... 5 2.4 Stopping and Stationary Times......................... 7 2.5 Time Reversal................................... 9 3 Riffle Shuffles 9 3. Gilbert-Shannon-Reeds model.......................... 9 3.2 Approach to Uniformity in the GSR Shuffling Model............. Introduction A Markov chains is a type of stochastic process that was first studied by Markov in 906. A process consists of a sequence of states; in a Markov chain, each state is independent of the previous one. Markov chains are of great interest, because they can model many different problems. They were studied rigorously by P. Diaconis []; further properties of Markov Chains can be found in [2]. The motivating example of this paper is card shuffling. This paper is mostly expository. In the first part, we introduce Markov Chains, as they

are described in [2]; then, we reprove one of the first fundamental theorems on this, which was first done in [3]. This paper was written without any previous knowledge of statistics and it doesn t assume the reader to familiar with this field. For concepts that are more complex and need a more detailed explanation, we give reference to the literature that does so. 2 Markov Chains 2. Basic Definitions and concepts Most of the material in this section is explained in further detail in [2]. The majority of the structures that we talk about are probability distributions: these represent the range of possible outcomes and their respective probabilities. The total probability of all outcomes must be one. Definition 2.. A probability distribution on a countable set Ω is a function P : Ω [0, ] such that P (A) =. A Ω A random variable X allows a measurable function to be defined on Ω. The probability measure P X of the random variable X on R is called the distribution. For Borel set B it is defined by P X (B) := P(X B). Two events A, A 2 are independent if P (A A 2 ) = P (A ) P (A 2 ). Definition 2.2. A probability (state) space is a triple (Ω, F, P ), where Ω is the set of outcomes, F the set of events, and P is a probability measure on Ω so that P : F [0, ] assigns probabilities to events. We will denote this probability space by P. For example, for a coin toss, Ω = {Heads (H), Tails (T )} ; if we are considering two tosses, F = {(HH), (T T ), (HT ), (T H)} and P (H, T ) = P (T, H) = P (T, T ) = P (H, H) = /4. We can imagine these processes as applying the transition matrix P many times. At time t, we have P t, which means that t transitions occured. P t (x, y) denotes the probability of getting from x to y in t steps. There are two different types of variables that make up probability distributions: discrete and absolutely continuous. Definition 2.3. A discrete variable X only has a finite number of possible outcomes. 2

Its simplest example is the coin toss. For this type of variable, there is a finite set, called the support of X, that contains all the possible values of X. If we denote these values x, x 2,..., x n, then n P (x i ) =. i= Definition 2.4. An absolutely continuous variable can take on any value within a certain range - just like a continuous function defined on some interval. For this type of absolutely continuous variable, the probability that a random variable has a value in an interval A is equal to the area of the function achieved on this interval, in relation to the area under the whole curve. This function is called the density function p(x) on R such that P X (A) = p(x)dx for any A Ω A is the measure of probability that the variable has a value in interval A. Thus, for any domain, we can define some kind of a probability space. function shows how close to each other do the events occur. The density Definition 2.5. For a random variable X and a function P : Ω R, we can define expectation: for a discrete random variable X with support Ω, the expectation E(x) is defined by E(x) = x R xp{x = x}; () for an absolutely continuous random variable X with density p(x), E(x) = xp(x)dx. (2) R Often, the initial distribution is concentrated at a single definite starting state x. This is denoted P x and E x. For more complex models, it is useful to define probability distributions. In most processes, the probability of each outcome is different; we can only say that the sum of all these probabilities over a distribution is. Most of the distributions described in this paper have independent and identically distributed random variables, which we will simply denote as i.i.d.. When we are interested in a specific range of outcomes, we study marginal distributions. These can be defined on both discrete and absolutely continuous distributions F. Definition 2.6. If F is the distribution of variables (X, X 2,..., X d ), the marginal distribution of X i is F i (x) = P (X i x). So, the marginal distribution for a variable can be visualized as a cut of the original distribution through the variable X i. 3

In this paper, we mostly use only two variables. Their marginal distributions are denoted by Greek letters µ(x) and ν(x). We often are interested in situations when some specific conditions are satisfied, so we look at joint distributions: For x and y being defined on a probability space Ω Ω, we can define a joint distribution P (x, y) = P (X = x and Y = y). Then, for independent random variables, P (x, y) = P (x)p (y). 2.2 Definition of Markov Chains What we call a chain here is a sequence of random variables where each somehow depends on the previous one. It is a process that changes in time increments δt, and we denote a state at such t by X t. In case of a coin toss, at each step, we have the same probability for all outcomes, regardless of the previous tosses. This makes the coin toss a good example of a stochastic process with the Markov property. So, we can establish the main properties of the distributions studied in this paper. Definition 2.7. Markov property of a sequence of random variables (X 0, X,..., X t, X t+ ) defined on Ω with P : Ω R means that the value of the last element is only dependent on the before-last element: P{X t+ = y X t = x,...x 0 = z} = P{X t+ = y X t = x}. For a process with discrete variables, each such state can be represented by a vector. We can imagine these processes as applying a transition matrix P many times. At time t, t transitions occured, so we can denote the transition by P t. P t (x, y) denotes the probability of getting from x to y in t steps. We will denote events by H t = t i=0 {X i = x i }, where x i can be any outcomes. Definition 2.8. A sequence of random variables (X 0, X,..., X t, X t+ ) is a Markov Chain with state space Ω and transition matrix P if for all x, y Ω, at any t, and all events H t = t s=0 {X s = x s } satisfying we have P(H t {X t = x}) > 0, P{X t+ = y H t {X t = x}} = P {X t+ = y X t = x} = P (x, y). We have just defined the Markov Chain as a sequence of events for which an outcome x has a nonzero probability at the last step, the probability that the outcome of this sequence of events at the next step will be y is the same as the probability that if the before-last step is x, the last step will be y. As we have seen, one of the simplest Markov Chains that we could think of is the coin toss. If we denote this set of possible outcomes at each step by λ, we can define a random mapping representation. 4

Definition 2.9. A random mapping representation of a transition matrix P on state space Ω is a function f : Ω Λ Ω, where Z is a Λ-valued random variable, satisfying P {f(x, Z) = y} = P (x, y). We needed such a precise definition, because it can be widely used. In this paper, we will see its use for studying card shuffling. Theorem 2.0. Every transition matrix on a finite state space has a random mapping representation. Proof. Take a Markov chain with state space Ω = {x, x 2,..., x n } with a transition matrix P. Choose the auxiliary random variables from the interval Λ = [0, ]. Define one auxiliary probability function showing how likely it is that after x j, we get at most x k : F j,k = k P (x j, x i ), i= and another auxiliary function for which f(x j, z) := x k when F j,k < z F j,k ; then P{f(x j, Z) = x k } = P{F j,k < Z F j,k } = P (x j, x k ). Since we could have used any x j and x k, we have a mapping representation for any transition matrix. 2.3 Coupling Here, we proceed to one of the key notions in this paper, which is called coupling, and leads to stationary distributions. Recall that we have defined marginal distributions. For a variable X, we can have P (X x) = µ(x); for another variable Y, P (Y y) = ν(y). Definition 2.. A coupling of µ and ν is a pair of random variables (X, Y ) defined on a single probability space such that the marginal distribution of X is µ and that of Y is ν. A coupling (X, Y ) satisfies P {X = x} = µ(x) and P {Y = y} = ν(y). Using the familiar example of coin tosses, if we use a fair coin, we have P {X = x and Y = y} = /4 for all possible pairs of (x, y) from {0, }. Definition 2.2. For a probability transition matrix P, a distribution π on Ω satisfying π = πp is a stationary distribution of the Markov chain. 5

As an example, we can take a look at simple random walks on a graph G with V vertices and E edges (written as G = (V, E)). We denote the number of neighbors of a vertex x by deg(x). Two neighboring vertices are denoted by x y. The probability that a person standing on any vertex y V will go to its neighboring edge x is. For any vertex deg(y) y V, with any of its neighboring vertices x, we get deg(x)p (x, y) = x y x V deg(x) deg(x) = deg(y). With the total number of vertices E, we can define the probability measure of coming to the vertex y at the next step as π(y) = deg(y). Therefore, for any y Ω, the probability 2 E measure π(y) is always a stationary distribution for the walk. Definition 2.3. For x Ω, the hitting time for x is τ x = min{t 0 when X t = x}. This is the first time at which the chain visits state x. The first return time is τ + x = min{t : X t = x}; here, we are only considering positive time. The notion of hitting time permits us to establish further properties of Markov chains with the property of irreducibility. Definition 2.4. A chain P is called irreducible if for any two states x, y Ω, there exists a t Z such that P t (x, y) > 0. This means that it is possible to get from any state to any other state using transitions of positive probability. Theorem 2.5. Let P be the transition matrix of an irreducible Markov chain. Then (a) there exists a probability distribution π on Ω such that π = πp and π(x) > 0 for all x Ω, (b) π(x) = E x (τ + x ). Proof. (a) We will study properties of the state x through looking at states y and z as well. For an arbitrary state of the Markov chain at any time, let the number of visits to y before returning to z be E z, which is the sum of all probabilities that at any time t when X t = y, the first return time to z is higher than this time; then π(y) := E z = P z {X t = y, τ z + > t}, (3) π is a stationary distribution if πp = π. For any of these arbitrary y, πp = P z {X t = y, τ z + > t}p (x, y) = = t=0 t=0 P z {X t = x, X t+ = y, τ z + t + } t=0 P z {X t = y, τ z + > t} P z {X 0 = y, τ z + > 0} + t=0 6 P z {X t = y, τ z + t= = t}.

Now, two cases can occur: if y = z; then P z {X 0 = z, τ + z > 0} = and P z {X t = z, τ z + = t} = ; t= if y z; then P z {X 0 = y, τ + z > 0} = 0, and P z {X t = z, τ z + = t} = 0. t= Thus, the last two terms in 2.3 cancel, and πp = π. (b) When we normalize the distribution by E x (τ z + ) = x π(x); we get, for any x Ω, Π(x) = π(x) E z (τ + z ) = E x (τ + x ). 2.4 Stopping and Stationary Times The following definitions seem trivial, but they become helpful in our study of markovian processes. Definition 2.6. For a sequence (X t ) t=0 of Ω - valued random variables; a {0,,..., } -valued random variable τ is a stopping time for (X t ) if for each t {0,,...}, there is a set B t Ω t+ such that {τ = t} = {(X 0, X,..., X t ) B t }. We can also say that at a stopping time τ, the event {τ = t} is determined by X 0,.., X t. A we can define the stopping time for a market stock. A trader can sell it after it exceeds a certain value; the time when this happened is the stopping tiime. Recall the random mapping representation (from definition 2.9): we can apply the map f at an i.i.d. sequence (Z t ) t=; then, the sequence (X t ) t= defined by is a Markov chain with transition matrix P. X 0 = x, X t = f(x t, Z t ) (4) Definition 2.7. A random time τ is called a randomized stopping time for the Markov chain (X t ) if it is a stopping time for a sequence (Z t ). Let s take a look at an example: the lazy random walk on the hypercube {0, } n. At each step of this process, an element (k, B) is selected, uniformly at random, from {, 2,..., n} {0, } and the coordinate k is updated with the bit B. The chain is determined by the i.i.d. sequence (Z t ), where with (Z t ) = (K t, B t ) being the coordinate and bit 7

pair used to update at step t. Let s define τ ref := min {t 0 : {j,..., j 2 } = {, 2,..., n}}, the first time when each coordinate has been updated at least once. At this time, all of the coordinates have been replaced with independent fair bits, so on {0, } n, the chain s distribution is uniform. So X τref is an exact sample from the stationary distribution π. Since τ ref is not a function of (X t ), but of (Z t ); it is a stopping time for (Z t ), so it is a randomized stopping time. Definition 2.8. For (X t ), an irreducible Markov chain with stationary distribution Π, a stationary time τ for (X t ) is a randomized stopping time, possibly depending on the starting position x, s.t. the distribution of X τ is Π : P x {X t = y} = Π(y). Definition 2.9. A strong stationary time for a Markov chain (X t ) with stationary distribution Π is a randomized stopping time τ, possibly depending on the starting position x, such that P x {τ = t, X τ = y} = P x {τ = t}π(y). This means that with a strong stationary time τ, X τ has distribution Π and is independent of τ. As an example, consider again the lazy random walk on the hypercube; τ ref is also a strong stationary time. Lemma 2.20. Let (X t ) be an irreducible Markov chain with a stationary distribution Π. If τ is a strong stationary time for (X t ), then for all t 0, P x {τ < t, X t = y} = P x {τ t}π(y). (5) Proof. Denote the (X t ) by an i.i.d. sequence Z, Z 2,... Then, at any s t, P x {τ = s, X t = y} = z Ω P x {X t = y τ = s, X s = z}p x {τ = s, X s = z}. (6) Now, by the definition of a strong stationary time, for there is a set B that is a subset of Ω s, for which {τ = s} = {Z,..., Z s } B. Therefore, we can define a function f r, for which X s+r = f r (X s, Z s+,..., Z s+r ). The vectors (Z,..., Z s ) and (Z s+,..., Z t ) are independent, which means that P x {X t = y τ = s, X s = z} = P x { f t s (z, Z s+,..., Z t ) = y (X,..., X s ) B, X s = z} When we put equations (6) and (7) together, we can see that = P t s (z, y). (7) P x {τ = s, X t = y} = z Ω P t s (z, y)π(z)p x {τ = s} = π(y)p x {τ = s}. 8

This further implies that P x {τ < t, X t = y} = P x {τ = s, X t = y} s t = π(y)p x {τ = s} s t = P x {τ t}π(y). 2.5 Time Reversal The analysis of many processes would be more simple from the end. For this reason, it is good to see what properties does a markovian process share with its inverse process. Definition 2.2. The time reversal of an irreducible Markov chain with transition matrix P and stationary distribution π is the chain with matrix P (x, y) := π(y)p (y,x) π(x). Definition 2.22. For a distribution µ on a group G, the inverse distribution µ is defined as µ := µ(g ) for all functions g G. Proposition 2.23. Let (X t ) be an irreducible Markov chain with transition matrix P and stationary distribution π. For the time-reversed chain with transition matrix P, write ( X t ). Then π is stationary for P, and for any x 0, x,..., x t Ω we have P π {X 0 = x 0,..., X t = x t } = P{ X 0 = x t,..., X t = x 0 }. Proof. We look for the stationary distribution for P. P π { X 0 = x n,..., X n = x 0 } = π(x n ) P (x n, x n )... P (x 2, x ) P (x, x 0 ) = π(x 0 )P (x 0, x )P (x, x 2 )...P (x n, x n ) = P π {X 0 = x 0,..., X n = x n }. So, we know another useful property of Markov Chains: a distribution that is stationary for a Markov chain is also stationary for its inverse. 3 Riffle Shuffles 3. Gilbert-Shannon-Reeds model The Gilbert-Shannon-Reeds model is the first mathematically precise model of shuffling. It describes the most common strategy of card shuffling: a deck is cut into two heaps; then, 9

a card is dropped from left or right heap, with a probability proportional to the number of cards in the heap, until there are no more cards in one of them. We will denote it by GSR. Since GSR model was the first mathematically precise model of shuffling, its convergence is an important result. It was first derived in a famous paper by Bayer and Diaconis [3], and the following section will find this result as well. Most authors currently studying this phenomenon present multiple processes for which this model can be used [2], but our analysis focuses on the fundamental convergence property of this model. Knowing these properties allows performing many card tricks, and in addition to statical analysis, a book on its application this analysis, the original author wrote a book on its use with magic card tricks [9] Definition 3.. For a deck of n cards, denote the a-shuffle in the following way: Take a stack of cards. Cut it into a packets. Then, drop the cards from these packets succeedingly onto one big pile, in the following way: let b i be the number of cards in the packet i at any b moment; then the chance that the next card dropped will be from this packet is i a. b j j= Theorem 3.2. For an a-shuffle, the probability that it will result in a specific permutation π with a-shuffles is (a+n r, where r is the number of rising sequences in π. a n n ) Proof. We can look at this process from the end. At that point, we have r rising sequences. We can choose how to reorder them into the a packets: we must make r cuts that ensure divising into the r rising sequences that we want; after that we can place a r cuts wherever we like. After any such cut, when we are recreating the original sequence of n cards, we have ( a+n r n ) possibilities. The number of possible intial a-shuffles is a n, since each of these n cards can be in one of a packets. So, this probability is (a+n r n ). a n Corollary 3.3. If a deck of cards is given a sequence of m shuffles of types a, a 2,..., a m ; then the chance that the deck is in arrangement π is (n+a r, for a = a a n, a 2,...a k and r the number of rising sequences in π. Proof. Knowing Theorem 3.2, we can see that if we have an R(π), there is a uniform conditional law of π. Now, we use Lemma of the famous paper by Rogers and Pitman [7]: Once we know that the family of distributions for the process of these rising sequences is complete, then the requirements of this Lemma are satisfied and R(π) is a Markov Chain. So, we want to show that if m ( ) a m + n r f(r) = n i= then f = g. i= n ) n ( ) a m + n r g(r) for m = 0,, 2,..., n 0

Taking x = a m, m ( ) a m + n r f(r) n = [(x + n )(x + n 2)...xf() + n! i= +(x + n 2)(x + n 3)...(x )f(2) +... + +x(x )...(x (n ))f(n)] = f(i) at x = i. Since the same decomposition holds for the right hand side of the equation, we can see that f(i) = g(i). 3.2 Approach to Uniformity in the GSR Shuffling Model In this subsection, we ll get to prove how does this distribution converge to a uniform one. m +n r Proposition 3.4. Let Q m (r) = (2 n ) be the probability of a permutation with r rising 2 mn sequences after m shuffles from the GSR distribution. Let r = n/2 + h, n/2 + h n/2, and m = log 2 (n 3/2 c) with 0 < c < fixed. Then Proof. Q m (r) = n! exp{ c n ( h + 2 + O C( h n )) 24c 2 2 ( h cn )2 + O C ( n )}. Recalling the inequality Q m (r) = (2m + n r)(2 m + n r )...(2 m r) n!(2 m ) n = n n! exp{ ln( + (n/2) h i )}. cn 3/2 i=0 x x2 2 + x3 3 x4 ln( + x) x x2 2 + x3 3, valid for 2 < x < ; we can bound the logarithmic term. We will evaluate all the terms of decomposition of Q m just with the standard summation formulas: n i = n(n+) n gives ( n h+/2 h i) = 2 cn 3/2 2 c ; n i= i=0 n i 2 = n(n+)(2n+) n gives ( n h 6 2c 2 n 3 2 i)2 = + ( h 24c 2 2 cn )2 + O c ( ); n i= i=0 n i 3 = n2 (n+) 2 n gives ( n h 4 3c i= 3 n 9/2 2 i)3 = O C ( h ); n 3/2 i=0 n ( n h 2 i)4 = O c ( ) gives ( n h n 2 i)4 = O c ( ). n i=0 c 4 n 6 n i=0 Putting these together, we get the estimate above.

Proposition 3.5. Let h be an integer such that Q m (n/2 + h) /n! h h. Then, for any fixed c, as n goes to infinity, with B. h = n 24c + 2c 3 + B + O c( n ), Proof. This bound can be found through looking at 3.4. Its exponent must be nonnegative in order to have Q m (n/2 + h) /n!. If we set it equal to zero for some h, the resulting expression is the one above. Theorem 3.6. Let Q m be the Gilbert-Shannon-Reeds distribution on the symmetric group S n, and U be the uniform distribution. Then for m = log 2 (n 3/2 c), with 0 < c < fixed, as n tends to, Q m U = 2Φ( 4c 3 ) + O C( ) (8) n/4 with Φ(x) = x e( t2 )/2 dt/ 2π. Proof. We have seen that the number of rising sequences is a sufficient information for estimating the probability of a distribution. This allows us to use the result of a paper by Diaconis and Zabell [4], which says that the total variation between two probabilities is equal to the total variation between the induced laws of any sufficient statistic. Then, if we denote the number of permutations with n/2 + h rising sequences by R nh, Q m U = n/2<h h R nh (Q m ( n 2 + h) n! ). The number of descents is crucial here: π has r descents if and only if π has r rising sequences. We can also recall that the Eulerian number a nj denotes the number of permutations with j descents. In his study of Eulerian numbers in [8], Tanny showed that the chance of the sum of n variables that are i.i.d. on [0,] is between j and j + equals a nj. n! If this is so, then a nj behaves according to the central limit theorem, and the same is then n! true for (a+n r n ) a n. Therefore n! h h= n/2 R nh = Φ( 4c )( + O( ) uniformly. 3 n We can also use the local central limit theorem as stated in [6], which, if used with x n = h (n/2) gives R nh n! = e (/2)(x n) 2 2πn/2 ( + o( n )) uniformly in h. (9) 2

(the derivation is almost identical to the one done in [6]). Now, we use (3.5). Its result can conveniently be divided into two zones: A = { 0n3/4 c h h }; and A 2 = { n 2 h < 0n3/4 c }}. (3.4) and (9) put together imply that R nh Q m (n/2 + h) = e /24c2 A 2πn/2 = e /24c2 2π e 2 ( h A (2c 2) = Φ( 4c )( + O( ). 3 n/4 ) 2 h n/2 c n +Oc ( n /4 ) + o( n ) e x2 /2 x/c 3 dx( + O( )) n/4 Now, for h in A 2, Q m (n/2 + h) Q m () e n/2c. n! A bound of the standard large deviation is given in Chapter 6 of [5]; applying it to our sum, with uniform n we get A 2 R nh n! 2n /4 0n /4 2π exp[ 2 (0 ) 2 ]. c This means that only zone A contributes, and the speed of convergence is as described above. This theorem also allows a corollary: Corollary 3.7. If n cards are shuffled m times with m = 3 2 log 2 n + θ, then for large n, with Q m U = 2Φ( 2 θ 4 3 ) + O( n Φ(x) = 2π x e t2 /2 dt. /4 ), Therefore, if θ is large, the distance to uniformity approaches 0, while for θ small, it approaches. We can calculate the different variation distances for distinct numbers of cards. Then, we see that about 3 log 2 2 n shuffles are necessary for shuffling n cards. Acknowledgements. This paper could never appear without my mentor Mohammad Abbas Rezaei, and his patience with correcting my mistakes and explaining me how to use LaTeX. Also, it could not be written without the advice of prof. Lalley and helpful comments from Jacob Perlman and Marcelo Alvisio. 3

References [] P. Diaconis (998) From Shuffling Cards to Walking Around the Building: An Introduction to Modern Markov Chain Theory Doc. Math. J. DMV, 87-204 [2] D. A. Levin, Y. Peres and E. L. Wilmer. Markov Chains and Mixing Times [3] D. Bayer and P. Diaconis. (992) Trailing the Dovetail Shuffle to its Lair Ann. Applied Prob. 2 294-34 [4] P. Diaconis and S. Zabell. (982) Updating subjective probability J. Amer. Statist. Assoc. 77 822-830 [5] W. Feller. (97) An Introduction to Probability and Its Applications Wiley, New York [6] G. F. Lawler and V. Limic. Random Walk: A Modern Introduction [7] L. Rogers and Pitman. (98) Markov Functions Ann. Probab. 9 573-582 [8] S. Tanny. (973) A Probabilistic Interpretation of the Eulerian Numbers Duke Math. J. 40 77-722 [9] P. Diaconis and Ron Graham. (20) Magical Mathematics: The Mathematical Ideas that Animate Great Magic Tricks Princeton University Press 4