Math 5040 Markov chain Monte Carlo methods

Similar documents
MARKOV CHAIN MONTE CARLO

Markov Chain Monte Carlo

Convergence Rate of Markov Chains

Ch5. Markov Chain Monte Carlo

Markov and Gibbs Random Fields

Convex Optimization CMU-10725

Markov Chains and MCMC

MATH 56A: STOCHASTIC PROCESSES CHAPTER 7

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

CSC 446 Notes: Lecture 13

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

6 Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Stochastic optimization Markov Chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Introduction to Machine Learning CMU-10701

Lecture 6: Monte-Carlo methods

Monte Carlo Methods. Leon Gu CSD, CMU

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computational statistics

CS168: The Modern Algorithmic Toolbox Lecture #14: Markov Chain Monte Carlo

STA 294: Stochastic Processes & Bayesian Nonparametrics

Bayesian Methods for Machine Learning

Statistical Data Mining and Medical Signal Detection. Lecture Five: Stochastic Algorithms and MCMC. July 6, Motoya Machida

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

MATH Mathematics for Agriculture II

Markov Chains and MCMC

Lecture 6: Markov Chain Monte Carlo


MCMC notes by Mark Holder

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Homework 2 will be posted by tomorrow morning, due Friday, October 16 at 5 PM.

CS168: The Modern Algorithmic Toolbox Lecture #6: Markov Chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Introduction to Machine Learning CMU-10701

Text Mining: Basic Models and Applications

Lecture 5: Random Walks and Markov Chain

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

Lecture 2: September 8

Lattice Gaussian Sampling with Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo Lecture 9

Advanced Sampling Algorithms

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

MCMC: Markov Chain Monte Carlo

Quantifying Uncertainty

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

The Ising model and Markov chain Monte Carlo

Sampling Algorithms for Probabilistic Graphical models

1 Random Walks and Electrical Networks

Chapter 7. The worm algorithm

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

CONTENTS. Preface List of Symbols and Notation

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Markov Chain Monte Carlo methods

Markov Networks.

Numerical methods for lattice field theory

Theory of Stochastic Processes 8. Markov chain Monte Carlo

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Lecture 3: Markov chains.

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

ELEC633: Graphical Models

Homework set 2 - Solutions

Markov Chain Monte Carlo Methods

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Probabilistic Graphical Models

Monte Carlo importance sampling and Markov chain

The Monte Carlo Method

Likelihood Inference for Lattice Spatial Processes

Brief introduction to Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Markov Chain Model for ALOHA protocol

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

1 Probabilities. 1.1 Basics 1 PROBABILITIES

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

Question Points Score Total: 70

Markov Processes. Stochastic process. Markov process

STOCHASTIC PROCESSES Basic notions

Propp-Wilson Algorithm (and sampling the Ising model)

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Markov Random Fields

Discrete solid-on-solid models

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

17 : Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Computer intensive statistical methods

Random Walks A&T and F&S 3.1.2

Transcription:

Math 5040 Markov chain Monte Carlo methods S. Ethier References: 1. Sheldon Ross, Probability Models, Section 4.9. 2. Gregory Lawler, Intro. to Stoch. Proc., Section 7.3. 3. Persi Diaconis, The Markov Chain Monte Carlo Revolution, BAMS, 2008.

Let h be a positive function on a large finite set S. How do we simulate an S-valued random variable X with distribution π(x) = P(X = x) = h(x) (1) h(y). y S Typically, we do not know the value of the sum in the denominator. For example, h could be identically 1, but we might not know or be able to count the number of elements that are in S. The idea of MCMC is to find a Markov chain in S with stationary distribution π. Simulate the Markov chain for sufficiently many steps for it to be approximately in equilibrium. The proportion of time spent in state i will be approximately π(i).

Example Let S be the set of all 25 25 matrices of 0s and 1s with no adjacent 1s in the same row or column. Let M be uniformly distributed over S. We want to simulate the expected proportions of 1s in M. The first problem is to construct the Markov chain. Consider the following one-step transition. Starting at M S, choose an entry (i, j) at random from {1, 2,..., 25} 2. If M(i, j) = 1, then change the entry at (i, j) to 0. If M(i, j) = 0, then change the entry at (i, j) to 1 if the resulting matrix belongs to S; otherwise leave it unchanged. Then P(M 1, M 2 ) = 1/(25) 2 if M 1 and M 2 are elements of S that differ in exactly one entry, and P(M 1, M 1 ) = j/(25) 2, where j is the number of zero entries of M 1 that cannot be changed to a 1. This MC is irreducible and aperiodic with P symmetric, hence doubly stochastic. So the uniform distribution is stationary.

In[89]:= K = 25; M = ConstantArray@0, 8K, K<D; steps = 1 000 000; ones = 0; sum = 0; For@n = 1, n steps, n++, i = Floor@K * RandomReal@DD + 1; j = Floor@K * RandomReal@DD + 1; If@M@@i, jdd ä 1, M@@i, jdd = 0; ones--, If@Hi ä 1»» M@@Max@i - 1, 1D, jdd ä 0L && Hi ä K»» M@@Min@i + 1, KD, jdd ä 0L && Hj ä 1»» M@@i, Max@j - 1, 1DDD ä 0L && Hj ä K»» M@@i, Min@j + 1, KDDD ä 0L, M@@i, jdd = 1; ones++dd; sum += onesd; Print@N@sum ê Hsteps * K * KLDD; 0.230537

Remark. Instead of sampling the Markov chain at every step, we might sample it every 2500 steps, say, and then the observations would be closer to independent. How would we estimate the size of the set of 25 25 matrices of 0s and 1s with no adjacent 1s in the same row or column? This is a tricky problem, but see Lawler and Coyle s Lectures on Contemporary Probability (1999) for a suggested solution. It is known that the number S N of N N matrices of 0s and 1s with no adjacent 1s in the same row or column satisfies β = lim N S N 1/N2, where 1.50304 β 1.50351, or roughly S N β N2.

Reversibility The method works well when the function h is identically 1. But in the general case we need the concept of reversibility. Consider a M.C. X 0, X 1, X 2,... with transition matrix P and stationary initial distribution π, hence it is a stationary process. Now reverse the process: X n, X n 1, X n 2,.... It is Markovian with transition probabilities Q(i, j) = P(X n = j X n+1 = i) = P(X n = j)p(x n+1 = i X n = j) P(X n+1 = i) π(j)p(j, i) =. π(i) If Q(i, j) = P(i, j) for all i, j, or equivalently, π(i)p(i, j) = π(j)p(j, i) for all i, j. (2) then the M.C is said to be reversible.

The equations π(i)p(i, j) = π(j)p(j, i) for all i, j. are called the detailed balance equations. Given a transition matrix P with a uniques stationary distribution π, if we can find x(i) such that x(i)p(i, j) = x(j)p(j, i) for all i, j, and i x(i) = 1, then summing over i we get x(j) = i x(i)p(i, j), so we get x(i) = π(i) by uniqueness of stationary distributions.

Example Recall the random walk on a connected finite graph. There are weights w ij associated with the edge from i to j. If we define P(i, j) = w ij k w, ik where the sum extends over all neighbors of vertex i, we have an irreducible finite M.C., so there is a unique stationary distribution. What is it? Check detailed balance equations. Use w ij = w ji to derive π(i) = i j w ij j w ij.

How do we find a Markov chain for general h? Here is the Hastings Metropolis algorithm. Let Q be any irreducible transition matrix on S and define P to be the transition matrix { Q(i, j)α(i, j) if i j, P(i, j) = Q(i, i) + k:k i Q(i, k)(1 α(i, k)) if i = j, (3) where ( ) h(j)q(j, i) α(i, j) = min h(i)q(i, j), 1. (4) Interpretation: Instead of making a transition from i to j with probability Q(i, j), we censor that transition with probability 1 α(i, j). That is, with probability α(i, j) the proposed transition from i to j is accepted, and with probability 1 α(i, j) the proposed transition from i to j is rejected and the chain stays at i.

If α(i, j) = h(j)q(j, i)/[h(i)q(i, j)], then α(j, i) = 1 and h(i)p(i, j) = h(i)q(i, j)α(i, j) = h(i)q(i, j)h(j)q(j, i)/[h(i)q(i, j)] = h(j)q(j, i)α(j, i) = h(j)p(j, i) (5) Similarly if α(i, j) = 1. So detailed balance eqs. hold for P. In particular, the unique stationary distribution is proportional to the function h.

Example Let S be the set of all K K matrices of 0s and 1s. (Physicists often take matrices of 1s and 1s instead to indicate spin orientation.) Let g be a symmetric function on {0, 1} 2 (i.e., g(0, 1) = g(1, 0)). Let β be a positive parameter. Define the energy of the matrix M to be E = g(m(i, j), M(i, j )), (6) (i,j) (i,j ) where (i, j) (i, j ) means that (i, j) and (i, j ) are nearest neighbors, that is, i i + j j = 1. We assume that the probability distribution of interest on S gives equal weight to all matrices with the same energy, so we let h(m) = e βe. (7)

The previous example is the limiting case with g(0, 0) = g(0, 1) = g(1, 0) = 0, g(1, 1) = 1, and β =. Another important example is the Ising model, with g(0, 0) = g(1, 1) = 0 and g(0, 1) = 1. (Configurations matrices with many pairs of neighboring spins pointing in opposite directions have high energy.) Starting with M 1, choose an entry at random from the K 2 possibilities and let M 2 be the matrix obtained by changing the chosen entry. If h(m 2 )/h(m 1 ) 1, then return matrix M 2. If h(m 2 )/h(m 1 ) = q < 1, then with probability q return the matrix M 2 and with probability 1 q return the matrix M 1. This is precisely the Hastings Metropolis algorithm.

Gibbs sampler There is a widely used version of the Hastings Metropolis algorithm called the Gibbs sampler. Here we want to simulate the distribution of an n-dimensional random vector X = (X 1, X 2,..., X n ). Again, its probability density π(x) is specified only up to a constant multiple. We assume that we can simulate a random variable X with density equal to the conditional density of X i, given the values of X j for all j i. Start from x = (x 1,..., x n ). Choose a coordinate i at random. Then choose a coordinate x according to the conditional density of X i, given that X j = x j for all j i. Consider the state y = (x 1,..., x i 1, x, x i+1,..., x n ) as a candidate for a transition. Apply Hastings Metropolis with...

Q(x, y) = 1 n P(X i = x X j = x j for all j i) = π(y) np(x j = x j for all j i). The acceptance probability is ( ) π(y)q(y, x) α(x, y) = min π(x)q(x, y), 1 ( ) π(y)π(x) = min π(x)π(y), 1 = 1. Hence the candidate star is always accepted.

The method also applies when the random variables are continuous. Example Suppose we want to simulate n points in the unit disk at least d apart. Take n = 35 and d = 1/4. We can use the Gibbs sampler. Initialize as follows. 1.0 0.5 0.0-0.5-1.0-1.0-0.5 0.0 0.5 1.0

At each step, choose one of the n points at random. Then simulate a random point in the unit disk (rejection method) until we get one that is a distance at least d from each of the other points. Replace the ith point with the new one. n = 35; d = 1 ê 4; coords = ConstantArray@0, 8n, 2<D; coords = 88-.75, 0.<, 8-.5, 0.<, 8-.25, 0.<, 80., 0.<, 8.25, 0.<, 8.5, 0.<, 8.75, 0.<, 8-.75,.25<, 8-.5,.25<, 8-.25,.25<, 80.,.25<, 8.25,.25<, 8.5,.25<, 8.75,.25<, 8-.75, -.25<, 8-.5, -.25<, 8-.25, -.25<, 80., -.25<, 8.25, -.25<, 8.5, -.25<, 8.75, -.25<, 8-.75,.5<, 8-.5,.5<, 8-.25,.5<, 80.,.5<, 8.25,.5<, 8.5,.5<, 8.75,.5<, 8-.75, -.5<, 8-.5, -.5<, 8-.25, -.5<, 80., -.5<, 8.25, -.5<, 8.5, -.5<, 8.75, -.5<<; Show@ContourPlot@x^2 + y^2 ä 1, 8x, -1, 1<, 8y, -1, 1<D, ListPlot@Table@coords, 8i, 1, n<ddd For@run = 1, run 1000, run++, m = IntegerPart@n RandomReal@DD + 1; For@try = 1, try 1000, try++, x = 2 RandomReal@D - 1; y = 2 RandomReal@D - 1; If@x^2 + y^2 < 1, flag = 1; For@i = 1, i n, i++, If@Sqrt@Hcoords@@i, 1DD - xl^2 + Hcoords@@i, 2DD - yl^2d d && i π m, flag = 0DD; If @flag ä 1, coords@@m, 1DD = x; coords@@m, 2DD = y; try = 1001DDDD; Show@ContourPlot@x^2 + y^2 ä 1, 8x, -1, 1<, 8y, -1, 1<D, ListPlot@Table@coords, 8i, 1, n<ddd

Output: 1.0 0.5 0.0-0.5-1.0-1.0-0.5 0.0 0.5 1.0

Example Self-avoiding walks (SAWs). (From Lawler Coyle book.) A SAW of length n in Z d is a sequence of paths ω = (ω 0, ω 1,..., ω n ) with ω 0 = 0, ω i Z d, ω i ω i 1 = 1, and ω i ω j whenever i j. SAWs are models for polymer chains in chemistry. Let Ω n be the set of all SAWs of length n. What is Ω n? Notice that d n Ω n 2d(2d 1) n 1. (8) In fact, since it can be shown that where d β 2d 1. Ω n+m Ω n Ω m, (9) lim Ω n 1/n = β, (10) n

There are a number of interesting questions that one can ask about SAWs. For example, what is the asymptotic probability that two independent n-step SAWs when combined end to end form a 2n-step SAW? For another example, how large are SAWs, that is, how does E[ ω n 2 ] behave for large n? These questions are too hard even for specialists to answer. But perhaps we can get some sense of what is going on by simulating a random SAW. Notice that the obvious rejection method is extremely inefficient if n is large. Let us assume from now on that d = 2. We use MCMC. We want to simulate the uniform distribution on Ω n for some specified n. Our approach is based on what is called the pivot algorithm.

Let O denote the set of orthogonal transformations of the plane that map Z 2 onto Z 2. These include the rotations by π/2, π, and 3π/2, and the reflections about the coordinate axes and about the lines y = x and y = x. (We exclude the identity transformation from O, so O = 7.) Consider the following MC in Ω n : Starting with ω = (ω 0, ω 1,..., ω n ), choose a number at random from {0, 1,..., n 1} and call it k. Choose a transformation T at random from O. Consider the walk obtained by fixing the first k steps of the walk but performing the transformation T on the remaining part of the walk, using ω k as the origin for the transformation. This gives us a new path which may or may not be a SAW. If it is, return it as the new path; if it isn t, return the original path. This gives us a MC on Ω n. It is irreducible and aperiodic with a symmetric p. So the limiting distribution is uniform.

L = 100; saw = ConstantArray@0, 8L + 1, 2<D; saw1 = ConstantArray@0, 8L + 1, 2<D; saw2 = ConstantArray@0, 8L + 1, 2<D; saw3 = ConstantArray@0, 8L + 1, 2<D; For@n = 0, n L, n++, saw@@n + 1, 1DD = n; saw@@n + 1, 2DD = 0D; H* initial saw is a straight line to the right *L For@step = 1, step 2000, step++, k = Floor@L RandomReal@DD; m = Floor@7 RandomReal@DD + 1; For@n = 0, n k, n++, saw1@@n + 1, 1DD = saw@@n + 1, 1DD; saw1@@n + 1, 2DD = saw@@n + 1, 2DDD; H* first k steps of saw are unchanged in saw1 *L For@n = k, n L, n++, saw2@@n - k + 1, 1DD = saw@@n + 1, 1DD - saw@@k + 1, 1DD; saw2@@n - k + 1, 2DD = saw@@n + 1, 2DD - saw@@k + 1, 2DDD; H* remaining steps of saw are saved, after shifting origin, in saw2 *L H* Now we transform saw2 *L If@m ä 1, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = -saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD; H* rotation of piê2 *L If@m ä 2, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = -saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 2DDDD; H* rotation of pi *L If@m ä 3, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 1DDDD; H* rotation of 3piê2 *L If@m ä 4, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = -saw2@@j + 1, 2DDDD; H* reflection about x-axis *L If@m ä 5, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = -saw2@@j + 1, 1DD; saw3@@j + 1, 2DD = saw2@@j + 1, 2DDDD; H* reflection about y-axis *L If@m ä 6, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD; H* reflection about y=x *L If@m ä 7, For@j = 0, j L - k, j++, saw3@@j + 1, 1DD = -saw2@@j + 1, 2DD; saw3@@j + 1, 2DD = saw2@@j + 1, 1DDDD; H* reflection about y=-x *L For@n = k, n L, n++, saw1@@n + 1, 1DD = saw@@k + 1, 1DD + saw3@@n - k + 1, 1DD; saw1@@n + 1, 2DD = saw@@k + 1, 2DD + saw3@@n - k + 1, 2DDD; H* insert transformed saw segment *L flag = 1; For@n = 0, n L, n++, For@nn = n + 1, nn L, nn++, If@saw1@@n + 1, 1DD ä saw1@@nn + 1, 1DD && saw1@@n + 1, 2DD ä saw1@@nn + 1, 2DD, flag = 0DDD; H* check whether saw1 is a saw *L If@flag ä 1, For@n = 0, n L, n++, saw@@n + 1, 1DD = saw1@@n + 1, 1DD; saw@@n + 1, 2DD = saw1@@n + 1, 2DDDDD; H* if it is, return saw1, otherwise do nothing *L Print@sawD ListPlot@saw, Joined Æ True, AspectRatio Æ 1D

20 15 10 5 5 10 15 20

Code-breaking example. Reference: Diaconis article: http://www-stat.stanford. edu/~cgates/persi/papers/mcmcrev.pdf. A substitution code uses a map f : {code space} {usual alphabet, space, comma, period, digits, etc.}, which is unknown. First step is to download a standard e-text and count the frequencies of the various one-step transitions, thereby getting an estimate of the one-step transition matrix M between consecutive letters. (This is not the MC that we will simulate.) Then we define the plausibility of a particular f by h(f ) = i M(f (s i ), f (s i+1 )). where s 1, s 2,... is the coded message. Ideally, we would like to find the f that maximizes this function. Instead we will apply MCMC to simulate from the distribution π(f ) := h(f )/ g h(g).

Now suppose our symbol space has m elements and our alphabet space has n m elements. Then let S be the set of one-to-one functions f, so S = (n) m, which is large if m = n = 40, say. (We ignore distinction between upper case and lower case letters.) What will our MC in S be? Let P(f, f ) be determined by requiring that a transition from f correspond to a random switch of two symbols. The P is symmetric, and we apply the Hastings Metropolis algorithm to get Q. Start with a preliminary guess, f. Change to f by a random transposition of the values f assigns to two symbols. If h(f )/h(f ) 1, accept f. If q := h(f )/h(f ) < 1, accept f with probability q. Otherwise stay with f.

allow it to go to less plausible f s, and keep it from getting stuck in local maxima. Of course, the space of f s is huge (40! or so). Why should this Metropolis random walk succeed? To investigate this, Marc tried the algorithm out on a problem to which he knew the answer. Figure 2 shows a well-known section of Shakespeare s Hamlet. Figure 2 The text was scrambled at random and the Monte Carlo algorithm was run. Figure 3 shows sample output.

allow it to go to less plausible f s, and keep it from getting stuck in local maxima. Of course, the space of f s is huge (40! or so). Why should this Metropolis random walk succeed? To investigate this, Marc tried the algorithm out on a problem to which he knew the answer. Figure 2 shows a well-known section of Shakespeare s Hamlet. Figure 2 The text was scrambled at random and the Monte Carlo algorithm was run. Figure 3 shows sample output.

problems. Here is an example drawn from course work of Stanford students Marc Coram and Phil Beineke. Example 1 (Cryptography). Stanford s Statistics Department has a drop-in consulting service. One day, a psychologist from the state prison system showed up with a collection of coded messages. Figure 1 shows part of a typical example. Figure 1 The problem was to decode these messages. Marc guessed that the code was a simple substitution cipher, each symbol standing for a letter, number, punctuation mark or space. Thus, there is an unknown function f f : {code space}! {usual alphabet}. One standard approach to decrypting is to use the statistics of written English to guess at probable choices for f, try these out, and see if the decrypted messages make sense. To get the statistics, Marc downloaded a standard text (e.g., War and Peace) and recorded the firstorder transitions: the proportion of consecutive text symbols from x to y. This gives a matrix M(x, y) of transitions. One may then associate a plausibility to f via Pl(f) = Y i M (f(s i),f(s i+1)) Departments of Mathematics and Statistics, Stanford University

problems. Here is an example drawn from course work of Stanford students Marc Coram and Phil Beineke. Example 1 (Cryptography). Stanford s Statistics Department has a drop-in consulting service. One day, a psychologist from the state prison system showed up with a collection of coded messages. Figure 1 shows part of a typical example. courses have designed homework problems around this example [17]. Students are usually able to successfully decrypt messages from fairly short texts; in the prison example, about a page of code was available. The algorithm was run on the prison text. A portion of the final result is shown in Figure 4. It gives a useful decoding that seemed to work on additional texts. Figure 1 The problem was to decode these messages. Marc guessed that the code was a simple substitution cipher, each symbol standing for a letter, number, punctuation mark or space. Thus, there is an unknown function f f : {code space}! {usual alphabet}. One standard approach to decrypting is to use the statistics of written English to guess at probable choices for f, try these out, and see if the decrypted messages make sense. To get the statistics, Marc downloaded a standard text (e.g., War and Peace) and recorded the firstorder transitions: the proportion of consecutive text symbols from x to y. This gives a matrix M(x, y) of transitions. One may then associate a plausibility to f via Pl(f) = Y i M (f(s i),f(s i+1)) Departments of Mathematics and Statistics, Stanford University