Statistical Data Mining and Medical Signal Detection. Lecture Five: Stochastic Algorithms and MCMC. July 6, Motoya Machida

Size: px

Start display at page:

Download "Statistical Data Mining and Medical Signal Detection. Lecture Five: Stochastic Algorithms and MCMC. July 6, Motoya Machida"

Lisa Singleton
5 years ago
Views:

1 Statistical Data Mining and Medical Signal Detection Lecture Five: Stochastic Algorithms and MCMC July 6, 2011 Motoya Machida Statistical Data Mining 1/20

2 Plotting Density Functions The function dnorm() returns the normal density value, and the graph of the density is given by > x = seq(0, 10, by=0.1) > plot(x, dnorm(x, mean=4.5, sd=1.8), type="l", main="normal Density") The function dbeta() returns the beta density value. > x = seq(0, 1, by=0.02) > plot(x, dbeta(x, shape1=4, shape2=8), type="l", main="beta Density") The parameters α 1 and α 2 correspond to shape1 and shape2. Statistical Data Mining 2/20

3 Normal Mixture Let f (x; µ, σ) be the normal density function with mean µ and standard deviation (SD) σ. Given parameters µ 1, σ 1, µ 2, σ 2, and 0 < γ < 1, the normal mixture density is defined as g(x; µ 1, σ 1, µ 2, σ 2, γ) = γf (x; µ 1, σ 1 ) + (1 γ)f (x; µ 2, σ 2 ) The mixture of two normal distribution will be a bimodal distribution, and the two peaks may be approximated by the respective mean µ 1 and µ 2. Statistical Data Mining 3/20

4 Plotting a Normal Mixture The function nm() is written in the form nm = function(x, p=c(0.3, 0.2, 0.8, 0.1, 0.4)){ p[5] * dnorm(x, p[1], p[2]) + (1-p[5]) * dnorm(x, p[3], p[4]) } It is executed by > p0 = c(0.2, 0.2, 0.7, 0.1, 0.4) > x = seq(0,1,by=0.01) > plot(x, nm(x,p=p0), type="l", main="normal Mixture") Statistical Data Mining 4/20

5 Optimization by Random Propagation Consider a nonnegative objective function f (x) on the interval [a, b], and the optimization problem to find x = argmax{f (x) : a x b} Our search algorithm employs a random propagation strategy. Search Algorithm 1. Choose the initial state X 0 at time t = Suppose that X t = x at time t. Then pick the next move x randomly and set y = x + x. 3. If a y b and f (y) > f (x) then set X t+1 = y. 4. Otherwise, stay X t+1 = x at time t + 1. Statistical Data Mining 5/20

6 Trajectories of Search Algorithm Consider a normal mixture density function as the objective function of choice. Download R source files nm.r and search.r into your own machine, and see how the search algorithm seeks the global maximum. Choose a different standard deviation σ for random steps x. Use a different parameter p = (µ 1, σ 1, µ 2, σ 2, γ) for the objective function. > source("nm.r") > source("search.r") > search(ff=nm, sigma=0.05, run.time=500, random=f, p=c(0.2, 0.2, 0.7, 0.1, 0.4)) Statistical Data Mining 6/20

7 Optimization by Random Ascending In addition to the randomness of search direction we can relax a strict ascent criteria and allow some moves to descend. Search Algorithm 1. Choose the initial state X 0 at time t = Suppose that X t = x at time t. Then pick the next move x randomly and set y = x + x. 3. Generate a uniform random variable U on [0, 1] 4. If 0 y 1 and f (y)/f (x) > U then set X t+1 = y. 5. Otherwise, stay X t+1 = x at time t + 1. > search(ff=nm, sigma=0.05, run.time=500, random=t, p=c(0.2, 0.2, 0.7, 0.1, 0.4)) Statistical Data Mining 7/20

8 Search Trajectories See how the evolution of search produces different trajectories. Save the trajectory data from t = 0, and plot the trajectory of moves in time series. Run with (random=t) or without (random=f) random ascending strategy. Choose a different running time N (run.time= N). > par(mfrow=c(2,1)) > xx = search(ff=nm, sigma=0.05, run.time=500, random=f, p=c(0.2, 0.2, 0.7, 0.1, 0.4), save.time=0) > plot(0:500, xx, type="l", main="trajectory", xlab="time", ylab="state") Statistical Data Mining 8/20

9 A Long Run Behavior See a long run behavior of search moves, and investigate how they are distributed. Save the trajectory data from t = 1000 (save.time=1000) to N = 2000 (run.time=1000). Construct the histogram of data, and use a different parameter p = (µ 1, σ 1, µ 2, σ 2, γ). > par(mfrow=c(2,1)) > xx = search(ff=nm, sigma=0.05, run.time=2000, random=t, p=c(0.2, 0.2, 0.7, 0.1, 0.4), save.time=1000) > hist(xx, freq=f, breaks=seq(0,1,by=0.05), col="blue", main="trajectory", xlab="state") Statistical Data Mining 9/20

10 Random beta-walk Here we will introduce a random walk on (0, 1) in which the next step X t+1 is determined by the beta distribution with parameter α = max(δ + θx t, 1) and β = max(δ + θ(1 X t ), 1). These parameters can control the behavior of walk. A smaller δ keeps a sample path closer to either of the boundary. The larger the value θ is, the smaller the move of each step becomes. Thus, 0 < δ < 1 will change the shape of stationary distribution of the random walk, and θ > 0 will influence the speed of convergence of random walk. Later we use this random walk as a proposal Markov chain on (0, 1). Statistical Data Mining 10/20

11 Sample Paths Download the R code bwalk.r and observe a sample path of the beta random walk. Choose a different choice of δ and θ. Change the running time and see how long it takes to display a stationary behavior. > source("bwalk.r") > sample.path = rwalk(move=bmove, trajectory=t, run.time=500, delta=0.8, theta=20) > par(mfrow=c(2,1)) > plot(sample.path, type="l", xlab="time", ylab="state") Statistical Data Mining 11/20

12 A Long Run Behavior A long run behavior can be observed from the histogram of X t from the end of runs repeatedly. Change the running time, and see if the distribution of X t is different. Obtain the histogram of X t for a different choice of δ and θ. Change the initial point by adding init.state=1, and see whether it affects the long run behavior. > sample.data = rwalk(move=bmove, run.time=100, delta=0.8, theta=20, sample.size=500) > hist(sample.data, freq=f, breaks=seq(0,1,by=0.05), col="red") Statistical Data Mining 12/20

13 Emergence of Markov Chain Monte Carlo In reality the state space for π(x) is not R. Either it is a subset of R n (in Bayesian applications), or it has a complex discrete structure (e.g., Ising model). For such models both algorithms via inverse probability transform and resampling method are not applicable in general. By way of Markov chain convergence theorem one can construct a Markov chain X whose stationary distribution is π. X t L π as t. This methodology gives a way to break the limitation of Monte Carlo simulation. But how can we make it happen? Statistical Data Mining 13/20

14 Metropolis-Hastings Algorithm Let π be a pdf of interest on a state space S, and let Q be a transition probability on S. Define the acceptance probability by { } π(y)q(y, x) A(x, y) := min π(x)q(x, y), 1 for a move from x to y satisfying Q(x, y) 0 and Q(y, x) 0. Metropolis-Hastings Algorithm (MHA) 1. Choose an initial state X 0 at time t = Suppose that X t = x at time t. Then pick y according to Q(x, ). 3. Move to X t+1 = y with the probability A(x, y), or stay X t+1 = x with the probability (1 A(x, y)). Statistical Data Mining 14/20

15 MHA for Normal Mixture Here we use a random beta walk as a proposal chain, and run the Metropolis-Hastings Algorithm to generate a normal mixture density on [0, 1]. > source("nm.r") > source("metro.r") > sample = bwalk.metro(ff=nm, run.time=100, delta=0.8, theta=20, sample.size=500) > hist(sample, freq=f, breaks=seq(0,1,by=0.05), col="red") > x = seq(0,1,by=0.01) > lines(x, nm(x)) Change the running time and the parameters for random beta walk, and see how the histogram differs from the target distribution of normal mixture. Statistical Data Mining 15/20

16 Markov chain Let S be a discrete state space and let (X t, t = 0, 1,...) be a discrete time stochastic process, taking its value on S. Then X t is called a Markov chain if P(X t = x t X s = x s, s t ) = P(X t = x t X t = x t ) for t < t. Let P(x, y) be a transition probability, that is, a probability distribution P(x, ) on S for each x S. Then one can construct a Markov chain X t so that P(X t = y X t 1 = x) = P(x, y). Then we can easily see that P(X t+n = y X t = x) = P n (x, y). Statistical Data Mining 16/20

17 Property of Time-Reversibility Let π be a probability distribution on S. We call π stationary if π(x)p(x, y) = π(y) x S for all y S. A Markov chain P is said to be time-reversed if the detailed balance π(x)p(x, y) = π(y) P(y, x) holds between P and P. Moreover, P is called a reversible Markov chain when P = P. In particular if the probability distribution π satisfies the detailed balance between the two Markov chains P and P, then π becomes stationary for both P and P. Statistical Data Mining 17/20

18 Ergodicity and Limit Theorem We call (X t ) (and its transition probability P) an ergodic Markov chain if P is irreducible and aperiodic. Now suppose that we have devised an ergodic Markov chain whose stationary distribution is π. Then we have the following convergence theorem: If P is irreducible and aperiodic, then there is a unique stationary distribution π such that P n (x, y) = P(X n = y X 0 = x) π(y) as n, regardless of the choice for x. In terms of sample path it implies that L π as t. X t Statistical Data Mining 18/20

19 Transition Probability for MHA One should choose Q to be irreducible on S, and call it a proposal chain. In terms of transition probability, the above algorithm is given by { Q(x, y)a(x, y) if x y; P(x, y) = Q(x, x) + z S Q(x, z)(1 A(x, z)) if x = y. Then P is irreducible, aperiodic and reversible transition probability with the stationary distribution π. Statistical Data Mining 19/20

20 Reversibility for MHA The detailed balance clearly holds for either x = y or π(x)q(x, y) = π(y)q(y, x). Suppose that x y, and that π(x)q(x, y) > π(y)q(y, x). Then and therefore, we obtain A(x, y) = π(y)q(y,x) π(x)q(x,y) and A(y, x) = 1, π(x)p(x, y) = π(x)q(x, y)a(x, y) = π(y)q(y, x) = π(y)p(y, x). It also holds for the case that π(x)q(x, y) < π(y)q(y, x) by symmetry. Statistical Data Mining 20/20

Markov Chain Monte Carlo

Chapter 5 Markov Chain Monte Carlo MCMC is a kind of improvement of the Monte Carlo method By sampling from a Markov chain whose stationary distribution is the desired sampling distributuion, it is possible