1 MCMC Methodology. ISyE8843A, Brani Vidakovic Handout Theoretical Background and Notation

Size: px

Start display at page:

Download "1 MCMC Methodology. ISyE8843A, Brani Vidakovic Handout Theoretical Background and Notation"

Marcus Chambers
6 years ago
Views:

1 ISyE8843A, Brani Vidakovic Handout MCMC Methodology. Independence of X,..., X n is not critical for an approximation of the form E θ x h(x) = n n i= h(x i), X i π(θ x). In fact, when X s are dependent, the ergodic theorems describe the approximation. An easy and convenient form of dependence is Markov chain dependence. The Markov dependence is perfect for computer simulations since for producing a future realization of the chain, only the current state is needed.. Theoretical Background and Notation Random variables X, X 2,..., X n,... constitute a Markov Chain on continuous state space if they possess a Markov property, P (X n+ A X,..., X n ) = P (X n+ A X,..., X n ) = Q(X n, A) = Q(A X n ), for some probability distribution Q. Typically, Q is assumed a time-homogeneous, i.e., independent on n ( time ). The transition (from the state n to the state n + ) kernel defines a probability measure on the state space and we will assume that the density q exists, i.e., Q(A X n = x) = q(x, y)dy = q(y x)dy. Distribution Π is invariant, if for all measurable sets A Π(A) = Q(A x)π(dx). A If the transition density π exists, it is stationary if q(x y)π(y) = q(y x)π(x). Here and in the sequel we assume that the density for Π exists, Π(A) = A π(x)dx. A distribution Π is an equilibrium distribution if for Q n (A x) = P (X n A X = x), lim n Qn (A x) = Π(A). In plain terms, the Markov chain will forget the initial distribution and will converge to the stationary distribution. The Markov Chain is irreducible if for each A for which Π(A) >, and for each x, one can find n, so that Q n (A x) >. The Markov Chain X,..., X n,... is recurrent if for each B such that Π(B) >, P (X n B i.o. X = x) =, a.s.(in distribution of X ) A It is Harris recurrent if P (X n B i.o. X = x) =, ( x). The acronym i.o. stands for infinitely often.

Figure : Nicholas Constantine Metropolis, 95-999.2 Metropolis Algorithm Metropolis algorithm is the fundamental to MCMC development.

2 Figure : Nicholas Constantine Metropolis, Metropolis Algorithm Metropolis algorithm is the fundamental to MCMC development. Assume that the target distribution is known up to a normalizing constant. We would like to construct a chain with π as its stationary distribution. As in ARM, we take a proposal distribution q(x, y) = q(y x), where the proposal for a new value of a chain is y, given that the chain is at value x. Thus q defines transition kernel Q(A, x) = A q(y x)dx which is the probability of transition to some y A. Detailed Balance Equation. A Markov Chain with transition density q(x, y) = q(y x) satisfies detailed balance equation if there exists a distribution f such that q(y x)f(x) = q(x y)f(y). () The distribution f is stationary (invariant) and the chain is reversible. Indeed, if () holds, q(x y)f(y)dy = q(y x)f(x)dy = f(x) q(y x)dy = f(x), which is the definition of invariant distribution. For a given target distribution π, the proposal q is admissible if supp π(x) x supp q( x). Metropolis-Hastings Algorithm is universal. One can select an arbitrary proposal distribution that is admissible. Of course such arbitrary distribution/kernel cannot be expected to satisfy the detailed balance equation () for the target distribution π, i.e, Suppose (wlog) q(y x)π(x) q(x y)π(y). q(y x)π(x) > q(x y)π(y). Then there is a factor ρ(x, y) such that the above inequality is balanced, q(y x) ρ(x, y) π(x) = q(x y)π(y). 2

3 By solving with respect to ρ(x, y) one obtains, ρ(x, y) = q(x y)π(y) q(y x)π(x), where a b denotes min{a, b}. What is the transition kernel corresponding to modified equation? q M (y x) = q(y x)ρ(x, y) + (y = x)( q(y x)ρ(x, y)dy). Metropolis-Hastings Algorithm. Assume that target distribution π is known up to the normalizing constant. This may be the case of posteriors which are always known up to the proportionality constant as products of the likelihood and a prior. STEP Start with arbitrary x from the support of target distribution. STEP 2 At stage n, generate proposal y from q(y x n ). Take x n+ = y with probability ρ(x n, y) = q(xn y)π(y) q(y x n )π(x n ). Otherwise, take x n+ = x n. This random acceptance is done by gen- STEP 3 erating a uniform on (,) random variable U and accepting the proposal y if U ρ(x n, y). STEP 4 Increase n and return to STEP 2. Some Common Choices for q. If q(x y) = q(y x), i.e. if the kernel is symmetric, the acceptance ratio ρ(x, y) simplifies to π(y) π(x), since the proposal kernels from the numerator and denominator cancel. If in addition q depends on (x, y) via y x, i.e., q(x, y) = q ( y x ), for some distribution q, the algorithm is called the Metropolis random walk. A symmetric kernel is the original proposal from Metropolis et al. (953). If the proposal q(x, y) does not depend on x, i.e., q(y x) = q(y), the algorithm is called theindependence Metropolis. It is similar to the aceptance/rejection method (ARM) but unlike the ARM, every step produces a realization from the target distribution. That realization may be repeated many times which is the case when proposal is not accepted and current state is repeatedly taken to be the new state..2. Examples [From Johnson and Albert (999)] A small company improved a product and wants to infer about the proportion of potential customers who will buy the product if the new product is preferred to the old one. The company is certain that this proportion will exceed.5, i.e. and uses the uniform prior on [.5, ]. Out of 2 customers surveyed, 2 prefer the new product. Find the posterior for p. 3

4 (a) x 4 (b) Figure 2: s =.5 Since the support of p is [.5,], we transform the data by θ = log p.5 p, so that θ (, ). For such θ it is easier to specify the proposal, although one can construct Metropolis chain for the untransformed parameter. The inverse transformation is p = /2 + exp{θ} + exp{θ}, with Jacobian /2 exp{θ} (+exp{θ}) 2, and the density for θ is proportional to (/2 + exp{θ}) 2 exp{θ} ( + exp{θ}) 22. The proposal distribution is normal N (θ n, s 2 ), where θ n is current state of the chain and s 2 is to be specified. Here is matlab program illustrating the sampling (albertmc.m at the course web page). % nn = 4; % nn=number of metropolis iterations s=; % s = std of normal proposal density burn=2; % burn = burnin amount % ps=[]; thetas=[]; %transformed p s old = ; % start, theta_ for i = :nn prop = old + s*randn(,); %proposal from N(theta_old, sˆ2) u = rand(,); ep=exp(prop); eo=exp(old); post_p=((/2 + ep)ˆ2 * ep)/((+ep)ˆ22); post_o=((/2 + eo)ˆ2 * eo)/((+eo)ˆ22); new = old; if u <= min(post_p/post_o, ) 4

5 new = prop; %accept proposal as new old = new; % and set old to be the new % for the next iteration; end thetas = [thetas, new]; %collect all theta s ps=[ps, (/2+exp(new))/(+exp(new))]; %back-transformation to p s. end (a) x 4 (b) Figure 3: s = Figures 2 and 3 illustrate the simulation for s =.5 and s = in the proposal distribution. Panels (a) depict the histograms for θ and p, while the panel (b) depicts the last 5 simulations of the chain. Notice that the chain in Figure 2(b) mixes well and indeed reminds a random walk, while its counterpart in Figure 3(b) shows poor mixing. Weibull Example. The Weibull distribution is used extensively in reliability, queueing theory, and many other engineering applications, partly for its ability to describe different hazard rate behavior and partly for historic reasons. The Weibull distribution parameterized by α - the shape or slope, and η /α - the scale, f(x α, η) = αηx α e xαη, is not a member of the exponential family of distributions and explicit posteriors for α and η are impossible. Consider the prior π(α, η) e α ηβ e βη, and observations data = [.2..25];. Imagine these data are extremely expensive obtained by performing a destructive inspection of the pricey products. Construct MCMC based on the Metropolis-Hastings algorithm and approximate posteriors for α and η. Assume the hyperparameter beta = 2; and proposal distribution q(α, η α, η) = } { αη exp α α η η 5

6 (product of two exponentials with means α and η). Note that q(α, η α, η) q(α, η α, η ) and q does not cancel in the expression for ρ. Some hints that are checked and should work well: (i) Start with arbitrary initial values, say: alpha = 2; eta = 2; (ii) Set alphas = []; etas = []; and do: for i = : alpha_prop = - alpha * log(rand); eta_prop = - eta * log(rand); % prod = prod(data); prod2 = prod( exp( eta * data.ˆalpha - eta_prop * data.ˆalpha_prop)); % rr = (eta_prop/eta)ˆ(beta-) * exp(alpha - alpha_prop - beta *... (eta_prop - eta)) * exp(- alpha/alpha_prop - eta/eta_prop +... alpha_prop/alpha + eta_prop/eta)*prod.ˆ(alpha_prop - alpha) *... prod2 *((alpha_prop * eta_prop)/(alpha * eta))ˆ(n-); % r = min( rr,); if (rand < r) alpha = alpha_prop; eta = eta_prop; end alphas = [alphas alpha]; etas = [etas eta]; end Values alpha prop and eta prop are proposals from independent exponential distributions with means alpha and eta. Burn in 5 out of simulations (usually -5 is enough) to make sure that there is no influence of the initial values for α and η, and plot the histograms of their posterior distributions. figure() subplot(,2,) hist(alphas(5:end),) subplot(,2,2) hist(etas(5:end),) Finally, report the mean and variance of alphas and etas. These are desired Bayes estimators with their posterior precisions..3 Gibbs Sampler The Gibbs sampler, introduced in [], is a special case of A Single Component Metropolis algorithm. Define X = (X, X 2,..., X p ). Each step of the algorithm will consist of p coordinatewise updates. Define X i = (X,..., X i, X i+,..., X p ), and X n i = (X n+,..., Xi n+, Xn i+,..., Xn p ), i =,..., p. X n i is the update at step n, where first i coordinates are updated to their values at step n + and the coordinates at positions i +, i + 2,..., p are not updated and still are at step n. Let q i (Y i Xi n, Xn i) be the proposal distribution that generates proposals for ith coordinate only. Then the single component Metropolis proceeds as follows: Generate a candidate Y i q i (Y i Xi n, Xn i). 6

7 Accept the candidate Y i with probability as Xi n+ = Y i. Otherwise set Xi n+ = Xi n. Go to the next coordinate at step n. Let π(x i X i ) = ρ(x n i, X n i, Y i ) = π(y i X n i)q i (X n i Y i, X n i) π(x i X n i)q i (Y i X i, X n i). π(x) R π(x)dx i be the full conditional. The Gibbs sampler is a single step Metropolis algorithm with q i (Y i X i, X i ) = π(y i X i ). Obviously, at each step n, ρ(x n i, Xn i, Y i ) =, and each update in Gibbs algorithm is accepted. In more familiar notation, suppose that θ = (θ,..., θ p ) is multidimensional parameter of interest. Each component can be univariate of multivariate. Suppose that we can simulate from the conditional densities π(θ i θ i ), where as the above, θ i denotes the parameter vector θ without the component i. If the current state of θ is θ n = (θ n, θn 2,..., θn p ), the Gibbs sampler produces θ n+ in the following way: Draw θ n+ from π(θ θ2 n, θn 3,..., θn p ) Draw θ2 n+ from π(θ 2 θ n+, θ3 n,..., θn p ) Draw θ3 n+ from π(θ 3 θ n+, θ2 n+, θ4 n,..., θn p )... Draw θp n+ from π(θ p θ n+, θ2 n+,..., θp 2 n+, θn p ) Draw θp n+ from π(θ p θ n+, θ2 n+,..., θp n+ ) Above, we have assumed a fixed updating order. This may not always be the case, since it is possible to generalize the Gibbs Sampler in a number of ways. It is possible to assume the random updating order, i.e. pick the block to update randomly. It is also possible to update only one block per iteration and to choose the block to update with some preassigned probability, see Gilks et al. [6] for discussion..3. Finding the Full Conditionals The full conditionals, needed for implementation of the Gibbs sampler are conceptually easy to find. From the joint distribution of all variables, only expressions that contain the particular variable are entering to the conditional distribution. The difficulty is (as always) in finding normalizing constants. Suppose θ = (θ s, θ s ) and we are interested in the full conditional for θ s. The full conditional is π(θ s θ s ) = π(θ s, θ s ) π(θs, θ s )dθ s π(θ s, θ s ). Example. A popular simple model to illustrate Gibbs sampler and finding the full conditionals the following [Gilks (from [2]), Chapter 5, page 76.] Y, Y 2,..., Y n N (µ, /τ) µ N (, ) τ Gamma(2, ). 7

8 The joint distribution is { n } f(y, µ, τ) = f(y i µ, τ) π(µ)π(τ) i= { = (2π) (n+)/2 τ n/2 exp τ/2 } n (y i µ) 2 i= exp{ /2µ 2 } τ exp{ τ}. To find the full conditional for µ we select the terms from f(y, µ, τ) that contain µ and normalize. Indeed, π(µ τ, y) = = π(µ, τ y) π(τ y) π(µ, τ, y) π(τ, y) π(µ, τ, y). Thus, { } π(µ τ, y) exp τ n (y i µ) 2 exp{ /2µ 2 } 2 i= exp { 2 ( ( + nτ) µ τ ) } y 2 i, + nτ which is normal N ( τ P y i +nτ, +nτ ) distribution. Similarly, { } n π(τ µ, y) τ n/2 exp τ/2 (y i µ) 2 τ exp{ τ} = τ n/2+ exp { τ i= [ + 2 ]} n (y i µ) 2, which is unnurmalized gamma Gamma(2 + n/2, + n 2 i= (y i µ) 2 ). The matlab code mcmc.m implements the sampler. Function rand gamma.m generates random gamma variates and is a part of BayesLab. We simulated n = 2 observations from N (, 4 2 ) distribution and started a gibs with µ = and τ = 2. n=3; % sample size randn( state, ); y = 4 * randn(,n) + ; % NN = ; mus = []; taus = []; suma = sum(y); mu = ; % set the parameters as prior means tau = 2; % for i = : NN new_mu = sqrt(/(+n*tau)) * randn + (tau * suma)/(+n*tau); 8 i=

9 par = +/2 * sum ( (y - mu).ˆ2 ); new_tau = rand_gamma(2 + n/2, par,,); mus = [mus new_mu]; taus = [taus new_tau]; mu=new_mu; tau=new_tau; end (a) (b) (c) Figure 4: (a) Last 5 simulations (out of ) for µ (top) and τ (below); (b) histograms of µ and τ; (c) joint simulation for µ and τ. Figure 4(a) depicts last 5 simulations (out of ) for µ (top) and τ (below). Panel (b) gives histograms of µ and τ, and panel (c) presents a joint simulation for µ and τ. The burn-in period was, so only 9 variates have been used to approximate the posteriors. The MCMC estimators are ˆµ =.938 and ˆτ =.652. The performance is quite good since the theoretical parameters are and /6=.625, respectively. Exercises. Beetle Mortality. The data comes from Bliss [] (cited in Dobson [2]) and is shown in Table. The data involves counting the number of beetles killed after five hours of exposure to various concentrations of gaseous carbon disulphide (CS 2 ). The analysis concerns estimating the proportion r i /n i of beetles that are killed by the gas. Consider the model ( ) exp{xi } m P (death w i ) = h(w i ) =, + exp{x i } where m >, and w i is a covariate (dose), and x i = w i µ σ, µ R, σ2 >. 9

10 Dosage (log CS 2 mg/litre) Beetles Killed Table : Data on Beetle Mortality from Bliss (935). Batches of adult beetles were exposed to gaseous carbon disulphide for five hours. The priors are m Gamma(a, b ), µ N (c, d ), σ 2 IG(e, f ) [τ = σ 2 Gamma(e, f ). The joint posterior π(µ, σ 2, m y) is proportional to f(y µ, σ 2, m )π(µ, σ 2, m ) ( k ) [h(w i )] y i [ h(w i )] n i y i i= { ma (σ 2 ) e + exp 2 ( µ c d ) } 2 m b f σ 2. The transformation θ = (θ, θ 2, θ 3 ) = (µ, 2 log(σ2 ), log m ) is supported by R 2 and multivariate normal proposal for θ is possible. In the new variables, ( k ) π(θ y) [h(w i )] y i [ h(w i )] n i y i exp{a θ 3 2e θ 2 ) i= exp { 2 ( θ c d ) } 2 exp{θ 3} exp{ 2θ 2}. b f The acceptance probability is more calculationally stable if it is represented as ρ = exp{log π (θ y) log π (θ n y)}, for unnormalized posterior π, θ proposal, and θ n current state. The choice of the hyperparameters is a =.25, b = 4 (m has prior mean, as in the standard logit model), c = 2, d =, e = 2, f =. Proposal density is MVN 3 (θ n, Σ) with Σ = diag(.2,.33,.). 2. Contingency Table. See the matlab file albertmc2.m on the Bayes page. Figure 5 depicts the output.

11 α η p C p L Histogram of differences: p L p C (a) (b) (c) References Figure 5: Graphical output from albertmc2.m. [] Besag, J. (974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser. B, 36, [2] Bliss, C. I. (935). The calculation of the dosage-mortality curve. The Annals of Applied Biology, 22, [3] Brooks, S.P. (998). Monte Carlo Methods and its application. The Statistician, 47, 69. [4] Brooks, S.P. (998). Quantitative convergence assessment for Markov chain Monte Carlo via cusums. Statistics and Computing, 8, [5] Casella, G., George, E.I. (992). Explaining the Gibbs Sampler. The American Statistician, 46, (Celebrated Gibbs for Kids ). [6] Chib, S., Greenberg, E. (995). Understanding the Metropolis-Hastings Algorithm. The American Statistician, 49, [7] Gelfand, A.E., Smith, A.F.M. (99). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, [8] Gelfand, A.E. (2). Gibbs Sampling. Journal of the American Statistical Association, 95, [9] Robert, C. (2). Bayesian Choice, Second Edition, Springer Verlag. [] Dobson. A. J. (983). An Introduction to Statistical Modelling. Chapman and Hall. [] Geman, S. and Geman, D. (984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. Pami-6, No. 6, [2] Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (996). Markov Chain Monte Carlo in Practice. Chapman and Hall.

12 [3] Hastings, W.K. (97). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, [4] Johnson, V. and Albert, J. (999). Ordinal Data Modeling. Springer Verlag, NY. [5] Metropolis, N., Rosenbluth, M.N., Teller, A.H., Teller, E. (953). Equations of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 2, [6] Robert, C.P., Casella, G. (999). Monte Carlo Statistical Methods. Springer-Verlag, New York. [7] Roberts, G.O., Gelman, A., Gilks, W.R. (997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Annals of Applied Probability, 7, 2. [8] Tanner, M.A. (996). Tools for Statistical Inference. Springer-Verlag, New York. [9] Tierney, L. (994). Markov Chains for exploring posterior distributions( with discussion). Annals of Statistics, 22,

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167