Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Similar documents
Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

Computational statistics

17 : Markov Chain Monte Carlo

MCMC algorithms for fitting Bayesian models

Markov Chain Monte Carlo (MCMC)

CPSC 540: Machine Learning

Monte Carlo methods for sampling-based Stochastic Optimization

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

eqr094: Hierarchical MCMC for Bayesian System Reliability

MCMC: Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

The Metropolis-Hastings Algorithm. June 8, 2012

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Adaptive Monte Carlo methods

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Computer Intensive Methods in Mathematical Statistics

Introduction to Machine Learning CMU-10701

Lecture 6: Markov Chain Monte Carlo

CSC 2541: Bayesian Methods for Machine Learning

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

6 Markov Chain Monte Carlo (MCMC)

An introduction to Sequential Monte Carlo

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Advanced Statistical Modelling

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Sequential Monte Carlo Methods for Bayesian Computation

Markov Chain Monte Carlo, Numerical Integration

LECTURE 15 Markov chain Monte Carlo

Gibbs Sampling in Linear Models #1

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Construction of Dependent Dirichlet Processes based on Poisson Processes

Multimodal Nested Sampling

Integrated Non-Factorized Variational Inference

Lecture 13 Fundamentals of Bayesian Inference

Chapter 12 PAWL-Forced Simulated Tempering

A short diversion into the theory of Markov chains, with a view to Markov chain Monte Carlo methods

Weak convergence of Markov chain Monte Carlo II

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Inference in state-space models with multiple paths from conditional SMC

Hmms with variable dimension structures and extensions

Markov Chains: Basic Theory Definition 1. A (discrete time/discrete state space) Markov chain (MC) is a sequence of random quantities {X k }, each tak

STA 294: Stochastic Processes & Bayesian Nonparametrics

Markov Chain Monte Carlo Lecture 6

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

CSC 2541: Bayesian Methods for Machine Learning

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Monte Carlo Methods in Bayesian Inference: Theory, Methods and Applications

Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

Scaling up Bayesian Inference

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer intensive statistical methods

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

On Reparametrization and the Gibbs Sampler

CS281A/Stat241A Lecture 22

STAT 425: Introduction to Bayesian Analysis

Bayesian nonparametrics

Markov Chain Monte Carlo methods

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Likelihood-free MCMC

INTRODUCTION TO BAYESIAN STATISTICS

Principles of Bayesian Inference

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Lecture 8: The Metropolis-Hastings Algorithm

Bayesian linear regression

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Sequential Monte Carlo Methods

STAT 518 Intro Student Presentation

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Bayesian Linear Models

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Markov chain Monte Carlo Lecture 9

Stat 451 Lecture Notes Monte Carlo Integration

Markov Chain Monte Carlo Methods

Brief introduction to Markov Chain Monte Carlo

Unsupervised Learning

MCMC Methods: Gibbs and Metropolis

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

A Review of Pseudo-Marginal Markov Chain Monte Carlo

MARKOV CHAIN MONTE CARLO

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Bayesian Linear Models

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Pseudo-marginal MCMC methods for inference in latent variable models

STAT Advanced Bayesian Inference

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Bayesian Nonparametric Regression for Diabetes Deaths

Transcription:

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1

1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture Page 2

2.1 Summary of Last Lecture Rejection Sampling and Importance Sampling are two general methods but limited to problems of moderate dimensions. Problem : We try to sample all the components of a potentially high-dimensional parameter simultaneously. There are two ways to implement incremental strategies. - Iteratively: Markov chain Monte Carlo. - Sequentially: Sequential Monte Carlo. Summary Page 3

2.2 Motivating Example: Nuclear Pump Data Multiple failures in a nuclear plant: Pump 1 2 3 4 5 6 7 8 9 1 Failures 5 1 5 14 3 19 1 1 4 22 Times 94.32 15.72 62.88 125.76 5.24 31.44 1.5 1.5 2.1 1.48 Model: Failures of the i th pump follow a Poisson process with parameter λ i (1 i 1). For an observed time t i,thenumber of failures p i is thus a Poisson P(λ i t i ) random variable. The unknowns consist of θ := (λ 1,...,λ 1,β). Summary Page 4

2.3 Bayesian Model for Nuclear Pump Data Hierarchical model with α =1.8 andγ =.1 and δ =1. λ i iid Ga(α, β) andβ Ga(γ,δ) The posterior distribution is proportional to 1 {(λ i t i ) p i exp( λ i t i )λ α 1 i=1 i exp( βλ i )}β 1α β γ 1 exp( δβ) 1 {λ p i+α 1 i i=1 exp( (t i + β)λ i )}β 1α+γ 1 exp( δβ). This multidimensional distribution is rather complex. It is not obvious how the inverse cdf method, the rejection method or importance sampling could be used in this context. Summary Page 5

2.4 Conditional Distributions The conditionals have a familiar form λ i (β,t i,p i ) Ga(p i + α, t i + β) for1 i 1, β (λ 1,...,λ 1 ) Ga(γ +1α, δ + 1 i=1 λ i ). Instead of directly sampling the vector θ =(λ 1,...,λ 1,β) at once, one could suggest sampling it iteratively, starting for example with the λ i s for a given guess of β, followed by an update of β given the new samples λ 1,...,λ 1. Summary Page 6

2.4 Conditional Distributions Given a sample, at iteration t, θ t := (λ t 1,...,λ t 1,β t ) one could proceed as follows at iteration t +1, 1. λ t+1 i (β t,t i,p i ) Ga(p i + α, t i + β t )for1 i 1, 2. β t+1 (λ t+1 1,...,λ t+1 1 ) Ga(γ +1α, δ + 1 i=1 λt+1 i ). Instead of directly sampling in a space with 11 dimensions, one samples in spaces of dimension 1 Note that the deterministic version of such an algorithm would not generally converge towards the global maximum of the joint distribution. Summary Page 7

2.4 Conditional Distributions The structure of the algorithm calls for many questions: Are we sampling from the desired joint distribution? If yes, how many times should the iteration above be repeated? The validity of the approach described here stems from the fact that the sequence {θ t } defined above is a Markov chain and some Markov chains have very nice properties. Summary Page 8

2.5 Introduction to Markov Chain Monte Carlo Markov chain: A sequence of random variables {X n ; n N} defined on (X, B (X)) which satisfies the property, for any A B(X) P (X n A X,..., X n 1 )=P (X n A X n 1 ). and we will write P (x, A) =P (X n A X n 1 ). Markov chain Monte Carlo: Given a target π, design a transition kernel P such that asymptotically as n 1 N N ϕ (X n ) n=1 ϕ (x) π (x) dx and/or X n π. It should be easy to simulate the Markov chain even if π is complex. Summary Page 9

2.6 Example Consider the autoregression for α < 1 X n = αx n 1 + V n,wherev n N (,σ 2). The limiting distribution is π (x) =N ( x;, σ 2 ). 1 α 2 To sample from π, we could just sample the Markov chain and asymptotically we would have X n π. Obviously, in this case this is useless because we can sample from π directly. Summary Page 1

2.7 Example Graphically, consider 1 independent Markov chains run in parallel. We assume that the initial distribution of these Markov chains is U [,2]. So initially, the Markov chains samples are not distributed according to π Summary Page 11

2.7 Example From top left to bottom right: histograms of 1 independent Markov chains with a normal distribution as target distribution..6.4 5 1 2 3 4.4 5 1 2 34.4 5 1 3 4 2.2.2.2 1 2 1 1 2 1 1 2 5 1 32 4 1 3 5 2 4 1 3 5 4 2.4.4.4.2.2.2 2 2 1 1 2 2 2.4 5 3 4 1 2.4 1 2 345.6.4 1 2 3 4 5.2.2.2 2 2 2 2 2 2 Summary Page 12

2.7 Example The target normal distribution seems to attract the distribution of the samples and even to be a fixed point of the algorithm. This is is what we wanted to achieve, i.e. it seems that we have produced 1 independent samples from the normal distribution. In fact one can show that in many (all?) situations of interest it is not necessary to run N Markov chains in parallel in order to obtain 1 samples, but that one can consider a unique Markov chain, and build the histogram from this single Markov chain by forming histograms from one trajectory. Summary Page 13

2.8 Example: Mixture of Normals 1 iterations.4.3.2.1 2 4 1 6 5 8 5 Iterations 1 1 X axis Summary Page 14

2.8 Example: Mixture of Normals 1 iterations.35.3.25.2.15.1.5 2 4 6 Iterations 8 1 1 5 X axis 5 1 Summary Page 15

2.9 Markov chain Monte Carlo The estimate of the target distribution, through the series of histograms, improves with the number of iterations. Assume that we have stored {X n, 1 n N} for N large and wish to estimate X ϕ(x)π(x)dx. In the light of the numerical experiments, one can suggest the estimator 1 N ϕ(x n ). N n=1 which is exactly the estimator that we would use if {X n, 1 n N} were independent. In fact, it can be proved, under relatively mild conditions, that such an estimator is consistent despite the fact that the samples are NOT independent! Under additional conditions, a CLT also holds with a rate of CV in 1/ N. Summary Page 16

2.9 Markov chain Monte Carlo To summarize, we are interested in Markov chains with transition kernel P which have the following three important properties observed above: The desired distribution π is a fixed point of the algorithm or, in more appropriate terms, an invariant distribution of the Markov chain, i.e. π(x)p (x, y) =π(y). X The successive distributions of the Markov chains are attracted by π, or converge towards π. The estimator 1 N N n=1 ϕ(x n) converges towards E π (ϕ(x)) and asymptotically X n π Summary Page 17

2.9 Markov chain Monte Carlo Given π (x), there is an infinite number of kernels P (x, y) which admits π (x) as their invariant distribution. The art of MCMC consists of coming up with good ones. Convergence is ensured under very weak assumptions; namely irreducibility and aperiodicity. It is usually very easy to establish that an MCMC sampler converges towards π but very difficult to obtain rates of convergence. Summary Page 18

2.1 The Gibbs Sampler Consider the target distribution π (θ) such that θ = ( θ 1,θ 2). Then the 2 component Gibbs sampler proceeds as follows. Initialization: Select deterministically or randomly θ = ( θ 1,θ 2 ). Iteration i; i 1: Sample θ 1 i π ( θ 1 θ 2 i 1). Sample θ 2 i π ( θ 2 θ 1 i ). Sampling from these conditional is often feasible even when sampling from the joint is impossible (e.g. nuclear pump data). Summary Page 19

2.11 Invariant Distribution Clearly {( )} θi 1,θ2 i is a Markov chain and its transition kernel is ( (θ P 1,θ 2) ( θ1,, θ 2)) ( = π θ1 θ 2) ( π θ2 θ 1). Then π ( θ 1,θ 2) ( (θ P 1,θ 2) ( θ1,, θ )) 2 dθ 1 dθ 2 satisfies π ( θ 1,θ 2) ( π θ1 θ 2) ( π θ2 θ 1) dθ 1 dθ 2 = π ( θ 2) π ( θ1 θ 2) π ( θ2 θ 1) dθ 2 = π ( θ1,θ 2) π ( θ2 θ 1) dθ 2 = π ( θ1 ) π ( θ2 θ 1) ( θ1 = π, θ 2) Summary Page 2

2.12 Irreducibility This does not ensure that the Gibbs sampler does converge towards the invariant distribution! Additionally it is required to ensure irreducibility: loosely speaking the Markov chain can move to any set A such that π (A) > for (almost) any starting point. This ensures that 1 N N n=1 ϕ ( θn,θ 1 n 2 ) ϕ ( θ 1,θ 2) π ( θ 1,θ 2) dθ 1 dθ 2 but NOT that asymptotically ( θ 1 n,θ 2 n) π. Summary Page 21

2.13 Irreducibility A distribution that can lead to a reducible Gibbs sampler. 2 1.8 1.6 Zero density 1.4 1.2 1.8.6 Zero density.4.2.2.4.6.8 1 1.2 1.4 1.6 1.8 2 Summary Page 22

2.14 Aperiodicity Consider a simple example where X = {1, 2} and P (1, 2) = P (2, 1) = 1. Clearly the invariant distribution is given by π (1) = π (2) = 1 2. However, we know that if the chain starts in X =1, then X 2n =1andX 2n+1 =foranyn. We have 1 N N ϕ (X n ) n=1 ϕ (x) π (x) dx but clearly X n is NOT distributed according to π. You need to make sure that you do NOT explore the space in a periodic way to ensure that X n π asymptotically. Summary Page 23

2.15 About the Gibbs sampler Even when irreducibility and aperiodicity are ensured, the Gibbs sampler can still converge very slowly. 2 15 1 5 5 1 15 2 3 2 1 1 2 3 Summary Page 24

2.16 More about the Gibbs sampler If θ =(θ 1,..., θ p )wherep>2, the Gibbs sampling strategy still applies. Initialization: Select deterministically or randomly θ = ( θ 1,..., θ p ). Iteration i; i 1: For k =1:p Sample θi k π ( θ k θ k ) i. where θ k i = ( θ 1 i,..., θk 1 i,θ k+1 i 1,..., i 1) θp. Summary Page 25

2.17 Random Scan Gibbs sampler Initialization: Select deterministically or randomly θ = ( θ 1,..., θ p ). Iteration i; i 1: Sample K U {1,...,p}. Set θ K i = θ K i 1. Sample θ K i π ( θ K θ K ) i. where θ K i = ( θ 1 i,..., θk 1 i,θ K+1 i,..., θ p ) i. Summary Page 26

2.18 Practical Recommendations Try to have as few blocks as possible. Put the most correlated variables in the same block. If necessary, reparametrize the model to achieve this. Integrate analytically as many variables as possible: pretty algorithms can be much more inefficient than ugly algorithms. There is no general result telling strategy A is better than strategy B in all cases: you need experience. Summary Page 27

2.19 Bayesian Variable Selection Example We select the following model p Y = β i X i + σv where V N(, 1) i=1 where we assume IG ( σ 2 ; ν 2, γ 2 ) and for α 2 << 1 β i 1 2 N (,α 2 δ 2 σ 2) + 1 2 N (,δ 2 σ 2) We introduce a latent variable γ i {, 1} such that Pr (γ i =)=Pr(γ i =1)= 1 2, β i γ i = N (,α 2 δ 2 σ 2), β i γ i =1 N (,δ 2 σ 2). Summary Page 28

2.2 A Bad Gibbs Sampler We have parameters ( β 1:p,γ 1:p,σ 2) and observe n observations D = {x i,y i } n i=1. A potential Gibbs sampler consists of sampling iteratively from p ( β 1:p D, γ 1:p,σ 2) (Gaussian), p ( σ 2 D, γ 1:p,β 1:p ) (inverse-gamma) and p ( γ 1:p D, β 1:p,σ 2). In particular and p ( γ i =1 β i,σ 2) = p ( γ 1:p D, β 1:p,σ 2) = p p ( γ i β i,σ 2) i=1 ( ) 1 2πδσ exp β2 i 2δ 2 σ 2 ( ) 1 2πδσ exp β2 i 2δ 2 σ 2 + 1 2παδσ exp ( ). β2 i 2α 2 δ 2 σ 2 The Gibbs sampler becomes reducible as α goes to zero. Summary Page 29

2.21 Bayesian Variable Selection Example This is the result of bad modelling and bad algorithm. You would like to put α andwrite p Y = γ i β i X i + σv where V N(, 1) i=1 where γ i =1ifX i is included or γ i = otherwise. However this suggests that β i is defined even when γ i =. A neater way to write such models is to write Y = β i X i + σv = βγ T X γ + σv {i:γ i =1} where, for a vector γ =(γ 1,..., γ p ), β γ = {β i : γ i =1},X γ = {X i : γ i =1} and n γ = p i=1 γ i. Prior distributions ( π γ βγ,σ 2) = N ( ) ( β γ ;,δ 2 σ 2 I nγ IG σ 2 ; ν 2, γ ) 2 and π (γ) = p i=1 π (γ i)=2 p. Summary Page 3

2.22 A Better Gibbs Sampler We are interested in sampling from the trans-dimensional distribution π ( γ,β γ,σ 2 D ) However, we know that π ( γ,β γ,σ 2 D ) = π (γ D) π ( β γ,σ 2 D, γ ) where π (γ D) π (D γ) π (γ) and π (D γ) = π ( D, β γ,σ 2 γ ) dβ γ dσ 2 ( ν + n Γ 2 ) ( +1 δ n γ Σ γ 1/2 γ + n i=1 y2 i μt γ Σ 1 γ 2 μ γ ) ( ν +n 2 +1). Summary Page 31

2.23 Bayesian Variable Selection Example π (γ D) is a discrete probability distribution with 2 p potential values. We can use the Gibbs sampler to sample from it. Initialization: Select deterministically or randomly γ = ( γ 1,..., γ p ). Iteration i; i 1: For k =1:p Sample γi k where γ k i = ( γi 1 π ( γ k D, γ k ) i.,..., γk 1 i,γ k+1 i 1,..., i 1) γp. Optional step: Sample ( ) ( β γ,i,σi 2 π βγ,σ 2 ) D, γ i. Summary Page 32

2.23 Bayesian Variable Selection Example This very simple sampler is much more efficient than the previous one. However, it can also mix very slowly because the components are updated one at a time. Updating correlated components together would increase significantly the convergence speed of the algorithm at the cost of an increased complexity. Summary Page 33

2.23 Bayesian Variable Selection Example The Gibbs sampler is a generic tool to sample approximately from high-dimensional distributions. Each time you face a problem, you need to think hard about it to design an efficient algorithm. Except the choice of the partitions of parameters, the Gibbs sampler is parameter free; this does not mean it is efficient. Summary Page 34