Spring 2006: Introduction to Markov Chain Monte Carlo (MCMC)

Similar documents
Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Spring 2006: Examples: Laplace s Method; Hierarchical Models

CS281A/Stat241A Lecture 22

Principles of Bayesian Inference

F denotes cumulative density. denotes probability density function; (.)

David Giles Bayesian Econometrics

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

INTRODUCTION TO BAYESIAN STATISTICS

MCMC algorithms for fitting Bayesian models

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Markov chain Monte Carlo

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo (MCMC)

Metropolis-Hastings Algorithm

LECTURE 15 Markov chain Monte Carlo

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

STAT 425: Introduction to Bayesian Analysis

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

17 : Markov Chain Monte Carlo

Bayesian Inference and MCMC

Markov chain Monte Carlo Lecture 9

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Computational statistics

Bayesian GLMs and Metropolis-Hastings Algorithm

eqr094: Hierarchical MCMC for Bayesian System Reliability

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

Bayesian Methods for Machine Learning

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

CPSC 540: Machine Learning

Principles of Bayesian Inference

Advanced Machine Learning

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

MONTE CARLO METHODS. Hedibert Freitas Lopes

Markov Chain Monte Carlo for Item Response Models

Learning the hyper-parameters. Luca Martino

Principles of Bayesian Inference

CSC 2541: Bayesian Methods for Machine Learning

Advanced Statistical Modelling

Bayesian Model Comparison:

36-463/663Multilevel and Hierarchical Models

David Giles Bayesian Econometrics

Bayesian Phylogenetics:

Lecture 7 and 8: Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

ST 740: Markov Chain Monte Carlo

Lecture 4: Dynamic models

Approximate Inference using MCMC

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Monte Carlo in Bayesian Statistics

Multivariate Normal & Wishart

Reminder of some Markov Chain properties:

MCMC Methods: Gibbs and Metropolis

θ 1 θ 2 θ n y i1 y i2 y in Hierarchical models (chapter 5) Hierarchical model Introduction to hierarchical models - sometimes called multilevel model

Introduction to Machine Learning CMU-10701

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Lecture Notes based on Koop (2003) Bayesian Econometrics

Introduction to Bayesian Methods

Integrated Non-Factorized Variational Inference

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Adaptive Monte Carlo methods

Markov Chain Monte Carlo Methods

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Lecture 13 Fundamentals of Bayesian Inference

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Foundations of Statistical Inference

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Point spread function reconstruction from the image of a sharp edge

16 : Approximate Inference: Markov Chain Monte Carlo

Monte Carlo Methods in Bayesian Inference: Theory, Methods and Applications

Session 3A: Markov chain Monte Carlo (MCMC)

Bayesian Estimation with Sparse Grids

Markov chain Monte Carlo

Image segmentation combining Markov Random Fields and Dirichlet Processes

Markov Chain Monte Carlo, Numerical Integration

Bayesian Inference in Astronomy & Astrophysics A Short Course

MIT Spring 2016

MCMC: Markov Chain Monte Carlo

Monte Carlo integration

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

7. Estimation and hypothesis testing. Objective. Recommended reading

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

19 : Slice Sampling and HMC

Probabilistic Graphical Networks: Definitions and Basic Results

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

I. Bayesian econometrics

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Down by the Bayes, where the Watermelons Grow

Bayesian Semiparametric GARCH Models

Probabilistic Machine Learning

Bayesian Semiparametric GARCH Models

Chapter 5. Bayesian Statistics

Transcription:

36-724 Spring 2006: Introduction to Marov Chain Monte Carlo (MCMC) Brian Juner February 16, 2006 Hierarchical Normal Model Direct Simulation An Alternative Approach: MCMC Complete Conditionals for Hierarchical Normal Aside: Marginal Density Estimates Summary 1 36-724 February 16, 2006

Hierarchical Normal Model p(µ,τ 2 ) y 11 y n1 1 N(θ µ,τ 2 ) N(y θ 1,σ 2 ) N(y θ 2,σ 2 ) N(y θ j,σ 2 ) N(y θ J,σ 2 ) y 12 y n2 2 y 1 j y i j y n j j y 1J y nj J Withσ 2 j =σ2 /n j nown, the model is where y j = y j. p(y,θ,µ,τ 2 σ 2 ) jn(y j θ j,σ 2 j } {{ } ) Level 1 jn(θ j µ,τ 2 ) } {{ } Level 2 p(µ,τ 2 ) } {{ } Level 3 2 36-724 February 16, 2006

Taing p(µ,τ 2 ) p(τ 2 ), we have where and ˆθ j ˆµ = p(θ,µ,τ 2 y) Direct Simulation p(τ 2 y) V 1/2 µ = p(θ µ,τ 2, y)p(µ τ 2, y)p(τ 2 y) p(µ τ 2, y) = N(µ ˆµ, V µ ) P(θ j µ,τ 2, y j ) = N(θ j ˆθ j, V j ) = y j/σ 2 j +µ/τ2 1/σ 2 j + V j = 1/τ2 j y j /(τ 2 +σ 2 j ) j 1/(τ 2 +σ 2 j ) V µ = [ jn(y j ˆµ,τ 2 +σ 2 j )] p(τ 2 ) 1 1/σ 2 j + 1/τ2 1 j 1/(τ 2 +σ 2 j ) This wasn t bad but it requires cleverness and lacs flexibility. 3 36-724 February 16, 2006

An Alternative Approach: MCMC The problem: Learn aboutπ(τ)=π(θ,µ,τ 2 ), whereτ=(θ,µ,τ 2 ) is some high-dimensional set of variables (parameters). The essential idea: Define a (stationary) Marov chainm 0,M 1,M 2,... with states M m = (θ m,µ m, (τ 2 ) m ). Simulate them m s; under regularity conditions (e.g., Tierney, 1994),M m will converge in distribution to a stationary distributionπ(θ,µ,τ 2 ): π(θ 1,µ 1, (τ 2 ) 1 ) = P[M m+1 = (θ 1,µ 1, (τ 2 ) 1 ) M m = (θ 0,µ 0, (τ 2 ) 0 )] π(θ 0,µ 0, (τ 2 ) 0 ) d(θ 0,µ 0, (τ 2 ) 0 ) For Bayes, design chain so thatπ(θ,µ,τ 2 ) turns out to be the posterior, p(θ,µ,τ 2 ) y). 4 36-724 February 16, 2006

Typical MCMC paradigm: Write p(θ,µ,τ 2 y)= p(τ 1,τ 2,...,τ d y), where (τ 1,...,τ d ) is a disjoint partition of the original model parameters (θ,µ,τ 2 ). General theory of MCMC (e.g., Tierney, 1994; Chib and Greenberg, 1995): construct statem m = (τ (m) 1,...,τ(m) d ), by sampling eachτ m from its complete conditional distribution: To step fromm m 1 = (τ (m 1) 1,...,τ (m 1) d ) tom m = (τ (m) 1,...,τ(m) d ): 1. τ (m) 1 p(τ 1 τ (m 1) 2,...,τ (m 1) d, y) p(τ 1 rest); 2. τ (m) 2 p(τ 2 τ (m) 1,τ(m 1) 3,...,τ (m 1) d, y) p(τ 2 rest); 3. τ (m) 3 p(τ 3 τ (m) 1,τ(m) 2,τ(m 1) d. τ (m) d. 4,...,τ (m 1) d p(τ d τ (m) 1,τ(m) 2,τ(m) 3,...,τ(m) d 1, y).., y) p(τ 3 rest);. p(τ d rest); This is the basic idea behind all so-called Gibbs samplers. 5 36-724 February 16, 2006

Marov Chain Monte Carlo (MCMC) vs. Independent-draws MC: Replace draws from higher-dimensional p(τ y) with a sequence of low-dimensional draws from p(τ 1 rest), p(τ 2 rest), etc. For Bayes, often can exploit partial conjugacy to simplify problem of drawing from p (τ rest). Costs: dependent rather than iid samples, approximate rather than exact sampling of p(τ y). 6 36-724 February 16, 2006

Sampling schemes Once a set of complete conditionals is decided upon: If you can sampleτ directly from p(τ rest), do so this is a Gibbs step; Otherwise perform a Metropolis step: Sample a proposal value τ from any convenient proposal distribution q m (τ τ (m 1) ) Compute the acceptance probability α from q m(τ τ (m 1) ) and the complete conditional p(τ rest) p(τ τ (m) 1,...,τ(m) 1,τ(m 1) +1,...,τ (m 1) d, y): Acceptτ (m) α = min p(τ rest)q m(τ (m 1) τ ) p(τ (m 1) rest)q m (τ τ(m 1) ), 1 ; =τ with probabilityα ; otherwise setτ(m) This is often called Metropolis-Hastings within Gibbs. =τ (m 1). 7 36-724 February 16, 2006

Identifying the complete conditionals Suppose y only depends onθ, so the lielihood is p(y θ), and suppose the θ s,µ s andτ 2 s have prior distributions p(θ µ,τ 2 ), p(µ τ 2 ) and p(τ 2 ). Then the complete conditional forθ, for example, is p(θ rest) = p(θ y,µ,τ 2 ) = = p(y,θ,µ,τ 2 ) p(y, t,µ,τ2 ) dt p(y θ) p(θ µ,τ 2 ) p(µ τ 2 ) p(τ 2 ) p(y t) p(t µ,τ2 ) p(µ τ 2 ) p(τ 2 ) dt p(y θ) p(θ µ,τ 2 ) Key observation: The shape of p(θ y,µ,τ 2 ) is determined by just the parts of the joint model that depend explicitly onθ. Similarly for other parameters. 8 36-724 February 16, 2006

Convergence M 0,M 1,M 2,...,M B, } {{ } Burn-in segment M B+1,...,M B+M } {{ } Usable MCMC sample How large should B be? Use time series plots to see when the chain has stabilized. Useacf() in Splus/R to chec when Corr(M m,m m+b ) small; Run several chains from different starting points; let B be so large that (within-chain variation)<(between chain variation). CODA subroutine pacage for Splus/R offers a menu of such checs (www.mrc-bsu.cam.ac.u/bugs/). Even after burn-in, the MC sample may have to be grouped or sub-sampled to reduce autocorrelation within the sample. 9 36-724 February 16, 2006

Complete Conditionals for Hierarchical Normal p(y,θ,µ,τ 2 σ 2 ) jn(y j θ j,σ 2 j ) jn(θ j µ,τ 2 ) p(µ,τ 2 ) p(θ j rest) N(y j θ j,σ 2 j ) N(θ j µ,τ 2 ) ) y j /σ N (θ 2 j +µ/τ2 1 j, 1/σ 2 j +1/τ2 1/σ 2 j +1/τ2 p(µ rest) jn(θ j µ,τ 2 ) p(µ,τ 2 ) N(µ θ,τ 2 /J) p(µ,τ 2 ) p(τ 2 rest) (τ 2 ) J/2 exp { 1 2 Inv-Gamma ( τ 2 J 2 1, 1 2 j(θ j µ) 2 /τ 2} p(µ,τ 2 ) j(θ j µ) 2) p(µ,τ 2 ) If p(µ,τ 2 ) IG(τ 2 α,β), we can use Gibbs steps (i.e., sample these complete conditionals directly)... See R code for this lecture 10 36-724 February 16, 2006

Aside: Marginal Density Estimates MCMC (and other Monte Carlo methods) can be used to obtain a sample from the joint posterior p(τ 1,...,τ d y). What if we just want p(τ 1 y)? Let (τ (m) 1,...,τ(m) d ), m=1... M, be a sample from p(τ 1,...,τ d y). P[τ 1 t y]=e[1 {τ1 t} y] 1 M m 1 (m) soτ(1) {τ t}, 1,...,τ(M) 1 is a sample from 1 p(τ 1 y). A histogram or density estimate based onτ m 1 thus estimates p(τ 1 y). Var ( 1 M m 1 (m) {τ t} y)= 1 Var (1 M {τ 1 t} y). 1 P[τ 1 t y]=e{p[τ 1 t y,τ 2,...,τ d ] y} 1 M differentiating, we see that 1 M m p(τ 1 y,τ (m) Var ( 1 M Which is better? m P[τ 1 t y,τ (m) 2,...,τ(m) d ]; ) also estimates p(τ 1 y). 2,...,τ(m) d m P[τ 1 t y,τ (m) 2,...,τ(m) d ] y)= 1 Var (P[τ M 1 t y,τ (m) 2,...,τ(m) d ] y). Var (1 {τ1 t} y)=var (P[τ 1 t y,τ 2,...,τ d ] y)+ E[Var (1 {τ1 t} y,τ 2,...,τ d ) y] The second method can be expected to have lower variance (Rao-Blacwellized density estimate: Casella & Robert, 1996, Biometria). 11 36-724 February 16, 2006

Summary As an alternative to direct iid sampling from the joint posterior, one can do successive substitution sampling of the complete conditionals. This sets up a Monte Carlo marov chain, whose stationary distribution is the joint posterior. The complete conditionals are often easy to identify. Even if they cannot be sampled directly with Gibbs steps they can be sampled using the Metropolis-Hastings rejection method. The complete conditionals are also useful for computing Rao-Blacwellized marginal density estimates. Heavy autocorrelation between draws in MCMC means we have to throw away an initial burn-in segment of draws, and we may also have to subsample or bloc-average the draws we eep, to reduce autocorrelation. MCMC is easy to set up but slow to operate. Much of the art of MCMC is choosing good parametrizations so that the autocorrelation between MCMC draws is low. 12 36-724 February 16, 2006