MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Similar documents
Parallel Tempering I

Bayesian Methods for Machine Learning

Markov Chain Monte Carlo

MCMC algorithms for fitting Bayesian models

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian data analysis in practice: Three simple examples

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

EM Algorithm II. September 11, 2018

Bayesian Inference and MCMC

eqr094: Hierarchical MCMC for Bayesian System Reliability

Learning the hyper-parameters. Luca Martino

Computational statistics

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Contents. Part I: Fundamentals of Bayesian Inference 1

Markov Chain Monte Carlo methods

STA 4273H: Statistical Machine Learning

Bayesian Linear Models

16 : Approximate Inference: Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Principles of Bayesian Inference

Markov Chain Monte Carlo (MCMC)

DAG models and Markov Chain Monte Carlo methods a short overview

Bayesian Linear Models

Control Variates for Markov Chain Monte Carlo

Markov chain Monte Carlo

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

Bayesian Linear Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Introduction to Machine Learning CMU-10701

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Markov Chain Monte Carlo, Numerical Integration

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Markov chain Monte Carlo

Introduction to Bayesian methods in inverse problems

Lecture 8: Bayesian Estimation of Parameters in State Space Models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Markov Chain Monte Carlo A Contribution to the Encyclopedia of Environmetrics

Monte Carlo Inference Methods

INTRODUCTION TO BAYESIAN STATISTICS

Bayesian Linear Regression

BUGS Bayesian inference Using Gibbs Sampling

MCMC: Markov Chain Monte Carlo

MARKOV CHAIN MONTE CARLO

Simulation of truncated normal variables. Christian P. Robert LSTA, Université Pierre et Marie Curie, Paris

Markov Chain Monte Carlo in Practice

Markov chain Monte Carlo methods in atmospheric remote sensing

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Monte Carlo Methods. Leon Gu CSD, CMU

David Giles Bayesian Econometrics

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Computer Intensive Methods in Mathematical Statistics

Default Priors and Effcient Posterior Computation in Bayesian

Latent Variable Models and EM algorithm

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

F denotes cumulative density. denotes probability density function; (.)

Bayesian modelling. Hans-Peter Helfrich. University of Bonn. Theodor-Brinkmann-Graduate School

Reminder of some Markov Chain properties:

A Bayesian Treatment of Linear Gaussian Regression

Optimization Methods II. EM algorithms.

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Bayesian model selection in graphs by using BDgraph package

Nonparametric Drift Estimation for Stochastic Differential Equations

Lecture 8: The Metropolis-Hastings Algorithm

Sampling Algorithms for Probabilistic Graphical models

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Fitting Narrow Emission Lines in X-ray Spectra

ABC methods for phase-type distributions with applications in insurance risk problems

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Gaussian Mixture Model

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Computer intensive statistical methods

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

An introduction to Sequential Monte Carlo

Bayesian Phylogenetics

CSC 2541: Bayesian Methods for Machine Learning

Forward Problems and their Inverse Solutions

A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection

Gibbs Sampling in Linear Models #2

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

The Ising model and Markov chain Monte Carlo

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Expectation Maximization

Advanced Statistical Modelling

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Overlapping Astronomical Sources: Utilizing Spectral Information

LECTURE 15 Markov chain Monte Carlo

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Likelihood Inference for Lattice Spatial Processes

Simulation - Lectures - Part III Markov chain Monte Carlo

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Principles of Bayesian Inference

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Transcription:

MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous cousin the Gibbs sampler January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 1

MH II goal is to sample from target density: π(x) exp [ H(x)/β] the above form is known as the Boltzman form of a distribution H(x) is called the fitness or energy function β is called the temperature EXAMPLE: target density for X Normal 1 (µ, σ 2 ): π(x) = 1 [ σ 2π exp [ exp 1 2σ 1 (x µ)2 2σ2 (x µ)2 2 ] ] here H(x) = 1 2σ (x µ) 2 and β = 1 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 2

MH III going to use a proposal distribution pdf to generate guesses or proposals for the draws from the target: T (x, ) EXAMPLE: given x, Normal proposal pdf for Y Normal 1 (x, τ 2 ): T (x, ) Normal 1 (x, τ 2 ; ) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 3

MH IV going to have to evaluate the proposal density pdf: T (x, y) make sure T (y, x) > 0 whenever T (x, y) > 0, otherwise the sampler will not work also don t just assume T (y, x) = T (x, y), a very common trap for beginners EXAMPLE: Normal proposal density: T (x, y) Normal 1 (x, τ 2 ; y) [ exp 1 ] (y x)2 2τ 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 4

MH V acceptance probability used in Metropolis-Hastings algorithm: α(x, y) = min { 1, } π(y)t (y, x) π(x)t (x, y) if proposal is symmetric, i.e., if T (x, y) = T (y, x) then we have: α(x, y) = min if π(y) π(x) then α(x, y) = 1 if π(y) < π(x) then α(x, y) < 1 { 1, π(y) } π(x) aside: if proposal is symmetric then the algorithm is called Metropolis algorithm note since we deal with ratios above its enough to know π( ) and T (, ) up to a proportionality constant January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 5

MH VI the MH algorithm (for N-many iterations): 1. initialize: set t = 0 and get a starting value x (t) 2. propose: generate y from T (x (t), ) 3. eval: evaluate acceptance probability α(x (t), y) 4. move: generate u from Uniform(0, 1) and set x (t+1) = y if u α(x (t), y) x (t) otherwise 5. if t N stop otherwise set t = t + 1 and go to step 2 aside: its enough to compute α(, ) without the min part because u 1, what??? January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 6

MH VII process the samples: {x (t) : t = 0, 1,..., N} discard some initial samples, say, N/10 is the burn-in period, for notational ease, reindex the rest as: {x (t) : t = 1, 2,..., M} use the rest for inference EXAMPLE: to estimate the mean of the target density use the estimator: 1 M M t=1 x (t) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 7

a simple example: set up: target: X Normal 1 (µ, σ 2 ) proposal: Y Normal 1 (x, τ 2 ) so we have: π(x) exp [ 1 2σ 2 (x µ) 2] MH VIII T (x, ) Normal 1 (x, τ 2 ; ) [ T (x, y) exp 1 2τ (y x) 2], note its symmetric! 2 { } { α(x, y) = min = min 1, exp 1, π(y) π(x) [ 1 2σ 2 (y µ) 2 + 1 2σ 2 (x µ) 2]} important aside: all the above expressions are nice and fine but while implementing do all your computations in log-scale January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 8

MH IX some general guidelines while implementing a typical MH sampler: tweak your proposal T (x, y) so that (see [2, Gelman et. al]) α(x, y) [40%, 50%] if x(, y) R 1 α(x, y) [20%, 30%] if x(, y) R d, d > 1 too high (above 70%) or too low (below 10%) values of α(x, y) is a sign of bad choice for T (x, y) start your sampler from dispersed starting values and check that you converge around the same region of the sample space propose to move very highly correlated variables together do not use very high-dimensional proposals, such proposals are rarely accpeted January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 9

EM I goal is to find the Maximum Likelihood Estimator (MLE) or the Maximum A Posterior (MAP) Estimator Expectation-Maximization (EM) algorithm is the most popular method for the above the above maximization problem involves two steps: the Expectation step or the E-step the Maximization step or the M-step January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 10

EM II set up: data: y := (y 1, y 2,..., y n ) parameter of interest: θ nuisance parameter or missing data: z E-step: Q θ θ (t) := 8 < E θ (t) [log p (θ, z y)] = R log p (θ, z y) p(z θ (t), y) dz : E θ (t) [log p (z, y θ)] = R log p (z, y θ) p(z θ (t), y) dz for MAP for MLE M-step: θ (t+1) := arg max θ Q (θ θ (t)) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 11

EM III the EM algorithm with ɛ-close stopping: 1. initialize: set t = 0 and get a starting value θ (t) 2. E-step: get Q ( θ θ (t)) 3. M-step: get θ (t+1) = arg max θ Q ( θ θ (t)) 4. if θ (t+1) θ (t) ɛ stop otherwise set t = t + 1 and go to step 2 in some easy cases you could combine the E-step and the M-step if you have a closed from expression for Q ( θ θ (t)) and (hence) for arg max θ Q ( θ θ (t)) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 12

EM IV EXAMPLE: we want MAP estimator of µ from (with σ 2 unknown): y i Normal 1 (µ, σ 2 ), i = 1, 2,..., n µ Normal 1 (µ 0, τ0 2 ) p(log σ) 1 so we have: data: y := (y 1, y 2,..., y n ) parameter of interest: θ = µ nuisance parameter: z = σ 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 13

EM V we observe: log p (θ, z y) = log p ( µ, σ 2 y ) = const 1 2τ 2 0 (µ µ 0 ) 2 (n + 1) log σ 1 2σ 2 n (y i µ) 2 i=1 we also note: p(z θ (t), y) = p(σ 2 µ (t), y) Inv χ 2 ( n, 1 n n i=1 ) (y i µ (t)) 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 14

EM VI E-step: only compute the expectations of the terms which involve θ because other terms are not useful in the M-step so we note: Q ( θ θ (t)) = Q (µ µ (t)) = const 1 2τ 2 0 = const 1 2τ 2 0 [ ] (µ µ 0 ) 2 1 n E µ (t) 2σ 2 (y i µ) 2 i=1 { 1 (µ µ 0 ) 2 1 1 n 2} (y i µ (t)) n 2 n i=1 we are ignoring the followin for the mentioned reason i=1 (y i µ) 2 (n + 1)E µ (t) [log σ] January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 15

EM VII M-step: note Q ( θ θ (t)) = Q ( µ µ (t)) is a quadratic in µ and hence easy to maximize taking derivatives once (and then twice) one can show θ (t+1) := arg max θ = arg max µ = 1 τ 2 0 Q (θ θ (t)) Q (µ µ (t)) n µ 0 + P n ȳ i=1(y i µ (t) ) 2 1 τ 2 0 = µ (t+1) 1 n n + P n i=1(y i µ (t) ) 2 1 n January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 16

EM VIII EXAMPLE: we want MLE for mixture proportions, (π 1, π 2,..., π k ): we have k-many known densities f j ( ), j = 1, 2,..., k there are k-many unknown proportions π j, j = 1, 2,..., k with k j=1 π j = 1 y i k j=1 π jf j ( ), i = 1, 2,..., n so we have: data: y := (y 1, y 2,..., y n ) parameter of interest: θ := (π 1, π 2,..., π k ) introduce missing data: z := (z 1, z 2,..., z n ) such that [z i θ] Multinomial(1, θ), i = 1, 2,..., n note here we need to cook up the missing data in such a way that integrating / summing it out gives us back our original model, see next slide January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 17

now we can rewrite our model as: EM IX [y i z i = e j, θ] f j ( ), i = 1, 2,..., n, j = 1, 2,..., k p(z i = e j θ) = π j, i = 1, 2,..., n, j = 1, 2,..., k here e j is the j-th canonical vector for j = 1, 2,..., k (e.g. e 1 = (1, 0, 0,..., 0) etc.) check that: z p(y, z θ) = p(y θ) so we have: p log p (z, y θ) = ( z ij = 1 y, θ (t)) = p n i=1 k z ij log {π j f j (y i )} j=1 (z i = e j y, θ (t)) and = π (t) j f j (y i ) k j =1 π(t) j f j (y i ) = a(t) ij, say January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 18

E-step: EM X Q ( θ θ (t)) = = n i=1 n i=1 k E θ (t)(z ij ) log {π j f j (y i )} j=1 k j=1 a (t) ij log {π jf j (y i )} M-step: its a constrained maximization problem with k j=1 π j = 1 which gives: θ (t+1) := arg max Q (θ θ (t)) = θ n i=1 a(t) ij n i=1 k j=1 a(t) ij = 1 n n i=1 a (t) ij January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 19

EM XI the tricky (theoretical) part of the EM algorithm is that many missing data schemes may give rise to the same model under consideration but not all are helpful EXAMPLE: in the mixture proportions example defining z the following way is not helpful at all (although it satisfies z p(y, z θ) = p(y θ)) p(z i = j θ) = π j, i = 1, 2,..., n, j = 1, 2,..., k note here z i is of dimension 1 as opposed to k, as before to find the best missing data scheme is an art, really check out The Art of Data Augmentation [5, van Dyk et.al] January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 20

References [1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (C/R: p22-37). Journal of the Royal Statistical Society, Series B, Methodological, 39:1 22, 1977. [2] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumping rules. In Bayesian Statistics 5 Proceedings of the Fifth Valencia International Meeting, pages 599 607, 1996. [3] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences (Disc: p483-501, 503-511). Statistical Science, 7:457 472, 1992. [4] Charles J. Geyer. Practical Markov chain Monte Carlo (Disc: p483-503). Statistical Science, 7:473 483, 1992. [5] David A. van Dyk and Xiao-Li Meng. The art of data augmentation (Pkg: p1-111). Journal of Computational and Graphical Statistics, 10(1):1 50, 2001. January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 21