Tutorial Lectures on MCMC II

Similar documents
Sampling Methods (11/30/04)

Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

13: Variational inference II

MCMC algorithms for fitting Bayesian models

An introduction to adaptive MCMC

Bayesian Methods for Machine Learning

Markov Chain Monte Carlo in Practice

Markov Chain Monte Carlo methods

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Learning Energy-Based Models of High-Dimensional Data

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Probabilistic Graphical Models

Probabilistic Graphical Networks: Definitions and Basic Results

Computational statistics

Introduction to Machine Learning CMU-10701

eqr094: Hierarchical MCMC for Bayesian System Reliability

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

CSC 2541: Bayesian Methods for Machine Learning

A note on Reversible Jump Markov Chain Monte Carlo

The Origin of Deep Learning. Lili Mou Jan, 2015

VIBES: A Variational Inference Engine for Bayesian Networks

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov Chain Monte Carlo methods

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Recent Advances in Bayesian Inference Techniques

Contents. Part I: Fundamentals of Bayesian Inference 1

Probabilistic Machine Learning

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Graphical Models and Kernel Methods

STA 4273H: Statistical Machine Learning

Principles of Bayesian Inference

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Bayesian Networks BY: MOHAMAD ALSABBAGH

Metropolis-Hastings Algorithm

Pattern Recognition and Machine Learning

Adaptive Monte Carlo methods

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

STA 4273H: Sta-s-cal Machine Learning

Index. Pagenumbersfollowedbyf indicate figures; pagenumbersfollowedbyt indicate tables.

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Markov chain Monte Carlo

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

Monte Carlo Integration using Importance Sampling and Gibbs Sampling

Lecture 13 Fundamentals of Bayesian Inference

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning - Lecture 7

Advances and Applications in Perfect Sampling

Tutorial on Probabilistic Programming with PyMC3

The Pennsylvania State University The Graduate School RATIO-OF-UNIFORMS MARKOV CHAIN MONTE CARLO FOR GAUSSIAN PROCESS MODELS

LECTURE 15 Markov chain Monte Carlo

Bayesian time series classification

CS Lecture 3. More Bayesian Networks

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

I. Bayesian econometrics

Bayesian Machine Learning

Markov Chain Monte Carlo (MCMC)

Probabilistic Graphical Models (I)

Introduction to Maximum Likelihood Estimation

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

an introduction to bayesian inference

On Markov chain Monte Carlo methods for tall data

Likelihood-free MCMC

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

A Note on Auxiliary Particle Filters

Lecture 7 and 8: Markov Chain Monte Carlo

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Advanced Statistical Modelling

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC: Markov Chain Monte Carlo

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

DAG models and Markov Chain Monte Carlo methods a short overview

Markov chain Monte Carlo methods in atmospheric remote sensing

Examples of Adaptive MCMC

Markov chain Monte Carlo Lecture 9

Based on slides by Richard Zemel

Markov Chain Monte Carlo

CS 343: Artificial Intelligence

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Monte Carlo in Bayesian Statistics

Directed Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

The Recycling Gibbs Sampler for Efficient Learning

Default Priors and Effcient Posterior Computation in Bayesian

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Fitting Narrow Emission Lines in X-ray Spectra

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Markov Networks.

Lecture 13 : Variational Inference: Mean Field Approximation

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

An ABC interpretation of the multiple auxiliary variable method

Transcription:

Tutorial Lectures on MCMC II Sujit Sahu a Utrecht: August 2000. You are here because: You would like to see more advanced methodology. Maybe some computer illustration. a In close association with Gareth Roberts

Outline: Convergence Study for Better Design Graphical Models BUGS illustrations. Bayesian Model Choice Reversible Jump Adaptive Methods

Rate of Convergence Let define a suitable divergence measure. Rate of convergence is given by: where " # $ % & #' # where is the density of ( ) *# is a non-negative function taking care of the effect of the starting point. For example, if % & # is the density of bivariate normal with correlation + then +-,..

The rate is a global measure of convergence does not depend on /, the starting point not only tells how fast it converges but also indicates the auto-correlation present in the MCMC sequence. the last point is concerned with the nse of ergodic averages. 0

Use to design MCMC. Recall: cor, for bivariate normal example. Can we have faster convergence if correlations are reduced? Answer: Yes and No No, because the correlation structure matters a great deal in high dimensions. 1

That is why, Roberts and Sahu (1997) investigate to see the effects of blocking parameterization correlation structure updating schemes on the Gibbs sampler for Gaussian cases. 2

Extrapolation Extrapolate results to the non-normal cases with similar correlation structure. Some of this should work because of asymptotic normality of the posterior distribution. Two further papers Roberts and Sahu (2000) and Sahu and Roberts (1999) investigate the issues in detail. 3

We show: Under a Gaussian setup the rate of convergence of the EM/ECM algorithm is the same as the rate of convergence of the Gibbs sampler. Implications: if a given problem is easy (hard) for the EM, then it is easy (hard) for the Gibbs too. can use improvement strategy for one algorithm to hasten convergence in the other. 4

Extra benefits of first running the EM: EM convergence is easy to assess. Can use the lessons learned to implement the Gibbs. Can use the last EM iterate as the starting point in MCMC. Further, we also prove the intuitive result that under conditions the Gibbs may converge faster than the EM algorithm which does not use the assumed proper prior distributions. 5

D,,,,,, # 6 Example: Sahu and Roberts (1999) 798: ; < =8 < >?8: =@8 A D CB >E8: A D CB # This is called NCEN (not centered). Priors: FHG D FHG I A J CK L A J CK K # K # Flat prior for ;. NM

,, # # HCEN (hierarchically centered) (Gelfand, Sahu and Carlin, 1995) 798: OP8 < >?8: OP8 A ; D >?8: A D B,# AUXA (Auxiliary, Meng and Van Dyk, 1997) 798: ; < I QSR, =8 < >E8: =@8 A CB I T VU Q >E8: A D CB,# W

X X, NCEN AUXA HCEN ECM Algorithm Rate 0.832 0.845 0.287 NIT 3431 150 15 CPU 23.32 5.57 0.25 Gibbs Sampler 0.786 0.760 0.526 = Principal eigenvalue of the lag 1 auto-correlation matrix of ; ( YD, Z YD, Z # @

Outline: Convergence Study Graphical Models BUGS illustrations Bayesian Model Choice Reversible Jump Adaptive Methods N.

= [ # Directed acyclic graphs (DAG) For example, the DAG- a graph in which all edges are directed, and there are no directed loops- PSfrag replacements a b c d expresses the natural factorisation of a joint distribution into factors each giving the joint distribution of a variable given its parents % C[ \] ^ # % [ #_% = #_% \a` = #b% ^ `c\ 0

The role of graphical models Graphical modelling provides a powerful language for specifying and understanding statistical models. Graphs consist of vertices representing variables, and edges (directed or otherwise) that express conditional dependence properties. For setting up MCMC, the graphical structure assists in identifying which terms need be in a full conditional. @1

Markov Property PSfrag Variables replacements are conditionally independent of their non-descendants, given their parents. f a b c d e In the graph shown, c is conditionally independent of (f, e) given (a, b, d). Find other conditional independence relationships yourself. N2

` ` ` ` ` Since % d T de# f % # as a function of d, % & # d hg % ij d pa d lk where all possible vertices, implies % d T d # % i d n o d pa d k m n % i n pa pa n k pa=parents, $ p everything but p in. That is, one term for the variable itself, and one for each of its children. 3

[ PSfrag replacements f a b c d e In the above % \a` rest# f % \a` = #b% ^ `c\ #@q Leads us to... N4

Outline: Theoretical Convergence Study Graphical Models BUGS illustrations Bayesian Model Choice Reversible Jump Adaptive Methods N5

BUGS Bayesian inference Using Gibbs Sampling is a general purpose computer software for doing MCMC. It exploits the structure of the graphical models to setup the Gibbs sampler. It is (still) freely available from www.mrc-bsu.cam.ac.uk We will go through two non-trivial examples. (Sorry, no introduction to BUGS). rm

6 Example: Regression models with fat tails and skewness. Assume and 7 8 s < t & 8 $ u # < Lwv 8 < > 8 q Specification for >. 1. >?8 A B YD 2. >?8 A y{z U B,x# D, # Specification for v. v 8 A 1. CB vw8 2. y?z B F #b} v 8 ~ Ba# F #)} vw8 ~ B# 9

L, s and t are unrestricted. D, and are given appropriate prior distributions. All details including a recent technical report, computer programs are available from my home-page. www.maths.soton.ac.uk/staff/sahu/utrecht W

Please make your own notes for BUGS in here. Basic 5 steps: check model load data compile load inits gen inits Then update (from the Model menu) and Samples (from the Inference menu). r.

^ ƒ ^ ` ^ 6 Example: A Bayesian test for determining the number of components in mixtures. Develop further the ideas of Mengersen and Robert (1996). Let ƒ # ˆ ƒ # & # # be the Kullback-Leibler distance between and. Let Š & # Š :@ Œ : : ; : #Sq We develop an easy to use approximation for Ž Š Š T V #. 0

^ # F The Bayesian test evaluates the posterior probability # Ž Š Š T V \a` data -q A simple decision rule is to select the $ component model if the above posterior probability is high, (greater than B q(, for example). Again the paper and computer programs are available from my home-page. W1

\ š placements ayes factor Distance density 0.0 0.1 0.2 0.3 0.4 How to choose -4-2 0 2 4 Normal Mixtures? Height at 0 decreases as distance increases. Five normal mixtures corresponding to distance = 0.1, 0.2, 0.3, 0.4 and 0.5. r2

Outline: Convergence Study Graphical Models BUGS illustrations. Bayesian Model Choice Reversible Jump Adaptive Methods 3

` 7 T Ÿ A ` 7 T 7 T ^ 7 T Will discuss 1. A very simple method: Cross-validation 2. A very complex method: Reversible Jump Assume: Parameter. œ qrqrq q q qž 7 ` Ÿ Examine the influence of 7 : to the fit. #. Done by using the cross-validatory predictive distribution of 7 :. 7 : : # 7 : ` Ÿ : #b% Ÿ ` : # ^ Ÿ This is called the conditional predictive ordinate or CPO. Can calculate residuals 7W: $ 7 : r4 : #Sq

Ÿ ` 7 7 T T $ F F 7 T # F q # How to estimate the CPO? Gelfand and Dey (1994). Can be shown: 7 : ` Let ( : # x 7 : ` : YŸ be MCMC samples from the T posterior. Then: 7 : : # U 7 : ` Ÿ ( T for conditionally independent data. Plot the CPO s against observation number. Then detect outliers etc for model checking. r5

ª «6 Example: Adaptive score data, Cook and Weisberg (1994). Simple Linear regression. y 60 70 80 90 110 10 20 30 40 x cpo -8-7 -6-5 -4 5 10 15 20 Number The log(cpo) plot indicates one outliers..wm

7 ` 7 T T :, # q The psedo-bayes factor PsBF is a variant of the : 7 : ` : 7 : Bayes factor Marginal Likelihood under Marginal Likelihood under, There are many other excellent methods available to calculate the Bayes factor, see DiCicio et al. (1997) for a review. : # We next turn to the second, more complex method..h

Outline: Convergence Study Graphical Models BUGS illustrations. Bayesian Model Choice Reversible Jump Adaptive Methods.H

` ` ` ` Reversibility A Markov chain with transition kernel 7 # and invariant distribution % & # is reversible if: % # 7 # % 7 # 7 #@q This is a sufficient condition for % # to be the invariant distribution of ) (#. Reversible chains are easier to analyse. All eigenvalues of are real. Reversibility is not a necessary condition for MCMC to work. The Gibbs sampler when updated in a deterministic order is not reversible..w.

Š ` ` q Jump Green (1995) extended the Metropolis-Hastings algorithm for varying dimensional state space. Let. Conditional on, the state space is assumed to be dimensional. Recall that the acceptance ratio is: s 7 # F % 7 # % # ± ± 7 Now require that the numerator and denominator have densities with respect to a common dominating measure ( dimension-balancing ). That is, a dominating measure for the joint distribution of the current state (in equilibrium) and next one.. 0 7 # #

² ƒ ² ² ƒ ƒ ƒ,, # ² # # # # We need to find a bijection where "² # ³ 7 ṕ # and p are random numbers of appropriate dimensions so that dim "² # dim 7 p # and 7 ṕ # µ where the transformation is one-to-one. ² Now the acceptance probability becomes: s F % 7 # % # # and where and p. ² 7 &ṕ # ṕ #Y µ How can we understand this? "² "² q p # are the densities of.h1

ƒ ƒ, # ² # # ` ` # q The usual M-H ratio: s 7 # F % 7 # % # ± ± 7 7 # # (1) Now reversible jump ratio: s ² 7 ṕ #Y F % 7 # % # &ṕ # µ "² "² q (2) The ratios in the last expression matches with those of the first. The first ratio in (2) exactly comes from the first ratio in (1). Look at the arguments of s in both cases..w2

p ` 7 # q According to (1) the second ratio in (2) should be ± & ± 7 "² ` where ¹ Recall that 7 ² ṕ # # 7 ² p # (# is the joint density. ṕ # µ "² # and µ is one-to-one. Hence the density ratio will be just the appropriate Jacobian. To see this consider your favourite example of transformation and obtain the ratio of the original and transformed densities. Peter Green s explanation is much more rigorous.. 3

» à º Example: Enzyme data, Richardson and Green (1997) Probabilities (after the correction in 1998) ¾»ÀÁH 2 3 4 5 ¼ ½ 0.047 0.343 0.307 0.200 0.103 Our distance based approach: g replacements 0 2 4 6 8 10 Density 0.0 0.5 1.0 1.5 Distance.W4

Outline: Convergence Study Graphical Models BUGS illustrations. Bayesian Model Choice Reversible Jump Adaptive Methods.W5

Problems with MCMC Slow convergence... Problem in estimating nse due to dependence. A way forward is to consider Regeneration. Regeneration points divide the Markov sample path in i.i.d. tours. So the tour means can be used as independent batch means. Using renewal theory and ratio estimation can approximate expectations and nse that are valid without quantifying Markov dependence. Mykland, Tierney and Yu (1995), Robert (1995). 0 M

Regeneration also provides a framework for adaptation. Background: Infinite adaptation leads to bias in the estimates of expectations. Gelfand and Sahu (1994). But the bias disappears if adaptation is performed at regeneration points. Gilks, Roberts and Sahu (1998) prove a theorem to this effect and illustrates. How do you see regeneration points? 0

^ ^ Nummelin s Splitting. Suppose that the transition kernel K # satisfies K # ÄÅ& #" CK # where is a probability measure and Ä is a non-negative function such that Let ÄÅ #_% # ~ B. ÆÇ# ÄÅ #E ^ Now, given a realisation from Æ # Æ # _ F qrqrq, construct conditionally independent 0/1 random variables È 0 / È _ q qrq with

Œ Ì È ( F ` q qrqn# Z Z U _ From our experience, this is little hard to do in multi-dimension. Sahu and Zhigljavsky (1998) provide an alternative MCMC sampler where regeneration points are automatically identified. Let s # # Ë *Ê (Ê constant. VU É n (Ê and Í ~ where B is a tuning 0.

œ œ œ œ œ Self Regenerative algorithm: Let F and Î B. Generate Ï A. œ A Generate Geometric s œ Ï ~ U : If B set ÑÐ Ï Ò F q qrq and Î Î < If then stop, else set < F and return to Step I. œ ~ Every time, for. #"#. B is a regeneration point. 0W0

t, 6 Example: Sahu and Zhigljavsky (1998). 7 8 t < F < ÓÕÔ Ö $ t & 8 $ t Ø #Y < > 8 q Series : metr eplacements Metropolis -5-4 -3-2 -1-3.0-2.5-2.0-1.5-1.0Gilks, -0.5 Roberts -6-5 -4-3 -2and -1 Sahu ASR 0 1000 2000 3000 4000 5000 6000 Iteration 0 1000 2000 3000 4000 5000 6000 Iteration 0 1000 2000 3000 4000 5000 6000 Iteration ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 ACF 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Lag Series : grs 0 20 40 60 80 100 Lag Series : asr 0 20 40 60 80 100 Lag Time series and acf plots of t. 0 1

Strengths of MCMC Freedom in modelling Freedom in inference Opportunities for simultaneous inference Allows/encourages sensitivity analysis Model comparison/criticism/choice Weaknesses of MCMC Order T R, precision Possibility of slow convergence 0 2

Ideas not talked about Plenty perfect simulation particle filtering Langevin type methodology Hybrid Monte Carlo MCMC methods for dynamically evolving data sets, Definitely need more fool-proof automatic methods for large data analysis 0e3

Online resources Some links are available from: www.maths.soton.ac.uk/staff/sahu/utrecht MCMC preprint service (Cambridge) Debug: BUGS user group in Germany MRC Bio-statistics Unit Cambridge The Debug site http://userpage.ukbf.fu-berlin.de/ debug/ maintains a data base of references. 0 4