Advances and Applications in Perfect Sampling

Similar documents
Advances and Applications in Perfect Sampling

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo (MCMC)

Monte Carlo methods for sampling-based Stochastic Optimization

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Bayesian Inference and MCMC

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computational statistics

Markov chain Monte Carlo methods in atmospheric remote sensing

MCMC algorithms for fitting Bayesian models

Markov Chain Monte Carlo methods

17 : Markov Chain Monte Carlo

CSC 2541: Bayesian Methods for Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

PERFECT STOCHASTIC SUMMATION IN HIGH ORDER FEYNMAN GRAPH EXPANSIONS

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian Methods for Machine Learning

STA 294: Stochastic Processes & Bayesian Nonparametrics

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

MCMC and Gibbs Sampling. Sargur Srihari

Markov Chain Monte Carlo

ABC methods for phase-type distributions with applications in insurance risk problems

16 : Markov Chain Monte Carlo (MCMC)

MCMC Sampling for Bayesian Inference using L1-type Priors

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Introduction to Bayesian methods in inverse problems

MCMC and Gibbs Sampling. Kayhan Batmanghelich

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Reminder of some Markov Chain properties:

Lecture 8: The Metropolis-Hastings Algorithm

eqr094: Hierarchical MCMC for Bayesian System Reliability

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Zig-Zag Monte Carlo. Delft University of Technology. Joris Bierkens February 7, 2017

Markov chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

I. Bayesian econometrics

Markov Chain Monte Carlo methods

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

Contents. Part I: Fundamentals of Bayesian Inference 1

MCMC: Markov Chain Monte Carlo

Bayesian Linear Models

Markov chain Monte Carlo

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Lecture 7 and 8: Markov Chain Monte Carlo

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

ST 740: Markov Chain Monte Carlo

Machine Learning. Probabilistic KNN.

Markov chain Monte Carlo

Surveying the Characteristics of Population Monte Carlo

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Adaptive Monte Carlo methods

A = {(x, u) : 0 u f(x)},

Learning the hyper-parameters. Luca Martino

Bayesian Estimation of Input Output Tables for Russia

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

Bayesian Linear Models

MCMC Methods: Gibbs and Metropolis

Bayesian Regression Linear and Logistic Regression

Metropolis-Hastings Algorithm

Down by the Bayes, where the Watermelons Grow

Computer intensive statistical methods

Markov Chain Monte Carlo Methods

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Example: Ground Motion Attenuation

CPSC 540: Machine Learning

Lattice Gaussian Sampling with Markov Chain Monte Carlo (MCMC)

Likelihood-free MCMC

LECTURE 15 Markov chain Monte Carlo

Hierarchical Modeling for Univariate Spatial Data

16 : Approximate Inference: Markov Chain Monte Carlo

Bayesian spatial hierarchical modeling for temperature extremes

Convex Optimization CMU-10725

Bayesian model selection in graphs by using BDgraph package

F denotes cumulative density. denotes probability density function; (.)

Principles of Bayesian Inference

Approximate Inference using MCMC

Doing Bayesian Integrals

Bayesian Linear Regression

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Quantifying Uncertainty

Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

DAG models and Markov Chain Monte Carlo methods a short overview

Advanced Statistical Modelling

Answers and expectations

Multimodal Nested Sampling

University of Toronto Department of Statistics

Markov Chains and MCMC

Practical unbiased Monte Carlo for Uncertainty Quantification

Bayesian Phylogenetics:

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Lecture 6: Markov Chain Monte Carlo

Transcription:

and Applications in Perfect Sampling Ph.D. Dissertation Defense Ulrike Schneider advisor: Jem Corcoran May 8, 2003 Department of Applied Mathematics University of Colorado

Outline Introduction (1) MCMC Methods (2) Perfect Sampling (3) Slice Coupling (4) Variants on the IMH Algorithm Applications (5) Bayesian Variable Selection (6) Computing Self Energy Using Feynman Diagrams Outline

1. MCMC methods MCMC = Markov Chain Monte Carlo Originally developed for the use in Monte Carlo techniques (approximating deterministic quantities using random processes). Main method is the Metropolis-Hastings algorithm (METROPOLIS, 1953, HASTINGS 1970). MCMC has been used extensively outside Monte Carlo methods. Introduction

1. MCMC methods Simulate from non-standard distributions with the help of Markov chains. Artificially create a Markov chain that has the desired distribution as equilibrium distribution. Start the chain in some state at time t = 0 and run long enough (yields approximate samples). HOW LONG IS LONG ENOUGH?? The problem of assessing convergence is a major drawback in the use of MCMC methods. Introduction

2. Perfect Sampling!! MCMC WITHOUT STATISTICAL ERROR!! Enables exact simulation from the stationary distribution of certain Markov chains. First paper by PROPP and WILSON in 1996. Parallel chains are started in each state. Chains are run as if started at time t = and stopped at time t = 0. Can be done in finite time for uniformly ergodic Markov chains. Introduction

Essential Idea Coupling Sample Paths Once two sample paths of a Markov chain couple or coalesce, they stay together. T (forward coupling time) time Perfect Sampling: Go back far enough in time, so that by time t = 0, all chains have coalesced (backward coupling time). Introduction

s 4 s 3 s 2 Starting at 1 : no coalescence by t = 0. t -3-2 -1 u 1 s 1 Introduction

s 4 s 3 s 2 Starting at 1 : no coalescence by t = 0. t -3-2 -1 u 1 s 1 t -3-2 -1 u 2 u 1 s 4 s 3 s 2 s 1 Starting at 2 : no coalescence by t = 0. Introduction

s 4 s 3 s 2 Starting at 1 : no coalescence by t = 0. t -3-2 -1 u 1 s 1 t -3-2 -1 u 2 u 1 s 4 s 3 s 2 s 1 Starting at 2 : no coalescence by t = 0. t -3-2 -1 u 3 u 2 u 1 s 4 s 3 s 2 s 1 Starting at 3 : coalescence by t = 0. s 2 is a draw from the stationary dist.! Introduction

Why backwards? If started stationary, the Markov chain is stationary at all fixed time steps. Time of coalescence is random. Reporting states at the random forward coupling time no longer necessarily gives draws from the stationary distribution. s 4 s 3 s 2 s 1 0 1 2 3 Forward coupling time T = 3. t Introduction

The Challenge What about infinite and continuous state spaces? Theoretically works for all uniformly ergodic chains, BUT we need a way to detect a backward coupling time! Ideas include minorization criteria, bounding processes, perfect slice sampling, Harris coupler, IMH algorithm, slice coupling,... Theoretical development has slowed down. Focus has shifted towards applying perfect sampling algorithms to relevant problems applications are non-trivial... Introduction

3. Slice Coupling Get a potentially common draw from two different continuous distributions! Will enable us to couple continuous sample paths in perfect sampling algorithms. x N(x,1) y N(y,1) time

3. Slice Coupling Slicing uniformly under the curve of a density function (area represents probability!)

3. Slice Coupling Slicing uniformly under the curve of a density function (area represents probability!) Different techniques Layered multishift coupler (WILSON, 2000). Folding coupler, shift-and-fold step (CORCORAN and SCHNEIDER, 2003). Shift-and-patch algorithm for non-invertible densities (CORCORAN and SCHNEIDER, 2003).

4. The IMH-algorithm IMH = independent Metropolis-Hastings IMH is the perfect counterpart (CORCORAN AND TWEEDIE, 2000) to the Metropolis-Hastings algorithm using an independent candidate density.

Regular Metropolis-Hastings Want sample from π(x) (possibly unnormalized). Use instead a candidate density q(x). Create the Metropolis-Hastings chain X t according to the transition law: Assume X t = x and draw a candidate y q(.) Accept (X t+1 = y) the candidate with probability α(x, y) = min( π(y)q(x) q(y)π(x), 1) Otherwise, reject (X t+1 = x). This Markov chain has stationary distribution π(x).

Perfect IMH Reorder the state space S, assume that there exists l S, such that π(l) q(l) = max x S π(x) q(x). This l (lowest element of S) satisfies α(l, y) α(x, y) x S, i.e. if we accept to move from state l to state y, any other x S will also move to y. This allows us to detect a backward coupling time.

Illustration of IMH q 0 Draw candidate q 0 at time t = 0. t -3-2 -1

Illustration of IMH q 0 l rejects candidate q 0 at time t = 0. t l -3-2 -1 u 1

Illustration of IMH q 1 q 0 Draw candidate q 1 at time t = 1. t -3-2 -1 u 1

Illustration of IMH q 1 q 0 l rejects candidate q 1 at time t = 1. t l -3-2 -1 u 2 u 1

Illustration of IMH q 2 q 1 q 0 Draw candidate q 2 at time t = 2. t -3-2 -1 u 2 u 1

Illustration of IMH q 2 q 1 q 0 l accepts candidate q 2 at time t = 2! t l -3-2 -1 u 3 u 2 u 1

Illustration of IMH q 2 q 1 q 0 l accepts candidate q 2 at time t = 2! t l -3-2 -1 u 3 u 2 u 1 q 2 q 1 q 0 Any state accepts q 2 at time t = 2! t -3-2 -1 u 3 u 2 u 1

Illustration of IMH q 2 q 1 q 0 l accepts candidate q 2 at time t = 2! t l -3-2 -1 u 3 u 2 u 1 q 2 q 1 q 0 Transition forward accept candidate q 1. t -3-2 -1 u 3 u 2 u 1

Illustration of IMH q 2 q 1 q 0 l accepts candidate q 2 at time t = 2! t l -3-2 -1 u 3 u 2 u 1 q 2 q 1 q 0 Transition forward reject candidate q 0. t -3-2 -1 u 3 u 2 u 1

Illustration of IMH q 2 q 1 q 0 l accepts candidate q 2 at time t = 2! t l -3-2 -1 u 3 u 2 u 1 t q 2 q 1-3 -2-1 u 3 u 2 u 1 q 0 X 0 Transition forward reject candidate q 0. X 0 is a draw from π!

Variants on IMH We only need to know the value of the maximum π q -ratio, but not where it occurs. (1) Bounded IMH (SCHNEIDER and CORCORAN, 2002) We do not even need to know the maximum exactly! An upper bound for π q is a lower bound for α(l, y) still obtain can therefore still obtain exact draws. Used in Variable selection problem Computing self-energy for the interacting fermion problem

Variants on IMH (2) Approximate IMH (CORCORAN and SCHNEIDER, 2003) If no upper bound is available, a built-in random search appears to outperform regular forward IMH at the same computational cost. 100,000 draws using approx. IMH with built in max. 100,000 draws using forward IMH Density 0.0 0.1 0.2 0.3 0.4 Density 0.0 0.1 0.2 0.3 0.4 0 2 4 6 8 0 2 4 6 8

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly.

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. forward MH: 100,000 draws, 300 time steps 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. forward MH: 100,000 draws, 500 time steps 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. forward MH: 100,000 draws, 1000 time steps 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. forward MH: 100,000 draws, 2000 time steps 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. adaptive IMH: 100,000 draws, 100 time steps 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 10 12 14

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. adaptive IMH: 100,000 draws, refinement 1 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14

Variants on IMH (3) Adaptive IMH (CORCORAN and SCHNEIDER, 2003) Forward IMH with a self-targeting candidate. Appears to converge very rapidly. adaptive IMH: 100,000 draws, refinement 2 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14

5. Bayesian Variable Selection Linear regression model with Gaussian noise y = γ 1 θ 1 x 1 + + γ q θ q x q + ε where: x i IR n, i = 1,..., q predictors (fixed, known) θ i IR, i = 1,..., q coefficients (random) ε N(0, σ 2 I) noise vector (random) γ i {0, 1}, i = 1,..., q indicators (random) Applications

The Goal Given an observation y, choose the "best" subset of the predictors which predictors were part of the model? Amounts to finding values of γ = (γ 1,..., γ q )! Applications

The Goal Given an observation y, choose the "best" subset of the predictors which predictors were part of the model? Amounts to finding values of γ = (γ 1,..., γ q )! Best? Bayesian perspective: Applications

The Goal Given an observation y, choose the "best" subset of the predictors which predictors were part of the model? Amounts to finding values of γ = (γ 1,..., γ q )! Best? Bayesian perspective: Select γ that appears most frequently when sampling from the posterior of the model. Applications

The Goal Given an observation y, choose the "best" subset of the predictors which predictors were part of the model? Amounts to finding values of γ = (γ 1,..., γ q )! Best? Bayesian perspective: Select γ that appears most frequently when sampling from the posterior of the model. We need to be able to simulate from the posterior! π Γ,σ2,Θ Usually, Bayesian approaches use regular MCMC methods question of convergence! Applications

The Model: Priors Want to incorporate the following standard normal gamma conjugate priors: λν σ 2 χ 2 (ν) Z := 1 σ 2 Γ(ν 2, λν 2 ) θ Z N(ξ, σ 2 V ) γ i i.i.d. Bernoulli( 1 2 ), i = 1..., q The variance σ 2 for the ε and θ is random. V, ξ, λ, and ν are hyperparameters (fixed and known). Applications

The Model: Posterior Linear regression with Gaussian noise yields likelihood L(γ, θ, z) = z n 2 exp{ 1 2 z(y q i=1 γ i θ i x i ) T (y q i=1 γ i θ i x i )} Posterior proportional to likelihood priors: π Γ,Z,Θ (γ, z, θ y) z n+q+ν 1 2 exp{ 1 q q 2 z[(y γ i θ i x i ) T (y γ i θ i x i )+(θ ξ) T V 1 (θ ξ)+λν]} i=1 i=1 Applications

Increasing Layers of Complexity Fixed variance, fixed coefficients sampling γ = (γ 1,..., γ q ) using support set coupling within a Gibbs sampler (HUANG and DJURIC, 2002) Random variance, fixed coefficients sampling (γ, z) = (γ 1,..., γ q, z) using slice coupling within a Gibbs sampler (SCHNEIDER and CORCORAN, 2002) Random variance, random coefficients sampling (γ, z, θ) = (γ 1,..., γ q, z, θ 1,..., θ q ) using bounded IMH (SCHNEIDER and CORCORAN, 2002) Applications

The General Case Incorporate random variance and random coefficents. Reduce the size of the state space define β i := γ i θ i to have a mixture prior distribution. The values of γ can be recovered from β = (β 1,..., β q ) T by setting: γ i = { 0 if β i = 0 1 if β i 0 i = 1,..., q. Applications

The General Case Using bounded IMH Posterior π,z (β, z y) L(β, z)g Z(β z)g Z (z) where L(, z) = z n 2 exp{ 1 2 z( q β i i=1 i) T ( q β i i=1 i)} Choose candidate density q(β, z) z n 2 g Z(β z)g Z (z). Then π q = L(β, z) z n 2 = exp{ 1 2 z(y q i=1 β i x i ) T (y q i=1 β i x i )} 1 Applications

The General Case Using bounded IMH To get the candidate values (z, β) according to q(z, β) z n 2 g Z(β z)g Z (z), sample hierarchically. Draw Z Γ( n+ν 2, λν 2 ) Draw B Z N(ξ, 1 z V ) Set β i = 0 with probability 1 2 (i = 1..., q) Can now use bounded IMH to get exact draws from the posterior. Applications

Some results Hald data set q = 4, n = 13, y contains heat measurements of cement, predictor variables describe composition of the cement (aluminate, silicate, ferrite, dicalcium) percentage component percentage (0,1,0,0) 69 % (1,1,0,0) 14 % (1,0,1,0) 13 % (0,1,1,0) 3 % (0,1,0,1) 1 % P ( P ( P ( P ( 1 = 1) 27 % 2 = 1) 87 % 3 = 1) 16 % 4 = 1) 1 % Applications

6. Computing Self Energy Compute self energy for the interacting fermion problem. Create and destroy a particle on a lattice of atoms (such as a crystal). Particle interacts with other electrons, wake of energy created around the movement of the particle. Quantify this self energy with the help of Feynman diagrams. Approximate the sum using Monte Carlo methods and perfect sampling (CORCORAN, SCHNEIDER and SCHÜTTLER, 2003) σ(k) = n max n=1 g G n ( T N ) n k 1,...,k n K F (n) g (k, k 1,..., k n ). Applications

7. Future Research Address large backward coupling times of the IMH algorithm (Bayesian variable selection) multistage coupling? Find analytical error bounds and employ convergence diagnostics for the approximate and adaptive IMH algorithm. More to come...

References [1] J. Corcoran and U. Schneider. Variants on the independent Metropolis-Hastings algorithm: approximate and adaptive IMH. In preparation, 2003. [2] J.N. Corcoran and U. Schneider. Shift and scale coupling methods for perfect simulation. Prob. Eng. Inf. Sci., 17:277 303, 2003. [3] J.N. Corcoran, U. Schneider, and H.-B. Schüttler. Perfect stochastic summation for high order Feynman diagrams. Submitted for publication, 2003. Preprint at http://amath.colorado.edu/student/schneidu/feynman.html. [4] Y. Huang and P. Djurić. Variable selection by perfect sampling. EURASIP J. Appl. Si. Pr., 1:38 45, 2002. [5] A.E. Raferty, D. Madigan, and J.A. Hoeting. Bayesian model averaging for linear regression models. J. Am. Stat. Ass., 92:179 191, 1997. [6] U. Schneider and J.N. Corcoran. Perfect sampling for Bayesian variable selection in a linear regression model. J. Stat. Plann. Inference, 2003. To appear. Preprint at http://www.cgd.ucar.edu/stats/staff/uli/modelselect.html. References