CS281A/Stat241A Lecture 22

Similar documents
Bayesian Linear Models

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31


17 : Markov Chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Markov Chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Principles of Bayesian Inference

Markov Chains and MCMC

Computational statistics

Bayesian GLMs and Metropolis-Hastings Algorithm

David Giles Bayesian Econometrics

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov chain Monte Carlo

Bayesian Inference and MCMC

STA 4273H: Statistical Machine Learning

Bayesian Methods for Machine Learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

MCMC Methods: Gibbs and Metropolis

Monte Carlo Methods. Leon Gu CSD, CMU

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Density Estimation. Seungjin Choi

Lecture 7 and 8: Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

Introduction to Bayesian inference

16 : Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo Lecture 9

CPSC 540: Machine Learning

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Bayesian Linear Regression

INTRODUCTION TO BAYESIAN STATISTICS

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

STA 294: Stochastic Processes & Bayesian Nonparametrics

MCMC and Gibbs Sampling. Sargur Srihari

The Metropolis-Hastings Algorithm. June 8, 2012

CSC 2541: Bayesian Methods for Machine Learning

Convergence Rate of Markov Chains

Markov Chain Monte Carlo methods

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Bayesian Estimation with Sparse Grids

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

MARKOV CHAIN MONTE CARLO

MCMC and Gibbs Sampling. Kayhan Batmanghelich

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

MCMC algorithms for fitting Bayesian models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Markov Chain Monte Carlo

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

16 : Approximate Inference: Markov Chain Monte Carlo

Markov Chain Monte Carlo Methods

Lecture : Probabilistic Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

I. Bayesian econometrics

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Answers and expectations

Foundations of Statistical Inference

Lecture 2: Priors and Conjugacy

Principles of Bayesian Inference

Introduction to Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

MCMC notes by Mark Holder

Bayesian Phylogenetics:

Bayesian Linear Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Part 6: Multivariate Normal and Linear Models

Lecture 6: Markov Chain Monte Carlo

Markov Chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Monte Carlo Inference Methods

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ST 740: Markov Chain Monte Carlo

Sequential Monte Carlo Methods (for DSGE Models)

Remarks on Improper Ignorance Priors

Bayesian Linear Models

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

COS513 LECTURE 8 STATISTICAL CONCEPTS

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian data analysis in practice: Three simple examples

Principles of Bayesian Inference

F denotes cumulative density. denotes probability density function; (.)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

an introduction to bayesian inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Lecture 8: Bayesian Estimation of Parameters in State Space Models

LECTURE 15 Markov chain Monte Carlo

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

10. Exchangeability and hierarchical models Objective. Recommended reading

Markov Chain Monte Carlo

Metropolis-Hastings Algorithm

1 Bayesian Linear Regression (BLR)

Spring 2006: Introduction to Markov Chain Monte Carlo (MCMC)

Variational Scoring of Graphical Model Structures

Machine Learning for Data Science (CS4786) Lecture 24

Transcription:

CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett

CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22 p. 3/4 Sampling in Bayesian Methods Consider a model p(x,θ) = p(θ) p(x θ)p(y x, θ). }{{} prior Given observations (x 1,y 1 ),...,(x n,y n ) and x, we wish to estimate the distribution of y, perhaps to make a forecast ŷ that minimizes the expected loss EL(ŷ, y). We condition on observed variables ((x 1,y 1 ),...,(x n,y n ),x) and marginalize out unknown variables (θ) in the joint p(x 1,y 1,...,x n,y n,x,y,θ) = p(θ)p(x,y θ) n i=1 p(x i,y i θ).

CS281A/Stat241A Lecture 22 p. 4/4 Sampling in Bayesian Methods This gives the posterior predictive distribution p(y x 1,y 1,...,x n,y n,x) = p(y θ,d n,x)p(θ D n,x)dθ = p(y θ,x)p(θ D n,x)dθ, where D n = (x 1,y 1,...,x n,y n ) is the observed data. We can also consider the prior predictive distribution, which is the same quantity when nothing is observed: p(y) = p(y θ)p(θ)dθ. It is often useful, to evaluate if the prior reflects reasonable beliefs for the observations.

CS281A/Stat241A Lecture 22 p. 5/4 Sampling from posterior We wish to sample from the posterior predictive distribution, p(y D n ) = p(y θ)p(θ D n )dθ. It might be straightforward to sample from p(θ D n ) p(d n θ)p(θ). For example, if we have a conjugate prior p(θ), it can be an easy calculation to obtain the posterior, and then we can sample from it. It is typically straightforward to sample from p(y θ).

CS281A/Stat241A Lecture 22 p. 6/4 Sampling from posterior In these cases, we can 1. Sample θ t p(θ D n ), 2. Sample y t p(y θ t ). Then we have (θ t,y t ) p(y,θ D n ), and hence we have samples from the posterior predictive distribution, y t p(y D n ).

CS281A/Stat241A Lecture 22 p. 7/4 xample: Gaussian with Conjugate Prior Consider the model y N(µ,σ 2 ) µ σ 2 N(µ 0,σ 2 /κ 0 ) where x gamma(a,b) for 1/σ 2 gamma(ν 0 /2,ν 0 σ 2 0/2), p(x) = ba Γ(a) xa 1 e bx.

CS281A/Stat241A Lecture 22 p. 8/4 xample: Gaussian with Conjugate Prior This is a conjugate prior: posterior is 1/σ 2 y 1,...,y n gamma(ν n /2,ν n σ 2 /2) µ y 1,...,y n,σ 2 N(µ n,σ 2 /κ n ) ν n =ν 0 + n σ 2 n = ( ν 0 σ 2 0 + i κ n =κ 0 + n µ n = (κ 0 µ 0 + nȳ)/κ n (y i ȳ) 2 + (ȳ µ 0 ) 2 κ 0 n/κ n )/ν n

CS281A/Stat241A Lecture 22 p. 9/4 xample: Gaussian with Conjugate Prior So we can easily compute the posterior distribution p(θ D n ) (with θ = (µ,σ 2 )), and then: 1. Sample θ t p(θ D n ), 2. Sample y t p(y θ t ).

CS281A/Stat241A Lecture 22 p. 10/4 Posterior Predictive Model Checking Another application of sampling: Suppose we have a model p(θ), p(y θ). We see data D n = (y 1,...,y n ). We obtain the predictive distribution p(y y 1,...,y n ). Suppose that some important feature of the predictive distribution does not appear to be consistent with the empirical distribution. Is this due to sampling variability, or because of a model mismatch?

CS281A/Stat241A Lecture 22 p. 11/4 Posterior Predictive Model Checking 1. Sample θ t p(θ D n ). 2. Sample Y t = (y t 1,...,yt n), with y t i p(y θt ). 3. Calculate f(y t ), the statistic of interest. We can use the Monte Carlo approximation to the distribution of f(y t ) to assess the model fit.

CS281A/Stat241A Lecture 22 p. 12/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22 p. 13/4 Semiconjugate priors Consider again a normal distribution with conjugate prior: y N(µ,σ 2 ) µ σ 2 N(µ 0,σ 2 /κ 0 ) 1/σ 2 gamma(ν 0 /2,ν 0 σ 2 0/2). In order to ensure conjugacy, the µ and σ 2 distributions are not marginally independent.

CS281A/Stat241A Lecture 22 p. 14/4 Semiconjugate priors If it s more appropriate to have decoupled prior distributions, we might consider a semiconjugate prior, that is, a prior that is a product of priors, p(θ) = p(µ)p(σ 2 ), each of which, conditioned on the other parameters, is conjugate. For example, for the normal distribution, we could choose µ N(µ 0,τ 2 0) 1/σ 2 gamma(ν 0 /2,ν 0 σ 2 0/2).

CS281A/Stat241A Lecture 22 p. 15/4 Semiconjugate priors The posterior of 1/σ 2 is not a gamma distribution. However, the conditional distribution p(1/σ 2 µ,y 1,...,y n ) is a gamma: p(σ 2 µ,y 1,...,y n ) p(y 1,...,y n µ,σ 2 )p(σ 2 ), p(µ σ 2,y 1,...,y n ) p(y 1,...,y n µ,σ 2 )p(µ). These distributions are called the full conditional distributions of σ 2 and of µ: conditioned on everything else. In both cases, the distribution is the posterior of a conjugate, so it s easy to calculate.

CS281A/Stat241A Lecture 22 p. 16/4 Semiconjugate priors: Gibbs sampling p(σ 2 µ,y 1,...,y n ) p(y 1,...,y n µ,σ 2 )p(σ 2 ), p(µ σ 2,y 1,...,y n ) p(y 1,...,y n µ,σ 2 )p(µ). Notice that, because the full conditional distributions are posteriors of conjugates, they are easy to calculate, and hence easy to sample from. Suppose we had σ 2(1) p(σ 2 y 1,...,y n ). Then if we choose µ (1) p(µ σ 2(1),y 1,...,y n ), we would have (µ (1),σ 2(1) ) p(µ,σ 2 y 1,...,y n ).

CS281A/Stat241A Lecture 22 p. 17/4 Semiconjugate priors In particular, µ (1) p(µ (1) y 1,...,y n ), so we can choose σ 2(2) p(σ 2 µ (1),y 1,...,y n ), so (µ (1),σ 2(2) ) p(µ,σ 2 y 1,...,y n ). etc...

CS281A/Stat241A Lecture 22 p. 18/4 Gibbs sampling For parameters θ = (θ 1,...,θ p ): 1. Set some initial values θ 0. 2. For t = 1,...,T, For i = 1,...,p, Sample θ t i p(θ i θ t 1,...,θt i 1,θt 1 i+1,...,θt 1 p ). Notice that the parameters θ 1,θ 2,...,θ T are dependent. They form a Markov chain: θ t depends on θ 1,...,θ t 1 only through θ t 1. The posterior distribution is a stationary distribution of the Markov chain (by argument on previous slide). And the distribution of θ T approaches the posterior.

CS281A/Stat241A Lecture 22 p. 19/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22 p. 20/4 Nonconjugate priors Sometimes conjugate or semiconjugate priors are not available or are unsuitable. In such cases, it can be difficult to sample directly from p(θ y) (or p(θ i all else)). Note that we do not need i.i.d. samples from p(θ y), rather we need θ 1,...,θ m so that the empirical distribution approximates p(θ y). Equivalently, for any θ,θ, we want E[# θ in sample] E[# θ in sample] p(θ y) p(θ y).

CS281A/Stat241A Lecture 22 p. 21/4 Nonconjugate priors Some intuition: suppose we already have θ 1,...,θ t, and another θ. Should we add it (set θ t+1 = θ)? We could base the decision on how p(θ y) compares to p(θ t y). Happily, we do not need to be able to compute p(θ y): p(θ y) p(θ t y) = p(y θ)p(θ)p(y) p(y)p(y θ t )p(θ t ) = p(y θ)p(θ) p(y θ t )p(θ t ). If p(θ y) > p(θ t y), we set θ t+1 = θ. If r = p(θ y)/p(θ t y) < 1, we expect to have θ appear r times as often as θ t, so we accept θ (set θ t = θ) with probability r.

CS281A/Stat241A Lecture 22 p. 22/4 Metropolis Algorithm Fix a symmetric proposal distribution, q(θ θ ) = q(θ θ). Given a sample θ 1,...,θ t, 1. Sample θ q(θ θ t ). 2. Calculate 3. Set θ t+1 = { r = p(θ y) p(θ t y) = p(y θ)p(θ) p(y θ t )p(θ t ). θ with probability min(r, 1), θ t with probability 1 min(r, 1).

CS281A/Stat241A Lecture 22 p. 23/4 Metropolis Algorithm Notice that θ 1,...,θ t is a Markov chain. We ll see that its stationary distribution is p(θ y). But we have to wait until the Markov chain mixes.

CS281A/Stat241A Lecture 22 p. 24/4 Metropolis Algorithm In practice: 1. Wait until some time τ (burn-in time) when it seems that the chain has reached the stationary distribution. 2. Gather more samples, θ τ+1,...,θ τ+m. 3. Approximate the posterior p(θ y) using the empirical distribution of {θ τ+1,...,θ τ+m }.

CS281A/Stat241A Lecture 22 p. 25/4 Metropolis Algorithm The higher the correlation between θ t and θ t+1, the longer the burn-in period needs to be, and the worse the approximation of the posterior that we get from the empirical distribution of an m-sample: The effective sample size is much less than m. We can control the correlation using the proposal distribution. Typically, the best performance occurs for an intermediate value of a scale parameter of the proposal distribution: small variance gives high correlation; high variance gives samples far from the posterior mode, hence with low acceptance probability.

CS281A/Stat241A Lecture 22 p. 26/4 Metropolis Algorithm Need to choose a proposal distribution that allows the Markov chain to move around the parameter space quickly, but not with such big steps that the proposals are frequently rejected. Rule of thumb: Tune proposal distribution so that the acceptance probability is between 20% and 50%.

CS281A/Stat241A Lecture 22 p. 27/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22 p. 28/4 Metropolis-Hastings Algorithm Gibbs sampling and the Metropolis algorithm are special cases of Metropolis-Hastings. Suppose we wish to sample from p(u,v). We come up with two proposal distributions, q u (u u,v ) and q v (v u,v ). Notice that they can depend on the other variable, and do not need to be symmetric as in the Metropolis algorithm.

CS281A/Stat241A Lecture 22 p. 29/4 Metropolis-Hastings Algorithm 1. Update u sample: (a) Sample u q u (u u t,v t ). (b) Compute (c) Set u t+1 = r = p(u,vt )q u (u t u,v t ) p(u t,v t )q u (u u t,v t ). { u u t with prob min(1, r), with prob 1 min(1,r).

CS281A/Stat241A Lecture 22 p. 30/4 Metropolis-Hastings Algorithm 2. Update v sample: (a) Sample v q v (v u t+1,v t ). (b) Compute r = p(ut+1,v)q v (v t u t+1,v) p(u t+1,v t )q v (v v t+1,v t ). (c) Set v t+1 = { v v t with prob min(1, r), with prob 1 min(1,r).

CS281A/Stat241A Lecture 22 p. 31/4 Metropolis-Hastings Algorithm Just like Metropolis algorithm, but the acceptance ratio contains an extra factor r = p(u,vt )q u (u t u,v t ) p(u t,v t )q u (u u t,v t ) q u (u t u,v t ) q u (u u t,v t ). If u is much more likely to be reached from u t than vice versa, downweight the probability of acceptance (to avoid over-representing u).

CS281A/Stat241A Lecture 22 p. 32/4 Metropolis versus M-H If q u is symmetric, then the correction factor q u (u u,v) = q u (u u,v), q u (u t u,v t ) q u (u u t,v t ) is 1, so acceptance probability is the same as the Metropolis algorithm. That is, Metropolis is a special case of Metropolis-Hastings.

CS281A/Stat241A Lecture 22 p. 33/4 Metropolis versus Gibbs If q u is the full conditional distribution of u given v, so the acceptance ratio is q u (u u,v) = p(u v), r = p(u,vt )q u (u t u,v t ) p(u t,v t )q u (u u t,v t ) = p(u,vt )p(u t v t ) p(u t,v t )p(u v t ) = p(u vt )p(v t )p(u t v t ) p(u t v t )p(v t )p(u v t ) = 1.

CS281A/Stat241A Lecture 22 p. 34/4 Metropolis versus Gibbs That is, if we propose a value from the full conditional distribution, we accept it with probability 1. So Gibbs sampling is a special case of Metropolis-Hastings.

CS281A/Stat241A Lecture 22 p. 35/4 Gibbs plus Metropolis plus MH Notice that we can choose different proposal distributions for different variables u, v,... Since Metropolis and Gibbs correspond to specific choices of the proposal distribution, we can easily combine these algorithms: For some variables, full conditional distributions might be available (typically, because of a conjugate prior for that variable), and we can do Gibbs sampling. For some variables, they are not available, but we can choose an alternative proposal distribution.

CS281A/Stat241A Lecture 22 p. 36/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo

CS281A/Stat241A Lecture 22 p. 37/4 Markov Chain Monte Carlo The transition probability matrix of a Markov chain determines the state evolution: A ij = Pr(x t+1 = j x t = i). Recall that a distribution over states p t (x) = (Pr(x t = 1),...,Pr(x t = N)) evolves as p t+1 = p ta. A stationary distribution p on X satisfies p A = p.

CS281A/Stat241A Lecture 22 p. 38/4 Markov Chain Monte Carlo An ergodic Markov chain is irreducible (no islands) and aperiodic. It always has a unique stationary distribution: for all p 0, p 0A t p. An ergodic MC mixes exponentially: for some C,τ and stationary distribution p, p 0A t p 1 Ce t/τ.

CS281A/Stat241A Lecture 22 p. 39/4 Markov Chain Monte Carlo If p satisfies the detailed balance equations p i A ij = p j A ji, then p is a stationary distribution, and the chain is called reversible: Pr(x t = i,x t+1 = j) = Pr(x t = j,x t+1 = i).

CS281A/Stat241A Lecture 22 p. 40/4 Metropolis-Hastings We ll show that the distribution p and the transition probabilities A of Metropolis-Hastings satisfy the detailed balance equations p i A ij = p j A ji, and hence p is the stationary distribution. Thus, if the Markov chain is ergodic, it converges to p, and in particular T t=τ f(x t ) E p f. For simplicity, we ll consider a single variable update.

CS281A/Stat241A Lecture 22 p. 41/4 Metropolis-Hastings MH proposal to move from i to j is accepted with probability ( a(i,j) = min 1, p(j)q(i j) ). p(i)q(j i) Thus, ( p i A ij = p i q(j i) min 1, p(j)q(i j) ) p(i)q(j i) ( ) pi q(j i) = p j q(i j) min p j q(i j), 1 = p j A ji. So p is a stationary distribution.

CS281A/Stat241A Lecture 22 p. 42/4 Metropolis-Hastings Will the distribution converge to the stationary distribution? When is the Markov chain ergodic? For instance, if p is continuous on R d and the proposal distribution q(x x) is a Gaussian centered at the current value, that is enough.

CS281A/Stat241A Lecture 22 p. 43/4 Metropolis-Hastings There are cases where we do not have an ergodic Markov chain. For example, consider Gibbs sampling in a directed graphical model X 1 S X 2, where X 1,X 2 are outcomes of coin tosses and S is the indicator for X 1 = X 2. In such cases, we can update blocks of variables at once: do exact inference for the block in a Gibbs sampling step.

CS281A/Stat241A Lecture 22 p. 44/4 Metropolis-Hastings The mixing time of the Markov chain (the time constant in the exponential convergence to the stationary distribution) is of crucial importance in practice, but often we do not have useful upper bounds on the mixing time.

CS281A/Stat241A Lecture 22 p. 45/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution Example: Normal with conjugate prior Posterior predictive model checking Example: Normal with semiconjugate prior Gibbs sampling Nonconjugate priors Metropolis algorithm Metropolis-Hastings algorithm Markov Chain Monte Carlo