LECTURE 15 Markov chain Monte Carlo

Similar documents
STA 4273H: Statistical Machine Learning


Introduction to Machine Learning

Bayesian GLMs and Metropolis-Hastings Algorithm

Markov Chains and MCMC

MCMC: Markov Chain Monte Carlo

Approximate Inference using MCMC

Introduction to Machine Learning CMU-10701

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

CPSC 540: Machine Learning

Markov Chain Monte Carlo methods

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Markov Chain Monte Carlo Methods

MCMC algorithms for fitting Bayesian models

Sampling Algorithms for Probabilistic Graphical models

MARKOV CHAIN MONTE CARLO

Probabilistic Machine Learning

Answers and expectations

Sampling Methods (11/30/04)

Session 3A: Markov chain Monte Carlo (MCMC)

Monte Carlo Inference Methods

Markov Chain Monte Carlo, Numerical Integration

STA 294: Stochastic Processes & Bayesian Nonparametrics

Monte Carlo in Bayesian Statistics

Computational statistics

Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Bayesian Inference and MCMC

An introduction to Sequential Monte Carlo

13: Variational inference II

17 : Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo

Learning the hyper-parameters. Luca Martino

CSC 2541: Bayesian Methods for Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Reminder of some Markov Chain properties:

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Variational Scoring of Graphical Model Structures

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

an introduction to bayesian inference

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

The Ising model and Markov chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Monte Carlo-based statistical methods (MASM11/FMS091)

Bayesian networks: approximate inference

Probabilistic Graphical Networks: Definitions and Basic Results

Monte Carlo Methods. Leon Gu CSD, CMU

MCMC Sampling for Bayesian Inference using L1-type Priors

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Lecture 13 : Variational Inference: Mean Field Approximation

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Quantifying Uncertainty

CSC 2541: Bayesian Methods for Machine Learning

The Metropolis-Hastings Algorithm. June 8, 2012

Stat 516, Homework 1

A = {(x, u) : 0 u f(x)},

Principles of Bayesian Inference

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Markov chain Monte Carlo

Lecture 8: Bayesian Networks

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Results: MCMC Dancers, q=10, n=500

Likelihood-free MCMC

Machine Learning. Probabilistic KNN.

Bayesian Phylogenetics:

Graphical Models and Kernel Methods

6 Markov Chain Monte Carlo (MCMC)

Linear Dynamical Systems

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Markov Chain Monte Carlo (MCMC)

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Down by the Bayes, where the Watermelons Grow

Convex Optimization CMU-10725

Markov Chain Monte Carlo

Machine Learning for Data Science (CS4786) Lecture 24

Markov chain Monte Carlo methods for visual tracking

Particle-Based Approximate Inference on Graphical Model

MCMC and Gibbs Sampling. Sargur Srihari

STA 4273H: Sta-s-cal Machine Learning

Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Metropolis-Hastings Algorithm

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Bayesian Methods for Machine Learning

On Markov chain Monte Carlo methods for tall data

Transcription:

LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte Carlo methods are a general all purpose method for sampling from a posterior distribution. To explain MCMC we will need to present some general Markov chain theory. However, first we first justify Gibbs sampling, this can be done without the use of any Markov chain theory. The basic problem is we would like to generate samples from πθ fθ x = fx θfθ fx wθ Z, here the normalization constant Z = fx,θdθ is in general intractable. The objective of our MCMC algorithms will be to sample from πθ without ever having to compute Z. The computation of wθ is usually tractable since evaluating the likelihood and prior are typically analytic operations. 15.1. Gibbs sampler The idea behind a Gibbs sampler is that one wants to sample from a joint posterior distribution. We do not have access to a closed form for the joint, however we do have an analytic form form for the conditionals. Consider a joint posterior fθ x where x is the data and θ = {θ 1,θ 2 }. For ease of notation we will write πθ 1,θ 2 fθ 1,θ 2 x. The idea behind the Gibbs sampling algorithm is that the following procedure will provide samples from the joint distribution πθ 1,θ 2 : 1 Set θ 2 Unif[support of θ 2 ] 2 Draw θ 1 πθ 1 θ 2 3 Draw θ 2 πθ 2 θ 1 4 Set θ 2 := θ 2 5 Goto step 2. We now show why the draws θ 1,θ 2 from the above algorithm are from πθ 1,θ 2. The following chain starts with the joint distribution specified by following the 117

118 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING above algorithm and proceeds to show it is the joint distribution pθ 1,θ 2 = πθ 1,θ 2 πθ 1 θ 2 πθ 2 θ 1dθ 1 dθ 2 = πθ 1,θ 2 πθ 1,θ 2 πθ 1,θ 2 πθ 2 πθ 1 dθ 1 dθ 2 = πθ 1 θ 2 πθ 2 θ 1πθ 2,θ 1dθ 1 dθ 2 [ ] = πθ 1,θ 2 πθ 1 θ 2 πθ 2 θ 1dθ 1 dθ 2 [ ] = πθ 1,θ 2 πθ 1,θ 2 θ 1dθ 1 dθ 2 = πθ 1,θ 2. One can also derive the Gibbs sampler from the more general Metropolis- Hastings algorithm. We will leave that as an exercise. 15.2. Markov chains Before we discuss Markov chain Monte Carlo methods we first have to define some basic properties of Markov chains. The Markov chains we discuss will always be discrete time. The state space or values the chain can take at an time t = 1,...,T can be either discrete or continuous. We will denote the state space as S and for almost all examples we will consider a finite discrete state space S = {s 1,...,s L }, this is so we can use linear algebra rather than operator theory for all our analysis. In the context of Bayesian inference it is natural to think of the state space S as the space of parameter values Θ and a state s i corresponding to a parameter value θ i. For a discrete state Markov chain we can define a Makov transition matrix P where each element P st s t+1 = Prs t s t+1, is the probability of moving from one state to another at any time t. We consider a probability vector ν as a vector of L numbers with l ν l = 1 and ν l 0. We will require our chain to mix and have a unique stationary distribution. This requirement will be captured by two criteria: invariance and irreducibility or ergodicity. We start with invariance: we would like the chain to have the following property lim T PT ν = ν, ν the limit ν is called the invariant measure or limiting distribution. The existence of a unique invariant distribution piles the following general balance condition s Ps sν s = ν s. There is a simple check for invariance given the transition matrix P by computing P = U T ΛU, where if we rank the eigenvalues λ l from largest to smallest we know the largest eigenvalue λ 1 = 1. We also know that all the eigenvalues cannot be less than 1.

So we now consider LECTURE 15. MARKOV CHAIN MONTE CARLO 119 lim N [ ] P N = λ N l u l u T l which will converge as long as no eigenvalues λ l = 1. In addition, all eigenvalues λ l 1,1 will not have an effect on the limit. Ergodicity or irreducibility of the chain means the following: There exists an N such that P N s s > 0 for all s and ν s > 0. Another way of stating the above is that the entire state space is reachable from any point in the state space. Again we can check for irreducibility using linear algebra. We first define the generator of the chain L = P I. We now look at the eigenvalues of L ordered from smallest to largest. We know the smallest eigenvalue λ 1 = 0 and has corresponding eigenvector 1. If the second eigenvalue λ 2 > 0 then the chain is irreducible and λ 2 λ 1 = λ 2 is called the spectral gap. For a Makov chain with a unique invariant measure that is ergodic the following mixing rate holds sup ν Pν = O1 λ N. ν We want our chains to mix. Algorithmically we will design Markov chains that satisfy what is called detailed balance: Ps sν s = Ps s ν s, s,s. Detailed balance is easy to check for in an algorithm and detailed balance plus ergodicity implies that the chain mixes. In the next section we see why detailed balance is easy to verify for the most common MCMC algorithm Metropolis-Hastings. 15.3. Metropolis-Hastings algorithm We begin with some notation we define a Markov transition probability or Markov transition kernel as Qs ;s fs s, as a conditional probability of s s, in the case of a finite state space these values are given by a Markov transition matrix. We also have a state probabilities ps ws Z, where we can evaluate ws using the prior and likelihood. Note that whenever we write ps ps we can use the computation ws ws l as a replacement. The following is the Metropolis-Hastings algorithm 1 t = 1 2 s t Unif[support of s] 3 Draw s Qs ;s t 4 Compute acceptance probability α α = min 1, ps Qs;s psqs ;s 5 Accept s with probability α: u Unif[0,1], If u αthen 6 If t < T goto step 3 else stop { t = t+1 s t = s

120 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING The Metropolis-Hastings algorithm is designed to generate s 1,...,s T samples from the posterior distribution ps. We will show soon that the algorithm satisfies detailed balance. Before that we will state a properties of the above algorithm. A common proposal Qs ;s is a random walk proposal s Ns,σ 2. If σ 2 is very small then typically the acceptance ratio α will be near 1, however in this case two consecutive draws s t,s t+1 will be conditionally dependent. If σ 2 is very large then the acceptance ratio α will be near 0, however in this case two consecutive draws s t,s t+1 will be independent. There is a trick of how local/global the steps should be and what acceptance ratio α is good, some theory suggests α =.25 is optimal. It is also the case that the first T 0 samples are not drawn form the stationary distribution, the stationary distribution has not kicked in yet. For this reason one typically does not include the first T 0 samples, this is called the burn-in period. We now show detailed balance. First observe that Ps s = αqs ;s. We start with Ps s ν s = Qs ;smin = min = Qs;s min 1, ν s Qs;s ν sqs ν s ;s ν sqs ;s,ν s Qs;s 1, ν sqs ;s ν s Qs;s ν s = Ps sν s.

LECTURE 16 Hidden Markov Models The idea of a hidden Markov model HMM is an extension of a Markov chain. The basic formalism is that we have two variables X 1,...,X T which are observed and Z 1,...,Z T which are hidden states and they have the following conditional dependence structure x t+1 = fx t ;θ 1 z t+1 = gx t+1 ;θ 2, where we think of t as time and f and g are conditional distributions. In this case we think of time as discrete. Typically in HMMs we consider the hidden states to be discrete, there are more general state space models where both the hidden variables and the observables are continuous. The parameters of the conditional distribution g is often called the transition probabilities and the parameters for observed distribution gx t+1 ;θ 2 are often called the emission probabilities. We will often use the notation x 1:t x 1,...,x t. The questions normally asked using a HMM include: Filtering: Given the observations x 1,...,x t we want to know the hidden states z 1,...,z t so we want to infer pz 1:t x 1:t. Smoothing: Given the observations x 1,...,x T we want to know the hidden statesz 1,...,z t wheret < T. Hereweareusingpastandfutureobservation to infer hidden states pz 1:t Posterior sampling: z 1:T pz 1:T x 1:T The hidden variables in an HMM are what make inference challenging. We start by writing down the joint complete likelihood T T Likx 1,...,x T,z 1,...,z T ;θ 1,θ 2 = πz 1 fz t+1 z t,θ 1 gx t z t,θ 2, here π is the probability of the initial state. One can obtain the likelihood of the observed data by marginalization Likx 1,...,x T ;θ 1,θ 2 = T T πz 1 fz t+1 z t,θ 1 gx t z t,θ 2. z 1,...,x T Naively the above sum is brutal since it consists of all possible hidden trajectories. If we assume N hidden states then we would have N T possible trajectories. We 121 t=2 t=2 t=1 t=1