CSC 2541: Bayesian Methods for Machine Learning

Similar documents
STA 4273H: Statistical Machine Learning

Bayesian Methods for Machine Learning

STA 4273H: Sta-s-cal Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

CSC 2541: Bayesian Methods for Machine Learning

Lecture 7 and 8: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

17 : Markov Chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Computational statistics

16 : Approximate Inference: Markov Chain Monte Carlo

Principles of Bayesian Inference

MCMC Methods: Gibbs and Metropolis

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Markov Chain Monte Carlo (MCMC)

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

MCMC and Gibbs Sampling. Sargur Srihari

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Session 3A: Markov chain Monte Carlo (MCMC)

The Metropolis-Hastings Algorithm. June 8, 2012

Lecture 6: Markov Chain Monte Carlo

Markov chain Monte Carlo methods in atmospheric remote sensing

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Gibbs Sampling in Linear Models #1

Bayesian Estimation with Sparse Grids

Markov Chain Monte Carlo

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

LECTURE 15 Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Advanced Introduction to Machine Learning CMU-10715

Bayesian data analysis in practice: Three simple examples

STA 294: Stochastic Processes & Bayesian Nonparametrics

MCMC algorithms for fitting Bayesian models

Gibbs Sampling in Linear Models #2

CS281A/Stat241A Lecture 22

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bayesian Regression Linear and Logistic Regression

The Expectation-Maximization Algorithm

Introduction to Machine Learning CMU-10701

Markov chain Monte Carlo Lecture 9

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Reminder of some Markov Chain properties:

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Approximate inference in Energy-Based Models

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

STA414/2104 Statistical Methods for Machine Learning II

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Markov chain Monte Carlo

CPSC 540: Machine Learning

The Ising model and Markov chain Monte Carlo

Bayes Nets: Sampling

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Strong Lens Modeling (II): Statistical Methods

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Brief introduction to Markov Chain Monte Carlo

Bayesian Inference and MCMC

Markov Chain Monte Carlo methods

Advances and Applications in Perfect Sampling

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian GLMs and Metropolis-Hastings Algorithm

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

F denotes cumulative density. denotes probability density function; (.)

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

David Giles Bayesian Econometrics

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Monte Carlo Methods. Leon Gu CSD, CMU

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

16 : Markov Chain Monte Carlo (MCMC)

Introduction to Bayesian methods in inverse problems

Sampling Methods (11/30/04)

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

CSC 446 Notes: Lecture 13

Principles of Bayesian Inference

Bayesian Gaussian Process Regression

CSC321 Lecture 18: Learning Probabilistic Models

Computer intensive statistical methods

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

An introduction to adaptive MCMC

CS 188: Artificial Intelligence. Bayes Nets

Likelihood-free MCMC

Stat 516, Homework 1

Stat 5102 Notes: Markov Chain Monte Carlo and Bayesian Inference

Default Priors and Effcient Posterior Computation in Bayesian

Recall that the AR(p) model is defined by the equation

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3a: The Origin of Variational Bayes

Artificial Intelligence Bayesian Networks

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Transcription:

CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3

More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll look at two more methods next: Gibbs Sampling updates one (or some other subset) of variables at a time. It requires that the conditional distributions be tractable. Slice Sampling is a general framework, but one application also updates one variable at a time, but doesn t require a tractable conditional distribution. Both of these methods eliminate or reduce the need for careful tuning of a proposal distribution, as is needed for Metropolis updates.

Gibbs Sampling The Gibbs sampling algorithm updates each component of the state by sampling from its conditional distribution given other components. Let the current state be x = (x 1,..., x n ). The next state, x = (x 1,..., x n), is produced as follows: Sample x 1 from x 1 x 2,..., x n. Sample x 2 from x 2 x 1, x 3,..., x n. Sample x i from x i x 1,..., x i 1, x i+1,..., x n. Sample x n from x n x 1,..., x n 1. In many interesting problems we can easily sample from these conditional distributions, even though the joint distribution cannot easily be sampled from. In many other problems, we can update some of the components this way, and use other updates (eg, Metropolis or slice sampling ) for the others.

A Strategy for Gibbs Sampling 1) Write down an expression for the joint posterior density of all parameters, omitting constant factors as convenient. This is found as the product of the prior and likelihood. 2) For each parameter in turn, look at this joint density, and eliminate all factors that don t depend on this parameter. Try to view the remaining factors as the density for some standard (tractable) distribution, disregarding any constant factors in the density function for the standard distribution. 3) Write a program that samples from each of these standard conditional distributions in turn. The parameters of these standard distributions will in general depend on the variables conditioned on. This works if the model and prior are conditionally conjugate fixing values for all but one parameter gives a conjugate model and prior for the remaining parameter. This covers a much larger class of problems than those where the model and prior are conjugate for all parameters at once.

Example: Linear Regression with Gaussian Noise Gibbs sampling is possible for a simple linear regression model with Gaussian noise, if the regression coefficients have a Gaussian prior distribution and the noise precision (reciprocal of variance) has a Gamma prior. For example, consider a model of n independent observations, (x i, y i ), with x i and y i both being real: y i x i, β 0, β 1, τ N(β 0 + β 1 x i, 1/τ) β 0 N(µ 0, 1/τ 0 ) β 1 N(µ 1, 1/τ 1 ) τ Gamma(a, b) Here, we ll assume that µ 0, µ 1, τ 0, τ 1, a, and b are known constants. The Gamma(a, b) distribution for τ has density function ba Γ(a) τ a 1 exp( bτ).

Gibbs Sampling for the Linear Regression Example β 0 The N(µ 0, 1/τ 0 ) prior for β 0 combines with the likelihood, with β 1 and τ fixed, to give a conditional posterior that is normal. This is like sampling for the mean of a normal distribution given independent data points, (y i β 1 x i ), that all have variance 1/τ: ( τ0 µ 0 + τ i β 0 x, y, β 1, τ N (y i β 1 x i ), τ 0 + nτ 1 ) τ 0 + nτ We can see this by looking at the relevant part of the log joint posterior density: (τ 0 /2)(β 0 µ 0 ) 2 i (τ/2)(y i β 0 β 1 x i ) 2 This is a quadratic function of β 0, corresponding to the log of a normal density, with the coefficient of β0 2 being (τ 0 + nτ)/2, giving the variance above. The coefficient of β 0 is τ 0 µ 0 + τ (y i β 1 x i ), giving the mean above. i

Gibbs Sampling for the Linear Regression Example β 1 Also, the N(µ 1, 1/τ 1 ) prior for β 1 combines with the likelihood, with β 0 and τ fixed, to give a conditional posterior that is normal. This also is like sampling for the mean of a normal distribution given independent data points, but with data points (y i β 0 )/x i that have different variances, equal to τx i. β 1 x, y, β 0, τ N( τ 1µ 1 + τ i x i(y i β 0 ) τ 1 + τ, i x2 i 1 τ 1 + τ i x2 i ) Again, we can see this by looking at the relevant part of the log joint posterior density: (τ 1 /2)(β 1 µ 1 ) 2 (τ/2)(y i β 0 β 1 x i ) 2 i This is a quadratic function of β 1, corresponding to the log of a normal density, with the coefficient of β1 2 being (τ 1 + τ i x2 i )/2, giving the variance above. The coefficient of β 1 is τ 1 µ 1 + τ i x i(y i β 0 ), giving the mean above.

Gibbs Sampling for the Linear Regression Example τ Finally, the Gamma prior for τ combines with the likelihood, with β 0 and β 1 fixed, to give a conditional posterior that is also Gamma: ( τ x, y, β 0, β 1 Gamma a + n/2, b + ) (y i β 0 β 1 x i ) 2 /2 i The relevant part of the log joint posterior density is (a 1) log τ bτ + (n/2) log τ i (τ/2)(y i β 0 β 1 x i ) 2 This combines into the sum of log τ times a + (n/2) 1 and τ times b + i (y i β 0 β 1 x i ) 2 /2, giving a Gamma distribution with parameters as above.

The Idea of Slice Sampling (I) Slice sampling operates by sampling uniformly from under the curve of a density function π(x). Ignoring the vertical coordinate then yields a sample of values for x drawn from this density: π(x) x However, we won t try to pick points drawn independently from under π(x). We will instead define a Markov chain that leaves this uniform distribution invariant. Note that we can just as easily use any constant multiple of π(x), so we have no need to calculate the normalizing constant for π.

The Idea of Slice Sampling (II) Let s write a point under the curve π(x) as (x, y), with y being the height above the x-axis. The uniform distribution under π(x) is left invariant by a Markov chain that alternately 1) Picks a new y uniformly from (0, π(x)). 2) Performs some update on x alone that leaves invariant the uniform distribution on the slice through the density at height y. If in step (2) we picked a point from the slice independently of the current x, this would just be Gibbs sampling on the (x, y) space.

A Picture of a Slice For one-dimensional x, the slice distribution that we sample from in step (2) will be uniform over the union of some collection of intervals: y If π is unimodal, the slice is a single interval.

Slice Sampling for One Variable We can use simple slice sampling for one variable at a time as an alternative to Gibbs sampling or single-variable Metropolis updates. We are probably best off doing just a single slice sampling update for each variable before going on to the next. (But note that this is not quite the same thing as Gibbs sampling.) On our visit to variable i, we first draw a value y uniformly from the interval (0, π(x i x j : j i)). Note that π needn t be normalized. We then update x i in a way that leaves the slice distribution defined by y invariant. The big question: How can we efficiently update x i in such a way?

Slice Sampling for One Variable: Finding an Interval By Stepping Out Given a current point x, and a height y, we start by finding an interval (x 0, x 1 ) around x for which both x 0 and x 1 are outside the slice. As our first attempt, we try the interval (x wu, x + w(1 u)) where u is drawn uniformly from (0, 1). If both ends of this interval are outside the slice (π(x 0 ) < y and π(x 1 ) < y), we stop. Otherwise, we step outwards from the ends, a distance w each step, until the ends are outside: x w A crucial point: We have the same probability of arriving at this final interval if we start at any point within it that is inside the slice.

Slice Sampling for One Variable: Sampling from the Interval Found Once we have an interval (x 0, x 1 ) around the current point, x, with both ends outside the slice, we select a point inside this interval in a way that leaves the slice distribution invariant. We proceed as follows: 1) Draw x uniformly from (x 0, x 1 ). 2) If x is inside the slice, it s our new point, x. We test this by checking whether π(x ) y. 3) Otherwise, we narrow the interval by setting either x 0 or x 1 to x, keeping the current point, x, inside the new interval, and then repeat. x x Detailed balance holds for T (x, x ) when the stepping procedure for finding the interval around x is followed by the above procedure for drawing x.

Requirements and Characteristics of One-Variable Slice Sampling There are no restrictions on distributions we can sample from. We need to be able to calculate only π(x), up to a constant factor. We must decide on a width, w i, for the initial interval and for any subsequent steps outward. Ideally, w i will be a few times the standard deviation for x i. If the distribution for x i is unimodal, we can use data from past iterations to choosea good value for w i, since in this case the distribution of x does not depend on w i. The x i we get is not an independent point from the conditional distribution for x i. The penalty from this could be large for heavy-tailed distributions.

Accuracy of MCMC estimates Recall that when estimating E π [a(x)] by the average of a(x) at n independent points drawn according to π, the estimate has standard error (the square root of an estimate of the variance in repetitions of the method) of s a / n, where s a is the sample standard deviation of a(x). Under mild conditions (finite variance of a(x) and geometric ergodicity of the Markov chain), the standard error for an MCMC estimate based on n dependent points from π has the same form, except that n is replaced by an effective sample size. We write this effective sample size for n points as n/τ, where τ is called the autocorrelation time. Note: This all assumes that the early part of the chain (before near convergence) has been discarded, with only later points being used.

Derivation of the Variance of a Here is the variance of the sample mean, a, of n dependent points, a (1),..., a (n), all from the same distribution, when the true mean is µ, and the autocovariance function for the (stationary) series is γ(k): [( 1 Var[a] = E[(a µ) 2 n 1 ) 2 ] ] = E (a (t) µ) n = 1 n 1 n 2 t=0 t,t =0 = 1 n 1 n 2 = 1 n t,t =0 n<s<n E[(a (t) µ)(a (t ) µ)] γ(t t) (1 s /n)γ(s) As n, we get the following: Var[a] 1 n n<s<n γ(s) = 1 n [ γ(0) + 2 s=1 ] γ(s) = σ2 n [ 1 + 2 s=1 ] ρ(s) σ 2 = γ(0) is the variance of a, and ρ(s) = γ(s)/σ 2 is the autocorrelation at lag s.

Estimating the Autocorrelation Time The effective sample size is therefore n/τ with τ = 1 + 2 s=1 ρ(s). How can we estimate this? The standard estimate of ρ(s) for 0 s < n is ρ(s) = γ(s)/ γ(0), with γ(s) = 1 n n s (a (i) a)(a (i+s) a) i=1 The obvious estimate of the autocorrelation time is τ = 1 + 2 but this doesn t work. Even as n, the estimate doesn t converge to the true n 1 s=1 ρ(s) value, because more and more noise gets included in the sum of ρ(s). Instead, one needs to cut off the sum of ρ(s) at some lag above which ρ(s) seems to be nearly zero.

Running Multiple Chains We can use more than one realization of the Markov chain for MCMC, with different seeds, and different initial state. We find estimates by averaging values from all chains, after discarding burn-in from each chain. If the same number of iterations are used from each chain, the estimate from C chains is a C = (1/C) where a c is the sample mean from chain c. (We could instead use multiple chains of different lengths, but then the analysis is a bit more complicated.) c=1 a c

Advantages and Disadvantages of Multiple Chains Using C > 1 chains has some advantages: The needed burn-in period can be assessed more reliably. A direct estimate of the standad error can be obtained as the standard deviation of the a c divided by C. (Needs C to be at least ten or so.) Autocovariances can be estimated more reliably using a. The C chains can be run in parallel if C processors are available. There are also disadvantages: For a given number of total iterations, more are wasted discarding burn-in periods if C burn-in periods are discarded rather than one. If the time actually needed to reach equilibrium is very long, we may need to spend all our available time on one chain to reach equilibrium. For relatively simple problems, using about ten chains may be best. For difficult problems, for which the required burn-in period is long and uncertain, it is possible that one chain may be best.