Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Similar documents
Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Gibbs Sampling in Endogenous Variables Models

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian Linear Regression

Lecture 8: The Metropolis-Hastings Algorithm

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian linear regression

MCMC algorithms for fitting Bayesian models

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Statistics & Data Sciences: First Year Prelim Exam May 2018

Bayesian GLMs and Metropolis-Hastings Algorithm

Generalized linear models

STAT 518 Intro Student Presentation

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Markov Chain Monte Carlo methods

Gibbs Sampling for the Probit Regression Model with Gaussian Markov Random Field Latent Variables

Computational statistics

Markov Chain Monte Carlo (MCMC)

CPSC 540: Machine Learning

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo Lecture 4

17 : Markov Chain Monte Carlo

Lecture 16: Mixtures of Generalized Linear Models

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Density Estimation. Seungjin Choi

Gibbs Sampling in Linear Models #2

MCMC: Markov Chain Monte Carlo

Here θ = (α, β) is the unknown parameter. The likelihood function is

Bayesian Linear Models

Part 8: GLMs and Hierarchical LMs and GLMs

PSEUDO-MARGINAL METROPOLIS-HASTINGS APPROACH AND ITS APPLICATION TO BAYESIAN COPULA MODEL

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

STA 4273H: Statistical Machine Learning

Bayesian Linear Models

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Statistics 360/601 Modern Bayesian Theory

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

13 Notes on Markov Chain Monte Carlo

Bayesian Linear Models

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Adaptive Monte Carlo methods

CSCI-567: Machine Learning (Spring 2019)

Monte Carlo methods for sampling-based Stochastic Optimization

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Gibbs Sampling in Latent Variable Models #1

Part 6: Multivariate Normal and Linear Models

Introduction to Bayesian methods in inverse problems

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

The linear model is the most fundamental of all serious statistical models encompassing:

A Bayesian Mixture Model with Application to Typhoon Rainfall Predictions in Taipei, Taiwan 1

Markov Chain Monte Carlo, Numerical Integration

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Markov Chain Monte Carlo methods

The Logit Model: Estimation, Testing and Interpretation

Gibbs Sampling in Linear Models #1

STA 216, GLM, Lecture 16. October 29, 2007

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

Inference in state-space models with multiple paths from conditional SMC

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Lecture Stat Information Criterion

CS281A/Stat241A Lecture 22

Lecture 6: Markov Chain Monte Carlo

GAUSSIAN PROCESS REGRESSION

STA 294: Stochastic Processes & Bayesian Nonparametrics

Approximate Bayesian Computation

STA414/2104 Statistical Methods for Machine Learning II

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo and Applied Bayesian Statistics

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Weak convergence of Markov chain Monte Carlo II

Pseudo-marginal MCMC methods for inference in latent variable models

Bayesian Methods for Machine Learning

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Markov chain Monte Carlo

Nonparametric Bayesian Methods (Gaussian Processes)

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Random Processes. DS GA 1002 Probability and Statistics for Data Science.

Pattern Recognition and Machine Learning

Chapter 12 PAWL-Forced Simulated Tempering

Introduction to Machine Learning

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Default Priors and Effcient Posterior Computation in Bayesian

Hierarchical Linear Models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Principles of Bayesian Inference

Transcription:

Stat 535 C - Statistical Computing & Monte Carlo Methods Lecture 15-7th March 2006 Arnaud Doucet Email: arnaud@cs.ubc.ca 1

1.1 Outline Mixture and composition of kernels. Hybrid algorithms. Examples Overview of the Lecture Page 2

2.1 Mixture of proposals. If K 1 and K 2 are π-invariant then the mixture kernel is also π-invariant. K θ, θ =λk 1 θ, θ +1 λ K 2 θ, θ If K 1 and K 2 are π-invariant then the composition K 1 K 2 θ, θ = K 1 θ, z K 2 z,θ dz is also π-invariant. Summary Page 3

2.1 Mixture of proposals Important: It is not necessary for either K 1 or K 2 to be irreducible and aperiodic to ensure that the mixture/composition is irreducible and aperiodic. For example, ro sample from π θ 1,θ 2 wecanhave the kernel K 1 updates θ 1 and keeps θ 2 fixed whereas the kernel K 2 updates θ 2 and keeps θ 1 fixed. Summary Page 4

2.2 Applications of Mixture and Composition of MH algorithms For K 1, we have q 1 θ, θ =q 1 θ 1,θ 2,θ 1 δ θ2 θ 2and r 1 θ, θ = π θ 1,θ 2 q 1 θ 1,θ 2,θ 1 π θ 1,θ 2 q 1 θ 1,θ 2,θ 1 = π θ 1 θ 2 q 1 θ 1,θ 2,θ 1 π θ 1 θ 2 q 1 θ 1,θ 2,θ 1 For K 2, we have q 2 θ, θ =δ θ1 θ 1 q 2 θ 1,θ 2,θ 2and r 2 θ, θ = π θ 1,θ 2 q 2 θ 1,θ 2,θ 2 π θ 1,θ 2 q 2 θ 1,θ 2,θ 2 = π θ 2 θ 1 q 2 θ 1,θ 2,θ 2 π θ 2 θ 1 q 2 θ 1,θ 2,θ 2 We then combine these kernels through mixture or composition. Summary Page 5

2.3 Composition of MH algorithms Assume we use a composition of these kernels, then the resulting algorithm proceeds as follows at iteration i. MH step to update component 1 Sample θ1 q 1 θ i 1 1,θ i 1 2, and compute 1 θ i 1 1,θ i 1 2, θ1,θ i 1 2 =min 1, π π θ1 θ i 1 2 q 1 θ1,θ i 1 2 θ i 1 1 θ i 1 2 q 1 θ i 1 1,θ i 1 2,θ i 1 1,θ 1 With probability α 1 θ i 1 1,θ i 1 2, θ1,θ i 1 2 otherwise θ i 1 = θ i 1 1., set θ i 1 = θ 1 and Summary Page 6

2.3 Composition of MH algorithms MH step to update component 2 Sample θ2 q 2 θ i 1,θi 1 2, and compute α 2 θ i 1,θi 1 2, θ i 1,θ 2 =min 1, π π θ2 θ i 1 q 2 θ i θ i 1 2 1 θ i,θ i 1 2 1,θ 2 q 2 θ i 1,θi 1 2,θ 2 With probability α 2 θ i 1,θi 1 2, θ i 1,θ 1, set θ i 2 = θ2 otherwise θ i 2 = θ i 1 2. Summary Page 7

2.4 Properties It is clear that in such cases both K 1 and K 2 are NOT irreducible and aperiodic. Each of them only update one component!!!! However, the composition and mixture of these kernels can be irreducible and aperiodic because then all the components are updated. Summary Page 8

2.5 Back to the Gibbs sampler Consider now the case where q 1 θ 1,θ 2,θ 1=π θ 1 θ 2. then r 1 θ, θ = π θ 1 θ 2 q 1 θ 1,θ 2,θ 1 π θ 1 θ 2 q 1 θ 1,θ 2,θ 1 = π θ 1 θ 2 π θ 1 θ 2 π θ 1 θ 2 π θ 1 θ 2 =1 Similarly if q 2 θ 1,θ 2,θ 2=π θ 2 θ 1 thenr 2 θ, θ =1. If you take for proposal distributions in the MH kernels the full conditional distributions then you have the Gibbs sampler! Summary Page 9

2.6 General hybrid algorithm Generally speaking, to sample from π θ whereθ =θ 1,..., θ p, we can use the following algorithm at iteration i. Iteration i; i 1: For k =1:p Sample θ i k q k θ i k,θi 1 k,θ k where θ i k = θ i 1 using an MH step of proposal distribution,..., θi and target π k 1,θi 1 k+1,..., θi 1 p θ k θ i k.. Summary Page 10

2.6 General hybrid algorithm If we have q k θ 1:p,θ k =π θ k θ k then we are back to the Gibbs sampler. We can update some parameters according to π θ k θ k andthemove is automatically accepted and others according to different proposals. Example: Assume we have π θ 1,θ 2 where it is easy to sample from π θ 1 θ 2 and then use an MH step of invariant distribution π θ 2 θ 1. Summary Page 11

2.6 General hybrid algorithm At iteration i. Sample θ i 1 π θ 1 θ i 1 2. Sample θ i 2 using one MH step of proposal distribution q 2 θ i 1,θi 1 2,θ 2 and target π θ 2 θ i 1. Remark: There is NO NEED to run the MH algorithm multiple steps to ensure that θ i 2 π θ 2 θ i 1 2. Summary Page 12

3.1 Alternative acceptance probabilities The standard MH algorithm uses the acceptance probability α θ, θ =min 1, π θ q θ,θ. π θ q θ, θ This is not necessary and one can also use any function α θ, θ = δ θ, θ π θ q θ, θ which is such that δ θ, θ =δ θ,θand0 α θ, θ 1 Example Baker, 1965: α θ, θ = π θ q θ,θ π θ q θ,θ+π θ q θ, θ. Generalization Page 13

3.1 Alternative acceptance probabilities Indeed one can check that K θ, θ =α θ, θ q θ, θ + 1 α θ, u q θ, u du δ θ θ is π-reversible. We have π θ α θ, θ q θ, θ = π θ = δ θ, θ = δ θ,θ δ θ, θ π θ q θ, θ q θ, θ = π θ α θ,θ q θ,θ. The MH acceptance is favoured as it increases the acceptance probability. Generalization Page 14

4.1 Logistic Regression Example In 1986, Challenger exploded; the explosion being the result of an O-ring failure. It was believed to be a result of a cold weather at the departure time: 31 o F. We have access to the data of 23 previous flights which give for flight i: Temperature at flight time x i and y i = 1 failure and zero otherwise Robert & Casella, p. 15. We want to have a model relating Y to x. Obviously this cannot be a linear model Y = α + xβ as we want Y {0, 1}. Examples Page 15

4.1 Logistic Regression Example We select a simple logistic regression model Pr Y =1 x =1 Pr Y =0 x = exp α + xβ 1+expα + xβ. Equivalently we have log it = log Pr Y =1 x Pr Y =0 x = α + xβ. This ensures that the response is binary. Examples Page 16

4.1 Logistic Regression Example We follow a Bayesian approach and select π α, β =π α b π β =b 1 exp αexp b 1 exp α ; i.e. exponential prior on expα and flat prior on β. b is selected as the data-dependent prior such that E α = α where α is the MLE of α Robert & Casella. As a simple proposal distribution, we use q α, β, α,β = π α b N β ; β i 1, σ β 2 where σ 2 β is the associated variance at thr MLE β. Examples Page 17

4.1 Logistic Regression Example The algorithm proceeds as follows at iteration i Sample α,β π α b N β; β i 1, σ β 2 and compute ζ α i 1,β i 1, α,β =min 1, π α,β data π α i 1 b π α i 1,β i 1 data π α b Set α i,β i =α,β with probability ζ α i 1,β i 1, α,β, otherwise set α i,β i = α i 1,β i 1. Examples Page 18

4.1 Logistic Regression Example amean 14.2 14.6 15.0 15.4 bmean 0.240 0.230 0.220 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Intercept Slope Plots of 1 k k i=1 αk left and 1 k k i=1 βi right. Examples Page 19

4.1 Logistic Regression Example Density 0.0 0.1 0.2 0.3 0.4 Density 0 5 10 15 20 25 8 10 12 14 16 18 Intercept 0.25 0.20 0.15 Slope Histogram estimates of p α data leftandp β data right. Examples Page 20

4.1 Logistic Regression Example x Density 0.2 0.4 0.6 0.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density 0.90 0.94 0.98 0 10 20 30 40 50 Predictive Pr Y =1 x = Pr Y =1 x, α, β π α, β data, predictions of failure probability at 65 o Fand45 o F. Examples Page 21

4.2 Probit Regression Example We consider the following example: we take 4 measurements from 100 genuine Swiss banknotes and 100 counterfeit ones. The response variable y is 0 for genuine and 1 for counterfeit and the explanatory variables are - x 1: the length, - x 2 : the width of the left edge - x 3 : the width of the right edge - x 4 : the bottom margin witdth All measurements are in millimeters. Examples Page 22

4.2 Probit Regression Example Status Bottom margin width mm 7 8 9 10 11 12 7 8 9 10 11 12 Bottom margin width mm 0 1 Status Left: Plot of the status indicator versus the bottom margin width. Right: Boxplots of the bottom margin wifth for both counterfeit status. Examples Page 23

4.2 Probit Regression Example Instead of selecting a logistic link, we select a probit one here Pr Y =1 x =Φ x 1 β 1 +...+ x 4 β 4 where Φu = 1 2π u exp v2 2 dv For n data, the likelihood is then given by f y 1:n β,x 1:n = n i=1 Φ x T i β y i 1 Φ x T i β 1 y i. Examples Page 24

4.2 Probit Regression Example We assume a vague prior where β N0, 100I 4 and we use a simple random walk sampler with Σ the covariance matrix associated to the MLE estimated using simple deterministic method. The algorithm is thus simply given at iteration i by Sample β N β i 1,τ 2 Σ and compute α β i 1,β π β y 1:n,x 1:n =min 1, π β i 1. y 1:n,x 1:n Set β i = β with probability α β i 1,β and β i = β i 1 otherwise. Best results obtained with τ 2 =1. Examples Page 25

4.2 Probit Regression Example 2.0 1.0 0.0 0.5 1.0 1.5 0.0 0.4 0.8 0 2000 6000 10000 2.0 1.5 1.0 0.5 0 200 400 600 800 1000 1 0 1 2 3 0.0 0.2 0.4 0.6 0.0 0.4 0.8 0 2000 6000 10000 1 0 1 2 3 0 200 400 600 800 1000 0.5 0.5 1.5 2.5 0.0 0.4 0.8 0.0 0.4 0.8 0 2000 6000 10000 0.5 0.5 1.5 2.5 0 200 400 600 800 1000 0.6 1.0 1.4 1.8 0.0 1.0 2.0 0.0 0.4 0.8 0 2000 6000 10000 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 200 400 600 800 1000 Traces left, Histograms middle and Autocorrelations right for β i 1,.., βi 4. Examples Page 26

4.2 Probit Regression Example One way to monitor the performance of the algorithm of the chain { X i} consists of displaying ρ k = cov [ X i,x i+k] /var X i which can be estimated from the chain, at least for small values of k. Sometimes one uses an effective sample size measure N ess = N 1+2 N 0 k=1 ρ k 1/2. This represents approximately the sample size of an equivalent i.i.d. samples. One should be very careful with such measures which can be very misleading. Examples Page 27

4.2 Probit Regression Example We found for E β y 1:n,x 1:n = 1.22, 0.95, 0.96, 1.15 so a simple plug-in estimate of the predictive probability of a counterfeit bill is p =Φ 1.22x 1 +0.95x 2 +0.96x 3 +1.15x 4 For x = 214.9, 130.1, 129.9, 9.5, we obtain p =0.59. A better estimate is obtained by Φ β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 π β y 1:n,x 1:n dβ Examples Page 28

4.3 Gibbs sampling for Probit Regression It is impossible to use Gibbs to sample directly from π β y 1:n,x 1:n. Introduce the following unobvserved latent variables Z i N x T i β,1, Y i = 1 if Z i > 0 0 otherwise. We have now define a joint distribution f y i,z i β,x i =f y i z i f z i β,x i. Examples Page 29

4.3 Gibbs sampling for Probit Regression Now we can check that f y i =1 x i,β= f y i,z i β,x i dz i = We haven t changed the model! 0 f z i β,x i dz i =Φ x T i β. We are now going to sample from π β,z 1:n x 1:n,y 1:n instead of π β x 1:n,y 1:n because the full conditional distributions are simple π β y 1:n,x 1:n,z 1:n = π β x 1:n,z 1:n standard Gaussian!, where π z 1:n y 1:n,x 1:n,β = z k y k,x k,β n π z k y k,x k,β i=1 N + x T k β,1 if y k =1 N x T k β,1 if y k =0. Examples Page 30

4.3 Gibbs sampling for Probit Regression 2.0 1.0 0.0 0.5 1.0 1.5 0.0 0.4 0.8 0 2000 6000 10000 2.0 1.5 1.0 0.5 0 200 400 600 800 1000 1 0 1 2 3 0.0 0.2 0.4 0.6 0.0 0.4 0.8 0 2000 6000 10000 1 0 1 2 3 0 200 400 600 800 1000 1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 0.0 0.4 0.8 0 2000 6000 10000 1 0 1 2 3 0 200 400 600 800 1000 0.6 1.0 1.4 1.8 0 2000 6000 10000 0.0 1.0 2.0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.4 0.8 0 200 400 600 800 1000 Traces left, Histograms middle and Autocorrelations right for β i 1,.., βi 4. Examples Page 31

4.3 Gibbs sampling for Probit Regression The results obtained through Gibbs are very similar to MH. We can also adopt an Zellner s type prior and obtain very similar results. Very similar were also obtained using a logistic fonction using the MH Gibbs is feasible but more difficult. Examples Page 32