Direct Simulation Methods #2

Similar documents
Monte Carlo Composition Inversion Acceptance/Rejection Sampling. Direct Simulation. Econ 690. Purdue University

MONTE CARLO METHODS. Hedibert Freitas Lopes

Markov Chain Monte Carlo methods

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Bayesian Computational Methods: Monte Carlo Methods

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Computational statistics

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Principles of Bayesian Inference

Gibbs Sampling in Linear Models #2

Other Noninformative Priors

E cient Importance Sampling

Principles of Bayesian Inference

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

COS513 LECTURE 8 STATISTICAL CONCEPTS

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Approximating Bayesian Posterior Means Using Multivariate Gaussian Quadrature

Principles of Bayesian Inference

Stat 451 Lecture Notes Simulating Random Variables

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayes: All uncertainty is described using probability.

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Sequential Bayesian Updating

Parametric Techniques Lecture 3

eqr094: Hierarchical MCMC for Bayesian System Reliability

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Statistical Data Analysis Stat 3: p-values, parameter estimation

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Parametric Techniques

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

One-parameter models

Lecture 8 October Bayes Estimators and Average Risk Optimality

7. Estimation and hypothesis testing. Objective. Recommended reading

BENCHMARK ESTIMATION FOR MARKOV CHAIN MONTE CARLO SAMPLERS

Lecture : Probabilistic Machine Learning

Risk Estimation and Uncertainty Quantification by Markov Chain Monte Carlo Methods

Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Introduction to Bayesian computation (cont.)

Sequential Monte Carlo Methods (for DSGE Models)

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Inference in Astronomy & Astrophysics A Short Course

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Answers and expectations

Lecture 2: Basic Concepts of Statistical Decision Theory

Introduction to Bayesian Inference

Gibbs Sampling in Endogenous Variables Models

I. Bayesian econometrics

Chapter 5. Bayesian Statistics

John Geweke Department of Economics University of Minnesota Minneapolis, MN First draft: April, 1991

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

INTRODUCTION TO PATTERN RECOGNITION

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Simulation. Li Zhao, SJTU. Spring, Li Zhao Simulation 1 / 19

Gibbs Sampling in Latent Variable Models #1

Markov chain Monte Carlo

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Stat 451 Lecture Notes Monte Carlo Integration

Session 3A: Markov chain Monte Carlo (MCMC)

Statistical Methods in Particle Physics

7. Estimation and hypothesis testing. Objective. Recommended reading

STA 4273H: Statistical Machine Learning

Estimation of Quantiles

DAG models and Markov Chain Monte Carlo methods a short overview

Data Analysis and Uncertainty Part 2: Estimation

STA 4273H: Statistical Machine Learning

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

LECTURE 15 Markov chain Monte Carlo

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Basic Sampling Methods

Sequential Monte Carlo Methods

MCMC algorithms for fitting Bayesian models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Stat 451 Lecture Notes Numerical Integration

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Sequential Monte Carlo Methods

INTRODUCTION TO BAYESIAN STATISTICS

Bayesian Inference: Posterior Intervals

Bayesian Inference for DSGE Models. Lawrence J. Christiano

STAT 425: Introduction to Bayesian Analysis

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

A NEW CLASS OF SKEW-NORMAL DISTRIBUTIONS

Density Estimation. Seungjin Choi

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Linear Models A linear model is defined by the expression

Empirical Evaluation and Estimation of Large-Scale, Nonlinear Economic Models

Bayesian Inference and MCMC

The Unscented Particle Filter

Gibbs Sampling in Linear Models #1

Introduction to Bayesian Computation

The Laplace Transform

an introduction to bayesian inference

Transcription:

Direct Simulation Methods #2 Econ 690 Purdue University

Outline 1 A Generalized Rejection Sampling Algorithm 2 The Weighted Bootstrap 3 Importance Sampling

Rejection Sampling Algorithm #2 Suppose we wish to obtain draws from some target density f (θ). Let Θ denote the support of f (θ). Suppose there exists some approximating density s(θ), called the source density, with support Θ, where Θ Θ.

Rejection Sampling Algorithm #2 In many applications requiring posterior simulation, the normalizing constant of the target density is unknown, since the joint or conditional posteriors are only given up to proportionality. To this end, let us work with the kernels of both the source and target densities and write: so that f and s denote the target and source kernels, respectively, and c f and c s denote the associated normalizing constants. Finally, let

Rejection Sampling Algorithm #2 Consider the following algorithm: 1 2 Draw a candidate from the source density s(θ), i.e., 3

For this algorithm, we seek to answer the following questions: (a) Show how this algorithm includes the previous one as a special case. (b) What is the overall acceptance rate in this algorithm? (c) Sketch a proof of why this algorithm provides a draw from f (θ).

As for our first question, consider using the algorithm above to generate a draw from f (θ) with compact support [a, b]. In addition, employ a source density s(θ) that is Uniform over the interval [a, b]. In this case we can write f (θ) = c f g(θ)i (a θ b) = c f f (θ), where f (θ) = g(θ)i (a θ b), b a It follows that M = max a θ b g(θ) = c 1 f and s(θ) = [b a] 1 I (a θ b) = [b a] 1 s(θ). ( f (θ) s(θ) ) = max a θ b g(θ) = c 1 f max a θ b f (θ) = c 1 f M, where M is defined as the maximum of f as in our first rejection sampling algorithm.

To implement the algorithm with the given Uniform source density, we first generate θ cand U(a, b), which is equivalent to writing We then generate U 2 U(0, 1) and accept θ cand provided This decision rule and the random variables U 1 and U 2 are identical to those described in the first algorithm.

As for the second question, the overall acceptance rate is

As for part (c), we note that for any subset A of Θ Since it follows that when θ is accepted from the algorithm, it is indeed a draw from f (θ).

Let us now carry out the following exercise which illustrates the use of the algorithm, as we seek to generate draws from the triangular distribution: (a) Comment on the performance of an algorithm that uses a U( 1, 1) source density. (b) Comment on the performances of an alternate algorithm that uses a N(0, σ 2 ) source density. First, consider a standard Normal source with σ 2 = 1. Then, investigate the performance of the acceptance/rejection method with σ 2 = 2 and σ 2 = 1/6.

(a)for the U( 1, 1) source density, our previous derivation showed that our generalized algorithm reduces to the first rejection sampling algorithm discussed in the previous lecture. Thus, it follows that the overall acceptance rate is.5. We regard this as a benchmark and seek to determine if an alternate choice of source density can lead to increased efficiency.

For the Normal source density in part (b), when σ 2 = 1, the maximum of the target/source ratio occurs at θ = 0 yielding a value of M = 1. Since the normalizing constant of the standard Normal is c s = (2π) 1/2, it follows that the overall acceptance rate is (2π) 1/2.40. When comparing this algorithm to one with, say, σ 2 = 2, it is clear that the standard Normal source will be preferred. The maximum of the target/source ratio with σ 2 = 2 again occurs at θ = 0 yielding M = 1. However, the overall acceptance probability reduces to 1/ 4π.28.

The final choice of σ 2 = 1/6 is reasonably tailored to fit this application. One can easily show that the mean of the target is zero and the variance is 1/6, so that the N(0, 1/6) source is chosen to match these two moments of the triangular density. With a little algebra, one can show that the maximum of the target/source ratio occurs at θ = 1 + 1/3 2.789. (Note that another maximum also occurs at -.789 since both the target and source are symmetric about zero). With this result in hand, the maximized value of the target/source ratio is M 1.37, yielding a theoretical acceptance rate of 1/(1.37 2π[1/6]).72. Thus, the N(0, 1/6) source density is the most efficient of the candidates considered here.

1.4 N(0,1/6) 1.2 U( 1,1) 1 0.8 Density N(0,1) 0.6 0.4 Triangular Density 0.2 0 1 0.5 0 0.5 1 θ Figure: Triangular Density together with Three Different Scaled Source Densities

The Weighted Bootstrap A potential difficulty of the previous method is the calculation of M. The weighted bootstrap of Smith and Gelfand (1992, American Statistician) and the highly related sampling-importance resampling (SIR) algorithm of Rubin (1987 JASA) circumvent the need to calculate this blanketing constant M. Another alternative for log-concave densities [which, in the univariate case, refers to a density for which the second derivative of the log density is everywhere non-positive] is adaptive rejection sampling, which we do not discuss here. [Gilks and Wild (1992, JRSS C)]

Consider the following procedure for obtaining a draw from a density of interest f : 1 Draw θ 1, θ 2, θ n from some approximating source density s(θ). 2 Like our last rejection sampling algorithm, let us work with the kernels of the densities and write: Set and define the normalized weights 3 Draw θ from the discrete set {θ 1, θ 2,, θ n } with Pr(θ = θ j ) = w j, j = 1, 2,, n.

We seek to show that θ provides an approximate draw from f (θ), with the accuracy of the approach improving with n, the simulated sample size from the source density. Note that

Continuing,

The following example illustrates the performance of the weighted bootstrap, again considering the problem of sampling from the triangular density. For our source, we use a U( 1, 1) density, and generate 1,500 draws from it. We then calculate the density of the source at these draws (which is always 1/2!) as well as the triangular density at these 1,500 values, and calculate the weights w i, 1 = 1, 2,, 1, 500. From this discrete distribution, we generate 50,000 draws. A histogram of these draws is provided on the following page:

6000 5000 4000 3000 2000 1000 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure: Histogram from Weighted Bootstrap Simulations

Importance Sampling Again, consider the problem of calculating a posterior moment of the form: and we assume (as is typically the case) that it is not possible to draw directly from p(θ y).

A potentially useful method in this situation is to apply importance sampling, whose use was championed for Bayesian applications by Kloek and van Dijk (1978, Econometrica ) and Geweke (1989 Econometrica ). To provide an intuitive explanation behind the importance sampling estimator, note that the posterior mean calculation can be re-written in the following way: for some importance function I (θ) whose support includes Θ.

Written this way, one can see that the original integrals have been transformed into new (though equivalent) problems of moment calculation. The averaging, however, is now performed with respect to I (θ) instead of p(θ y). Provided one can draw from the importance function I (θ), direct Monte Carlo integration can be performed on the numerator and denominator individually to produce the importance sampling estimator (shown on the following page):

a weighted average of the g(θ i ) with w(θ i ) = w(θ i )/ i w(θ i) denoting the (normalized) weight and w(θ i ) p(θ i )L(θ i )/I (θ i ).

Importantly, note that for the case of importance sampling, θ i iid I (θ), and are not draws from p(θ y) as was the case in direct Monte Carlo integration. Since the importance function I (θ) is under the control of the researcher, it can (and should) be a density from which samples can be easily obtained. Though the importance sampling estimator may seem like a convenient way to solve all problems of posterior moment calculation, note that if I (θ) is a poor approximation to p(θ y), then the weights w(θ) will typically be small for most values of θ i, resulting in a very inaccurate and unstable estimate.

Common sense, of course, suggests that the accuracy of an importance sampling estimate will improve as I (θ) more closely approximates the target distribution p(θ y). Indeed, if I (θ) and p(θ y) coincide, then the weights w(θ) = 1/N, and the importance sampling estimator reduces to the ideal case, the direct Monte Carlo estimator. However, this is an ideal that we cannot achieve, as direct sampling from p(θ y) is typically not possible.

Geweke (1989) shows, under standard conditions, namely: E[g(θ) y] g exists, Var[g(θ) y] σ 2 is finite The support of I (θ) includes the support of p(θ y), then E[g(θ) y] ĝ IS p g. i.e., the importance sampling estimator is consistent. Establishing a rate of convergence is a little more involved. The important new condition is that w(θ) must be bounded above on θ which means that, in general I (θ) should be chosen to have heavier tails than p(θ y).

Under this new condition, Geweke (1989) shows: where τ 2 can be consistently estimated by: Note that, if I (θ) = p(θ y), then w(θ) = 1, and the above reduces to our estimate of Var[g(θ) y] = σ 2. Also note that the numerical standard error (denoted NSE) can be approximated as follows:

To evaluate the performance of a particular importance sampling estimator, Geweke (1989) suggests a number of diagnostics, including: Monitoring the weights w i.if these are large for a few draws, it is suggestive of an inaccurate approximation of the posterior moment. Geweke (1989) also suggests keeping track of the fraction of total weight assigned to the draw receiving highest weight. If this is found to be much larger than 1/N, it suggests that I (θ) can be improved.

A more formal and widely-used statistic involves the calculation of a quantity termed relative numerical efficiency or RNE. This statistic seeks to quantify how much is lost (owing to a choice of I (θ) that is far from the target p(θ y)) by using importance sampling relative to the numerical precision that would have obtained using direct Monte Carlo integration. Specifically, the RNE is defined as the ratio σ 2 /τ 2 which can be consistently estimated by: N RNE ˆ i=1 = [g(θ i) ĝ IS ] 2 w(θ i ) n i=1 w(θ N 1 N i=1 [g(θ i) ĝ IS ] 2 w 2 (θ i ) [ i) N ] 2. i=1 w(θ i)

Since NSE[ĝ IS ] τ 2 N it follows that and thus N RNE has the interpretation of the effective sample size. Since RNE tends to be smaller than 1 in practice, and may be significantly smaller when there is large variability in the weights w, this gives a sense of the loss relative to direct Monte Carlo integration.

To illustrate the performance and use of importance sampling, we consider the following experiment. Suppose it is of interest to calculate the first and second moments of a standard normal random variable. As an importance function, I (θ), we employ the Laplace or double exponential distribution: Thus, the Laplace density has two parameters: a mean µ and a parameter b which controls the variance. (Specifically, Var(θ) = 2b 2.) Note: The Laplace distribution will have heavier tails than the normal since we have a linear rather than quadratic term in the exponential kernel.

We will consider the use of two different Laplace distributions as importance functions: a Laplace(0,2) distribution and a Laplace(3,1) distribution. We will obtain 2,500 different Importance sampling estimates under each importance function, each time generating 5,000 draws from the corresponding Laplace distribution. This procedure will approximate the sampling distributions of the importance sampling estimators. In addition, we plot the sampling distribution of RNE for each case.

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Figure: Laplace(0,2) importance function and Standard Normal Density

300 200 100 0 0.06 0.04 0.02 0 0.02 0.04 0.06 300 Samp. Dist of Mean 200 100 0 0.94 0.96 0.98 1 1.02 1.04 1.06 Samp. Dist of Second Moment 300 200 100 0 0.95 0.96 0.97 0.98 0.99 1 1.01 Samp. Dist of RNE Figure: Samp. Distributions using Laplace(0,2) importance function

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 2 0 2 4 6 8 Figure: Laplace(3,1) importance function and Standard Normal Density

300 200 100 0 0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 300 Samp. Dist of Mean 200 100 0 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 300 Samp. Dist of Second Moment 200 100 0 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 Samp. Dist of RNE Figure: Samp. Distributions using Laplace(3,1) importance function

Clearly, the Laplace(0,2) density performs better than the Laplace(3,1) alternative. Note that, in this final case, the effective sample size is approximately one-tenth of the sample size under iid sampling. That is, to obtain an equal level of numerical precision in the estimated mean, we will need to obtain an importance sampling sample that is ten times as large as the sample size required under direct Monte Carlo integration.