BENCHMARK ESTIMATION FOR MARKOV CHAIN MONTE CARLO SAMPLERS

Size: px

Start display at page:

Download "BENCHMARK ESTIMATION FOR MARKOV CHAIN MONTE CARLO SAMPLERS"

Isabella McKinney
5 years ago
Views:

1 BENCHMARK ESTIMATION FOR MARKOV CHAIN MONTE CARLO SAMPLERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Subharup Guha, M. Sc. * * * * * The Ohio State University 2004 Dissertation Committee: Steven N. MacEachern, Co-Adviser Mario Peruggia, Co-Adviser L. Mark Berliner Peter F. Craigmile Approved by Co-Adviser Co-Adviser Department of Statistics

2 c Copyright by Subharup Guha 2004

3 ABSTRACT While studying various features of the posterior distribution of a vector-valued parameter using an MCMC sample, systematically subsampling of the MCMC output can only lead to poorer estimation. Nevertheless, a 1-in-k subsample is often all that is retained in investigations where intensive computations are involved or where speed is essential. The goal of benchmark estimation is to produce a number of estimates based on the best available information, i.e. the entire MCMC sample, and to use these to improve other estimates made on the basis of the subsample. We take a simple approach by creating a weighted subsample where the weights are quickly obtained as a solution to a system of linear equations. We provide a theoretical basis for the method and illustrate the technique using examples from the literature. For a subsampling rate of 1-in-10, the observed reductions in MSE often exceed 50% for a number of posterior features. Much larger gains are expected for certain complex estimation methods and for the commonly used thinner subsampling rates. Benchmark estimation can be used wherever other fast or efficient estimation strategies like importance sampling already exist. We show how the two strategies can be used in conjunction with each other. We discuss some asymptotic properties of benchmark estimators that provide insight into the gains associated with the technique. The observed gains are found to closely match the theoretical values predicted by the asymptotic, even for k as small as 10. ii

4 Dedicated to my dear parents iii

5 ACKNOWLEDGMENTS I am indebted to Dr. Steve MacEachern for many useful research ideas and for his unstinting help and guidance over the years. I am grateful to Dr. Mario Peruggia for his patience, insight and invaluable support. I am deeply appreciative of Dr. Mark Berliner and Dr. Peter Craigmile for acting as references for my work. Finally, I thank my parents and lovingly dedicate my dissertation to them. iv

6 VITA September 13, Born - Calcutta, India Master of Science in Statistics, Indian Institute of Technology, Kanpur, India 1999-present Graduate Teaching Associate, The Ohio State University PUBLICATIONS MacEachern S. N., M. Peruggia and S. Guha (2003). Discussion of A theory of statistical models for Monte Carlo integration by Kong, McCullagh, Nicolae, Tan and Meng. Journal of the Royal Statistical Society - Series B, Volume 65, Issue 3. FIELDS OF STUDY Major Field: Statistics v

7 TABLE OF CONTENTS Page Abstract Dedication Acknowledgments Vita List of Tables ii iii iv v viii List of Figures ix Chapters: 1. Introduction Monte Carlo Methods Rejection Method Importance Sampling MCMC Estimation Systematic Subsampling Benchmark Estimation for MCMC Samplers An Overview of this Dissertation A Simple Approach to Benchmark Estimation An Improved Subsample Estimator Some Methods of Obtaining Weights Benchmark Estimation and Importance Sampling vi

8 3. Theoretical Results Notation Asymptotic Properties of Post-stratification Estimators Based on a Subsample Asymptotic A: n is held fixed and k tends to Asymptotic B: k is held fixed and n tends to Asymptotic C: n and k jointly tend to Illustrations Failure Times of Power Plant Pumps Allometry Example: Brain Masses and Body Weights Conclusions and Future Work Appendices: A. Proof of Theorem B. Results Used to Establish Theorem B.1 A New Set of Asymptotic Tools B.2 Details of Lemmas C. Likelihood Function of an Over- or Under-dispersed Model for the Pumps Dataset D. Allometry Example: R Code for Generating Draws from the Semiparametric Model Posterior Bibliography vii

9 LIST OF TABLES Table Page 4.1 Comparison of MSE of the subsample estimators for a 1-in-10 systematic subsample Comparison of MSE of the subsample estimators for a 1-in-100 systematic subsample viii

10 LIST OF FIGURES Figure Page 4.1 Trellis plots of the percent reductions in MSE produced by the subsample estimator for all method-by-rate combinations and all features of interest E[g(θ)] Level plot of the percent MSE reduction for the weighted subsample estimation of β 2 using the post-stratification weights (iii) and 1-in-10 subsamples Plot of ˆB c 2 ˆ SE( ˆB c ) versus c A comparison of the predictive mass functions of the number of failures of a pump on which there is no data, having an operation time of hours Importance sampling weights versus β for the target posterior corresponding to the model M The percent reductions in MSE for various features of the target posterior, π A normal probability plot of the log-transformed, recentered covariates for the Weisberg data Percent reductions in variance achieved by post-stratification, relative to subsample-based empirical average estimation, for some of the investigated posterior features Percent reductions in variance achieved by post-stratification, relative to subsample-based empirical average estimation, for the remaining six posterior features investigated ix

11 4.10 A comparison of the asymptotic ( ) and the actual () variance reductions for PS-(2). The dashed lines indicate a margin of two standard errors A comparison of the asymptotic ( ) and the actual () variance reductions for PS-(2) for the remaining set of posterior features. The dashed lines indicate a margin of two standard errors Estimated asymptotic reductions in variance ( ) and squared multiple correlations ( ) for different posterior features Estimated asymptotic reductions in variance ( ) and squared multiple correlations ( ) for different posterior features x

12 CHAPTER 1 Introduction 1.1 Monte Carlo Methods Bayesian methods have long been touted as providing an optimal approach to statistics (Savage, 1954). Bayesian methods have a common foundation with traditional approaches to statistics. Both approaches begin with a description of the outcome of an experiment, x, as a random quantity whose value is influenced by a parameter, θ. The outcome of the experiment, or data, can be thought of as the result of the experiment. In small problems, the data will consist of dozens of measurements; in very large problems, the data may come in terabytes. In a simple setting, the parameter may be as direct as the mean of a probability distribution for x; in more complex settings, the parameter may be a vector with hundreds, thousands, or even an infinite number of dimensions. The more complex parameters are used to describe various features of the distribution of x, including the shape of a distribution and the relationship between various components of x. Bayesian methods depart from traditional statistical methods in how they treat the parameter. Since we use the language of probability to describe our uncertainty about the world, the Bayesian naturally uses this language to describe our uncertainty 1

13 about the parameter, θ, as well. Having done so, the formal distinction between parameter and data disappears, and the tools of conditional probability can be used to describe our uncertainty about (or conversely, knowledge about) the parameter after having seen the data. This last probability distribution is referred to as the posterior distribution for θ given x. This distribution summarizes all current knowledge about the parameter, and so is the basis of all inference about the model parameter, θ. Formally, a Bayesian model consists of the data x, and a parameter θ taking values in a set Θ. The joint probability distribution, denoted by f(x, θ), is specified by the model, usually in terms of the likelihood function f(x θ) and the prior f(θ). All inference is based on the posterior distribution, denoted by π. The posterior density at a point θ is given by π(θ) = f(x, θ)/m(x), where m(x) = f(x, θ)µ(dθ) is the marginal likelihood of the data and µ is a σ-finite measure. Interest often focusses on features that can be expressed as a posterior expectation, E[g(θ)], for some function g( ). Written explicitly, E[g(θ)] = g(θ)f(x, θ)µ(dθ)/m(x). Examples of such posterior features include posterior moments, quantiles, HPD regions and the posterior density itself. In many Bayesian models, the parameter has a large number of components, and the posterior distribution is complex enough that it cannot be exactly evaluated numerically. Evaluating posterior features by numerical integration is therefore not possible unless the parameter consists of a very small number of components. Widespread use of Bayesian methods was hindered by a lack of computational strategies and power that restricted their use to a small number of stylized problems. A seminal paper (Gelfand and Smith, 1990) advocated the fitting of statistical models 2

14 with simulation (or Monte Carlo ) methods like rather than through analytic integration, analytic approximation, or numerical integration. Since that time, there has been increased acceptance of the use of simulation-based approximations to fit realistic models. There has also been tremendous growth in the use of Bayesian methods, to the point where several departments have been founded with an exclusively Bayesian approach to statistics. Monte Carlo methods use a stream of random numbers to generate a sample of parameter vectors drawn from the posterior distribution of θ. The information contained in these parameter vectors is then used to provide estimates of a broad class of features of the posterior distribution. Gilks et al. (1996) provides an excellent overview of these techniques. Formally, simulation methods use a random sample of N draws, denoted by θ (1),..., θ (N), to estimate posterior features of interest. These sample draws may or may not be independent. If the draws are i.i.d. with a common distribution π s, referred to as the source distribution, the law of large numbers states that the empirical average, Ê[g(θ)] = 1 N N i=1 g(θ(i) ), tends almost surely to E πs [g(θ)]. Therefore, if the source distribution π s is identical to the posterior disribution π and the sample size N is large, Ê[g(θ)] may be regarded as a reasonable estimate of the posterior feature E[g(θ)]. In practice, it is usually not possible to sample directly from π if the marginal likelihood m(x) cannot be analytically or numerically evaluated. This is the case with most realisitic Bayesian models. We could obtain i.i.d. samples from a different source distribution and then weight the samples appropriately to obtain an estimate of E[g(θ)]. Two techniques that use this approach are importance sampling and the 3

15 rejection method. A drawback of both these methods is that they are not automatic and often require considerable skill to be effectively applied Rejection Method Given a set of i.i.d. draws from a source distribution π s, the rejection method re-samples from these draws with a probability compensating for the difference between the source and the posterior distribution. The resulting sample of draws are approximately i.i.d. π. The success of the method depends on being able to find a good envelope, π s, for the posterior. For most Bayesian models, it is difficult to find a reasonably good envelope Importance Sampling Geweke (1989) provides an overview of importance sampling for i.i.d. draws. Let θ (1),..., θ (N) be a sample of i.i.d. draws from the source distribution π s. It is often convenient to work with a non-normalized version πs(θ) of the source density, where πs(θ) = π s (θ)/d and d > 0 is a constant. The non-normalized importance weight of a point θ equals w (θ) = f(x, θ)/πs(θ). On the other hand, the normalized weight, w(θ) = π(θ)/π s (θ), cannot be computed exactly. The importance sampling estimator of E[g(θ)] is Ê[g(θ)] IS = n i=1 w (θ (i) ) g(θ (i) ) n i=1 w (θ (i) ) (1.1) Under assumptions about the existence of first moments and about the support of π s (θ), Theorem 1 of Geweke s paper states that Ê[g(θ)] IS is a consistent estimator of E[g(θ)]. Under additional assumptions about the existence of appropriate second-order moments, Theorem 2 states that the estimator is asmptotically normal: n (Ê[g(θ)]IS E[g(θ)]) N(0, σ 2 IS ), where σ2 IS = E[w(θ) (g(θ) E[g(θ)])2 ]. 4

16 The relative numerical efficiency of Ê[g(θ)] IS is defined as RNE = V ar(g(θ))/σis 2. The RNE can be estimated consistently based on the importance sample. An RNE much smaller than unity indicates that the estimator Ê[g(θ)] IS has a low precision. A better importance sampling source should then be considered because the posterior itself, at least theoretically, is a source corresponding to a substantially better estimator. An RNE greater than unity implies that the estimator Ê[g(θ)] IS has a smaller asymptotic MSE than the empirical average estimator based on i.i.d. draws from the posterior. Theorem 3 of Geweke s paper states that πs(θ) = g(θ) E[g(θ)] f(x, θ) optimizes the asymptotic variance of Ê[g(θ)] IS. Importance sampling can therefore be regarded as a variance reduction technique. However, notice that the optimal source depends on g( ). Thus, computational costs may prevent one from sampling from the optimal source corresponding to each investigated feature. Nevertheless, the form of the optimal weight function suggests that a thicker-tailed source than the posterior would produce reasonably precise (although possibly sub-optimal) estimates. Choices for the source include a multivariate normal or t-distribution, with the parameters chosen to match the tail behavior of the posterior. A useful diagnostic for the effectiveness of importance sampling estimation is a matrix scatterplot of the importance sampling weights versus the coordinates of the sampled points. The existence of high weights for outlying coordinate values would indicate that the importance sampling estimates are imprecise. A related measure of the effectiveness of importance sampling is the effective sample size. This is defined as follows: Given a set of N i.i.d. draws from the source 5

17 distribution π s, ESS = N V ar πs (w(θ)) + 1 where w( ) is the normalized importance sampling weight. The intuitive interpretation of this quantity is that importance sampling estimation with π s as the source, has roughly the same precision as empirical average estimation based on ESS i.i.d. draws sampled from π. An ESS much smaller than N suggests imprecise importance sampling estimation. The ESS involves essentially the same idea as the RNE. The two measures are equal if w(θ) and g(θ) are uncorrelated under π (Liu, 2001.) However, the ESS is easier to estimate and is not tied to any particular posterior feature. The ESS is maximized when π s π. Intuitively, the importance sampling source that works best for a wide variety of posterior features is the posterior distribution itself. To summarize, the source density π s (θ) should closely resemble the posterior density. This corresponds to V ar πs (w(θ)) being small. The source should also be thickertailed than the posterior, so that large importance weights do not occur in the tails. This strategy would produce reasonably accurate importance sampling estimates of most posterior features. Importance Link Function (ILF) estimation, introduced in MacEachern and Peruggia (2000a), is a sophisticated version of importance sampling. A transformation is applied to the source distribution so that the density of the transformed distribution more closely matches the posterior. Re-centering and re-scaling (applying an affine transformation) often achieves this goal. It is also computionally attractive because the Jacobian is trivially known for an affine transformation. More accurate importance sampling estimates are obtained in this fashion. 6

18 1.1.3 MCMC Estimation For almost any Bayesian model, Markov Chain Monte Carlo (MCMC) methods can be applied to produce dependent samples approximately distributed as the posterior. Tierney (1994) states the following property of Markov chains: Suppose the transition kernel P of the Markov chain is aperiodic, irreducible and has an invariant distribution π. Then π is the unique invariant distribution of P. For any starting value θ 0 belonging to a set having π-probability equal to 1, the n-th iterate of the kernel, P n (θ 0, ), converges in total variation distance to the distribution π as n. This convergence holds for all θ 0 Θ if the chain P is Harris recurrent. This result implies that, under mild conditions, a Markov chain having the posterior π as the invariant distribution ultimately produces dependent samples approximately distributed as π. Let θ (1),..., θ (N) represent the MCMC sample from the posterior. As stated in Tierney s paper, ergodic results and central limit theorems also hold for the empirical average estimators of most posterior features. These results provide a theoretical justification for discarding the initial MCMC draws and for using the post burn-in draws to estimate posterior features. MCMC samples can be produced using the Metropolis-Hastings algorithm or Gibbs sampling, which is a special case of the former method. Metropolis-Hastings Algorithm. Let Q(y, ) be a distribution having the density q(y, ) with respect to the measure µ. Let X n = y denote the current state of the chain. The chain proposes a candidate state z Q(y, ). It moves to the proposed state (i.e. X n+1 = z) with probability α(y, z) and rejects the move (i.e. X n+1 = y) 7

19 with probability 1 α(y, z), where { { } min π(z)q(z,y) π(y)q(y,z) α(y, z) =, 1, if π(y)q(y, z) > 0, 1, if π(y)q(y, z) = 0. Under weak assumptions, it can be shown that this chain has the invariant distribution π. The algorithm depends on π only through the ratio π(z)/π(y), which can be computed irrespective of the value of the marginal likelihood, m(x). In most situations, the Metropolis-Hastings algorithm can therefore be used to generate draws from the posterior. Special cases of Metropolis-Hastings chains include random walk chains, for which the proposal distribution is of the form q(y, z) = h(z y). The hit-and-run algorithm also belongs to this class. Independence chains, in which the candidate steps are chosen according to a fixed distribution (i.e. q(y, z) = h(z)) are another special case. These chains are closely related to importance sampling. Section 2.4 of Tierney s paper discusses hybrids strategies like cycles and mixtures. The Gibbs sampler is an example, in which several reducible kernels are cycled in a fixed or random order to obtain an irreducible kernel. A mixture or cycle is uniformly ergodic if one of its components is uniformly ergodic (Gilks, et al., 1996). This property can be used to construct faster mixing chains. Mixing is also improved by occassionally restarting the chain by combining it with an independence chain. Gibbs Sampling. In this sampling scheme, the elements of the vector of model parameters is grouped into blocks, each consisting of a small number of components. Block sizes are typically 1, although it may be natural (as in a Bayesian randomeffects model) to group an entire precision matrix into a single block. Grouping highly 8

20 correlated components into a single higher-dimensional block may also improve the mixing of the chain. Gibbs sampling updates the current state one block at a time. In the context of the Metropolis-Hastings algorithm, the proposal distribution is the conditional posterior distribution of the block being updated, given the data x and the current values of the remaining blocks. Candidate steps are always accepted, since α(y, z) is identically equal to 1. The block sizes are chosen to be very small because the full conditionals have to be sampled in a straight-forward manner, possibly using numerical integration to evaluate the marginal likelihood (or other techniques, like adaptive rejection sampling) when the full conditionals cannot be identified or easily sampled from. Random permutations of the updating order of the blocks are allowed. In fact, not all of the blocks need to be updated in a given cycle. Result GG1 of Gelfand and Smith (1990) states that the chain converges to π, as long as the blocks are updated according to an i.o. visiting scheme. Gibbs sampling is widely used because of its ease of implementation. However, these chains may mix slowly compared to more specialized chains. Provided the model follows a standard distribution at each hierarchical stage, the BUGS software can be used to automatically implement Gibbs sampling. Importance Sampling Based on MCMC Draws. After an MCMC sample has been obtained from the posterior distribution, any subsequent changes to the prior or the likelihood function of the model results in a different posterior distribution. We refer to the posterior resulting from the subsequent changes as the target posterior. Many applications involve the investigation of thousands of changes to the model, 9

21 with each change producing a different posterior distribution. Generating a different MCMC sample for each target posterior is usually not cost-effective. Moreover, the target model may involve parameters that do not have an explicitly written standard distribution at each hierarchical stage. Off the shelf packages like BUGS then cannot be used to generate MCMC draws easily from the target posterior. Departures from a conjugate hierarchical structure may also result in slower mixing chains. In such cases, importance sampling provides estimates of the target posterior features if the source posterior density dominates the target posterior density. Let π s denote the source posterior, and π t denote the target posterior distribution. Let m s (m t ) denote the marginal density of the data under the source (target) model. Both m t and m s are typically unknown. The non-normalized importance sampling weight w (θ) equals (f t (x θ)f t (θ)) / (f s (x θ)f s (θ)), where the subscripts t and s stand for target and source, and f t (θ) and f s (θ) are the respective priors under the target and the source models. The importance sampling estimator Ê[g(θ)] IS defined in (1.1) is consistent, by the ergodic result for empirical averages based on MCMC draws. Analogously to Theorem 2 of Geweke s paper, an application of the delta method and the central limit theorem for geometrically ergodic Markov chains (Geyer, 1992) gives the following result: Assume that the Markov chain is geometrically ergodic with invariant distribution π s. Under mild conditions, the estimator Ê[g(θ)] IS is asymptotically normal: n (Ê[g(θ)]IS E[g(θ)]) N(0, σis 2 ) for some σ2 IS > 0. As described in Geyer (1992), the asymptotic variance σ 2 IS can be estimated consistently by the batch means method or by window estimation. 10

22 As with independent draws, a sufficient condition for importance sampling to produce reasonably accurate estimates is that the importance weight function w(θ) is bounded in the tails. Similarly to importance sampling with independent draws, a matrix scatterplot of the importance sampling weights versus the coordinates of the sample may be used to verify this. The ESS may be used because of computational convenience. Its use may be valid, especially if the j-lag covariance Cov πs (g(θ (0) ), g(θ (j) )) decays fast enough (e.g. if the function g( ) is bounded, in which case the lag covariance decays exponentially in j). Importance sampling techniques can be used to produce faster mixing chains. As mentioned earlier, cycles or mixtures having a uniformly ergodic component are uniformly ergodic. Combining a uniformly ergodic chain with the regular Gibbs sampler therefore produces a faster mixing chain. As an example of a uniformly ergodic chain, consider an independence chain having a proposal density of h(z) for the candidate state z. Let y be the current state of the chain. Then w(z) = π t (z)/h(z) is the importance sampling weight of the candidate state z, with the distribution h regarded as the source. The acceptance probability of the chain is α(y, z) = min {w(z)/w(y), 1}. Gilks, et al. (1996) states that an independence chain is uniformly ergodic if the weight function is bounded, i.e. the density h is thickertailed than the target posterior. Experience with importance sampling is therefore useful for constructing such uniformly ergodic independence chains. An interesting feature of the importance link methodology (discussed earlier) is that it produces consistent importance sampling estimates even when the Markov chain is reducible. This contradicts common notions about MCMC methods. For example, consider a reducible chain having two separate components Θ 1 and Θ 2. The 11

23 probability of the chain going from one component to the other is zero. Suppose that a (possibly many-to-many) ILF mapping exists from set Θ 1 onto set Θ 2. A sample initialized in the set Θ 1 can then be transformed to a sample of points belonging to the set Θ 2. The transformed sample can be used to estimate the feature E[g(θ)I Θ2 (θ)] consistently by importance sampling. The feature E[g(θ)] = E[g(θ)I Θ1 (θ)] + E[g(θ)I Θ2 (θ)] can therefore be consistently estimated. 1.2 Systematic Subsampling While using an MCMC sample to investigate the posterior distribution of a vectorvalued parameter θ, a subsample of the MCMC output is often all that is retained for further investigation of the posterior distribution. Systematically subsampling of the MCMC output is not recommended unless the computational cost of processing the output or of creating the estimator is much greater than the cost of generating the sample. This is convincingly argued in Geyer (1992) as follows: Consider the empirical average estimator, Ê k [g(θ)], based on a systematic 1-in-k subsample of size n. The cost of generating the samples and of creating the estimator are approximately equal to c 1 nk and c 2 n, respectively, for large n. Therefore the total cost is approximately (c 1 k + c 2 )n. The precision of the estimator is approximately n/σk 2, where σ2 k is the asymptotic variance of Êk[g(θ)]. The effective precision of the estimator is therefore (σk 2(c 1k + c 2 )) 1 for large values of n. If the ratio c 2 /c 1 is negligible, the limit equals (kσk 2c 1) 1. As shown in Geyer s paper and also in MacEachern and Berliner (1994), kσk 2 > σ2 1 if k > 1. Any subsampling is therefore bad if this cost structure applies. 12

24 Nevertheless, subsampling is often necessary in computationally intensive or realtime, interactive investigations where speed is essential. Examples include expensive plot processing and examination of changes in the prior (sensitivity analysis), likelihood (robustness) or data (case influence). Typically, such studies involve hundreds or thousands of changes to the model, necessitating subsampling. Practical constraints, like the limited disk space available to users of shared computing resources, often make it infeasible to store the entire sample of MCMC draws when the parameter has a large number of components. A subsample is then retained for future investigation of the posterior. Subsampling may actually result in an increase in the effective accuracy of estimation if the ratio c 2 /c 1 is non-negligible or if the processing cost is non-linear in n. The optimal value of k would then be different from one. As an example of a situation where subsampling is beneficial, consider the likelihood estimator for the marginal likelihood introduced in section 5.2 of Kong et al. (2003). The asymptotic precision of this estimator has a higher order precision than that of Chib (1995). Using the same notation as above, the new likelihood estimator, based on a subample of size n, has a total cost of c 1 n + c 2 n 2 and a precision of p k n 2. The effective precision is therefore approximately equal to p k /c 2, for large n. The positive correlation typical of Gibbs samplers implies that setting k > 1 results in an estimator having a higher effective precision. The above argument appears in MacEachern, Peruggia and Guha (2003). 13

25 1.3 Benchmark Estimation for MCMC Samplers This dissertation introduces a subsample-based estimation technique called benchmark estimation. The goal of benchmark estimation is to improve estimation based on a subsample by using some of the discarded information available in the full MCMC sample. The benchmark estimates must be quick and easy to compute. They must also be compatible with quick computations for further, more (computationally) expensive analyses based on the eventual subsample. Several motivating perspectives are useful to understand and investigate various aspects of benchmark estimation. The point of view of calibration estimation, developed in the sampling literature to improve survey estimates (Deville and Särndal, 1992; Vanderhoeft, 2001), helps to bring all these perspectives together into a unified framework. In calibration estimation, a probability sample from a finite population is used to compute estimates of population quantities of interest. The (regression type) estimators are built as weighted averages of the observations in the sample, with the weights determined so as to satisfy a (vector-valued) calibration equation which forces the resulting estimators to produce exact estimates of known population features. Usually, the constraints imposed by the calibration equation do not determine a unique set of weights. Thus, among the sets of weights satisfying the calibration equation, one chooses the set that yields weights that are as close as possible (with respect to some distance metric) to a fixed set of prespecified (typically uniform) weights. To cast MCMC benchmark estimation into the the framework of calibration estimation, we regard the MCMC output as a finite population and a 1-in-k systematic subsample as a probability sample drawn from the finite population. This systematic 14

26 sampling design gives each unit in the population a probability 1/k of being selected, though many joint inclusion probabilities are 0. In this setting, the (vector-valued) benchmark E[h(θ)], for which the subsample estimate is forced to match the full sample estimate, corresponds to the auxiliary information available through the calibration equation. Once the calibration weights have been calculated, they can then be used to compute the calibration subsample estimate of any feature E[g(θ)]. As the full MCMC sample size increases, the asymptotic performance of these benchmark estimators matches that of the corresponding calibration estimators. The benchmark estimators introduced in Chapter 2 can be shown to be calibration estimators corresponding to appropriately chosen calibration equations and metrics. 1.4 An Overview of this Dissertation The rest of this dissertation is organized as follows: Chapter 2 investigates two methods of creating weights: post-stratification and maximum entropy. In their simplest form, post-stratification weights are derived by partitioning the parameter space into a finite number of regions and by forcing the weighted subsample frequencies of each region to match the corresponding raw frequencies for the entire MCMC sample. The weights are taken to be constant over the various elements of the partition and to sum to one. An improved version of post-stratification, (and, in fact, the approach that in our experience has generated the most successful estimators) begins with a representation of an arbitrary function g(θ) as a countable linear combination of basis functions h j (θ): g(θ) = j=1 c jh j (θ). The estimand, E[g(θ)], is expressed as the same linear 15

27 combination of integrals of the basis functions, j=1 c je[h j (θ)]. Splitting the infinite series representation of g(θ) into two parts, we have a finite series which may provide a good approximation to g(θ) and an infinite remainder series that fills out the representation of g(θ). Focusing on the finite series, we determine the weights by forcing estimates of E[h j (θ)] based on the subsample to match those based on the full sample. In addition, we require the weights to be constant over the elements of a suitably chosen partition of the parameter space and to sum to one. This produces a better estimate of E[ m j=1 c jh j (θ)] than one based on the subsample alone. The improvement carries over to estimation of E[g(θ)] when the tail of the series is of minor importance. We refer to the finite set of basis functions as the (vector-valued) benchmark function. In both the basic and improved post-stratification approaches, we specify enough conditions that (for virtually all MCMC samples of reasonable size) there is a unique set of weights that would satisfy them. Thus, from the point of view of calibration estimation, the choice of the distance metric becomes immaterial, in the sense that any metric would yield identical weights. In this respect, our post-stratification weights arise from a degenerate instance of a problem of calibration estimation. In the case of the maximum entropy weights, however, we do not specify enough benchmark conditions to make the weights unique. Rather, among the sets of weights satisfying an under-determined number of benchmark conditions, we select the set having maximum entropy and this, from the point of view of calibration estimation, is tantamount to choosing a specific distance metric. As we mentioned earlier, many computationally expensive investigations involve hundreds or thousands of changes to the model. Each change in the prior, likelihood 16

28 or the data corresponds to a different posterior distribution. Generating a different MCMC sample for each posterior is usually impossible because of the prohibitive amounts of time and computational effort required. Importance sampling can be then used along with subsampling to estimate quickly various features of interest of the different target posteriors. We show how post-stratification can be used in conjunction with importance sampling to improve future estimation based on the stored subsample. The combination of the techniques makes possible quick and accurate estimation in these situations. Moreover, the cost of combining the two techniques is negligible. Chapter 3 investigates the large-sample properties of post-stratification estimators under three asymptotic situations: Case A. The subsampling distance k tending to, with the subsample size n remaining fixed. This asymptotic is motivated by the fact that in situations where subsampling is necessary, the cost of processing the MCMC draws to produce empirical average estimates generally exceeds the cost of generating the draws. Case B. The subsample size n tending to, with the subsampling distance k remaining fixed. This is motivated by benchmark estimation based on a large number of widely spaced (approximately independent) MCMC draws, where the subsampling distance k is fixed by an initial run of the MCMC algorithm. Case C. The subsample size n and the subsampling distance k jointly tending to. This asymptotic is also a natural candidate for theoretical exploration. It is motivated by MCMC estimation based on a large subsample of widely spaced and approximately independent draws. Viewing this process as a whole, k and n both tend to as the computational resources grow. 17

29 For Case C, we obtain a general result that applies to subsample-based estimators relying on a combination of importance sampling and post-stratification weights. As a corollary, we obtain expressions for the asymptotic variances of post-stratification estimators that do not rely on importance sampling. Although the asymptotic is technically difficult, a new set of analytical tools presented as Lemma B.1.1 and its corollaries in the Appendix allows us to pass in a relatively straightforward manner to an i.i.d. sample of draws. The asymptotic precisions of the combination-weighted estimators are then quantified by a linear projection on the space of the benchmark functions and the strata indicators. The theoretical results presented in chapter 3 suggest the substantial gains that we see in practice for benchmakr estimation. Chapter 4 illustrates the methodology on examples from the literature. By itself and in conjunction with importance sampling, benchmark estimation results in substantial benefits in the estimation of E[g(θ)] for a variety of functions, g(θ). The extent of the improvement in estimation of E[g(θ)] for functions that are noticeably different from the benchmarks is striking, even for values of m as small as 3 or 4. As illustrated in chapter 4 using the data from George, Makov and Smith (1993), the theoretical results are a close match to the actual behavior of the estimators, even for a subsample distance k as small as 10. Chapter 5 summarizes the dissertation and provides pointers to future research. Appendix A contains the proofs of the result stated in chapter 3 describing the behavior of subsample-based post-stratification estimators under Asymptotic B. Appendix B develops the asymptotic tool used in chapter 3 to investigate the properties under Asymptotic C of subsample estimators based on a combination of 18

30 post-stratification and importance sampling weights. It also provides the details of the proofs of related results. Appendix C derives the closed-form expression for the likelihood function obtained in the pumps example in chapter 4 by considering over- or under-dispersed waiting times for the basic gamma-poisson model. Appendix D contains the R code used to generate MCMC samples for the allometry example of chapter 4. 19

31 CHAPTER 2 A Simple Approach to Benchmark Estimation 2.1 An Improved Subsample Estimator Let θ Θ be a vector-valued parameter. Imagine that an MCMC sample is drawn from the posterior distribution of θ. Call the sequence of draws θ (1), θ (2),..., θ (N). The draws are used to estimate some feature of the posterior distribution. Often these features of interest can be represented as E[g(θ)] for some (possibly vector-valued) function g(θ). The most straightforward estimator for E[g(θ)] is Ê[g(θ)] f = 1 N g(θ (i) ), (2.1) N where the right hand subscript denotes the full sample estimator. If one selects a systematic 1-in-k subsample of the data, the natural estimator is i=1 Ê[g(θ)] s = 1 n g(θ (ki) ), (2.2) n where N = kn and s denotes the subsample estimator. As mentioned in Chapter 1, this form of subsampling always leads to poorer estimation; the unweighted subsample estimator (2.2) has a larger variance than the full sample estimator (2.1). We wish to use the information available from the full sample to improve future i=1 estimation based on the subsample for a large class of posterior features. For an 20

32 appropriately chosen (and possibly vector-valued) function h(θ), we refer to the feature E[h(θ)] as the benchmark. We now create a weighted version of the subsample estimator of E[g(θ)] as follows: Ê[g(θ)] w = n w i g(θ (ki) ), (2.3) i=1 where n i=1 w i = 1. The weights w i are chosen so that they force the weighted subsample benchmark estimate to equal the full sample estimate: Ê[h(θ)] w = Ê[h(θ)] f. (2.4) Thus Ê[h(θ)] w and Ê[h(θ)] f have the same distributions provided the weights can be constructed, and all features of their distributions conditional on this event are the same. For a vector-valued benchmark function, any linear combination of its coordinates results in the same estimate for both the subsample and the full sample, and the estimators have the same distribution. In particular, the two estimators have the same variance, and we have possibly greatly increased precision for our subsample estimator of E[g(θ)]. The connection between a conditionally conjugate structure and linear posterior expectation in exponential families implies that, for many popular models, quantities such as the conditional posterior mean for a case or the conditional posterior predictive mean will be a linear function of hyperparameters. The structure of the hierarchical model enables us to use benchmark functions based on the hyperparameters to create more accurate estimates of these quantities. The reduction in variability when moving from Ê[h(θ)] s to Ê[h(θ)] w also appears when examining expectations of functions g(θ) that are similar to h(θ). Functions such as a predictive variance which depend on first and second moments will typically be closely related to benchmark functions 21

33 based on the hyperparameters and so they will be more accurately estimated with our technique. The weighted subsample, (w i, θ (ki) ) for i = 1, 2,..., n, is now used in place of the unweighted subsample. The weights act exactly as they would if arising from an importance sample, and so we obtain weighted subsample estimates Ê[g(θ)] w for various features of interest E[g(θ)] of the posterior. Techniques and software developed for importance samples can be used without modification for these weighted samples. 2.2 Some Methods of Obtaining Weights The constraints that n i=1 w i = 1 and that Ê[h(θ)] w = Ê[h(θ)] f will not typically determine the w i. With a single real benchmark function, we would have only two linear constraints on the w i. We supplement the constraints with a principle that will yield a unique set of weights. The two principles we investigate are motivated by the literatures on survey sampling and information theory. Weights by Post-stratification. Post-stratification is a standard technique in survey sampling, designed to ensure that a sample matches certain characteristics of a population. The population characteristics are matched by computing a weight for each unit in the sample. Large sample results show that a post-stratified sample behaves almost exactly like a proportionally allocated stratified sample. This type of stratification reduces the variance of estimates as compared to a simple random sample. In this setting, the full sample plays the role of the population while the subsample plays the role of the sample. Thus, the essence of the technique is to partition the parameter space into (say) m strata, and to assign the same weight to each θ (ki) in a 22

34 stratum. Formally, suppose that {Θ j } m j=1 is a finite partition of the parameter space Θ. Let I j (θ) denote the indicator of set Θ j, for j = 1,..., m. The natural application of the post-stratification method takes as the benchmark function the vector of these m indicator functions. That is, h(θ) = (I 1 (θ), I 2 (θ),..., I m (θ)). We assign the same weight to all subsample points belonging to a given stratum. Specifically, for all i such that θ (ki) Θ j, we set w i = v j, where, according to (2.4), the values v j are determined by n v j I j (θ (ki) ) = 1 N i=1 N I j (θ (i) ), where j = 1,..., m. i=1 The post-stratification weights are then obtained as: v j = N 1 N i=1 I j(θ (i) ) n i=1 I, (2.5) j(θ (ki) ) provided each of the strata contains at least one subsample point. As in survey sampling, with fairly well chosen strata, the chance that any of the strata are empty of subsample points is negligible. The intuitive description of the post-stratification weight v j is as the ratio of the proportion of full sample points in Θ j to the number of subsample points in Θ j. We refer to this subsample estimator as the basic poststratification estimator, Ê[g(θ)] w,ps. The perspective of a basis expansion of g(θ) provides a more sophisticated use of post-stratification. Instead of using a basis formed of indicator functions (essentially a Haar basis), alternative bases consist of functions other than indicators. An attractive basis, due to its success throughout statistics, is the polynomial basis that generates Taylor series. Assigning equal weight to subsample points within each given post-stratum yields n m linear constraints on the weights, and forcing the 23

35 weights to sum to 1 provides one additional constraint. Supplementing these with a further m 1 linear constraints on the weights (and also with mild conditions on the posterior distribution and simulation method to guarantee uniqueness) defines the weights. These constraints are provided by matching the full sample estimates and the weighted subsample estimates of a vector-valued benchmark function consisting of m 1 components, denoted by h(θ) = (h 1 (θ),..., h m 1 (θ)). The weights are quickly obtained as a solution to a system of m linear equations. This version of post-stratification has proven to be extremely effective in practice. ( ). Let c represent the column vector 1, Ê[h 1(θ)] f, Ê[h 2(θ)] f,..., Ê[h m 1(θ)] f For t = 1,..., m 1 and j = 1,..., m, let h t,j = n i=1 h t(θ (ki) ) I j (θ (ki) ) be the sum of the function h t (θ) over the subsample points belonging to stratum Θ j. For j = 1,..., m, let n j be the number of subsampling points falling in stratum Θ j. Define the square matrix B as n 1 /n... n m /n h 1,1 /n... h 1,m /n B =.. (2.6) h m 1,1 /n... h m 1,m /n Provided the matrix is invertible, the vector of modified post-stratification weights, v = (v 1,..., v m ), is determined as the unique solution to the following system of linear equations: v = (nb) 1 c (2.7) We refer to an estimator based on these benchmark weights as a modified poststratification estimator, Ê[g(θ)] w2. The estimator is defined arbitrarily when B is singular. We shall prove later, under mild assumptions, that this is a rare event for n large enough. 24

36 Equivalent expressions for the estimator Ê[g(θ)] w 2 are: Ê[g(θ)] w2 = ( ḡ 1,... ), ḡ m B 1 c m = (n j v j ) ḡ j,n, (2.8) j=1 where ḡ j = 1 n n i=1 g(θ(ki) ) I j (θ (ki) ) and ḡ j,n = 1 n n j i=1 g(θ(ki) ) I j (θ (ki) ), for j = 1,..., m. Unlike simple post-stratification weights, some of the weights obtained by the modified version of post-stratification may occasionally be negative. The weighted subsample cannot be interpreted as a probability sample in that case. However, it can be shown that this happens very rarely for large subsample sizes. In the simulation study of chapter 4, where subsamples of size n = 1000 are generated, the modified post-stratification weights were positive for all 100 independent replications of the chain. Maximum Entropy Weights. Information theory describes, in various fashions, the amount of information in the data about a parameter or distribution. In a Bayesian context, it is often used to describe subjective information (playing the role of data) in order to elicit a prior distribution. This is accomplished by specifying a number of features of the distribution, typically expectations, as the information about the prior. The prior is then chosen to reflect this information but no more. With entropy defined as the negative of information, the prior which reflects exactly the desired information is that which maximizes entropy among those priors matching the constraints. 25

37 In our setting, we borrow this technique, matching exactly the information in the full sample benchmark estimates, but no more. Let w = (w 1, w 2,..., w n ) be the n-tuple of weights given in (2.3). Let us denote by Ω the (possibly empty) set of all n-tuples of weights that satisfy (2.4). Thus Ω is the set { w w i 0 i, n i=1 w i = 1, n i=1 w ih(θ (ki) ) = Ê[h(θ)] f}. Definition The entropy of an n-tuple w belonging to the set Ω is defined as n En (w) = w i ln w i, i=1 subject to the convention that 0 ln(0) equals 0. We observe that for all w belonging to Ω, En (w) En (( 1 n, 1 n,..., 1 n)) = ln(n). Since Ω is closed, there exists an element w of Ω such that En (w ) = sup w Ω En (w). These weights w are called maximum entropy weights, and they exist whenever Ω is non-empty. Finding maximum entropy weights w is thus equivalent to maximizing En (w) subject to the constraints w i 0 for i = 1,..., n, n i=1 w i = 1, and n i=1 w ih(θ (ki) ) = Ê[h(θ)] f. Since Ω is a closed, convex set and En (w) is a strictly concave function, the maximum entropy weights w are unique whenever they exist. For a real benchmark function h(θ) and most subsamples of reasonable size, it can be shown that the maximum entropy weights w are given by w i = e λ 1 +λ 2 h(θ (ki)), i = 1, 2,... n; (2.9) where λ 2 R satisfies the equation n i=1 ( ) ( )) h(θ (ki) ) Ê[h(θ)] f exp (λ 2 h(θ (ki) ) Ê[h(θ)] f 26 = 0, (2.10)

Computational statistics

Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated