Multivariate Output Analysis for Markov Chain Monte Carlo

Size: px

Start display at page:

Download "Multivariate Output Analysis for Markov Chain Monte Carlo"

Gwendolyn Cameron
5 years ago
Views:

1 Multivariate Output Analysis for Marov Chain Monte Carlo Dootia Vats School of Statistics University of Minnesota James M. Flegal Department of Statistics University of California, Riverside Galin L. Jones School of Statistics University of Minnesota December 23, 2015 Abstract Marov chain Monte Carlo (MCMC) produces a correlated sample in order to estimate expectations with respect to a target distribution. A fundamental question is when should sampling stop so that we have good estimates of the desired quantities? The ey to answering this question lies in assessing the Monte Carlo error through a multivariate Marov chain central limit theorem. However, the multivariate nature of this Monte Carlo error has largely been ignored in the MCMC literature. We provide conditions for consistently estimating the asymptotic covariance matrix via the multivariate batch means estimator. Based on this result, we present a relative standard deviation fixed-volume sequential stopping rule for terminating simulation. We show that this stopping rule is asymptotically equivalent to terminating when the effective sample size for the MCMC sample is above a desired threshold, giving an intuitive and theoretically justified approach to im- Research supported by the National Science Foundation. Research supported by the National Institutes of Health and the National Science Foundation. 1

2 plementing the proposed method. The finite sample properties of the proposed method are demonstrated in examples. 1 Introduction Marov chain Monte Carlo (MCMC) algorithms are used to estimate expectations with respect to a probability distribution when independent sampling is difficult. Typically, interest is in estimating a vector of quantities such as means, moments, and quantiles associated with a given probability distribution. However, analysis of the MCMC output routinely focuses on a univariate Marov chain central limit theorem (CLT). Thus standard convergence diagnostics, sequential stopping rules for termination, effective sample size definitions, and confidence intervals are univariate in nature. We present a methodological framewor for multivariate analysis of MCMC output. Let F be a probability distribution with support X and g : X R p be an F -integrable function such that θ := E F g is of interest. If {X t } is an F -invariant Harris recurrent Marov chain, set {Y t } = {g(x t )} and estimate θ with θ n = n 1 n t=1 Y t since θ n θ with probability 1 as n. Finite sampling leads to an unnown Monte Carlo error, θ n θ, and estimating this error is essential to assessing the quality of estimation. If, for δ > 0, g has 2 + δ moments under F and {X t } is geometrically ergodic, an approximate sampling distribution for the Monte Carlo error is available via a Marov chain CLT. That is, there exists a p p positive definite symmetric matrix, Σ, such that, as n, n(θn θ) d N p (0, Σ). (1) Assessment of the Monte Carlo error rests on estimating Σ. This, however, can be challenging since Σ = Var F (Y 1 ) + 2 Cov F (Y 1, Y 1+ ). =1 We consider the multivariate batch means (mbm) estimator of Σ and establish con- 2

3 ditions for its strong consistency. Using the mbm estimator we develop multivariate confidence regions, multivariate sequential termination rules, and multivariate effective sample size. Thus we provide a methodological framewor for multivariate MCMC output analysis. Univariate output analysis methods are motivated by the univariate CLT n(θn,i θ i ) d N(0, σ 2 i ), (2) as n, where θ n,i and θ i are the ith components of θ n and θ, respectively and σi 2 is the ith diagonal element of Σ. Flegal and Jones (2010), Hobert et al. (2002) and Jones et al. (2006) propose strongly consistent estimators of σi 2 using spectral variance methods, regenerative simulation, and batch means methods, respectively. Consistent estimators of the asymptotic variance in independent sampling are used to sequentially terminate simulation when the half-width of the confidence interval for θ i is below a tolerance level set a priori. In MCMC, this idea was formalized by Jones et al. (2006) via the fixed-width sequential stopping rule. For a desired tolerance of ɛ, simulation is terminated the first time, for all components σ n,i t + ɛi(n < n ) + n 1 ɛ, (3) n where σn,i 2 is a strongly consistent estimator of σ2 i, t is an appropriate t-distribution quantile, and n is a minimum simulation effort specified by the user. The fixed-width sequential stopping rule laid the foundation for termination based on quality of estimation rather than convergence of the Marov chain (for a review of convergence diagnostics see Cowles and Carlin, 1996). Fixed-width procedures are aimed at terminating simulation when the estimation is reliable in the sense that if the procedure is repeated again, the estimates will not be vastly different (Flegal et al., 2008). Implementing the fixed-width sequential stopping rule at (3) requires careful analysis 3

4 for choosing ɛ for each θ n,i. Since this can be challenging for large p, Flegal and Gong (2015) proposed relative fixed-width sequential stopping rules that terminate simulation relative to attributes of the estimation process. Specifically, they proposed a relative standard deviation fixed-width sequential stopping rule, where the simulation is terminated relative to the uncertainty in the target distribution. That is, if λ 2 n,i is the sample variance for the ith component of {Y t }, simulation is terminated the first time for all components σ n,i t + ɛλ n,i I(n < n ) + n 1 ɛλ n,i. (4) n Jones et al. (2006) and Flegal and Gong (2015) showed that the confidence intervals created at termination using fixed-width and relative standard deviation fixed-width sequential stopping rules are asymptotically valid, in that the confidence intervals created at termination have the right coverage probability as ɛ 0 when σ 2 n,i σ2 i with probability 1 as n. Thus, strong consistency of the estimators of σ 2 i is vital for valid inference using sequential termination rules. However, these termination rules are univariate in that, simulation is terminated when all components in θ n individually satisfy the termination rule. Clearly, termination is dictated by the component that mixes slowest, and cross-correlations between components are ignored. In addition, corrections for multiple testing result in conservative confidence intervals leading to delayed termination. We propose a multivariate approach to termination called the relative standard deviation fixed-volume sequential stopping rule. If Λ n is the sample covariance matrix of {Y t } and denotes determinant, then simulation is terminated the first time Volume of Confidence Region 1/p + ɛ Λ n 1/2p I(n < n ) + n 1 < ɛ Λ n 1/2p. Since the volume of the confidence region is proportional to the estimate of Σ an interpretation is that the simulation is terminated when the generalized variance of the Monte Carlo error is small relative to the estimated generalized variance with respect 4

5 to F ; that is, the estimate of Σ is small compared to the estimate of Cov F (Y 1 ). This is a multivariate generalization of the relative standard deviation fixed-width sequential stopping rule of Flegal and Gong (2015). We show that the relative standard deviation fixed-volume sequential stopping rule leads to asymptotically valid confidence regions. The bottlenec in demonstrating asymptotic validity is proposing estimators of Σ that are strongly consistent. To this end, we investigate the mbm estimator and provide conditions for strong consistency. Another common way to terminate simulation is to continue until the effective sample size (ESS) is greater than some W, set a priori. (see Drummond et al. (2006), Atinson et al. (2008), and Giordano et al. (2015) for a few examples). ESS is the equivalent number of independent and identically distributed samples as in the correlated sample. Terminating using ESS is intuitive since it is comparable to termination for independent sampling. Gong and Flegal (2015) showed that this method of termination is equivalent to terminating according to the relative standard deviation fixed-width stopping rule. However, we are unaware of any formal definition of multivariate ESS (mess). Since the samples {Y t } are vector-valued, it is common to calculate ESS for all components simultaneously. In Section 3.1, we introduce a definition of mess and present a strongly consistent estimator, mess. We show that terminating according to the relative standard deviation fixed-volume sequential stopping rule is asymptotically equivalent to terminating when mess W p,α,ɛ, (5) where 1 α represents the confidence level and ɛ is the precision specified in the relative standard deviation fixed-volume sequential stopping rule. Notice that W p,α,ɛ is only a function of the dimension of the estimation problem and the specified precision. Thus the user can specify a precision, calculate W p,α,ɛ, and simulate until (5) is satisfied. This sequential process is the same as terminating when the relative standard deviation fixed-volume sequential stopping rule is satisfied. 5

6 Generally, our proposed mess and relative standard deviation fixed-volume sequential stopping rule methods terminate earlier than simultaneous univariate methods since termination is dictated by the overall Marov chain and not by the component that mixes the slowest. In addition, since multiple testing is no longer an issue, the volume of the confidence region is considerably smaller even in moderate p problems. Using the inherent multivariate nature of the problem and acnowledging the existence of crosscorrelation in estimating effective sample size leads to a more realistic understanding of the estimation process. 1.1 An Illustrative Example For i = 1,..., K, let Y i be a binary response variable and X i = (x i1, x i2,..., x i5 ) be the observed predictors for the ith observation. Assume τ 2 is nown, ( ) Y i X i, β ind 1 Bernoulli 1 + e X, and β N iβ 5 (0, τ 2 I 5 ). (6) This simple hierarchical model results in an intractable posterior, F, on R 5. The dataset used is the logit dataset in the mcmc R pacage. The goal is to estimate the posterior mean of β, E F β. Thus g here is the identity function mapping to R 5. We implement a random wal Metropolis-Hastings algorithm with a multivariate normal proposal distribution N 5 (, I 5 ) where I 5 is the 5 5 identity matrix and the 0.35 scaling ensures an optimal acceptance probability as suggested by Roberts et al. (1997). We obtain an MCMC sample of size 10 5 and calculate the Monte Carlo estimate for E F β. The starting value for β is a random draw from the prior distribution. The covariance matrix Σ is estimated using the mbm estimator described in Section 2. We also implement the univariate batch means (ubm) methods described in Jones et al. (2006) to estimate σi 2, which captures the autocorrelation in each component while ignoring the cross-correlation. This cross-correlation is often significant as exhibited in Figure 1, and can only be captured by multivariate methods lie mbm. In Figure 2 we 6

7 Autocorrelation for β 1 Cross correlation for β 1 and β 3 ACF ACF Lag Lag Trace plot for β 1 Trace plot for β 3 β β e+00 4e+04 8e+04 0e+00 4e+04 8e+04 Iteration Iteration Figure 1: Autocorrelation plot for β 1, cross-correlation plot between β 1 and β 3 and trace plots for β 1 and β 3. The cross-correlation is ignored by univariate methods. Monte Carlo sample size = present 90% confidence regions created using mbm and ubm for β 1 and β 3 (for the purposes of this figure, we treat this as a problem in 2 dimensions). The boxes were created using ubm both corrected and uncorrected with Bonferroni and the ellipse was constructed using the mbm estimator. To assess the confidence regions, we verify their coverage probabilities over 1000 independent replications with Monte Carlo sample sizes in {10 4, 10 5, 10 6 }. The true posterior mean, (0.5706, , , , ), is obtained by taing the mean over 10 9 iterations. For each of the 1000 replications, it was noted whether the true posterior mean was present in the confidence region. In addition, the volume of the confidence region to the pth root was also observed. Table 1 summarizes the results. Note that though the uncorrected univariate methods provide the smallest confidence regions, their coverage probabilities are far from desirable. For a large enough Monte Carlo sample size, mbm produces 90% coverage probabilities with a systematically lower volume than ubm corrected with Bonferroni (ubm-bonferroni). This conclusion agrees with Figure 2, where due to the elliptical structure, mbm ignores a large chun of area 7

8 90% Confidence Region β β 1 Figure 2: Joint 90% confidence region for β 1 and β 3. The ellipse is generated using the mbm estimator, the dotted line using the uncorrected univariate BM estimator and dashed line the univariate BM estimator corrected by Bonferroni. Monte Carlo sample size is contained in the ubm-bonferroni confidence region. This example shows that even simple MCMC problems produce complex dependence structures within and across components of the samples. Ignoring this dependence structure leads to an incomplete understanding of the estimation process. Not only do we gain more information about the Monte Carlo error using multivariate methods, but avoid using conservative Bonferroni methods. It is evident that termination rules based on volume of confidence regions would generally terminate earlier for multivariate methods while extracting more information from the samples. The rest of the paper is organized as follows. In Section 2 we define the mbm estimator and provide conditions for strong consistency. In Section 3 we formally introduce a general class of relative fixed-volume sequential termination rules. In addition, we introduce a multivariate definition of ESS and relate it to sequential termination rules. In Section 4 we continue our implementation of the Bayesian logistic regression model and consider additional examples. We choose a vector autoregressive process of order 1, where convergence of the process can be manipulated. Specifically, we construct the process in such a way that one component mixes slowly, while the others are fairly well behaved. Such behavior is nown to occur often in hierarchical models with priors 8

9 Volume to the pth root n mbm ubm-bonferroni ubm 1e (7.94e-05) (9.23e-05) (6.48e-05) 1e (1.20e-05) (1.42e-05) (1.00e-05) 1e (1.70e-06) (2.30e-06) (1.60e-06) Coverage Probabilities 1e (0.0104) (0.0099) (0.0155) 1e (0.0103) (0.0090) (0.0156) 1e (0.0097) (0.0094) (0.0153) Table 1: Volume to the pth (p = 5) root and coverage probabilities for 90% confidence regions constructed using mbm, ubm uncorrected and ubm corrected for Bonferroni. Replications = 1000 and standard errors are indicated in parenthesis. on the variance components. The next example is that of a Bayesian Lasso where the posterior is in 51 dimensions. We also implement our output analysis methods for a fairly complicated Bayesian dynamic spatial temporal model. We conclude with a discussion in Section 5. 2 Multivariate Batch Means Estimator Let n = a n, where a n is the number of batches and is the batch size. For = 0,...,, define Ȳ := b 1 bn n t=1 Y bn+t. Then Ȳ is the mean vector for batch and the mbm estimator of Σ is given by Σ n = a n 1 =0 (Ȳ θ n ) (Ȳ θ n ) T. (7) When Y t is univariate, the batch means estimator has been well studied for MCMC problems (Flegal and Jones, 2010; Jones et al., 2006) and for steady state simulations (Damerdji, 1991; Glynn and Iglehart, 1990; Glynn and Whitt, 1991). Glynn and Whitt (1991) showed that the batch means estimator cannot be consistent for fixed batch size,. Damerdji (1991, 1995), Jones et al. (2006) and Flegal and Jones (2010) established its asymptotic properties including strong consistency and mean square consistency 9

10 when both the batch size and number of batches increases with n. The multivariate extension as in (7) was first introduced by Chen and Seila (1987). For steady-state simulation output Charnes (1995) and Muñoz and Glynn (2001) studied confidence regions for θ based on the mbm, however, the theoretical properties of the mbm remain unexplored. In Theorem 1, we present conditions for strong consistency of Σ n in estimating Σ for MCMC, but our results hold for more general processes. Our main assumption on the process is that of a strong invariance principle. Condition 1. Let denote the Euclidean norm and {B(t), t 0} be a p-dimensional multivariate Brownian motion. There exists a nonnegative increasing function γ on the positive integers, a sufficiently rich probability space Ω, an n 0 N, a p p lower triangular matrix L and a finite random variable D such that for almost all ω Ω and for all n > n 0, n(θ n θ) LB(n) < Dγ(n) w.p. 1 as n. (8) A strong law, CLT, and functional central limit theorem (FCLT) are consequences of Condition 1 which holds for independent processes (Beres and Philipp, 1979; Einmahl, 1989; Zaitsev, 1998), Martingale sequences (Eberlein, 1986), renewal processes (Horvath, 1984) and for φ-mixing and strongly mixing sequences (Dehling and Philipp, 1982; Kuelbs and Philipp, 1980). Using Kuelbs and Philipp (1980) results, Vats et al. (2015) established Condition 1 with γ(n) = n 1/2 λ, λ > 0 for polynomially ergodic Marov chains. As in the univariate case, the batch means estimator can be consistent only if the batch size increases with n. Condition 2. The batch size is an integer sequence such that and n/ as n where, and n/ are monotonically increasing. Condition 3. There exists a constant c 1 such that ( n bn n 1) c <. In Theorem 1, we establish strong consistency of Σ n. The proof is given in Appendix A.1. 10

11 Theorem 1. Let X be a geometrically ergodic Marov chain with invariant distribution F and let g be an F -measurable real-valued function such that E F g 2+δ <. Then (8) holds with γ(n) = n 1/2 λ for some λ > 0. Let Conditions 2 and 3 hold. If b 1/2 n (log n) 1/2 n 1/2 λ 0 as n, then Σ n Σ with probability 1 as n. Remar 1. The theorem holds more generally outside of the context of Marov chains for processes that satisfy Condition 1. The general statement of the theorem is provided in Appendix A.1. Remar 2. The value of λ depends on the mixing time of the Marov chain. For slowly mixing Marov chains λ is closer to 0 and for fast mixing chains λ is closer to 1/2. Remar 3. It is natural to consider = n ν for 0 < ν < 1. Then ν > 1 2λ is required to satisfy b 1/2 n (log n) 1/2 n 1/2 λ 0 as n. Hence, for fast mixing process (λ close to 1/2), smaller batch sizes suffice and similarly slow mixing processes require larger batch sizes. This reinforces our intuition that higher correlation calls for larger batch sizes. Remar 4. Since Σ is non singular Σ n should be non-singular, which requires a n > p. Thus larger Monte Carlo sample sizes will be required for higher dimensional estimation problems. An immediate consequence of the strong consistency of the mbm estimator is the convergence of its eigenvalues. The proof follows from Theorem 2 in Vats et al. (2015). Corollary 1. Let λ 1 > λ 2 > > λ p > 0 be the eigenvalues of Σ. Let ˆλ 1,..., ˆλ p be the p eigenvalues of Σ n such that ˆλ 1 > ˆλ 2 > > ˆλ p, then under conditions of Theorem 1, ˆλ λ w.p. 1 as n for all 1 p. 3 Termination Rules We consider multivariate sequential termination rules that lead to asymptotically valid confidence regions. Let F 1 α,p,an p denote the 1 α quantile of a central F distribution with p numerator degrees of freedom and a n p denominator degrees of freedom. A 11

12 100(1 α)% confidence region for θ is the set C α (n) = { θ R p : n(θ n θ) T Σ 1 n (θ n θ) < p(a } n 1) (a n p) F 1 α,p,a n p. Then C α (n) forms an ellipsoid in p dimensions oriented along the directions of the eigenvectors of Σ n. Letting denote determinant, the volume of C α (n) is Vol(C α (n)) = ( ) 2πp/2 p(an 1) p/2 pγ(p/2) n(a n p) F 1 α,p,a n p Σ n 1/2. (9) Since p is fixed and as n, Σ n Σ with probability 1, Vol(C α (n)) 0 with probability 1 as n. If ɛ > 0 and s(n) is a positive real valued function defined on the positive integers, then a fixed-volume sequential stopping rule terminates the simulation at the random time T (ɛ) = inf { } n 0 : Vol(C α (n)) 1/p + s(n) ɛ. (10) Glynn and Whitt (1992) showed that terminating at T (ɛ) yields confidence regions that are asymptotically valid in that, as ɛ 0, Pr [θ C α (T (ɛ))] 1 α if (a) a FCLT holds, (b) s(n) = o(n 1/2 ) as n, and (c) Σ n Σ with probability 1 as n. A FCLT holds under Condition 1 and our Theorem 1 establishes consistency of mbm. Next (b) will be satisfied if we set s(n) = ɛi(n < n ) + n 1, which ensures simulation does not terminate before n iterations. Also, n should satisfy a n > p so that Σ n is positive definite; recall Remar 4. The sequential stopping rule (10) can be difficult to implement in practice since the choice of ɛ depends on the units of θ, and has to be carefully chosen for every application. We consider a generalization of (10) which can be used more naturally and which we will show connects nicely to the idea of a multivariate effective sample size. Let K(Y, p) > 0 be an attribute of the estimation process and suppose K n (Y, p) > 0 is an estimator of K(Y, p); for example tae θ = K(Y, p) and θ n = K n (Y, p). Set 12

13 s(n) = ɛk n (Y, p)i(n < n ) + n 1 and define T (ɛ) = inf { } n 0 : Vol(C α (n)) 1/p + s(n) ɛk n (Y, p). (11) The following result establishes asymptotic validity of this termination rule. The proof is provided in Appendix A.2. Theorem 2. Suppose a FCLT holds. If K n (Y, p) K(Y, p) w.p. 1 and Σ n Σ w.p. 1 as n, then, as ɛ 0, T (ɛ) and Pr [θ C α (T (ɛ))] 1 α. Suppose K(Y, p) = Λ 1/2p := Cov F Y 1 1/2p and if Λ n is the usual sample covariance matrix for {Y t }, set K n (Y, p) = Λ n. Then K n (Y, p) K(Y, p) with probability 1 as n and T (ɛ) is the first time the uncertainty in estimation (measured via the volume of the confidence region) is an ɛth fraction of the uncertainty in the target distribution. The relative standard deviation fixed-volume sequential stopping rule is formalized as terminating at T SD (ɛ) = inf { n 0 : Vol(C α (n)) 1/p + ɛ Λ n 1/2p I(n < n ) + n 1 ɛ Λ n 1/2p}. (12) Note that Λ n is positive definite as long as n is larger than p, so Λ n 1/2p > Relation to Effective Sample Size Kass et al. (1998), Liu (2008), and Robert and Casella (2013) define ESS for the ith component of the process as ESS i = n 1 + (i) =1 ρ(y 1, Y (i) 1+ ), where ρ(y (i) 1, Y (i) 1+ ) = ρ i() is the lag correlation for the ith component of Y. It is challenging to estimate ρ i () consistently, especially for larger. Alternatively, Gong 13

14 and Flegal (2015) rewrite the above definition as, ESS i = n λ2 i σi 2, where σ 2 i be the ith diagonal element of Σ and λ 2 i be the ith diagonal element of Λ. Using this formulation, a consistent estimate of ESS i is obtained by using consistent estimators of λ 2 i and σ2 i via the sample variance and univariate batch means estimators, respectively. The R pacage coda estimates ESS i by using the sample variance to estimate λ 2 i but estimates σ 2 i by determining the spectral density at frequency zero of an approximating autoregressive process. Thus, it does not estimate σ 2 i directly. The R pacage mcmcse uses the univariate batch means method to estimate σ 2 i. The estimate of ESS i is then, ÊSS i = n λ2 n,i σn,i 2, where ˆλ 2 i is the sample variance and σn,i 2 is the batch means estimator. Then ÊSS i is a consistent estimate of univariate ESS i. However in both of these pacages, ESS i is estimated for each component separately, and a conservative estimate of the overall ESS is taen to be the minimum of all ESS i. This again leads to the situation where the estimate of ESS is dictated by the component that mixes the slowest, while ignoring the behavior of the other components. A natural definition of multivariate ESS is mess = n ( ) Λ 1/p. (13) Σ When p = 1, mess reduces to the form of ESS presented above. In (13), we replace the concept of variance with generalized variance. For a random vector, the determinant of its covariance matrix is called its generalized variance. This was first introduced by Wils (1932) as a univariate measure of spread for a multivariate distribution. Wils 14

15 (1932) recommended the use of the pth root of the generalized variance. This was formalized by SenGupta (1987) as the standardized generalized variance in order to compare variability over different dimensions. Due to the strong consistency of mbm established in Theorem 1, a strongly consistent estimator for ESS is, mess = n ( ) Λn 1/p. (14) Σ n Rearranging the defining inequality in (12) yields that when n n ( mess 2π p/2 pγ(p/2) ) 1/p (p(an ) 1) 1/2 a n p F 1 α,p,a n p + 2 2/p ( ) π p(an 1) 1 (pγ(p/2)) 2/p a n p F 1 α,p,a n p ɛ 2. 1 n 1/2 Σ n 1/2p 2 1 ɛ 2 Thus, the relative standard deviation fixed-volume sequential stopping rule is equivalent to terminating the first time mess is larger than a lower bound. This lower bound is a function of n and thus is difficult to determine before starting the simulation. However, as n, the scaled F distribution converges to a χ 2, leading to the following approximation mess 2 2/p π (pγ(p/2)) 2/p χ 2 1 α,p ɛ 2. Thus in order to determine termination, a user can a priori decide to simulate until W effective samples are obtained. This choice of W will correspond to some α and ɛ and thus be an asymptotically valid termination rule. The choice of W should be made eeping in mind that the samples obtained are only approximately from the invariant distribution F, and this does not change in finite time. Thus the choice of W should be conservative to begin with and be bigger for larger p. Example 1. Suppose p = 5 (as in the logistic regression setting of Section 1.1) and that we want a precision of ɛ =.05 (so the desired computational uncertainty is a.05th fraction of the uncertainty in the target distribution) for a 95% confidence region. This 15

16 requires mess On the other hand, if we simulate until mess = 10000, we obtain a precision of ɛ = Univariate Termination Rules In this section we formally present the univariate termination rules we will implement in Section 4 as a comparison to the multivariate rules. Recall that for the ith component of θ, θ n,i is the Monte Carlo estimate for θ i, and σ 2 i is the asymptotic variance in the univariate CLT. Let σ 2 n,i be the univariate batch means estimator of σ2 i and let λ 2 n,i be the sample variance for the ith component. Common practice is to terminate simulation when all components satisfy a termination rule. We focus on the relative standard deviation fixed-width sequential stopping rules. Due to multiple testing, a Bonferroni correction is often used. We will refer to the uncorrected method as the ubm method and the corrected method as the ubm-bonferroni. To create 100(1 α)% univariate uncorrected confidence intervals, the relative standard deviation fixed-width rule terminates the first time, { ( 1 max,...,p λ n,i σ n,i 2t 1 α/2,an 1 + ɛλ n,i I(n < n ) + 1 )} ɛ. n n To create 100(1 α)% univariate corrected confidence intervals, the relative standard deviation fixed-width rule terminates the first time, { ( 1 max,...,p λ n,i σ n,i 2t 1 α/2p,an 1 + ɛλ n,i I(n < n ) + 1 )} ɛ. n n Remar 5. If it is unclear which features of F are of interest a priori, i.e., g is not nown before simulation, one can implement Scheffe s simultaneous confidence intervals. This method can also be used to create univariate confidence intervals for arbitrary contrasts a posteriori. Scheffe s simultaneous intervals are boxes on the coordinate axes that 16

17 bound the confidence ellipsoids. For all a T θ, where a R p, a T θ n ± a T Σ n a p() (a n p) F 1 α,p,a n p will have simultaneous coverage probability 1 α. Since we generally are not interested in all possible linear combinations, Scheffe s intervals are often too conservative. In fact, if the number of confidence intervals is less than p, Scheffe s intervals are more conservative than Bonferroni corrected intervals. 4 Examples In each of the following examples we present a target distribution F, a Marov chain with F as its invariant distribution, we specify g, and are interested in estimating E F g. We consider the finite sample performance (based on 1000 independent replications) of the relative standard deviation fixed-volume sequential stopping rules. In each case we use 90% confidence regions and various choices of ɛ and specify our choice of n and. 4.1 Bayesian Logistic Regression We continue the Bayesian logistic regression example introduced in Section 1.1. Recall that a random wal Metropolis-Hastings algorithm was implemented to sample from the intractable posterior. We prove the chain is geometrically ergodic in Appendix A.3. Theorem 3. The random wal based Metropolis-Hastings algorithm with invariant distribution given by the posterior from (6), is geometrically ergodic. As a consequence of Theorem 3 and that F has a moment generating function, a FCLT and thus a CLT holds. This ensures the validity of Theorems 1 and 2. Motivated by the ACF plot in Figure 1, was set to n 1/2 and n = For calculating coverage probabilities, we declare the truth as the posterior mean from an independent simulation of length The results are presented in Table 2. As before, 17

18 mbm ubm-bonferroni ubm Termination Iteration ɛ = (196) (391) (213) ɛ = (1158) (1880) (1036) ɛ = (1837) (7626) (3150) Effective Sample Size ɛ = (9) 9270 (13) 4643 (7) ɛ = (51) (65) (36) ɛ = (110) (271) (116) Coverage Probabilities ɛ = (0.0099) (0.0091) (0.0157) ɛ = (0.0097) (0.0090) (0.0155) ɛ = (0.0098) (0.0097) (0.0155) Table 2: Bayesian Logistic: Over 1000 replications, we present termination iterations, effective sample size at termination and coverage probabilities at termination for each corresponding method. Standard errors are in parenthesis. the univariate uncorrected method has less than desired coverage probabilities. For ɛ = 0.02 and 0.01, the coverage probabilities for both the mbm and ubm-bonferroni regions are at 90%. However, termination for mbm is significantly earlier. This is a consequence of two things. First, accounting for the multivariate nature of the problem allows us to capture more information from a smaller sample; this is reflected in the estimates of ESS. Second, using a Bonferroni correction gives larger confidence regions, leading to delayed termination. Recall that due to Theorem 2, as ɛ decreases to zero, the coverage probability of confidence regions created at termination using the relative standard deviation fixed-volume sequential stopping rule converges to the nominal level. This is demonstrated in Figure 3 where we present the coverage probability over 1000 replications as ɛ increases (or ɛ decreases). Notice that the increase in coverage probabilities need not be monotonic due to the randomness of the underlying process. To illustrate the convergence of the eigenvalues of the mbm estimator, we plot the condition number of the estimate under the spectral norm. That is, we report ˆλ max /ˆλ min 18

19 Coverage Probability as ε Decreases Coverage Probability ε Figure 3: Bayesian Logistic: Plot of coverage probability with confidence bands as ɛ decreases at 90% nominal rate. Replications = Condition Number by Iteration Condition Number log 10 (Iteration) Figure 4: Bayesian Logistic: Plot of condition number under spectral norm with confidence bands versus Monte Carlo sample size. Replications = over varying Monte Carlo sample sizes, and present the plot in Figure 4. In Figure 4, the condition number for the mbm estimator decreases with an increase in sample size indicating better conditioning at large simulation sizes. 4.2 Vector Autoregressive Process Consider the vector autoregressive process of order 1 (VAR(1)). For t = 1, 2,..., Y t = ΦY t 1 + ɛ t, 19

20 where Y t R p, Φ is a p p matrix, ɛ t iid N p (0, Ω), where Ω is a p p positive definite symmetric matrix. The matrix Φ determines the nature of the autocorrelation. This Marov chain has invariant distribution F = N p (0, V ) where V is such that vec(v ) = (I p 2 Φ Φ) 1 vec(ω), denotes the Kronecer product and is geometrically ergodic when the spectral radius of Φ is less than 1 (Tjøstheim, 1990). Consider the goal of estimating the mean of F, E F Y = 0 with Ȳn. Then Σ = (I Φ) 1 V + V (I Φ) 1 V. Let p = 5, Φ = diag(.9,.5,.1,.1,.1), and Ω be the AR(1) covariance matrix with autocorrelation 0.9. Since the first eigenvalue of Φ is large, the first component mixes slowest. This component will dictate the primary orientation of the confidence ellipsoids. We sample the process for 10 5 iterations and in Figure 5 present the autocorrelation (ACF) plot for Y (1) and Y (3) and the cross-correlation (CCF) plot between Y (1) and Y (3) in addition to the trace plot for Y (1). Notice that Y (1) has larger significant lags than Y (3) and there is significant cross-correlation between Y (1) and Y (3). Figure 6 displays joint confidence regions for Y (1) and Y (3). Recall that the true mean is (0, 0), and is present in all three regions, but the ellipse produced by mbm has significantly smaller volume than the ubm boxes. The orientation of the ellipse is determined by the cross-correlations witnessed in Figure 5. We set n = 1000, = n 1/3, ɛ in {0.05, 0.02, 0.01} and at termination of each method, calculate the coverage probabilities and effective sample size. Results are presented in Table 3. Note that as ɛ decreases, termination time increases and coverage probabilities tend to 90% nominal for each method (due to asymptotic validity). Also note that the uncorrected Bonferroni methods produce confidence regions with undesirable coverage probabilities and thus, are not of interest. Consider ɛ =.02 in Table 3. Termination for mbm is at 8.8e4 iterations compared to 9.6e5 for ubm-bonferroni. However, the estimates for multivariate ESS at 8.8e4 iterations is 4.7e4 samples compared to univari- 20

21 Autocorrelation for Y (1) Autocorrelation for Y (3) ACF ACF Lag Lag Cross correlation for Y (1) and Y (3) Trace plot for Y (1) ACF Y (1) Lag 0e+00 4e+04 8e+04 Time Figure 5: VAR: ACF plot for Y (1) and Y (3) and the CCF plot between Y (1) and Y (3) in addition to the trace plot for Y (1). The observed cross-correlation is completely ignored by univariate methods. Monte Carlo sample size is ate ESS of 5.6e4 samples for 9.6e5 iterations. This is because the leading component Y (1) mixes much slower than the other components, and defines the behavior of the univariate ESS. A small study presented in Table 4 elaborates on this behavior. Over 100 replications of Monte Carlo sample sizes 10 5 and 10 6, we present the mean estimate of ESS using multivariate and univariate methods. The estimate of ESS for the first component is significantly smaller than all other components leading to a conservative univariate estimate of ESS. 4.3 Bayesian Lasso Let y be an K 1 response vector and X be an K q matrix of predictors. We consider the following Bayesian Lasso formulation of Par and Casella (2008). y β, σ 2, τ 2 N K (Xβ, σ 2 I n ) β σ 2, τ 2 N q (0, σ 2 D τ ) where D τ = diag(τ 2 1, τ 2 2,..., τ 2 q ) 21

22 90% Confidence Region Y Y 1 Figure 6: VAR: Joint 90% confidence region for first two components of Y. Solid ellipse is generated using mbm estimator, dotted the ubm estimator uncorrected and dashed line the ubm estimator corrected by Bonferroni. Monte Carlo sample size is σ 2 Inverse-Gamma(α, ξ) ( ) τj 2 iid λ 2 Exponential 2 for j = 1,..., q, where λ, α, and ξ are fixed and the Inverse-Gamma(a, b) distribution is x a 1 e b/x. We use a deterministic scan Gibbs sampler to draw approximate samples from the posterior; see Khare and Hobert (2013) for a full description of the algorithm. Khare and Hobert (2013) showed that for K 3, this Gibbs sampler is geometrically ergodic for arbitrary q, X and λ. We fit this model to the cooie dough dataset of Osborne et al. (1984). The data was collected to test the feasibility of near infra-red (NIR) spectroscopy for measuring the composition of biscuit dough pieces. There are 72 observations and the response variable is taen as the amount of dry flour content measured and the predictor variables are 25 measurements of spectral data spaced equally between 1100 to 2498 nanometers. Since we are interested in estimating the posterior mean for (β, τ 2, σ 2 ), p = 51. The data is available in the R pacage ppls, and the Gibbs sampler is implemented in function blasso in R pacage monomvn. The truth was declared by averaging posterior means from 1000 parallel chains of length 1e6. We set n = 2e4 and = n 1/3. 22

23 mbm ubm-bonferroni ubm Termination Iteration ɛ = (10) (93) (52) ɛ = (23) (286) (617) ɛ = (454) (0) (902) Effective Sample Size ɛ = (6) 9093 (6) 4542 (3) ɛ = (21) (25) (29) ɛ = (217) (56) (51) Coverage Probabilities ɛ = (0.0101) (0.0072) (0.0136) ɛ = (0.0102) (0.0074) (0.0134) ɛ = (0.0095) (0.0075) (0.0131) Table 3: VAR: Over 1000 replications, we present termination iterations, effective sample size at termination and coverage probabilities at termination for each corresponding method. Standard errors are in parentheses. n mess ESS 1 ESS 2 ESS 3 ESS 4 ESS 5 1e (71) 5447 (41) (284) (664) (639) (647) 1e (332) (256) (1665) (3801) (3639) (3845) Table 4: VAR: Effective sample sample (ESS) estimated using proposed multivariate method and the univariate method of Gong and Flegal (2015) for Monte Carlo sample sizes of n = 1e5, 1e6 and 100 replications. Standard errors are in parentheses. In Table 5 we present termination results from 1000 replications. With p = 51, the uncorrected univariate regions produce confidence regions with very low coverage probabilities. The ubm-bonferroni and mbm provide competitive coverage probabilities at termination. However, termination for mbm is significantly earlier than univariate methods over all values of ɛ. For ɛ =.05 and.02 we observe zero standard error for termination using mbm since termination is achieved at the same 10% increment over all 1000 replications. Thus the variability in those estimates is less than 10% of the size of the estimate. Aside from the conservative multiple testing correction, delayed termination for the univariate methods is also due to conservative univariate estimates of ESS. Figure 7 23

24 mbm ubm-bonferroni ubm Termination Iteration ɛ = (0) (76) (7) ɛ = (0) (664) (103) ɛ = (393) (431) (332) Effective Sample Size ɛ = (4) (15) 4778 (6) ɛ = (8) (122) (24) ɛ = (283) (163) (74) Coverage Probabilities ɛ = (0.0096) (0.0097) (0.0031) ɛ = (0.0098) (0.0093) (0.0030) ɛ = (0.0096) (0.0081) (0.0030) Table 5: Bayesian Lasso: Over 1000 replications, we present termination iterations, effective sample size at termination and coverage probabilities at termination for each corresponding method. Standard errors are in parentheses. shows the ESS using multivariate and univariate methods at termination over 1000 replications for ɛ = The multivariate ESS at termination is more conservative than the univariate ESS at ubm-bonferroni termination, but the iterations to termination is significantly lower for mbm. 4.4 Bayesian Dynamic Spatial-Temporal Model Gelfand et al. (2005) propose a Bayesian hierarchical model for modeling univariate and multivariate dynamic spatial data viewing time as discrete and space as continuous. The methods in their paper have been implemented in the R pacage spbayes. We present a simpler version of the dynamic model as described by Finley et al. (2015). Let s = 1, 2,..., N s be location sites and t = 1, 2,..., N t be time-points. Let the observed measurement at location s and time t be denoted by y t (s). In addition, let x t (s) be the q 1 vector of predictors, observed at location s and time t, and β t be the 24

25 Sample Size in Thousands Termination Iteration ESS at Termination mbm ubm Bonferroni ubm Figure 7: Bayesian Lasso: Plot of termination iteration and ESS for mbm, ubm-bonferroni and ubm for ɛ =.02. Replications = q 1 vector of coefficients. For t = 1, 2,..., N t, y t (s) = x t (s) T β t + u t (s) + ɛ t (s), ɛ t (s) ind N(0, τ 2 t ); (15) β t = β t 1 + η t, η t iid N(0, Σ η ); u t (s) = u t 1 (s) + w t (s), w t (s) ind GP (0, σ 2 t ρ( ; φ t )), (16) where GP (0, σt 2 ρ( ; φ t )) denotes a spatial Gaussian process with covariance function σt 2 ρ( ; φ t ). Here, σt 2 denotes the spatial variance component and ρ(, φ t ) is the correlation function with exponential decay. Equation (15) is referred to as the measurement equation and ɛ t (s) denotes the measurement error, assumed to be independent of location and time. Equation (16) contains the transition equations which emulate the Marovian nature of dependence in time. To complete the Bayesian hierarchy, the following priors are assumed β 0 N(m 0, C 0 ) and u 0 (s) 0; τ 2 t IG(a τ, b τ ) and σ 2 t IG(a s, b s ); 25

26 Σ η IW(a η, B η ) and φ t Unif(a φ, b φ ), where IW is the inverse-wishart distribution Σ η aη+q+1 2 e 1 2 tr(bησ 1 η ) and IG(a, b) is the inverse-gamma distribution x a 1 e b/x. We fit the model to the NETemp dataset in the spbayes pacage. This dataset contains monthly temperature measurements from 356 weather stations on the east coast of USA collected from January 2000 to December The elevation of the weather stations is also available as a covariate. We choose a subset of the data with 10 weather stations for the year 2000, and fit the model with intercept. The resulting posterior has p = 185 components. A conditional Metropolis-Hastings sampler is described in Gelfand et al. (2005) and implemented in the spdynlm function. Default hyper parameter settings were used. The rate of convergence for this sampler has not been studied; thus we do not now if the sampler is geometrically ergodic. Our goal is to estimate the posterior expectation of θ = (β t, u t (s), σ 2 t, Σ η, τ 2 t, φ t ). For the calculation of coverage probabilities, 1000 parallel runs of a 2e6 MCMC sample were averaged and declared as the truth. We set = n 1/2 and n = 5e4 so that a n > p to ensure positive definitiveness of Σ n. Due to the Marovian transition equations in (16), the β t and u t exhibit a significant covariance structure in the posterior distribution. This is evidenced in Figure 8 where for Monte Carlo sample size n = 10 5, we present confidence regions for β (0) 1 and β (0) 2 - the intercept coefficient for the first and second months, and for u 1 (1) and u 2 (1) - the additive spatial coefficient for the first two weather stations. The thin ellipses indicate that the principle direction of variation is due to the correlation between the components. This significant reduction in volume, along with the conservative Bonferroni correction (p = 185) results in increased delay in termination. For smaller values of ɛ it was not possible to store the MCMC output in memory on a 8 gigabyte machine using ubm-bonferroni methods. As a result in Table 6, the univariate methods could not be implemented for smaller ɛ values. For ɛ =.10, termination for mbm was at n = 5e4 for every replication. At 26

27 90% Confidence Region 90% Confidence Region β 2 (0) u 2(1) β 1 (0) u 1(1) Figure 8: Bayesian Spatial: 90% confidence regions for β (0) 1 and β (0) 2 and u 1 (1) and u 2 (1). Monte Carlo sample size = Coverage Probability as ε Decreases Coverage Probability ε Figure 9: Bayesian Spatial: Plot of ɛ versus observed coverage probability for mbm estimator over 1000 replications with = n 1/2. these minimum iterations, the coverage probability for mbm is at 88%, whereas both the univariate methods have far lower coverage probabilities at 0.62 for ubm-bonferroni and for ubm. The coverage probabilities for the uncorrected methods are quite small since we are maing 185 confidence regions simultaneously. In Figure 9, we illustrate asymptotic validity of the confidence regions constructed using the relative standard deviation fixed-volume sequential stopping rule. We present the observed coverage probabilities over 1000 replications for several values of ɛ. 27

28 mbm ubm-bonferroni ubm Termination Iteration ɛ = (0) (28315) (491) ɛ = (12) (2178) ɛ = (174) - - Effective Sample Size ɛ = (20) 3184 (75) 1130 (1) ɛ = (20) (4) ɛ = (97) - - Coverage Probabilities ɛ = (0.0102) (0.0153) (0.0026) ɛ = (0.0102) (0.0040) ɛ = (0.0108) - - Table 6: Bayesian Spatial: Over 1000 replications, we present termination iteration, effective sample size at termination and coverage probabilities at termination for each corresponding method at 90% nominal levels. Standard errors are in parentheses. 5 Discussion Multivariate analysis of MCMC output data has received little attention thus far. Seila (1982) and Chen and Seila (1987) built a framewor for multivariate analysis for a Marov chain using regenerative simulation. Since establishing regenerative properties for a Marov chain requires a separate analysis for every problem and will not wor well in component-wise Metropolis-Hastings samplers, the application of their wor is limited. More recently, Vats et al. (2015) showed strong consistency of the multivariate spectral variance estimators of Σ, (msve) which could potentially substitute for the mbm estimator in applications to termination rules. However, outside of toy problems where p is small, the msve estimator is computationally demanding compared to the mbm estimator. To compare these two estimators, we extend the VAR(1) example discussed in Section 4.2, to p = 50. Since in this case we now the true Σ, we assess the relative error in estimation ˆΣ Σ F / Σ F where ˆΣ is either the mbm or msve estimator and F is the Frobenius norm. Results are in Table 7. The msve estimator overall has smaller relative error, but as Monte Carlo sample size increases, computation 28

29 Method n Relative Error Computation Time mbm 1e ( ) (1.5e-05) msve 1e ( ) (2.1e-05) mbm 1e ( ) (1.7e-05) msve 1e ( ) (1.8e-04) mbm 1e ( ) (8.6e-05) msve 1e ( ) ( ) Table 7: Relative error and computation time (in seconds) comparison between msve and mbm for a VAR(1) model with p = 50 over varying Monte Carlo samples sizes time is significantly higher than the mbm estimator. Also note that the relative error for both estimators decreases with an increase in Monte Carlo sample size. The msve, mbm estimators, and the multivariate ESS have been implemented in the mcmcse R pacage (Flegal et al., 2015). Through our examples in Section 4, we present a variety of settings where multivariate estimation leads to improved analysis when compared to univariate methods. The univariate methods provide a fundamental framewor to build on, but are not equipped for analyzing multivariate output, even when the dimension of the problem is as small as 5. This is because there are two layers of conservative methods required to implement the relative standard deviation fixed-width sequential stopping rule; (i) the need for the termination rule to be satisfied for each component and (ii) a correction to adjust for multiple testing. As demonstrated in our examples, uncorrected methods have less than desired coverage probabilities even for problems with small p. For large p uncorrected methods provide confidence regions with coverage probabilities very close to zero. Thus, a correction is essential when using univariate methods. References Atinson, Q. D., Gray, R. D., and Drummond, A. J. (2008). mtdna variation predicts population size in humans and reveals a major southern asian chapter in human prehistory. Molecular Biology and Evolution, 25(2):

30 Beres, I. and Philipp, W. (1979). Approximation theorems for independent and wealy dependent random vectors. The Annals of Probability, 7: Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New Yor. Charnes, J. M. (1995). Analyzing multivariate output. In Proceedings of the 27th conference on Winter simulation, pages IEEE Computer Society. Chen, D.-F. R. and Seila, A. F. (1987). Multivariate inference in stationary simulation using batch means. In Proceedings of the 19th conference on Winter simulation, pages ACM. Cowles, M. K. and Carlin, B. P. (1996). Marov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91: Csörgő, M. and Révész, P. (1981). Strong Approximations in Probability and Statistics. Academic Press. Damerdji, H. (1991). Strong consistency and other properties of the spectral variance estimator. Management Science, 37: Damerdji, H. (1994). Strong consistency of the variance estimator in steady-state simulation output analysis. Mathematics of Operations Research, 19: Damerdji, H. (1995). Mean-square consistency of the variance estimator in steady-state simulation output analysis. Operations Research, 43(2): Dehling, H. and Philipp, W. (1982). Almost sure invariance principles for wealy dependent vector-valued random variables. The Annals of Probability, 10: Drummond, A. J., Ho, S. Y., Phillips, M. J., Rambaut, A., et al. (2006). Relaxed phylogenetics and dating with confidence. PLoS biology, 4(5):699. Eberlein, E. (1986). On strong invariance principles under dependence assumptions. The Annals of Probability, 14:

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo

Applicability of subsampling bootstrap methods in Markov chain Monte Carlo James M. Flegal Abstract Markov chain Monte Carlo (MCMC) methods allow exploration of intractable probability distributions by