Asymptotically Stable Drift and Minorization for Markov Chains. with Application to Albert and Chib s Algorithm

Size: px

Start display at page:

Download "Asymptotically Stable Drift and Minorization for Markov Chains. with Application to Albert and Chib s Algorithm"

Letitia Howard
5 years ago
Views:

1 Asymptotically Stable Drift and Minorization for Markov Chains with Application to Albert and Chib s Algorithm Qian Qin and James P. Hobert Department of Statistics University of Florida December 2017 Abstract The use of MCMC algorithms in high dimensional Bayesian problems has become routine. This has spurred so-called convergence complexity analysis, the goal of which is to ascertain how the convergence rate of a Monte Carlo Markov chain scales with sample size, n, and/or number of covariates, p. Recent work suggests that, while often useful for establishing convergence rates when n and p are fixed, techniques based on drift and minorization may not be versatile enough to handle the more delicate task of convergence complexity analysis. This article provides some general ideas for sharpening drift and minorization conditions, particularly in cases where n and/or p are large. The key ideas include developing an appropriately centered drift function, and suppressing high-dimensionality in the construction of minorization conditions. These concepts are employed in a thorough convergence complexity analysis of Albert and Chib s 1993 data augmentation algorithm for the Bayesian probit model. The main result is that the geometric convergence rate of the underlying Markov chain is bounded below 1 both as n with p fixed, and as p with n fixed. Furthermore, the first computable bounds on the total variation distance to stationarity are byproducts of the asymptotic analysis. 1 Introduction Markov chain Monte Carlo MCMC has become an indispensable tool in Bayesian analysis, and it is now well known that the ability to utilize an MCMC algorithm in a principled manner with regard to choosing Key words and phrases. Bayesian probit regression, Convergence complexity, Geometric ergodicity, High dimensional inference, Large p - small n, Markov chain Monte Carlo 1

2 the Monte Carlo sample size, for example requires an understanding of the convergence properties of the underlying Markov chain see, e.g., Flegal et al., Taking this a step further, in modern high dimensional problems, it is also important to understand how the convergence properties of the chain change as the sample size, n, and/or number of covariates, p, increases. Denote the data i.e., responses and covariates by D. If we imagine n or p or both increasing, this leads to consideration of a sequence of data sets, {D j }, and corresponding sequences of posterior distributions and Monte Carlo Markov chains. A natural question to ask is What can we say about the convergence properties of the Markov chains as j? There is currently a great deal of interest in questions like this in the MCMC community see, e.g. Rajaratnam and Sparks, 2015; Durmus and Moulines, 2016; Johndrow et al., 2016; Yang et al., 2016; Yang and Rosenthal, Rajaratnam and Sparks 2015 call this the study of convergence complexity, and we will follow their lead. Asymptotic analysis of the convergence properties of a sequence of Markov chains associated with increasingly large finite state spaces has a long history in the computer science literature, dating back at least to Sinclair and Jerrum While the techniques developed in computer science have been successfully applied to a few problems in statistics see, e.g. Yang et al., 2016, they are generally not applicable in situations where the state space is high-dimensional and uncountable, which is the norm for Monte Carlo Markov chains in Bayesian statistics. In this paper, we show how methods based on drift and minorization can be used to study the convergence complexity of such Markov chains. Let π : X [0, denote an intractable probability density function pdf, where X R d, and let Π denote the corresponding probability measure, i.e., for measurable A, ΠA = A πx dx. More general spaces will be considered later. Let Kx,, x X, denote the Markov transition function Mtf of an irreducible, aperiodic, Harris recurrent Markov chain with invariant probability measure Π. See Meyn and Tweedie 2012 for definitions. The chain is called geometrically ergodic if there exist M : X [0, and ρ [0, 1 such that K m x, Π TV Mxρ m for all x X and all m N, 1 where TV denotes total variation norm, and K m x, is the m-step Mtf. The important practical benefits of basing one s MCMC algorithm on a geometrically ergodic Markov chain have been well-documented by, e.g., Roberts and Rosenthal 1998, Jones and Hobert 2001, Flegal et al and Łatuszyński et al Define the geometric convergence rate of the chain as ρ = inf { ρ [0, 1] : 1 holds for some M : X [0, }. 2

3 Clearly, the chain is geometrically ergodic if and only if ρ < 1. Establishing the convergence rate of a practically relevant Monte Carlo Markov chain can be quite challenging. A key tool for this purpose has been the technique developed by Rosenthal 1995, which allows for the construction of an upper bound on ρ using drift and minorization conditions. This method, which is described in detail in Section 2, has been used to establish the geometric ergodicity of myriad Monte Carlo Markov chains see, e.g., Fort et al., 2003; Marchev and Hobert, 2004; Roy and Hobert, 2010; Vats, Since the drift and minorization hereafter, d&m method is still the most and arguably the only reliable tool for bounding ρ for practically relevant Monte Carlo Markov chains on uncountable state spaces, it is important to know whether it remains useful in the context of convergence complexity analysis. Unfortunately, it turns out that most of the upper bounds on ρ that have been produced using techniques based on d&m converge to 1 often exponentially fast, becoming trivial, as n and/or p grow see, e.g., Rajaratnam and Sparks, One example of this is Roy and Hobert s 2007 analysis of Albert and Chib s 1993 data augmentation algorithm for the Bayesian probit model, which establishes geometric ergodicity of the underlying Markov chain, but also leads to an upper bound on ρ that converges to 1 as n. There are, of course, many possible explanations for why the d&m-based upper bounds on ρ converge to 1. It could be that the associated Monte Carlo Markov chains actually have poor asymptotic properties, or, if not, perhaps Rosenthal s 1995 method is simply not up to the more delicate task of convergence complexity analysis. We show that, in the case of Albert and Chib s 1993 chain, neither of these potential explanations is correct. Indeed, our careful d&m analysis of Albert and Chib s 1993 chain hereafter, A&C s chain leads to upper bounds on ρ that are bounded away from 1 in both the large n, small p case, and the large p, small n case. We believe that this is the first successful convergence complexity analysis of a practically relevant Monte Carlo Markov chain using Rosenthal s 1995 method. Recently, Yang and Rosenthal 2017 used a modified version of Rosenthal s 1995 technique to successfully analyze the convergence complexity of a Gibbs sampler for a simple Bayesian linear mixed model. We note that, because one of the variance components in their model is assumed known, it is actually straightforward to sample directly from the posterior distribution using a univariate rejection sampler Jones, 2001, Section 3.9. Thus, while Yang and Rosenthal s 2017 results are impressive, and their methods may suggest a way forward, the Monte Carlo Markov chain that they analyzed is not practically relevant. One of the main contributions of this article is a general approach for sharpening the upper bounds that are produced via d&m, particularly in cases where n and/or p is large. The key ideas include centering the drift function to a region in the state space that the chain visits frequently, and suppressing high-dimensionality in the construction of minorization conditions. In addition, for two-block Gibbs chains, 3

4 we introduce a technique for constructing asymptotically stable minorization conditions that is based on the well known fact that the two marginal Markov chains and the joint chain all share the same geometric convergence rate. Before describing our results for A&C s chain, we introduce an alternative definition of geometric ergodicity. Let L 2 Π denote the set of signed measures µ that are absolutely continuous with respect to Π, and satisfy X dµ/dπ2 dπ <. As in Roberts and Rosenthal 1997, we say that the Markov chain with Mtf K is L 2 -geometrically ergodic if there exists ρ < 1 such that for each probability measure ν L 2 Π, there exists a constant M ν < such that νk m Π TV M ν ρ m for all m N, where νk m = X Km x, νdx. We define the L 2 -geometric convergence rate, ρ, to be the infimum of all ρ [0, 1] that satisfy this definition. Not surprisingly, ρ and ρ are closely related see, e.g., Roberts and Tweedie, We now provide an overview of our results for Albert and Chib s 1993 Markov chain, starting with a brief description of their algorithm. Let {X i } n be p-dimensional covariates, and let {Y i} n be a corresponding sequence of binary random variables such that Y i X i, B BernoulliΦX T i B independently, where B is a p 1 vector of unknown regression coefficients, and Φ is the standard normal distribution function. Consider a Bayesian analysis using a prior density for B given by { ωβ exp 1 } 2 β vt Qβ v, where v R p, and Q R p p is either a positive definite matrix proper Gaussian prior, or a zero matrix flat improper prior. Assume for now that the posterior is proper. Propriety under a flat prior is discussed in Section 3. As usual, let X denote the n p matrix whose ith row is X T i, and let Y = Y 1 Y 2 Y n T denote the vector of responses. The intractable posterior density is given by { n π B Y,X β Y, X ΦX T i β Y i 1 ΦX T i β } 1 Y i ωβ. 2 The standard method for exploring 2 is the classical data augmentation algorithm of Albert and Chib 1993, which simulates a Harris ergodic irreducible, aperiodic, and Harris recurrent Markov chain, {B m } m=0, that has invariant density π B Y,X. In order to state the algorithm, we require a bit of notation. For θ R, σ > 0, and i {0, 1}, let TNθ, σ 2 ; i denote the Nθ, σ 2 distribution truncated to 0, if i = 1, and to, 0 if i = 0. The matrix Σ := X T X + Q is necessarily non-singular because of propriety. If the current state of A&C s Markov chain is B m = β, then the new state, B m+1, is drawn using the following two steps: 4

5 Iteration m + 1 of the data augmentation algorithm: 1. Draw {Z i } n independently with Z i TNXi T β, 1; Y i, and let Z = Z 1 Z 2 Z n T. 2. Draw B m+1 N p Σ 1 X T Z + Qv, Σ 1. The convergence rate of A&C s chain has been studied by several authors. Roy and Hobert 2007 proved that when Q = 0, the chain is always geometrically ergodic. Again, we re assuming posterior propriety. A similar result for proper normal priors was established by Chakraborty and Khare Both results were established using a technique that does not require construction of a minorization condition see Meyn and Tweedie, 2012, Lemma , and, consequently, does not yield an explicit upper bound on ρ. Thus, neither paper addresses the issue of convergence complexity. However, in Section 5 we prove that Roy and Hobert s 2007 drift function cannot be used to construct an upper bound on ρ that is bounded away from 1 as n. Johndrow et al recently established a convergence complexity result for the intercept only version of the model p = 1 and X i = 1 for i = 1, 2,..., n with a proper univariate normal prior, under the assumption that all the responses are successes Y i = 1 for i = 1, 2,..., n. Their results, which are based on Cheeger s inequality, imply that ρ 1 as n, indicating that the algorithm is inefficient for large samples. The results established herein provide a much more complete picture of the convergence behavior of A&C s chain. Three different regimes are considered: i fixed n and p, ii large n, small p, and iii large p, small n. Our analysis is based on two different drift functions that are both appropriately centered at the posterior mode. One of the two drift functions is designed for case ii, and the other for case iii. We establish d&m conditions for both drift functions, and these are used in conjunction with Rosenthal s 1995 result to construct two explicit upper bounds on ρ. They are also used to construct two computable upper bounds on the total variation distance to stationarity as in 1, which improves upon the analyses of Roy and Hobert 2007 and Chakraborty and Khare The goal in case ii is to study the asymptotic behavior of the geometric convergence rate as n, when p is fixed. To this end, we consider a sequence of data sets, D n := {X i, Y i } n. So, each time n increases by 1, we are are given a new p 1 covariate vector, X i, and a corresponding binary response, Y i. To facilitate the asymptotic study, we assume that the X i, Y i pairs are generated according to a random mechanism that is governed by very weak assumptions that are consistent with the probit regression model. 5

6 We show that there exists a constant ρ < 1 such that, almost surely, lim sup n ρ D n ρ. Apart from this general result, we are also able to show that, in the intercept only model considered by Johndrow et al. 2016, the A&C chain is actually quite well behaved as long as the proportion of successes is bounded away from 0 and 1. To be specific, let {Y i } denote a fixed sequence of binary responses, and let ˆp n = n 1 n Y i. Our results imply that, as long as 0 < lim inf n ˆp n lim sup n ˆp n < 1, there exists ρ < 1 such that both ρ D n and ρ D n are eventually bounded above by ρ, and there is a closed form expression for ρ. In case iii, n is fixed and p. There are several important differences between cases ii and iii. First, in case ii, since p is fixed, only a single prior distribution need be considered. In contrast, when p, we must specify a sequence of priors, {Q p, v p } p=1, where v p is a p 1 vector, and Q p is a p p positive definite matrix. When p > n, a positive definite Q is required for posterior propriety. Also, in case iii, there is a fixed vector of responses of length n, and it is somewhat unnatural to consider the new columns of X to be random. Let D p := {v p, Q p, X p, Y } denote a fixed sequence of priors and data sets, where Y is a fixed n 1 vector of responses, and X p is an n p matrix. We show that, under a natural regularity condition on X p Q 1 p X T p, there exists a ρ < 1 such that ρ D p ρ for all p. The remainder of the paper is laid out as follows. In Section 2, we formally introduce the concept of stable d&m conditions, and describe our general ideas for constructing such. The centered drift functions that are used in our analysis of A&C s chain are described in Section 3. In Section 4, we provide results for A&C s chain in the case where n and p are both fixed. Two sets of d&m conditions are established, and corresponding exact total variation bounds on the distance to stationarity are provided. The heart of the paper is Section 5 where it is shown that the geometric convergence rate of A&C s chain is bounded away from 1 as n for fixed p, and as p for fixed n. A good deal of technical material is relegated to an Appendix. 2 Asymptotically Stable Drift and Minorization Let X be a set equipped with a countably generated σ-algebra BX. Suppose that K : X BX [0, 1] is an Mtf with invariant probability measure Π, so that ΠA = X P x, AΠdx for all measurable A. Assume that the corresponding Markov chain is Harris ergodic. Recall the definitions of geometric ergodicty and geometric convergence rate from Section 1. The following result has proven extremely useful for establishing geometric ergodicity in the context of Monte Carlo Markov chains used to study complex Bayesian posterior distributions. 6

7 Theorem 1 Rosenthal Suppose that Kx, satisfies the drift condition V x Kx, dx λv x + L, x X 3 X for some V : X [0,, λ < 1 and L <. Suppose that it also satisfies the minorization condition Kx, εq whenever V x < d 4 for some ε > 0, probability measure Q on X, and d > 2L/1 λ. Then assuming the chain is started according to the probability measure ν, for any 0 < r < 1, we have νk m Π TV 1 ε rm L [ ] 1 + 2L + λd 1 r m 1 λ + V xνdx {1 + 2λd + L} r. X 1 + d The function V is called the drift or Lyapunov function, and {x X : V x < d} is called the small set associated with V. We will refer to ε as the minorization number. Manipulation of the total variation bound in Theorem 1 leads to where νk m Π TV 2 + L 1 λ + V xνdx ˆρ m, X 1 + 2L + λd 1 r ˆρ := 1 ε r { } r 1 + 2λd + L, d and r 0, 1 is arbitrary. Then, ˆρ is an upper bound on the geometric convergence rate ρ. It s easy to verify that when λ < 1, L <, and ε > 0, there exists r 0, 1 such that ˆρ < 1. The bound ˆρ has a reputation for being too conservative. This is partly due to the fact that there are toy examples where the true ρ is known, and ˆρ is quite far off see, e.g., Rosenthal, There also exist myriad analyses of practical Monte Carlo Markov chains where the d&m conditions 3 and 4 have been established proving that the underlying chain is indeed geometrically ergodic, but the total variation bound of Theorem 1 is useless because ˆρ is so near unity. Of course, the quality of the bound ˆρ depends on the choice of drift function, and the sharpness of 3 and 4. Our results for the A&C chain suggest that poorly chosen drift functions and/or loose inequalities in the d&m conditions are to blame for at least some of the unsuccessful applications of Theorem 1. We now introduce the concept of asymptotically stable d&m. Consider a sequence of geometrically ergodic Markov chains, {Ψ j } j=1, with corresponding geometric convergence rates given by ρ j. In practice, j is usually the sample size, n, or number of covariates, p. We are interested in the asymptotic behavior of the rate sequence. For example, we might want to know if it 7

8 is bounded away from 1. Suppose that for each chain, Ψ j, we have d&m conditions defined through λ j, L j, and ε j, and thus an upper bound on the convergence rate, ˆρ j [ρ j, 1. The following simple result provides conditions under which these upper bounds are unstable, that is, lim sup j ˆρ j = 1. Proposition 2. Suppose that there exists a subsequence of {Ψ j } j=0, call it {Ψjl } l=0, that satisfies one or more of the following three conditions: i λ j l 1, ii L j l, while ε j l is bounded away from 1 and λ j l is bounded away from 0, iii ε j l 0. Then the corresponding subsequence of the upper bounds, ˆρ j l, converges to 1. Proof. Throughout the proof, we drop the superscripts on λ, L, and ε. If i holds, then 1 + 2L + λd 1 + d λ 1. Since 1 + 2λd + L 1, regardless of the asymptotic behavior of r 0, 1, we have 1 + 2L + λd 1 r lim inf {1 + 2λd + L} r 1. j 1 + d Hence, ˆρ j l 1. Now assume that ii holds. Then 1 + 2λd + L. Meanwhile, for large enough l, 1 + 2L + λd 1 + d λ > c, where c is a positive constant. Hence, to ensure that 1 + 2L + λd 1 r {1 + 2λd + L} r 1 + d is bounded, one must let r 0. However, this would result in 1 ε r 1. Finally, assume that iii holds. Then 1 ε r 1 regardless of the asymptotic behavior of r 0, 1. If one or more of the conditions in Proposition 2 holds for some subsequence of {Ψ j } j=0, then we say that the d&m conditions are asymptotically unstable in j. On the other hand, if i λ j is bounded away from 1, ii L j is bounded above, and iii ε j is bounded away from 0, then the sequence ˆρ j can be bounded away from 1, thus giving an asymptotically nontrivial upper bound on ρ j. We say that the drift conditions are stable if i and ii hold, and likewise, that the minorization conditions are stable if iii holds. Before proceeding to discuss techniques for achieving stability, we note that there exist some elementary transformations on V that could affect the quality of ˆρ, and even the stability of d&m. For instance, if 3 is satisfied, then X L 1 V x Kx, dx λl 1 V x

9 Thus, if L, we can simply re-scale each drift function by its corresponding L 1, which results in L 1. However, such re-scaling cannot be used to stabilize minorization conditions. In the above example, suppose that for some d > 2/1 λ, Kx, εq whenever L 1 V x > d. Then Kx, εq whenever V x > Ld, where Ld > 2L/1 λ. Thus, if one cannot establish a stable minorization condition associated with V x, then there is no hope of finding a stable minorization associated with L 1 V x. Stability could also be affected by adding a scalar to the drift function, i.e., replacing V x with some V x + c. One can easily show that subtracting a positive number from V while preserving the positiveness of V decreases the volume of the small set, and thus increases the minorization number, ε. In fact, it always improves the quality of ˆρ. Needless to say, we will only deal with instability that cannot be prevented by these trivial transformations. In particular, throughout the paper, we consider only drift functions whose infimums are 0, that is, we make the elementary transformation of V x V x inf x X V x. Choosing an appropriate drift function is extremely important for establishing stable d&m conditions. The total variation bound in Theorem 1 is based on a theoretical coupling argument that involves two dependent copies of the Markov chain in question, one started in stationarity, and the other started from an arbitrary initial distribution. The total variation distance between the two chains after m iterations is bounded above by P T > m, where T is the coupling time. In the coupling construction, there is a chance to couple each time the chains enter the small set, {x X : V x d}. Loosely speaking, we want the chains to visit the small set as often as possible since each visit provides a chance to couple, and fast coupling means small values for P T > m, which, in turn, leads to small total variation distance. This argument suggests that one should choose a drift function that is centered with respect to the Markov chain in the sense that the drift function should take small values in the part of the state space where the chain is likely to spend the bulk of its time. Of course, if the chain is well suited to the problem, it should spend the bulk of its time in the high posterior probability regions of X. A properly centered drift function is likely to give a small L, and in turn a small d, which is often necessary for a large ε. This intuition is not new, and many authors do attempt to center their drift functions in some way. Although, in many cases, the centering is not delicate enough to ensure that the drift function reflects the dynamics of the chain. However, we believe that our work is one of the first to illustrate the importance of centering in a quantitative fashion through a genuinely non-trivial example. Indeed, in Section 5 we show that, for A&C s chain, in the large n, small p case, the un-centered drift function employed by Roy and Hobert 2007 cannot possibly lead to stable d&m, while a centered version of the same drift function does. Suppose now that V is properly centered and stable drift has been established. Obtaining stable minorization can still be difficult due to the intractability of Kx,. We now describe a common difficulty 9

10 encountered in the convergence analysis of two-block Gibbs samplers and data augmentation algorithms. For simplicity, we ll investigate Mtfs that admit density functions with respect to the Lebesgue measure, but most of the material in the remainder of this section can be easily generalized. Suppose that X R p, and Kx, is associated with a Markov transition density Mtd of the form kx, x = sx zhz x dz, 6 R n where n is the sample size, z = z 1, z 2,..., z n T R n is a vector of latent data, and s : X R n [0, and h : R n X [0, are conditional densities associated with some joint density on X R n. In practice, it is often the case that hz x can be factored as follows: n hz x = h i z i x, where, for each i, h i : R X [0, is a univariate conditional pdf. Usually, s z and h i x are tractable, but there is no closed form for kx, x. However, there is a well-known argument for establishing a minorization condition in this case. Suppose that whenever V x < d, h i t x > ɛ i ν i t, i = 1, 2,..., n where ɛ i > 0 and ν i : R [0, is a pdf. Then, whenever V x < d, we have n n kx, x > ɛ i sx z ν i z i dz. 7 R n Since R n sx z n ν iz i dz is a pdf on X, 7 gives a minorization condition with ε = n ɛ i. Unfortunately, this quantity will almost always converge to 0 as n. Consequently, if our sequence of Markov chains is indexed by n, then we have unstable minorization. This problem is well known see, e.g., Rajaratnam and Sparks, The instability of the minorization described above is due to the fact that the dimension of z is growing with n. However, it is often the case that sx z depends on z only through fz, where f : R n Y is a function into some fixed space Y, say Y R q, where q does not depend on n. Then integrating along fz = γ in 6 yields where sx fz = sx z, and kx, x = C Y hγ x dγ = s x γ h γ x dγ, {z:fz C} 10 hz x dz

11 for all x X and any measurable C Y. Note that this new representation of kx, x no longer contains n explicitly, and the high dimensionality problem for z is resolved. However, we now have a new problem. Namely, hγ x is likely to be quite intractable. Fortunately, there is also a way to circumvent this difficulty. Consider a Markov chain driven by the Mtd kγ, γ = X hγ x sx γ dx, γ, γ Y. We call this the flipped version of the chain defined by kx, x. It is well known that these two chains share the same geometric convergence rate see, e.g., Roberts and Rosenthal, 2001; Diaconis et al., Thus, we can analyze the original sequence of Markov chains indirectly through the corresponding flipped chains. Recall that the drift condition for the original chain is X V x sx γ hγ x dγ dx λv x + L. 8 Y Define Ṽ γ = X V x sx γ dx. Multiplying both sides of 8 by sx γ, and integrating out x yields the following drift condition for the flipped chain: Y Ṽ γ kγ, γ dγ λṽ γ + L. Moreover, since λ and L are the same as before, this drift is stable. Now, as mentioned above, sx fz = sx z is often tractable. Suppose that there exists some ε > 0 and pdf ν : X [0, such that sx γ ε νx when Ṽ γ < d, then we have the following minorization condition for the flipped chain: kγ, γ ε hγ x νx dx whenever Ṽ γ < d. X This minorization is stable as long as ε is bounded away from 0 asymptotically. The flipped chain argument is not always applicable, but when it is, it simplifies calculations greatly. Furthermore, we will demonstrate that the flipped chain argument can also be useful for establishing minorization conditions that are stable in p. We end this section with a result that allows us to use information about a flipped chain to get total variation bounds for the original. Obviously, if we have d&m conditions for a flipped chain, then this provides us with an upper bound on the geometric convergence rate of the corresponding original chain. However, using the information about the flipped chain to get total variation bounds for the corresponding original chain requires a bit of work. The following result follows immediately from Proposition 24, which is stated and proven in Subsection B.1. 11

12 Corollary 3. Suppose we have d&m conditions for a flipped chain which is driven by k. Let λ, L, and ε denote the drift and minorization parameters, and let ˆρ f denote the corresponding bound on the geometric convergence rate. Then, if the original chain is started at x X, we have, for m 1, K m x, Π TV In the next section, we begin our analysis of the A&C chain. 2 + L 1 λ + Ṽ γ hγ x dγ ˆρ m 1 f. Y 3 Albert and Chib s Markov Chain and the Centered Drift Functions 3.1 Basics As in the introduction, let {X i } n be p-dimensional covariates and let {Y i} n be a sequence of binary random variables such that Y i X i, B BernoulliΦX T i B independently, where B is a p 1 vector of unknown regression coefficients. Suppose that, having observed the data, D := {X i, Y i } n, we wish to perform a Bayesian analysis using a prior density for B given by { ωβ exp 1 } 2 β vt Qβ v, where v R p, and Q is a p p inverse covariance matrix that is either positive definite, or vanishing. Let X denote the n p matrix whose ith row is Xi T, and let Y = Y 1 Y 2 Y n T denote the vector of responses. The posterior distribution 2 is proper precisely when cx, Y := R p n ΦX T i β Y i 1 ΦX T i β 1 Y i ωβ dβ <. When Q is positive definite, ωβ is a proper normal density, and the posterior is automatically proper. If Q = 0, then propriety is not guaranteed. Define X as the n p matrix whose ith row is Xi T if Y i = 1, and X T i if Y i = 0. Chen and Shao 2000 proved that when the prior is flat, i.e., Q = 0, the following two conditions are necessary and sufficient for posterior propriety: C1 X has full column rank; C2 There exists a vector a = a 1 a 2 a n T R n such that a i > 0 for all i, and X T a = 0. Until further notice, we will assume that the posterior is proper. 12

13 A&C s algorithm to draw from the intractable posterior is based on the following latent data model. Given X and B, let {Y i, Z i } n be a sequence of independent random vectors such that Y i Z i, X, B is a point mass at 1 R+ Z i Z i X, B NX T i B, 1. Clearly, under this hierarchical structure, Y i X i, B BernoulliΦX T i B, which is consistent with the original model. Thus, if we let π B,Z Y,X β, z Y, X denote the corresponding augmented posterior density where Z = Z 1 Z 2 Z n T, then it s clear that R n π B,Z Y,X β, z Y, X dz = π B Y,X β Y, X, which is the target posterior from 2. Albert and Chib s algorithm is simply a two-variable Gibbs sampler based on π B,Z Y,X β, Z Y, X. Indeed, the Mtd, k : R p R p R +, is defined as k AC β, β := k AC β, β ; Y, X = π B Z,Y,X β z, Y, Xπ Z B,Y,X z β, Y, X dz. R n One can show that B Z, Y, X N p Σ 1 X T Z + Qv, Σ 1, where, again, Σ = X T X + Q. Moreover, the density π Z B,Y,X z β, Y, X is a product of n univariate densities, where Z i B, Y, X TNX T i B, 1; Y i. Obviously, these are the conditional densities that appear in the algorithm described in the Introduction. 3.2 A centered drift function Roy and Hobert 2007 and Chakraborty and Khare 2017 both used the drift function V 0 β = Σ 1/2 β 2. While this drift function is certainly amenable to analysis, it is not centered in any practical sense. Indeed, V 0 β takes on it s minimum when β = 0, but, in general, there is no reason to expect A&C s Markov chain to make frequent visits to the vicinity of the origin. This heuristic is borne out by a result in Section 5 showing that V 0 cannot lead to stable d&m in the large n, small p case. As an alternative to V 0 β, we consider drift functions of the form V β = Mβ β 2, 9 where M = MX, Y is a matrix with p columns, and β = β X, Y is a point in R p that is attractive to A&C s Markov chain. A candidate for β would be a fixed point for the chain, that is, a value B F satisfying R p β k AC B F, β dβ = B F

14 Indeed, this equation suggests that the chain tends to stay around the point B F, and hence B F might be an effective point at which to center the drift function. The following result shows that there is, in fact, a unique solution to 10. Proposition 4. The posterior density, π B Y,X β Y, X, possesses a unique mode, and this mode is the unique solution to 10. Proof. Note that log π B Y,X β Y, X = log π Y B,X Y β, X + log ωβ + constant. It is well-known that log ωβ is concave, and that log π Y B,X Y β, X, as a function of β, is strictly concave. Therefore, since the posterior density is proper, it must possess a unique mode in R p. Now, the mode is the solution to the following equation: where d log ωβ + log πy dβ B,X Y β, X = 0, 11 log ωβ = 1 2 β vt Qβ v + constant, and log π Y B,X Y β, X = n log { ΦXi T β1 {1} Y i + 1 ΦXi T β 1 {0} Y i } + constant. Thus, 11 is equivalent to the following equation n φx T i β ΦXi T β1 {1}Y i φxt i β 1 ΦXi T β1 {0}Y i X i Qβ Qv = On the other hand, a fixed point solves the following equation in β: R p R n β π B Z,Y,X β z, Y, Xπ Z B,Y,X z β, Y, X dz dβ = β. 13 Recall that B Z, Y, X N p Σ 1 Qv + X T Z, Σ 1. Thus, 13 is equivalent to By 27 in Subsection A.2, we have Qv + n X i EZ i B = β, Y, X = Σβ. 14 EZ i B = β, Y, X = X T i β + φxt i β ΦX T i β1 {1}Y i φxt i β 1 ΦX T i β1 {0}Y i, where φ is the pdf of the standard normal distribution. Routine calculation shows that 14 is equivalent to

15 Remark 5. The posterior density, π B Y,X β Y, X, is invariant for A&C s Markov chain, so it shouldn t be too surprising that the chain tends to hang around the posterior mode. On the other hand, the fact that the posterior mode is an exact unique fixed point for the chain is a bit unexpected. We will consider two different versions of 9, both centered at B F. The first, which will be used in the large n-small p case, is simply a centered version of V 0 given by V 1 β = Σ 1/2 β B F 2. In the large p-small n case, we assume that Q is positive definite which is necessary for posterior propriety and that X is full row rank, and we use the following drift function V 2 β = XΣ 1 X T 1/2 Xβ BF 2. In the next section, we establish two sets of d&m conditions for the A&C chain based on V 1 and V 2. 4 Results for the Albert and Chib Chain Part I: Fixed n and p 4.1 Drift inequalities for V 1 and V 2 Define g : R R as gθ = θφθ Φθ + φθ 2. Φθ For any β R p, let Dβ denote an n n diagonal matrix with ith diagonal element 1 g X T i β 1 {1} Y i g X T i β 1 {0} Y i. Lemma 6. If V β = Mβ B F 2, where M is any matrix with p columns, then V β k AC β, β dβ sup MΣ 1 X T DB F + tβ B F Xβ B F { } tr MΣ 1 M T. R p t 0,1 Proof. Note that Moreover, R p V β π B Z,Y,X β z, Y, X dβ = MΣ 1 X T z + Qv MB F 2 + tr { MΣ 1 M T }. R n MΣ 1 X T z + Qv MB F 2 π Z B,Y,X z β, Y, X dz = MΣ 1 { X T E Z B = β, Y, X + Qv } MB F 2 + tr { MΣ 1 X T var Z B = β, Y, X XΣ 1 M T }

16 For two symmetric matrices of the same size, M 1 and M 2, we write M 1 M 2 if M 2 M 1 is non-negative definite. By Lemma 23 in Subsection A.2, varz B = β, Y, X I n. It follows that MΣ 1 X T var Z B = β, Y, X XΣ 1 M T MΣ 1 X T XΣ 1 M T MΣ 1 M T. 16 Therefore, V β k AC β, β dβ = R p V β R p π B Z,Y,X β z, Y, Xπ Z B,Y,X z β, Y, X dz dβ R n MΣ 1 { X T E Z B = β, Y, X + Qv } MB F 2 { + 2 tr MΣ 1 M T }. 17 Now, for α R p, define µα = MΣ 1 { X T E Z B = B F + α, Y, X + Qv } MB F. Since B F is a fixed point, we have µ0 = M βk AC B F, β dβ MB F = 0. R p By the mean value theorem for vector-valued functions see, e.g., Rudin, 1976, Theorem 5.19, for any α R p, Now, by 27 in Subsection A.2, µα 2 2 sup µtα t. t 0,1 µtα t = MΣ 1 n X i t EZ i B = B F + tα, Y, X n = MΣ 1 { X i X T t i B F + tα } d θ + 1 dθ {1} Y i φθ Φθ 1 φθ {0}Y i 1 Φθ θ=x i T B F +tα = MΣ 1 X T DB F + tα Xα. Hence, µα 2 sup MΣ 1 X T DB F + tαxα t 0,1 The result then follows from 17 and 18 by taking α = β B F. We now use Lemma 6 to establish explicit drift inequalities for V 1 and V 2. We begin with V 1. 16

17 Proposition 7. For all β R p, we have V 1 β k AC β, β dβ R p sup Proof. Taking M = Σ 1/2 in Lemma 6 yields R p V 1 β k AC β, β dβ sup t 0,1 α 0 t 0,1 sup Σ 1/2 X T DB F + tαxα 2 sup Σ 1/2 α 2 V 1 β + 2p. Σ 1/2 X T DB F + tβ B F Xβ B F 2 + 2p t 0,1 α 0 Σ 1/2 X T DB F + tαxα 2 sup Σ 1/2 α 2 V 1 β + 2p. In Subsection B.2, we prove that sup sup t 0,1 α 0 Σ 1/2 X T DB F + tαxα 2 Σ 1/2 α 2 < 1. For a symmetric matrix M, let λ min M and λ max M denote the smallest and largest eigenvalues of M, respectively. Here is the analogue of Proposition 7 for V 2. Proposition 8. Assume that X has full row rank. Then for all β R p, we have V 2 β k AC β, β dβ { λ 2 max XΣ 1 X T } V 2 β + 2n. R p Proof. Taking M = XΣ 1 X T 1/2 X in Lemma 6 and applying Lemma 23 yields V 2 β k AC β, β dβ R p sup XΣ 1 X T 1/2 Dβ + tβ BF Xβ B F 2 + 2n t 0,1 sup λ 2 max{ XΣ 1 X T 1/2 DBF + tβ B F XΣ 1 X T 1/2} V2 β + 2n t 0,1 { λ 2 max XΣ 1 X T } V 2 β + 2n. 4.2 Drift and minorization for the Albert and Chib chain based on V 1 In this subsection, we exploit the flipped chain idea described in Section 2. In particular, V 1 is used to establish d&m conditions for a flipped chain that has the same geometric convergence rate as A&C s chain. Later, in Section 5, we will use these results to prove asymptotic stability as n. 17

18 Note that π B Z,Y,X β Z, Y, X depends on the n-dimensional vector Z only through X T Z, which is a one-to-one function of the following p-dimensional vector: Γ := Σ 1/2 { Σ 1 X T Z + Qv B F }. Hence, we can represent the Mtd of the A&C chain as follows: k AC β, β = π B Γ,Y,X β γ, Y, Xπ Γ B,Y,X γ β, Y, X dγ. R p Recalling the discussion in Section 2, this maneuver seems to represent progress since we have replaced n with p. However, it is difficult to establish a minorization condition using this version of k AC β, β because π Γ B,Y,X γ β, Y, X lacks a closed form. On the other hand, this chain has the same geometric convergence rate as the flipped chain defined by the following Mtd: k AC γ, γ := k AC γ, γ ; Y, X = π Γ B,Y,X γ β, Y, Xπ B Γ,Y,X β γ, Y, X dβ. R p Constructing a minorization condition for this Mtd is much less daunting since B Γ, Y, X N Σ 1/2 Γ + B F, Σ 1. Here is the main result of this subsection. Proposition 9. Let Ṽ1γ = γ 2. The Mtd k AC satisfies the drift condition where L = p1 + λ, and R p Ṽ 1 γ k AC γ, γ dγ λṽ1γ + L, λ = sup sup t 0,1 α 0 Moreover, for d > 2L/1 λ, k AC satisfies Σ 1/2 X T DB F + tαxα 2 Σ 1/2 α 2. k AC γ, γ εqγ, where q : R p [0, is a pdf, and ε = 2 p/2 e d. Proof. We begin with the drift. It s easy to verify that V 1 βπ B Γ,Y,X β γ, Y, X dβ = Ṽ1γ + p. R p 18

19 We now use the techniques described at the end of Section 2 to convert the drift inequality in Proposition 7 into a drift inequality for the flipped chain. We know that V 1 β π B Γ,Y,X β γ, Y, X π Γ B,Y,X γ β, Y, X dγ dβ λv 1 β + 2p. R p R p Multiplying both sides of the above inequality by π B Γ,Y,X β γ, Y, X and integrating with respect to β yields Ṽ 1 γ k AC γ, γ dγ λṽ1γ + p1 + λ. R p As explained in Section 2, to establish the minorization condition, it suffices to show that there exists a pdf νβ := νβ Y, X such that π B Γ,Y,X β γ, Y, X ενβ 19 whenever Ṽ1γ d. Recall that B Γ, Y, X N Σ 1/2 Γ + B F, Σ 1. Define ν 1 β = inf π B Γ,Y,X β γ, Y, X γ:ṽ1γ d = inf γ:ṽ1γ d Σ 1/2 { exp 1 2π p/2 2 Then νβ = ν 1 β/ R p ν 1 β dβ is a pdf, and whenever Ṽ1γ d, Σ 1/2 β B F Σ 1/2 γ 2 }. π B Γ,Y,X β γ, Y, X ν 1 β dβ νβ. R p This is 19 with ε = ν 1 β dβ = 2π p/2 R p R p inf exp 12 β γ: γ 2 d γ 2 dβ. Finally, since β γ 2 2 β 2 + γ 2, ε 2π R p/2 inf exp β 2 γ 2 dβ = 2 p/2 e d. p γ: γ 2 d Mainly, Proposition 9 will be used to establish asymptotic stability results for A&C s chain in the large n, small p case. Indeed, since the Markov chains defined by k AC and k AC have the same geometric convergence rate, ˆρ calculated using 5 with λ, L, and ε from Proposition 9 is an upper bound on ρ = ρ X, Y for the 19

20 A&C chain. On the other hand, Proposition 9 can also be used in conjunction with Theorem 1 and Corollary 3 to get computable bounds on the total variation distance to stationarity for the A&C chain for fixed n and p. In order to state the result, we require a bit of notation. For an integer m 1, let k m AC : Rp R p R + be the chain s m-step Mtd. For β R p, let ϕβ = E Z B = β, Y, X R n. Then results in Subsection A.2 show that the ith element of ϕβ is given by X T i β + φxt i β ΦX T i β1 {1}Y i φxt i β 1 ΦX T i β1 {0}Y i. Proposition 10. If ˆρ is calculated using 5 with λ, L, and ε from Proposition 9, then for m 1 and β R p, k m AC β, β ; Y, X π B Y,X β Y, X dβ Hβ ˆρ m 1, R p where Hβ = 2 + L 1 λ + tr XΣ 1 X T + Σ 1/2 { Σ 1 X T ϕβ + Qv } 2 B F. Proof. We simply apply Corollary 3. First, putting M = Σ 1/2 in 15, we have Ṽ 1 γ π Γ B,Y,X γ β, Y, X dγ = Σ 1/2 { Σ 1 X T z + Qv } 2 B F πz B,Y,X z β, Y, X dz R p R n = Σ 1/2 { Σ 1 X T ϕβ + Qv } 2 B F { + tr Σ 1/2 X T varz B = β, Y, XXΣ 1/2}. Finally, a calculation similar to 16 shows that { tr Σ 1/2 X T varz B = β, Y, XXΣ 1/2} tr { XΣ 1 X T }, and the result follows. Calculating λ in Proposition 9 calls for maximization of a function on 0, 1 R p, which may be difficult. Here we provide an upper bound on λ that s easy to compute when p is small. Let {S j } 2p j=1 denote the open orthants of R p. For instance, if p = 2, then S 1, S 2, S 3 and S 4 are the open quadrants of the real plane. Define W S j = W S j ; X, Y as follows W S j = X i S j X i 1 {0} Y i X T i + The following result is proven in Subsection B.3. Proposition 11. An upper bound on λ 1/2 in Proposition 9 is λ max Σ 1/2 X T XΣ 1/2 2 π min 1 j 2 p λ min X i S j X i 1 {1} Y i X T i. Σ 1/2 W S j Σ 1/2. If this upper bound is strictly less than 1 which is always true when Q is positive definite, then one can replace λ with its square in Proposition 9. 20

21 4.3 Drift and minorization for the Albert and Chib chain based on V 2 The A&C chain has the same convergence rate as the flipped chain defined by the following Mtd: ǩ AC z, z := ǩacz, z ; Y, X = π Z B,Y,X z β, Y, Xπ B Z,Y,X β z, Y, X dβ. R p In this subsection, we use V 2 to establish d&m conditions for this chain, and these will be used later to prove asymptotic stability as p. First, for z R n, define wz = XΣ 1 X T 1/2 X { Σ 1 X T z + Qv B F }. Now define ˇV 2 : R n [0, as ˇV 2 z = wz 2. Proposition 12. The Mtd ǩac satisfies the drift condition ˇV2 z ǩacz, z dz λ ˇV 2 z + L, 20 R n where λ = λ 2 max{ XΣ 1 X T } and L = n1 + λ. Moreover, for d > 2L/1 λ, ǩac satisfies ǩ AC z, z εqz, where q : R n [0, is a pdf, and ε = 2 n/2 e d. Proof. It s easy to verify that V 2 β π B Z,Y,X β z, Y, X dβ = ˇV 2 z + n. R p We know from Proposition 8 that { } V 2 β π B Z,Y,X β z, Y, X π Z B,Y,X z β, Y, X dz dβ { λ 2 max XΣ 1 X T } V 2 β+2n. R p R n Multiplying both sides of the above inequality by π B Z,Y,X β z, Y, X and integrating with respect to β yields 20. We now move on the the minorization condition. Note that π Z B,Y,X z B, Y, X depends on B only through XB, which is a one-to-one function of the n-dimensional vector A := XΣ 1 X T 1/2 XB BF. Hence, ǩacz, z can be re-expressed as ǩ AC z, z = π Z A,Y,X z α, Y, Xπ A Z,Y,X α z, Y, X dα, R n 21

22 where A Z, Y, X NwZ, I n. To get the minorization condition, we will construct a pdf να := να Y, X such that whenever ˇV 2 z d. Define ν 1 α = π A Z,Y,X α z, Y, X ενα inf π A Z,Y,X α z, Y, X = inf z: ˇV 2 z d w:w 2 d 2π n/2 exp 1 2 α w 2. Then να = ν 1 α/ R n ν 1 α dα is a pdf, and ε = 2π R n/2 inf exp 1 n w:w 2 d 2 α w 2 dα 2π n/2 e R d exp α 2 dα = 2 n/2 e d. n The next result is the analogue of Proposition 10. The proof is omitted as it is essentially the same as the proof of the said proposition. Proposition 13. Assume that X has full row rank. Let λ, L, and ε be as in Proposition 12. If ˆρ is calculated using 5, then for m 1 and β R p, where R p k m AC β, β ; Y, X π B Y,X β Y, X dβ Hβ ˆρ m 1, Hβ = 2 + L 1 λ + tr XΣ 1 X T + XΣ 1 X T 1/2 { X Σ 1 X T ϕβ + Qv } 2 B F. Let ρ = ρ X, Y denote the geometric convergence rate of the A&C chain. In the next section, which is the heart of the paper, we develop general convergence complexity results showing that, under weak regularity conditions, ρ is bounded away from 1 both as n for fixed p, and as p for fixed n. 5 Results for the Albert and Chib Chain Part II: Asymptotics 5.1 Large n, small p In this section, we consider the case where p is fixed and n grows. In particular, we are interested in what happens to the geometric convergence rate of the A&C chain in this setting. Recall that the prior on B is ωβ exp{ β v T Qβ v/2}. 22

23 Since p is fixed, so is the prior. Hence, the hyperparameters v and Q will remain fixed throughout this subsection. In Subsection 3.1, we introduced the data set D := {X i, Y i } n. We now consider a sequence of data sets, D n := {X i, Y i } n. So, each time n increases by 1, we are are given a new p 1 covariate vector, X i, and a corresponding binary response, Y i. In order to study the asymptotics, we assume that the X i, Y i pairs are generated according to a random mechanism that is consistent with the probit regression model. In particular, we make the following assumptions: A1 The pairs {X i, Y i } are iid random vectors such that Y i X i BernoulliGX i, where G : R p 0, 1 is a measurable function; A2 EX 1 X T 1 is finite and positive definite; A3 For j {1, 2,..., 2 p } and β 0, 1Sj P X Sj X 1 X1 T β 0 > 0. Assumption A1 contains the probit model as a special case. Thus, our results concerning the asymptotic behavior of A&C s Markov chain do not require strict adherence of the data to the probit regression model. The only reason for assuming A2 is to guarantee that, almost surely, X will eventually have full column rank. This is a necessary condition for posterior propriety when Q = 0. While A3 is rather technical, it is clearly satisfied if X 1 follows a distribution admitting a pdf with respect to Lebesgue measure that is positive over some open ball {γ R p : γ 2 c}, where c > 0. It is shown in Subsection B.4 that A3 also allows for an intercept. That is, even if the first component of X 1 is 1 constant, then as long as the remaining p 1 components satisfy the density condition described above, A3 is still satisfied. It was assumed throughout Section 3 that the posterior distribution is proper. Of course, for a fixed data set, this is check-able. All we need in the large n, small p regime is a guarantee that the posterior is proper for all large n, almost surely. The following result is proven in Subsection B.5. Proposition 14. Under assumptions A1-A3, almost surely, the posterior distribution is proper for all sufficiently large n. For fixed n, D n := {X i, Y i } n represents the first n random covariate vectors and responses. Let ρ D n denote the random geometric convergence rate of the corresponding A&C chain. Here is one of our main results. Theorem 15. If A1-A3 hold, then there exists a constant ρ < 1 such that, almost surely, lim sup ρ D n ρ. n 23

24 Proof. Let ˆρD n denote the upper bound on ρ D n that is based on λ, L, and ε from Proposition 9. We prove the result by showing that, almost surely, lim sup n ˆρD n ρ < 1. Note that L = p1 + λ and ε = 2 p/2 e d where d > 2L/1 λ. Thus, control over λ provides control over L and ε as well. In particular, to prove the result it suffices to show that there exists a constant c [0, 1, such that, almost surely, lim sup λd n c. 21 n Noting that Σ 1/2 X T XΣ 1/2 I p, we have, by Proposition 11, { Σ λ 1/ /2 π min λ 1 1 j 2 p min X i 1 n n {0} Y i Xi T + 1 Σ } 1/2 X i 1 n {1} Y i Xi T. n X i S j X i S j Fix j {1, 2,..., 2 p }. By A1 and the strong law, almost surely, Σ 1/2 1 lim X i 1 n n n {0} Y i Xi T + 1 Σ 1/2 X i 1 n {1} Y i Xi T n X i S j X i S j = EX 1 X1 T 1/2 EX1 EX 1 X1 T 1 Sj X 1 1 {0} Y 1 + EX 1 X1 T 1 Sj X 1 1 {1} Y 1 X1 T 1/2 = EX 1 X1 T { 1/2 EX1 EX 1 X1 T 1 Sj X 1 1 GX 1 + EX 1 X1 T 1 Sj X 1 GX 1 } X1 T 1/2. It follows from 26 and Lemma 22 that, almost surely, { Σ 1/2 lim λ 1 min X i 1 n n n {0} Y i Xi T + 1 Σ } 1/2 X i 1 n {1} Y i Xi T n X i S j X i S j λ 1 max EX1 X1 T { } λmin EX 1 X1 T 1 Sj X 1 1 GX 1 + EX 1 X1 T 1 Sj X 1 GX By A2, λ 1 max EX1 X1 T > 0. Hence by 22 and 23, to show that 21 holds, almost surely, it is enough to show that } λ min {EX 1 X1 T 1 Sj X 1 1 GX 1 + EX 1 X1 T 1 Sj X 1 GX 1 > 0. By A3, for any β 0, P 1 Sj X 1 X1 T β 0 or 1 Sj X 1 X1 T β 0 > Since 0 < GX 1 < 1, 24 implies that, for any β 0, P β T { X 1 X1 T 1 Sj X 1 1 GX 1 } β + β T { X 1 X1 T 1 Sj X 1 GX 1 } β > 0 > 0 As a result, for any β R p such that β 2 = 1, { } β T EX 1 X1 T 1 Sj X 1 1 GX 1 + EX 1 X1 T 1 Sj X 1 GX 1 β >

25 It follows from A2 that the left-hand side of 25 is continuous in β. Hence, we can take infimum on both sides of the inequality with respect to β and retain the greater-than symbol, which yields } λ min {EX 1 X1 T 1 Sj X 1 1 GX 1 + EX 1 X1 T 1 Sj X 1 GX 1 > 0, and the result follows. Theorem 15 shows that, under weak regularity conditions on the random mechanism that generates D n := {X i, Y i } n, A&C s MCMC algorithm scales well with n. Johndrow et al studied the convergence rate of the A&C chain as n for a particular fixed sequence of covariates and responses. Suppose that p = 1, Q > 0 is a constant, v = 0, and X 1 = X 2 = = X n = 1. Johndrow et al showed that if all the Bernoulli trials result in success, i.e., Y 1 = Y 2 = = Y n = 1, then lim n ρ D n = 1. That is, in terms of L 2 -geometric convergence rate, the convergence is arbitrarily slow for sufficiently large n. As we now explain, our results can be used to show that, in Johndrow et al. s 2016 setting, almost any other sequence of responses leads to well behaved convergence rates. Let {Y i } denote a fixed sequence of binary responses, and define ˆp n = n 1 n Y i. It follows from Propositions 9 and 11 that the A&C chain satisfies d&m conditions with [ λ = n n + Q 2 π n {ˆpn 1 ˆp n }] 2 n + Q [ 1 2 π {ˆp n 1 ˆp n }] 2, L = 1 + λ, and ε = 2 1/2 e d for d > 2L/1 λ. For any fixed n, suppose that there exist c 1, c 2 0, 1 such that c 1 ˆp n c 2, then using 5 one can find ρ < 1, which depends only on c 1 1 c 2, such that ρ D n ρ. It now follows that the geometric convergence rates, ρ D n, are eventually bounded away from 1 so long as 0 < lim inf n ˆp n lim sup n ˆp n < 1. It is important to note that D n is not random here. Moreover, an analogous result holds for ρ D n. Here s a formal statement. Corollary 16. For the intercept-only model described above, if 0 < lim inf n ˆp n lim sup n ˆp n < 1, then for any δ 0, 1 [ 1 2 { }] 2 lim inf π ˆp n 1 lim sup ˆp n, n n lim sup n ρ D n ρ < 1, and lim sup n ρ D n ρ < 1, where ρ equals ˆρ in 5 with λ = L = 1 + λ, and ε = 2 1/2 e d, where d > 2L/1 λ. [ 1 2 { }] 2 lim inf π ˆp n 1 lim sup ˆp n + δ, n n 25

26 Proof. It suffices to prove the result for ρ, since the argument for ρ has already been provided. Fix an arbitrary δ. By Proposition 10, when n is sufficiently large, for any β R and m 1, k m AC β, β ; Y, X π B Y,X β Y, X dβ Hβ ρ m 1, R where Hβ is given in the said proposition. Let Π be the probability measure corresponding to the posterior density, π B Y,X β Y, X. Then for any probability measure ν L 2 Π and m 1, k m AC β, β ; Y, Xνdβ π B Y,X β Y, X dβ R R k m AC β, β ; Y, X π B Y,X β Y, X dβ νdβ R R Hβ νdβ ρ m 1. R One can verify that Hβ can be bounded by polynomial functions. As a result, by Theorem 2.3 in Chen and Shao 2000, R H2 βπ B Y,X β Y, X dβ <. Then by Cauchy-Schwarz, R Hβνdβ <. Therefore, ρ D n ρ for all sufficiently large n. Recall that the convergence rate analyses of Roy and Hobert 2007 and Chakraborty and Khare 2017, which establish the geometric ergodicity of the A&C chain for fixed n and p, are based on an un-centered drift function. We end this subsection with a result showing that, while this un-centered drift may be adequate for non-asymptotic results, it simply does not match the dynamics of the A&C chain well enough to get a result like Theorem 15. The following result is proven is Subsection B.6. Proposition 17. Assume that A1 and A2 hold, and that there exists β R p such that β 0 and Gγ = G γ T β for all γ R p, where G : R 0, 1 is a strictly increasing function such that G 0 = 1/2. Then, almost surely, any drift and minorization based on V 0 β = Σ 1/2 β 2 is necessarily unstable in n. Remark 18. In the above proposition, if G θ = Φθ for all θ R, then the probit model is correctly specified, and the true parameter is β. 5.2 Large p, small n In this subsection, we consider the case where n is fixed and p grows. In contrast with the strategy of the previous subsection, here we consider a deterministic sequence of data sets. Also, since p is changing, we need to specify a sequence of prior parameters {Q p, v p } p=1. Let D p := {v p, Q p, X p, Y } denote a sequence of priors and data sets, where Y is a fixed n 1 vector of responses, X p is an n p matrix, v p is 26

Convergence complexity analysis of Albert and Chib s algorithm for Bayesian probit regression

Convergence complexity analysis of Albert and Chib s algorithm for Bayesian probit regression Qian Qin and James P. Hobert Department of Statistics University of Florida April 2018 Abstract The use of