Checking for Prior-Data Conßict. by Michael Evans Department of Statistics University of Toronto. and

Size: px

Start display at page:

Download "Checking for Prior-Data Conßict. by Michael Evans Department of Statistics University of Toronto. and"

Leslie Sutton
6 years ago
Views:

1 Checking for Prior-Data Conßict by Michael Evans Department of Statistics University of Toronto and Hadas Moshonov Department of Statistics University of Toronto Technical Report No. 413, February 24, 25 TECHNICAL REPORT SERIES UNIVERSITY OF TORONTO DEPARTMENT OF STATISTICS

2 Checking for Prior-Data Conßict M. Evans and H. Moshonov U. of Toronto Abstract Inference proceeds from ingredients chosen by the analyst and data. To validate any inferences drawn it is essential that the inputs chosen be deemed appropriate for the data. In the Bayesian context these inputs consist of both the sampling model and the prior. There are thus two possibilities for failure: the data may not have arisen from the sampling model, or the prior may place most of its mass on parameter values that are not feasible in light of the data, referred to here as prior-data conßict. Failure of the sampling model can only be Þxed by modifying the model while prior-data conßict can be overcome if sufficient data is available. We examine how to assess whether or not a prior-data conßict exists and how to assess when its effects can be ignored for inferences. The concept of prior-data conßict is seen to lead to a partial characterization of what is meant by a noninformative prior or a noninformative sequence of priors. 1 Introduction Model checking is a necessary component of statistical analyses. This is because statistical analyses are based on assumptions. These assumptions typically take the form of possibilities for how the observed data was generated the sampling model and, in Bayesian contexts, a quantiþcation of the investigator s beliefs the prior about the actual data generation mechanism, among the possibilities included in the sampling model. In certain contexts it might be argued that we have speciþc information that leads to a sampling model or that the investigator has a very clear idea of what an appropriate prior is but, in general, this does not seem to be the case. In fact the sampling model and prior often seem, if not chosen arbitrarily, then at least selected for convenience. Accordingly it seems quite important that, before carrying out an inferential analysis, we Þrst check to make sure that the choices made, make sense in light of the data collected. For, if we can show that the observed data is surprising in light of the sampling model and the prior then we must at least be suspicious about the validity of the inferences drawn (ignoring any ambiguities about the correct inference process itself). Our concern here is with the Bayesian context where we have two ingredients, namely, the sampling model and the prior. Several authors have considered model checking in this context such as Guttman (1967), Box (198), Rubin 1

3 (1984), and Gelman, Meng and Stern (1996). We note that all of these authors considered the effect of both the sampling model and the prior simultaneously. There are, however, two possible ways in which the Bayesian model can fail. First we may have that the sampling model fails. By this we mean that the observed data is surprising for each of the assumed distributions in the model. If this is the case, then either we had the misfortune to observe a rare data set or the model is in fact not appropriate. In the former case collecting more data mayresolvetheissuebutinthelattercasethiswillnotbethecase. Second, assuming the sampling model is appropriate, the Bayesian model mayfailbythepriorplacingitsmassprimarily on distributions in the sampling model for which the observed data is surprising. So in other words there are distributions in the model for which the data is not surprising but the prior places little or no mass there. We refer to this as prior-data conßict. This conßict again leads to doubts about the validity of inferences drawn from the Bayesian model. We note that in the second context increasing the amount of data can lead to a resolution of the problem. For as the amount of data increases the effect of the prior decreases, vanishing in the limit. What this implies is that even if there is prior-data conßict, we may have sufficient data so that its effect can be ignored. The preceding discussion leads us to the conclusion that in Bayesian contexts we must check individually for the separate sources of failure. First we must check for the failure of the sampling model. For example, there are many frequentist methods for this and also the Bayesian methods as discussed in Bayarri and Berger (2). If we have no evidence that the sampling model is in error, then we must check to see whether or not there is any prior-data conßict and, if there is, whether or not this conßict leads to erroneous inferences. Of course, if we Þnd evidence that the sampling model is wrong then it doesn t make sense to check for prior-data conßict. See also Section 6 for further justiþcation for separately checking the sampling model and for prior-data conßict. We note also that if we do obtain evidence of the sampling model failing then this failure may not lead to serious problems for our inferences in certain circumstances. This is because the model may only be approximately correct andwehaveobservedasufficient amount of data to detect this deviation. If the approximate correctness of the model is appropriate for the application at hand, then we would not want to discard the model. This is, a separate issue, however, from what we will discuss in this paper and needs to be addressed by methods other than what we present here. In any case, throughout the remainder of this paper we assume that the sampling model is correct and focus on looking for prior-data conßict. The question now arises as how we should check for the existence of a priordata conßict. In Section 2 we argue that it is appropriate to use the prior predictive distribution of the minimal sufficient statistic to do this. In other words, if the observed value of the minimal sufficient statistic is a surprising value from its prior predictive distribution then we have evidence that a priordata conßict exists. In Section 3 we show how the concept of ancillarity becomes 2

4 relevant when looking for prior-data conßict. In Section 4 we discuss the implications of our deþnition of prior-data conßict for the characterization of a noninformative prior. In Section 5 we discuss how to assess whether or not an observed prior-data conßict can be ignored so that the Bayesian model can be used to derive inferences about the unknowns in the sampling model. In Section 6 we discuss a factorization of the joint distribution of the parameter and data that serves as further support for the approach taken here. In preparation for our discussion we specify some notation. We denote the samplespaceforthedatas by S, the parameter space by Ω, and the sampling model by {f θ : θ Ω}, where each f θ is a probability density on S with respect to some support measure µ. The prior probability measure on Ω is denoted by Π with density π with respect to a support measure υ on Ω. The joint distribution of (θ, s) leads to the prior-predictive measure for s as given by Z Z Z M(B) = f θ (s) π (θ) µ (ds) υ (dθ) = m (s) µ (ds) where Ω B Z m (s) = Ω f θ (s) π (θ) υ (dθ) is the density of M with respect to µ. We note that we are restricting our attention here to the case where Π is proper so that M is indeed a probability measure. We do, however, make some comments relevant to the improper case. When we observe the data s, then we have the posterior probability measure Π ( s ) with density, with respect to υ, givenby π (θ s )= f θ (s ) π (θ) m (s ) for inferences about θ. For a function T : S T we denote the marginal densities by f θt, with respect to some support measure λ on T, and this leads to the marginal prior predictive density for T given by Z m T (t) = f θt (t) π (θ) υ (dθ), Ω again with respect to λ. We will also refer to the posterior predictive distribution of T which has density Z m T (t s )= f θt (t) π (θ s ) υ (dθ) Ω with respect to λ. To assess whether or not a prior-data conßict exists we need to compare some observed value with a Þxed distribution to assess whether or not it is surprising. For convenience we will use the P-value for this purpose although we are not necessarily recommending this as "the" way to measure surprise. There are various concerns with P-values and in practical contexts we recommend at least plotting the distribution being used in the comparison in addition to computing the P-value. B 3

5 2 Checking for Prior-data Conßict: Sufficiency Intuitively, a prior-data conßict exists whenever the data provide little or no support to those values of θ where the prior places its support. So, if the region where the prior places most of its mass is somewhat separate from the region where the likelihood function is relatively high, then the measure we use to assess the conßict should indicate this. This might lead one to compare the MLE ˆθ (s ) (or some other consistent estimate of θ) to the prior distribution to assess the degree of conßict. If ˆθ (s ) lay in a region where the prior placed little support, e.g., out in the tails of π, then we might conclude that a conßict exists. We note, however, that this is not quite correct for, if the spread of the likelihood was sufficiently wide, then there could be a signiþcant overlap between the effective support of the prior and regions of support for the likelihood, even if ˆθ (s ) was "out in the tails" of π. Therefore, a more appropriate measure of prior-data conßict would seem to be one that compared the effective support of the prior (loosely speaking, a region that contained most of the mass for the prior) with the region where the likelihood was relatively high. If there is very little overlap between these regions then we have evidence that a prior-data conßict exists. It is not clear, however, how to appropriately measure this degree of overlap. Accordingly we take a somewhat different approach. We note that the prior induces a probability distribution on the set of possible likelihood functions via the prior predictive probability measure M. If the observed likelihood L ( s )= f (s ) is a surprising value from this distribution, then this would seem to indicate that a prior-data conßict exists. We can simplify this somewhat by noting that the likelihood is in effect equivalent to a minimal sufficient statistic T, namely, if we know the value T (s ) then we can obtain the map L ( s ), and conversely. So, instead of comparing the observed likelihood to its marginal model, we can compare T (s ) to its prior distribution M T. WecanthenchooseT so that this comparison can be made relatively simply. We might ask if anything in the observed data s, beyond the value of T (s ), has any relevance to checking for any prior-data conßict. The following result suggests that this is not the case. Theorem 1. Suppose T is a sufficient statistic for the model {f θ : θ Ω} for data s. Then the conditional prior predictive distribution of the data s given T is independent of the prior π. Proof: See the Appendix. So the conditional prior distribution of any aspect of the data, beyond the value of T (s ), doesn t depend on the prior. This suggests that such information can tell us nothing about whether or not a prior-data conßict exists and so can be ignored for this purpose. Note also the following result. Theorem 2. If L ( s ) is nonzero only on a θ-region where π places no mass 4

6 then T (s ) is an impossible outcome for M T. Proof: See the Appendix. So we see, at least in this extreme case of nonoverlapping support for the likelihood and prior, that comparing T (s ) to M T leads to deþnite evidence of a prior-data conßict. These results support our assertion that comparing the observed value T (s ), of a minimal sufficient statistic, to its prior distribution M T is an appropriate method for checking for prior-data conßict. We note that Box (198) suggested comparing the observed data s to M as an appropriate method for model checking. So the method we are suggesting here is really just a restriction of Box s approach to minimal sufficient statistics and we are further suggesting that this is appropriate only for checking for prior-data conßict and not model checking generally. This, together with the discussion in Section 6 would seem to lead to a resolution of some anomalous behavior of Box s approach in particular examples. We note also that the discussion in Section 3 leads to a further modiþcation of Box s approach. The following examples show that basing the check for prior-data conßict by comparing T (s ) to M T produces acceptable results. We also consider some alternatives in these examples (such as using the posterior predictive) and show that these have bad properties for the problem of checking for prior-data conßict. Perhaps the most basic model is the following. Example 1. Location normal model Suppose s =(s 1,..., s n ) is a sample from a N(θ, 1) distribution with θ R 1 and we place a N(θ,σ 2 ) prior on θ. A minimal sufficient statistic is T (s) = s N(θ, 1/n). For the prior predictive distribution of s we can write s = θ + Z where Z N (, 1/n) independent of θ. So the prior predictive distribution is N(θ,σ 2 + 1/n). Accordingly it makes sense to see if s lies in the tails of this distribution andweassessthisviathep-value M T (m T ( s) m T ( s )) = 2 1 Φ s θ q σ (1) n This shows that when s lies in the tails of its prior distribution, then we have evidence of a prior-data conßict existing. Note that using the prior-predictive distribution of s to assess this, instead of the prior for θ, results in the standardization by (σ 2 +1/n)1/2 rather than σ. So s has to be somewhat further out in the tail than if we simply compared the MLE to the prior of θ. Now suppose that the true value of θ is θ. Then as n we have that (1) converges almost surely to 2 µ 1 Φ µ θ θ σ. Therefore, if the true value of θ lies in the tails of the prior, then asymptotically (1) will detect this and lead to evidence that a prior-data conßict exists. 5

7 Also note that when σ then (1) converges to 1 and so no evidence will be found of a prior-data conßict. This is not surprising. For recall that we are proceeding as if the sampling model is correct and a diffuse prior simply indicates that all values are equally likely, so we should deþnitely not Þnd any conßict. Further as σ, so we have a very precise prior, then (1) will deþnitely Þnd evidence of a prior-data conßict for large n, unless θ is indeed the true value. One might wonder at the possibility of using the posterior predictive distribution of the minimal sufficient statistic instead. In fact a result analogous to Theorem 1 is true in the posterior context as well, so it makes sense to only look at functions of the minimal sufficient statistic again. The posterior predictive distribution for s is given by Ã µ 1 µ µ 1! 1 θ 1 s N σ 2 + n σ 2 + n s, σ 2 + n + 1. n Then comparing s to this distribution we would Þnd a prior-data conßict if the P-value ³ 1 ³ 2 1 Φ s 1 + n θ + n s σ 2 σ 2 r ³ 1 (2) 1 σ + n o n is small. As n we have that (2) converges almost surely to 1, irrespective of the true value of θ. So if we were to use the posterior predictive, we would never conclude that a prior-data conßict exists. This is clearly an unacceptable situation. This phenomenon occurs because the data inevitably overwhelms the effect of the prior in the posterior and we have an instance of the double use of the data as discussed in Bayarri and Berger (2). Similarly coin tossing represents a very simple context. Example 2. Bernoulli model Suppose s =(s 1,..., s n ) is a sample from a Bernoulli(θ) model where θ [, 1] is unknown, and θ is given a Beta(α, β) prior distribution. A minimal sufficient statistic is T (s) =n s = P s i Binomial(n, θ). The prior-predictive probability function of T (s) =n s is then given by m T (t) = µ n t Γ(α + β) Γ(α)Γ(β) Γ(t + α)γ(n t + β). Γ(n + α + β) To assess whether or not t = n s is surprising we must compare t to m T. This can be conveniently done by computing the tail probability M T (m T (t) m T (t )) (3) which, in this case, must be done numerically. In essence (3) is the prior probability of obtaining a value of the minimal sufficient statistic with probability of 6

8 occurrence no greater than that for the observed value. Because of cancellations we can write (3) as µ Γ(t + α)γ(n t + β) M T Γ(t +1)Γ(n t +1) Γ(t + α)γ(n t + β). (4) Γ(t +1)Γ(n t +1) Notice that when α = β =1then (4) equals 1 and we never have any evidence of prior-data conßict. This makes sense, however, as given that the sampling model is correct, there should be no conßict between the data and a noninformative prior. To illustrate the use of the P-value given by (4) we consider a numerical example. For this suppose that n =1,α=5and β =2. In this case the prior distributionputsmostofit smassonvaluesofθ in (, 1/2). Figure 1 is a plot of the prior predictive probability function of the minimal sufficient statistic density t Figure 1: Plot of the prior predictive probability function of the minimal sufficient statistic for a sample of n =1from the Bernoulli(θ) distribution when θ Beta(5, 2). Generating a sample of size 1 from the Bernoulli(.9) distribution we obtained n s =9. Note that θ =.9 is outside the effective support of the prior distribution, i.e., a prior-data conßict exists. The prior predictive P-value, is given by M T (m T (y) m T (9)) = m T (9) + m T (1) =.117. As expected, this is an indication of a prior-data conßict. Generating a sample of size 1 from the Bernoulli(.25) distribution we obtained n s =1. Note that this value of θ =.25 lies well in the support of the prior distribution i.e., a prior-data conßict does not exist. The P-value is given by M T (m T (y) m T (1)) = As expected, this does not indicate any prior-data conßict. The following example illustrates that the notion of a noninformative prior is somewhat subtle. We discuss this further in Section 4. 7

9 Example 3. Negative-binomial sampling Suppose that a coin is tossed in independent tosses until k heads are obtained and the number of tails observed until the k-thheadiss. Then s is distributed Negative-Binomial(k, θ), and is minimal sufficient. Note that the likelihood function is a positive multiple of the likelihood function from the Bernoulli model in Example 2 when we obtain k ones in a sample of size n = s + k. It is then clear that posterior inferences about θ are the same from the two models. We will see, however, that the checks for prior-data conßict are somewhat different. Suppose that we place a uniform prior on θ. Then, as noted in Example 2, the prior predictive for the minimal sufficient statistic is uniform and we never have evidence of a prior-data conßict existing. The prior predictive under negative binomial sampling, however, is given by m(s) = = µ Z s + k 1 1 θ k (1 θ) s dθ k 1 Γ (s + k) Γ (k +1)Γ (s +1) Γ (k) Γ (s +1) Γ (s + k +2) = k (s + k)(s + k +1) (5) and this is not uniform. Indeed, it cannot be uniform because the support for this distribution is the nonnegative integers. Notice that (5) is decreasing in s and so X k M (m(s) m(s )) = (s + k)(s + k +1) s=s X µ 1 = k s + k 1 s + k +1 s=s = k s + k and we see that this P-value is small only when s is large compared to k. While this P-value may seem unusual, when compared to the Binomial case, there is a subtle difference between the two situations. In particular, in the Binomial(n, θ) case it is permissible to have θ =and so, for example, observe n consecutive tails. In the Negative Binomial(k, θ) case, however, it is not possible to have θ =as there is no such thing as the Negative Binomial(k, ) distribution. The uniform prior on the unit interval places positive mass on every interval about and so is not really a sensible prior for this model, as the positive mass indicates a prior belief that θ =is a possible value. Recall our restriction that, when looking for prior-data conßict, we assume that the sampling model is correct. If we place a uniform prior on [<, 1] with <>, as this places no mass at, and denote the prior predictive by m *, then the dominated convergence theorem establishes that lim * M * (m * (s) m * (s )) is equal to (6). In fact when k =1 we have that M * (m * (s) m * (s )) = (1 <) s /(s +1). So we see that (6) is an approximation to the P-value we will obtain when < is small. Accordingly we can view (6) as an approximation to the P-value we would obtain for a prior that (6) 8

10 is in a small neighborhood of and rises smoothly to 1 outside this interval. If (6) is small, then we have evidence of prior-data conßict when using this kind of prior. The following example illustrates that looking at different functions of the minimal sufficient statistic may be appropriate when assessing prior-data conßict. Example 4. Location-scale normal model Suppose that x =(x 1,..., x n ) is a sample from a N(µ, σ 2 ) distribution where µ R 1 and σ> are unknown. With s 2 =(n 1) 1 P (x i x) 2, then ( x, s 2 ) is a minimal sufficient statistic for (µ, σ 2 ) with x N(µ, σ 2 /n) independent of s 2 (σ 2 / (n 1))χ 2 (n 1). Suppose we put the prior on (µ, σ 2 ) given by µ σ 2 N(µ,τ 2 σ 2 ) 1 σ 2 Gamma(α,β ). (7) The posterior distribution of (µ, σ 2 ) is then given by µ σ 2,x 1...x n Ã µ N µ x, n + 1! 1 τ 2 σ 2 1 σ 2 x 1...x n Gamma(α + n 2,β x) where µ µ x = n µ µ τ 2 τ 2 + n x and β x = β + µ2 2τ 2 + n 2 x2 + n 1 s 2 1 µ n µ 2 µ 2 2 τ 2 τ 2 + n x. The joint prior predictive distribution of ( x, s 2 ) is given by m( x, s 2 ) = Z Z = Γ n 2 + α Γ (α ) f( x, s 2 µ, σ 2 )π(µ σ 2 )π(σ 2 ) dσ 2 dµ µ n + 1 1/2 (2π) n/2 β α τ 2 (β τ x ) (n/2) α. (8) We then need to assess whether or not an observed value ( x,s 2 ) is a reasonable value from the distribution given by (8). While there might be a number of ways of assessing this, we compute the probability of observing a value of ( x, s 2 ) at least as extreme as ( x,s 2 ), namely, we compute the P-value M m( x, s 2 ) m( x,s 2 ). 9

11 To compute this, for speciþed values of the hyperparameters α,β,µ and τ 2, we generate (µ, σ 2 ) using (7), then generate ( x, s 2 ) from their joint distribution given (µ, σ 2 ) and evaluate m( x, s 2 ) using the formula in (8). Repeating this many times, and recording the proportion of values of m( x, s 2 ) that are less then or equal to m( x,s 2 ), gives us a Monte Carlo estimate of the P-value. For example, suppose that (µ, σ) =(, 1) and that for a sample of size n =2 from this distribution we obtained x = and s 2 = Then for the prior speciþed by τ 2 =1,µ =5,α =1and β =5, the above simulation process, based on a Monte Carlo sample of size N =1 3, gives the P-value as.19 and so clear evidence of a prior-data conßict is obtained. Rather than computing a P-value based on the full prior predictive distributionoftheminimalsufficient statistic, it is also possible to use the prior predictive distribution of some function of the minimal sufficient statistic. For example, we could instead use the marginal prior predictive distributions of x and s 2. We note that we can write x = µ +(σ 2 /n)z where z v N(, 1) and when µ σ 2 N(µ,τ 2 σ 2 ) then x σ 2 v N(µ,σ 2 (τ 2 +1/n)). Therefore, after multiplying the N(µ,σ 2 (τ 2 +1/n)) density by the marginal prior for 1/σ 2 and integrating out σ 2, we have that x t 2α (µ, (β τ n α 1 )1/2 ) (where t λ (, 1) denotes the t distribution with λ degrees of freedom and t λ (µ, σ) =µ+σt λ (, 1) is the marginal prior predictive distribution of x. Therefore the P-value based on x alone is 2 1 G 2α x µ q β (τ 2 +1/n) α 1 where G 2α is the cdf of the t distribution with 2α degrees of freedom. For the above generated data this gives the P-value equal to.21, which also indicates the existence of a prior-data conßict. Similarly, we Þnd that s 2 (β /α )F (n 1,2α) is the marginal prior predictive of s 2. For the above example we compare s 2 /5= with the F (19, 2) distribution. Computing the P-value by computing the probability of obtaining a value from the F (19, 2) distribution with density smaller than that obtained at the observed value, leads to the P-value Pr (F (19, 2) ) + Pr (F (19, 2) > ) = which does not indicate any prior-data conßict. We see from these two tests that the conßict arises from the location of that data and not its spread. Similar results are available for the normal regression model when using a conjugate prior. In general it seems difficult to determine the large sample behavior of M T. Example 5. Exponential models Suppose we have a sample s =(s 1,...,s n ) from a density of the form f θ (s i )=c (θ) h (s i )exp θ t u(s i ) ª. 1

12 Then the minimal sufficient statistic T (s) =n 1 P n i=1 u(s i) converges almost surely P θ to E θ (u) and so we expect that, for large n, the prior predictive distribution of T will be closely approximated by the prior distribution of E θ (u). So, for large n, if T (s ) is extreme with respect to the prior distribution of E θ (u), we will have evidence of prior-data conßict. The following example has posed a variety of problems for inference but seems to lead to sensible answers when we apply our criterion for assessing whether or not any prior-data conßict exists. Example 6. Suppose we observe s =(s 1,...,s n ) where s i N(θ i, 1) with s 1,...,s n independent and θ 1,...,θ n independent N(θ,σ 2 ).Inthiscasesis a minimal suf- Þcient statistic with prior predictive distribution N n (θ (1,...,1) t, σ 2 +1 I). So we see that we have evidence of prior-data conßict whenever 1 X n σ 2 +1 (s i θ ) 2 i=1 is extreme with respect to the Chi-squared(n) distribution, i.e., whenever the P-value Ã! P χ 2 n > 1 X n σ 2 +1 (s i θ ) 2 is small. Notice that as σ 2 this P-value goes to 1 and so we have no evidence of a conßict existing when we have a diffuse prior. When σ 2 we have evidence of a prior-data conßict whenever s 1,...,s n doesn t look like a sample from the N(θ, 1) distribution. 3 Checking for Prior-data Conßict: Ancillarity In principle we could look for prior-data conßict by comparing the observed value of U (T (s)) with its marginal prior-predictive distribution for any function U, e.g., consider Example 4. We note, however, that for certain choices of U this will clearly not be appropriate. For example, if U is ancillary then the marginal prior-predictive distribution of U is the same as its marginal distribution from the sampling model, i.e., the marginal prior-predictive distribution does not depend on the prior. So if U (T (s )) is surprising this cannot be evidence of a prior-data conßict but rather indicates a problem with the sampling model. Since we are assuming here that the sampling model has passed its checks, we want to avoid this possibility. A simple answer would be to say that we simply don t choose U to be ancillary. But we note that the prior-predictive distribution of T will also be affected by aspects that are ancillary. In other words, T (s ) may be a surprising value simply because U (T (s )) is surprising for some ancillary U and we want to avoid this. Accordingly, in such a context, it makes sense to compare i=1 11

13 the observed value T (s ) with its prior-predictive distribution given the value U (T (s )), i.e., we remove the variation in the prior-predictive distribution of T, that is associated with U, by conditioning on U. Note that we avoid the necessity of conditioning on an ancillary whenever we have a complete minimal sufficient statistic T, as then Basu s theorem implies that every ancillary U is independent of T. So, for the examples discussed in Section 2 we do not need to condition on ancillaries, i.e., the prior predictive distribution of T does not exhibit variation that can be ascribed to any ancillary U. If U 1 T and U 2 T are ancillary and U 1 T = h U 2 T for some h then we need only do the check based on the prior-predictive distribution for T given U 2 (T (s )) since this is conditioning on more information and so removing more variation. When an ancillary U 1 is such that there is no function h and ancillary U 2 (other than a choice where h is one-to-one) such that U 1 T = h U 2 T then U 1 is a maximal ancillary. So in general we would like to condition on maximal ancillaries whenever possible. We note, however, as discussed in Lehmann and Scholz (1992), the lack of a general method for obtaining maximal ancillaries. Still if we have an ancillary U available it seems better to condition on U, and so remove the variation due to U, even if we cannot determine whether or not U is maximal ancillary or determine a maximal ancillary based on U. Furthermore, it can happen that there is more than one maximal ancillary in a problem, e.g., see Lehmann and Scholz (1992). We note that the lack of unique maximal ancillary can cause problems in frequentist approaches to inference because inference about θ can be different, depending on which ancillary we choose to condition on. This does not cause a problem here, however. For if the check based on conditioning on one maximal ancillary indicates a prior-data conßict then a conßict exists irrespective of what happens when we condition on other maximal ancillaries. In essence we are considering different aspects of the prior-predictive distribution when we condition on different maximal ancillaries. The following example is presented in Cox and Hinkley (1974) as a particularly compelling argument for the use of ancillary statistics in inference. Example 7. Mixtures Suppose we have that a response x is either from a N(θ, σ 2 ) or N(θ, σ 2 1) distribution where θ R 1 is unknown and σ 2,σ 2 1 are both known and unequal. Suppose these two distributions correspond to the variation exhibited by two measuring instruments and the particular instrument used is chosen according to c Bernoulli(p ) where p is known. Then we see that (c, x) is minimal sufficient and c is ancillary. We place a N(θ, 1) prior on θ. Therefore, when c =we would use the prior predictive based on the prior and the N(θ, σ 2 ) family, namely x N(θ,σ 2 +1), and when c =1we would use the prior predictive based on the prior and the N(θ, σ 2 1) family, namely x N(θ,σ ), to check for prior-data conßict. This example nicely illustrates the use of ancillary statistics. For there are two "components" to the prior predictive distribution of the minimal sufficient 12

14 statistic and the ancillary statistic selects one. We see that x may lead to a prior-data conßict for both components, one component or neither component. This will depend on the prior π and the values of σ 2 and σ 2 1. Note also that if p is very small and we observed c =1, then we might consider this as evidence of a problem with the model, not as indicating that a prior-data conßict exists (see Section 6). More generally, if we have that (x, u) is minimal sufficient for a model with x u f θ ( u) and u h a maximal ancillary, then Z m (x u) = f θ (x u) π (θ) υ (dθ). Ω We compare the observed value of x with this distribution to see if there is any evidence of prior-data conßict. Clearly the variation in the value of the minimal sufficient statistic due to changes in u, can tell us nothing about whether or not there is any prior-data conßict. Conditioning on u removes this variation. The following example is presented in Cox and Hinkley (1974) as a case where the use of ancillary statistics leads to ambiguity for inferences about the model parameter. Example 8. Special Multinomial Distribution Suppose that we have a sample of n from the Multinomial (1, (1 θ) /6, (1 + θ) /6, (2 θ) /6, (2 + θ) /6) distribution where θ [ 1, 1] is unknown. Then the cell counts (f 1,f 2,f 3,f 4 ) constitute a minimal sufficient statistic and U 1 =(f 1 + f 2,f 3 + f 4 ) is ancillary as is U 2 =(f 1 + f 4,f 2 + f 3 ). Then (f 1,f 2,f 3,f 4 ) U 1 is given by f 1 U 1 Binomial(f 1 + f 2, (1 θ) /2) independent of f 3 U 1 Binomial(f 3 + f 4, (2 θ) /4) giving µ µ f1 + f 2 f3 + f 4 m (f 1,f 2,f 3,f 4 U 1 )= Z 1 1 µ 1 θ 2 f1 µ 1+θ 2 f 1 f2 µ 2 θ 4 f 3 f3 µ 2+θ 4 f4 π (θ) dθ. This distribution is 2-dimensional and can be used as described in Example 4 for prior-data conßict checking. Alternatively, we could use the 1-dimensional distributions f 1 U 1 and f 3 U 1. In Example 4 we saw that separate 1-dimensional assessments can lead to more insight concerning any prior-data conßict that may exist. For example, suppose π is a Beta(α, β) distribution on [ 1, 1]. When α = β =2, so the prior concentrates about, andf 1 + f 2 =1, Figure 2 gives the conditional prior predictive probability function for f 1 U 1. We see that we will Þnd evidence of a prior-data conßict whenever f 1 {, 1, 9, 1}. On the other hand when α =1,β =2, so the prior concentrates about 1, andf 1 + f 2 =1, Figure 3 gives the conditional prior predictive probability function for f 1 U 1. In 13

15 this case we will Þnd evidence of a prior-data conßict whenever f 1 {3,...,1} probability f_ Figure 2: Conditional prior predictive of f 1 given U 1 when α = β = probability f_ Figure 3: Conditional prior predictive of f 1 given U 1 when α =1,β =2. Alternatively, we could use f 1 U 2 Binomial(f 1 + f 4, (1 θ) /3). In this case, when α = β =2and f 1 + f 4 =1, Figure 4 gives the conditional prior predictive probability function for f 1 U 2. WeseethatwewillÞnd evidence of aprior-dataconßict whenever f 1 {, 8, 9, 1}. This differs from the previous check involving f 1 but there is no conßict between them. If either check indicates that f 1 is a surprising value then we have evidence of a prior-data conßict existing. We also have available the checks using f 3 U 1 and f 2 U probability f_ Figure 4: Conditional prior predictive of f 1 given U 2 when α = β =2. 14

16 The following example can be considered as an archetype for the common situation where an ancillary is determined as the maximal invariant under a transformation group acting on a sample space. Example 9. Location Cauchy Suppose we have a sample s =(s 1,...,s n ) from a distribution with density proportional to 1 ³1+(x 2 θ) where θ R 1. Then it is known that T (s) = s (1),...,s (n), the order statistic, is a minimal sufficient statistic. Further we have that U(T (s)) = s (2) s (1),...,s (n) s (1) =(u1,...,u n 1 ) is ancillary. Clearly the conditional distribution of T given U can be expressed as the conditional distribution of s (1) given U and this has conditional density proportional to n 1 1 Y ³1+ s (1) θ 1 2 ³1+ s (1) + u i θ 2. i=1 Note that the normalizing constant will not involve θ and so the conditional prior predictive density of s (1) is = m(s (1) u 1,...,u n 1 ) Z n 1 1 Y ³1+ s (1) θ 1 2 ³1+ s (1) + u i θ 2 π (θ) dθ Z i=1 n 1 1 Y 1 (1 + v 2 ) i=1 ³1+(v 2 π s (1) v dv (9) + u i ) for prior π. Integrating s (1) out of the above expression shows that the normalizing constant is given by Z n 1 1 Y 1 (1 + v 2 dv. (1) ) i=1 ³1+(v + u i ) 2 We then assess whether or not there is any prior-data conßict by comparing the observed value of s (1) with the distribution given by m( u 1,...,u n 1 ). To evaluate a P-value associated with this conditional prior predictive we must integrate numerically and this can be a bit tricky. For example, consider the following ordered sample of n =1from the Cauchy distribution with θ =

17 In Figure 5 we have plotted the normalized integrand in (1) in the interval ( u n 1, u 1 ). Here the normalizing constant is equal to This is in in fact a plot of the conditional density of v = s (1) θ given θ and (u 1,...,u n 1 ) density v -1-5 Figure 5: The conditional density of v = s (1) θ given θ and (u 1,...,u n 1 ). We can combine the information in this plot, with our knowledge about where the prior places its mass, to determine where we should place the integration points to evaluate the integral in (9). For example, when the prior is a N(, 1) distribution we see that, for this data set, we need to tabulate m(s (1) u 1,...,u n 1 ) for s (1) in the interval ( 8, 1). Note that we need only tabulate the normalized integrand in (1) once to tabulate m(s (1) u 1,...,u n 1 ) and this saves considerable computation. Doing this we obtain the conditional density m( u 1,...,u n 1 ) as shown in Figure 6. From this we can see that the observed value s (1) = provides no evidence of prior-data conßict and this is what we would expect..4.3 density s_(1) -2 Figure 6: The density function m( u 1,...,u n 1 ) with a N(, 1) prior. Suppose, however, the prior distribution is N(θ, 1). In this case the density m( u 1,...,u n 1 ) looks just like that plotted in Figure 6 but is translated by 16

18 θ units. So we see that if θ is more than 3 then the observed value s (1) = will provide evidence of a prior-data conßict. When the prior is N(θ,σ 2 ) and σ 2, thenm( u 1,...,u n 1 ) converges to a density that is the same as the density depicted in Figure 5, but translated by θ. So when θ > 1 or θ < 2 we see clearly that there will be evidence of a prior-data conßict. Note that the left-tail probability is given by P s (1) u 1,...,u n 1 = R n 1 Y 1 (1+v 2 ) i=1 R ³ 1 (1+(v+u i) 2 ) Φ θ v σ dv 1 (1+v 2 ) n 1 Y i=1 1 (1+(v+u i) 2 ) dv and this converges to 1/2 as σ 2. So for a very diffuse prior we will not Þnd any evidence of a prior-data conßict. We note that the method of calculation that we have outlined here will work with any location family, when the minimal sufficient statistic is given by the order statistic, as then (u 1,...,u n 1 ) will always be ancillary. For example, when we are sampling from any t distribution, then this will be the case. Further this method can be generalized to location-scale models. As noted in Lehmann and Scholz (1992), there can be a conßict between ancillarity and sufficiency in frequentist inference in the sense that it is not entirely clear which should be imposed Þrst. In other words there are situations where ancillary statistics exist that are not functions of the minimal sufficient statistic. In the context of checking for a prior-data conßict, however, this does not pose a difficulty as it is clear from Theorem 1 that we only consider functions of the minimal sufficient statistic for this. Of course, for Bayesian inferences about θ the concept of ancillarity has no relevance. 4 Noninformative Priors Various deþnitions are available for expressing what it means for a prior to be noninformative. For example, see Kass and Wasserman (1996) for a discussion of the various possibilities and of the difficulties inherent in this, and also Bernardo (1979), Berger and Bernardo (1992). A somewhat different requirement for noninformativity arises from considerations about the existence of a prior-data conßict. For if a prior is such that we would never conclude that a prior-data conßict exists, no matter what data is obtained, then it seems reasonable to say that such a prior is at least a candidate for being called noninformative. Example 2 gives an example where the uniform prior on a compact parameter space satisþes this requirement. We saw in Example 3 that, while the parameter space is contained in a compact set, it is not the case that the uniform distribution is noninformative. 17

19 Strictly interpreted, our requirement only applies to proper priors because M T is not a probability measure unless Π is proper. We can, however, extend this to sequences of priors that converge in some sense to an improper prior. As in Example 1, we can say that a sequence of priors satisþes this requirement for noninformativity if the P-value we choose to compute converges to 1 for every possible set of data. In general it is not clear whether a proper prior, or a sequence of proper priors, will exist for a given problem, that satisþes this requirement. Further, even if we can show that such objects do exist it is likely that further requirements must be satisþed for a prior to be called noninformative. If it is possible, however, for a so-called noninformative prior to conßict with the data in some sense, then it does seem that it is putting some information into the analysis. So we consider the absence of the possibility of any prior-data conßict as a necessary characteristic of noninformativity rather than as a characterization of this concept. The following example demonstrates that more than one sequence can satisfy the requirement. Example 1. Location normal model Consider the context of Example 1 but now suppose that we take the prior Π w for θ to be the uniform distribution on ( w, w) for some w>. Then the prior predictive density of T (s) = s is given by m T,w ( s) = 1 Z w 2w w n n exp n 2π 2 ( s θ)2o dθ Φ n (w s) Φ n ( w s). = 1 2w It is clear that m T,w is unimodal, with mode at s =, and is symmetric about. Then the relevant P-value is given by M T,w (m T,w ( s) <m T,w ( s )) = 1 M T,w (( s, s )) Z s 1 = 1 Φ n (w s) Φ n ( w s) d s w which converges to 1, as w, by the dominated convergence theorem. So in this case we see that a natural sequence of priors converging to the uniform prior satisþes our requirement, as in the limit there is no prior-data conßict. This is also true, as discussed in Example 1, for a sequence of N(θ,σ 2 ) priors with σ 2. Both of these sequences give the same limiting posterior inferences. The discussion in Section 3 indicates that the deþnition of a noninformative prior must also involve ancillary statistics. Example 11. Mixtures Consider again the context of Example 7 and suppose that we take the prior on θ to be the uniform prior on ( w, w). Then, after conditioning on 18

20 the ancillary c, the analysis of Example 1 shows that the improper uniform prior satisþes our requirement for noninformativity as the sequence of P-values converges to 1. Notice, however, that if p was small and we didn t condition on c, then if we observed c =1the P-value would not converge to 1 as w.this reinforces the necessity of conditioning on ancillaries when considering whether or not a prior-data conßict exists. In Examples 8 and 9 it is not entirely clear what priors will satisfy our requirement for noninformativity. Partly this is a technical problem that might yield results with more work but it is conceivable that in certain problems no such priors or sequences of priors will exist. Further, there is no reason to suppose that, even if a such a prior exists, that it will be necessarily unique. In Example 13 we discuss what it means for a sequence of priors to be noninformative for the location-scale normal problem as discussed in Example 4 in light of our requirement. This presents a more challenging problem than those considered in this section. 5 Diagnostics for Ignoring Prior-data Conßict Suppose we have concluded that there is evidence of a prior-data conßict. The questionthenremainsastowhattodonext. Ingeneralthisisasomewhat difficult question to answer as it may well entail using a different prior to avoid the conßict and thus we would need to prescribe how to go about choosing such apriorinascientiþcally defendable way. There are circumstances, however, where the answer can be as simple as collecting more data. For we know that, with sufficient data, the effect of the prior on our inferences is immaterial. Of course, circumstances can arise where collecting more data is not possible, and so we may be left with the more difficult problem of selecting a different prior, but we would still like to know if, in a given context, we have enough data to ignore the prior-data conßict and, if not, how much more data do we need. Intuitively, if the inferences about θ that we are interested in making are not strongly dependent on the prior, then we might feel that we can ignore any prior-data conßict. We need, however, some quantitative assessment of this and there are several possibilities. When there is a prior that is noninformative, we can simply compare the posterior inferences obtained via the prior with those obtained using the informative prior. If these inferences do not differ by an amount that is considered of practical importance then it seems reasonable to ignore the prior-data conßict. The amount of difference will depend on the particular application, as a small difference in one context may be considered quite large in another. Consider the following examples. Example 12. Bernoulli model Suppose the situation is as described in Example 2 and we specify a Beta(α, β) prior distribution for θ. Suppose further we are interested in estimating θ and 19

21 for this we use the posterior mode. Given that the posterior distribution is Beta(α + n s, β + n(1 s)) themodeis(α + n s 1)/(α + β + n 2). A noninformative prior is, as described in Example 2, given by α = β =1. Acomparison between the posterior modes leads to the difference α + n s 1 α + β + n 2 s = (α 1)(1 s) (β 1) s α + β + n 2. (11) If (11) is smaller than some prescribed value δ, then we can say that any priordata conßict can be ignored. Note that (11) is bounded above by (α+β 2)/(α+ β + n 2) and so we can determine n so that this difference is always smaller than δ for a given choice of prior. The value of δ is a value such that a difference greater than this is of practical signiþcance and not otherwise. Necessarily, this value depends upon the application and must be speciþed by the analyst. Example 13. Location-scale normal model Suppose the situation is as described in Example 4 and we are interested in.95 hpd intervals for µ. With the prior speciþed as in (7), the posterior distribution of µ is µ x + n +1/τ 2 1/2 (α + n/2) 1/2 βx 1/2 t 2α+n, where µ x,β x are as speciþed in Example 4. Therefore, a γ-hpd interval for µ is given by µ x ± µ n +1/τ 2 1/2 (α + n/2) 1/2 1+γ βx 1/2 G 1 2α +n (12) 2 where G 1 2α +n is the inverse cdf for the t distribution on 2α + n degrees of freedom. For a noninformative sequence of priors we consider what happens to M m( x, s 2 ) m( x,s 2 ) as as a function of the hyperparameters µ,τ,α, and β. From (8), and some easy algebraic manipulations, we see that M m( x, s 2 ) m( x,s 2 ) Ã n 1 α n 1 nτ = M 2 +1 β ( x µ ) 2! + α β s 2 n 1 n 1 β ( x µ ) 2 + α β s 2. (13) nτ 2 +1 α Now recall that s 2 (β /α )F (n 1,2α ) and so the left side of the inequality in (13) converges almost surely (and so in distribution also) to a random variable with the F (n 1,2α) distribution, as τ 2,β with the added restriction that β τ 2. Then with the additional restriction that α /β, we see that (13) converges to 1. Note that since β this also implies that α. Therefore a sequence of priors, as speciþed in Example 4, via the hyperparameters (µ,τ,α,β ), will be noninformative for this problem whenever 2

22 τ,β τ 2 and α /β. Actually, it is apparent that we need only require that α /β to have a noninformative sequence of priors, so there are many such sequences. The condition τ is saying that the variance of the prior on µ is becoming large. Since the mean of a Gamma(α,β ) distribution is α /β, the condition α /β says that the mass of the prior distribution for 1/σ 2 is concentrating at. But since the variance a Gamma(α,β ) distribution is α /β 2, the condition β τ 2 implies that the concentration must occur at a suitably slow rate. For example, if we take β =1/τ and α =1/τ p with p>1, these conditions will be satisþed as τ. Note that with these choices the prior variance of 1/σ 2 is α /β 2 = τ 2 p and so this will become inþnite as τ whenever 1 <p<2, will remain constant at 1 when p =2and goes towhenp>2. Under a sequence of priors, as speciþed above, we have that the γ-hpd interval for µ given by (12) converges to x ± s r n 1 n n G 1 n µ 1+γ 2. (14) This interval differs from the classical frequentist interval by using the factor (1 1/n) 1/2 G 1 n ((1 + γ)/2) instead of G 1 n 1 ((1 + γ)/2). Note, however, that the (1 1/n) 1/2 t n distribution and the t n 1 distribution are very similar. Now we want to compare the two hpd intervals to obtain our diagnostic. Perhaps the most natural comparison is to compute the length measure of the symmetric difference of the two intervals. Certainly something like this would be necessary for general regions. For intervals, however, if one can show that the differences between the corresponding endpoints are less than δ then the length measure of the symmetric difference is also less than δ. So our diagnostic is the maximum difference between the corresponding endpoints of the intervals given by (13) and by (14). If this maximum is less than δ, where δ is a value such that a difference greater than this is of practical signiþcance and not otherwise, then we can ignore any prior-data conßict. Other diagnostics suggest themselves. For example, in general we could compute the divergence between the posterior under the informative and noninformative prior. For this, however, we would need to Þrst state a cut-off value for the divergence, below which we would not view the difference between the distributions as being material. As indicated in Section 4 it does not always seem obvious how to obtain a noninformative prior or sequence of priors. One possibility in these circumstances is to simply select another prior that seems reasonable for the problem and compare inferences as we have done in Examples 12 and 13. For example, we may use some convenient reference prior even if we have not been able to prove that this prior is noninformative or arises from a noninformative sequence of priors. 21

23 6 Factoring the Joint Distribution As further support for the approach taken here consider the following factorization of the joint distribution of the data and parameter. This seems to point to a logical separation between the activities of checking the sampling model, checking for prior-data conßict and inferences about the model parameter. In a Bayesian analysis with a proper prior the full information available to the statistician for subsequent analysis is effectively the observed data s and the joint distribution of (θ, s) as given by the measure P θ Π. Denoting a minimal sufficient statistic by T we see that this factors as P ( T) P Tθ Π where P Tθ is the marginal probability measure of T and P ( T) is the conditional distribution of the data s given T. The measure P ( T) is independent of θ and the prior and so only depends on the choice of the sampling model {P θ : θ Ω}. Therefore, we can compare the observed data s against this distribution to assess whether or not the sampling model is reasonable. Also we can write P Tθ Π as M T Π ( T) where M T is the prior predictive distribution of T and Π ( T) is the posterior distribution of θ. As we have discussed in this paper, comparing the observed value T (s ) with the distribution M T is an appropriate method for assessing whether or not there is a conßict between the prior and the data. For a maximal ancillary U, that is a function of the minimal sufficient statistic T, we can write M T Π ( T) as P U M T ( U) Π ( T) where P U is the marginal distribution of U and M T ( U) is the conditional prior predictive distribution of T given U. WehavethatP U depends only on the choice of the sampling model {P θ : θ Ω} and so we can also compare the observed value U (s ) with P U to assess whether or not the sampling model is reasonable. In a case where U is not independent of T we have argued that it is more appropriate to compare T (s ) with M T ( U), than against M T, to assess whether or not a prior-data conßict exists as conditioning on U removes inappropriate variation (variation that does not depend on the prior) from the comparison. Finally, when we feel comfortable with both the sampling model and the prior, we can proceed to inferences about θ and for this we use the posterior distribution Π ( T). Note that checking for prior-data conßict is only appropriate when we know the sampling model is reasonable and making inferences about θ is only appropriate when we know both the sampling model is appropriate and that there is no prior-data conßict. In Box (198) a method was suggested for model checking based upon the full marginal prior predictive M for the data. In several examples this can be seen to give anomalous results. The approach taken in this paper can be seen as reþnement of Box s proposal. For we can factor M as P ( T) M T and we see that the Þrst component is available for checking the sampling model and the second component is available for checking for prior-data conßict. 22

Checking for Prior-Data Conflict

Bayesian Analysis (2006) 1, Number 4, pp. 893 914 Checking for Prior-Data Conflict Michael Evans and Hadas Moshonov Abstract. Inference proceeds from ingredients chosen by the analyst and data. To validate