Notes on Jaynes s Treatment of the Marginalization Paradox

Size: px

Start display at page:

Download "Notes on Jaynes s Treatment of the Marginalization Paradox"

Barnard Holmes
5 years ago
Views:

1 Notes on Jaynes s Treatment of the Marginalization Paradox Kevin S. Van Horn Leuther Analytics November 23, 23 In Chapter 15 of Probability Theory: The Logic of Science [2] abbreviated PTLOS, Jaynes discusses the Marginalization Paradox MP of Dawid, Stone, and Zidek [1] abbreviated DSZ, going into details for two specific examples. In the MP we have a model with four variables: ζ, η, y, and z. ζ is the parameter of interest, η is a nuisance parameter, and y and z are data. ζ and η are assigned independent priors π 1 ζ and π 2 η, and we are given py, z ζ, η. Suppose now that π 2 η belongs to a certain class of improper priors; the MP is that, deriving pζ z in two different but apparently equally legitimate ways, we get different answers. This result has been used to argue that one should not use improper priors at all in Bayesian analyses. Jaynes argues that the problem is not the use of an improper prior, but that the two derivations implicitly condition on different states of information, and hence we should expect to get different answers. In this note I show that Jaynes is wrong: the problem does in fact stem from the use of an improper prior. To Jaynes s credit, however, the paradox is resolved by carefully following the procedures he advocates in PTLOS. 1 The change-point problem In PTLOS section Jaynes looks specifically at a particular parameterization of the change-point problem, wherein y is a scalar, z is a vector of length n 1, and we have py, z ζ, η, I 1 = c n ζ η n y n 1 exp η y Qζ, z 1 for a constant c and positive function Q. Here I 1 is the state of information of a Bayesian B 1 who knows about all four variables and the above sampling distribution. Integrating out y, we find pz ζ, η, I 1 = cn ζ n 1! Qζ, z n. 2 1

2 Since the right-hand side of 2 has no dependence on η, Jaynes follows DSZ in concluding PTLOS that pz ζ, I 1 = cn ζ n 1! Qζ, z n. 3 Now, this is the first point where we need to exercise some caution. How do we justify the last step above? Let s carefully apply the rules of probability theory. To get pz ζ, I 1 from pz ζ, η, I 1 we must multiply by pη ζ, I 1 and integrate out η: pz ζ, I 1 = = dη pz, η ζ, I 1 dη pz η, ζ, I 1 pη ζ, I 1 = pz η, ζ, I 1 = pz η, ζ, I 1 = cn ζ n 1! Qζ, z n, dη pη ζ, I 1 for any < η <. The third equality follows because pz η, ζ, I 1 has no dependence on η. What if pη ζ, I 1 is improper? The usual procedure would be to replace = with above, in the hope that the end result will be a proper distribution, and then normalize when we are done. The problem, however, is that by definition the integral on the third line above diverges. Thus, the above derivation of pz ζ is invalid when using an improper prior over η. In such circumstances, Jaynes tells us that we must consider the sequence of proper priors whose limit our improper prior represents, go back and do our analysis with these proper priors, and see if our result converges to some well-defined limit. If it does converge, we take this limit as our result for the improper prior. If it does not converge, then we cannot use the improper prior. In this case we get the same answer at every point in whatever sequence of proper priors we might choose, and so we have our well-defined limit. Thus we can justify 3 even for improper priors. At this point Jaynes, again following DSZ, postulates another Bayesian B 2 who does not know of the existence of η or y; his knowledge is limited to ζ, z, the prior π 1 ζ, and the marginal sampling distribution 3. B 2 then applies Bayes rule to derive PTLOS 15.7 pζ z, I 2 π 1ζc ζ [Qζ, z] n, 4 where I 2 represents the state of information of B 2. Suppose now that B 1 assigns an improper prior for η: π 2 η η k, < η <. Applying Bayes Rule with pζ, η I 1 = π 1 ζπ 2 η and the sampling 2

3 distribution 3 to obtain pζ, η y, z, I 1, then integrating out η, Jaynes and DSZ obtain the marginal posterior for ζ PTLOS pζ y, z, I 1 π 1 ζc ζ [Qζ, z] n k+1 5 and thereby conclude, since 5 has no dependence on y, that pζ z, I 1 π 1 ζc ζ. 6 [Qζ, z] n k+1 4 and 6 are clearly different posterior distributions for ζ. Jaynes asserts that 6 is in fact correct for B 1, and attributes the difference with 4 to the different prior information of B 1 and B 2. To see that Jaynes s analysis is incorrect, it suffices to note that pz ζ, I 1 = pz ζ, I 2 pζ I 1 = pζ I 2 ; B 1 is also entitled to apply Bayes Rule to the above, and doing so, obtains the same answer as B 2 : pζ z, I 1 π 1ζc ζ [Qζ, z] n. 7 Since this contradicts 6, we do indeed have a paradox to resolve. We resolve the paradox as follows: We show that the step from 5 to 6 is invalid: taking this step involves integrating py z, I 1, which itself is improper, and hence the integral diverges. Thus we cannot obtain 6. We take the conservative approach of defining the sequence of proper priors whose limit π 2 η represents, then deriving pζ y, z for each of these proper priors; this gives us 7, and not 6, as the limit. 2 An overlooked impropriety As we did with 3, let us carefully go through the steps we would have to take to derive 6 from 5: pζ z, I 1 = = dy pζ, y z, I 1 dy pζ y, z, I 1 py z, I 1 = pζ y, z, I 1 = pζ y, z, I 1 dy py z, I 1 3

4 for any < y <, since pζ y, z, I 1 has no dependence on y. As in the discussion of 3, the above derivation fails if py z, I 1 is improper, as the integral on the second-to-last line diverges. We now show that py z, I 1 is, indeed, improper, given the improper prior over η. Applying 1 and the improper prior π 2 η η k, we obtain py z, I 1 py, z I 1 ζ y n 1 ζ = y n 1 ζ y k 2, dη py, z ζ, η, I 1 π 1 ζη k π 1 ζc ζ dη η n k exp ηy Qζ, z π 1 ζc ζ Γn k + 1 yqζ, z n k+1 which is improper. In the above we used the equality x α 1 e βx dx = Γα/β α, which should be familiar as the normalization constant for the gamma distribution. Thus we cannot obtain 6 from 5, and the paradox disappears. I believe this to be the heart of the Marginalization Paradox: the implicit use of a divergent integral when an improper prior for a parameter leads to an improper distribution for a data variable. 3 Take it to the limit The divergent integral in Section 2 tells us that we need to return to basics; that is, we need to repeat the analysis with a sequence of proper priors, and see if the results tend to some well-defined limit. Following Jaynes, we look at proper priors of the form π 2 η η a exp bη 8 for b > and a > 1; our previous improper prior corresponds to a = k and b. Let us write I 1 b for the state of information corresponding to use of this prior for a specific value b. Corresponding to 5 we now have pζ y, z, I 1 b π 1 ζc ζ, 9 b + y Qζ, z n+a+1 which is just PTLOS We must be careful here, though. We are going to multiply by py z, I 1 b then integrate out y in order to obtain pζ z, I 1 b. In using instead of = we have dropped factors that do not depend on ζ, but may depend on y; we need the y-dependent factors to get the correct result. So 4

5 we shall put the normalization constant back in: pζ y, z, I 1 b = π 1 ζc ζ b + y Qζ, z n+a+1 ζ π. 1 1ζ c ζ b + y Qζ, z n+a+1 Now let us again use 1 to derive the marginal distribution of y conditional on z, only this time for a proper prior: py z, I 1 b py, z I 1 b = ζ π 1 ζ y n 1 ζ dη pη I 1 bpy, z ζ, η, I 1 b π 1 ζc ζ dη η n+a exp ηb + y Qζ, z y n 1 ζ π 1 ζc ζ b + y Qζ, z n+a+1 11 The proportionality constant depends only on z and b; in particular, it doesn t depend on ζ. We now combine 11 with 1 to obtain pζ z, I 1 b = dy pζ y, z, I 1 b py z, I 1 b π 1 ζc ζ dy y n 1 b + y Qζ, z n+a To finish the derivation, we need the following facts, obtained by use of Mathematica: y r b + uy m dy = y r+1 b m r F 1 r + 1, m, r + 2, uy/b 13 2F 1 α, β, γ, s From 14 we find Γγ ΓβΓγ β 1 t β 1 1 t γ β 1 1 ts α dt 14 2F 1 α, β, γ, = Γγ ΓβΓγ β t β 1 1 t γ β 1 dt = Γγ ΓβΓγ β ΓβΓγ β Γγ = 1; 15 furthermore, as y, we have = κ 1 y r+1 fr + 1, m, r + 2, uy/b 1 t m 1 1 t r m+1 y r uyt/b r+1 dt 5

6 = κ 1 t m 1 1 t r m+1 y 1 + ut/b r+1 dt r+1 b 1 κ t m r 2 1 t r m+1 dt u r+1 b = κ Γm r 1Γr m + 2 u = Γr + 2Γm r 1 Γm r+1 b 16 u where κ Γr + 2/ΓmΓr m + 2. Combining 13, 15, and 16, we obtain r+1 y r b + uy m dy = r m Γr + 2Γm r 1 b b Γm u Substituting this into 12, with r = n 1, u = Qζ, z, and m = n + a + 1, we obtain pζ z, I 1 b π 1 ζc ζ [Qζ, z] n, which accords with 7. The above holds regardless of b, so it is also the limit as b. Thus 7, and not 6, is the correct solution for B 1. 4 What went wrong: non-uniform convergence The reader may still be puzzled as to how pζ z, I 1 could differ from pζ y, z, I 1 if the latter has no dependence on y. To clarify what is happening we look again at what happens with the proper priors 8 as b and we approach the improper prior η a. The problem is one of nonuniform convergence. Comparing 5 and 9, we have pζc ζ b + y Qζ, z n+a+1 pζc ζ y n a 1 Qζ, z n+a+1 only for y b/qζ, z. For any fixed value of y, pζ z, I 1 b does indeed tend to 5 as b ; nonetheless, for all values of b, no matter how small, there remain values of y for which pζ y, z, I 1 b differs greatly from 5. Is this important? Since the only way to get rid of the conditioning on y is to multiply by py z, I 1 b and then integrate out y, to obtain 6 we need to have py Cb/Qζ, z z, I 1 b, for any ζ, z, and constant C >, as b. In Figure 1 we plot pu z, I 1 b, where u yq/b and Q is the average of Qζ, z, 1 ζ n. We chose a = n = 1, c = 1.5, and randomly generated the vector z as follows: x i, 1 i n/2, were drawn from an exponential distribution with mean 1; x i, n/2 < i n, were drawn from an exponential 6

7 pdf v Figure 1: pu z, I 1 b, where u yq/b distribution with mean 1/c; and z i x i /x 1. The ratios Qζ, z/q range from.819 to 1.167; thus u 1 corresponds roughly to y b/qζ, z. We see then that in this case there is substantial probability mass in the region y b/qζ, z for all ζ, which is precisely the region in which 5 differs substantially from 9. We have not specified b, because pu z, I 1 b has no dependence on b! Thus as b we find that, for any ζ, P u CQ/Qζ, z z, I 1 b remains constant, and hence so does P y Cb/Qζ, z z, I 1 b. To see this, note from 11 and the definition of u that pu z, I 1 b u n 1 ζ π 1 ζc ζ b + bu Qζ, z/q n+a+1 u n 1 ζ π 1 ζc ζ 1 + u Qζ, z/q n+a+1, which has no dependence on b. 5 DSZ Example #5 Jaynes analyzes a second specific MP example in PTLOS section ; again, the apparent paradox can be traced to an overlooked improper distribution, and the paradox is resolved by noting that the integral we must take to obtain 7

8 the paradox diverges. In this case we have iid variables x i, with a normal distribution for each x i with mean µ and variance σ 2. We then define ζ = µ/σ, ν = σ, y = r, and z = R, where R 2 nx 2 + s 2 r nx/r x n 1 n i=1 x i n s 2 n 1 x i x 2. The paradox arises when we use an improper prior of the form i=1 pµ, σ σ γ 1, γ >, taken as the limit of the proper prior pµ, σ σ γ 1 exp β/σ αµ 2 as α, β. Similarly to the first example, pζ R is derived by first finding pζ r, R hζ, R, noting that this has no dependence on r, and concluding that pζ R hζ, R also. As before, this last step is invalid, and therefore the paradox cannot be created, because pr r is improper when we use the improper prior for µ and σ. We now present the proof that pr r is improper. Theorem: Let x i N µ, σ for 1 i n, and define S 2 = i x i x 2. Then x and S 2 are independent given µ and σ; x N µ, σ/ n; S 2 σ 2 χ 2 n 1. Proof: This is a standard result in statistics; see, for example, M. G. Bulmer, Principles of Statistics. We have s 2 = S 2 /n; define also u = S 2 σ 2 = nσ 2 s 2. Then du = nσ 2 ds 2, and combining this with the above theorem we have px µ, σ σ 1 exp n x µ2 2σ2 pu µ, σ u n 3/2 e u/2 ps 2 µ, σ σ n+1 s n 3 exp n 2σ 2 s2 px, s 2 µ, σ σ n s n 3 exp n s 2 2σ 2 + x µ 2. We will later remove the conditioning on µ and σ by multiplying by the priors on these variables and integrating them out. For this reason we explicitly show any 8

9 factors dependent on µ or σ, rather than absorbing them into the proportionality constants above. We now want to do a change of variables from x and s 2 to R. To compute the Jacobian, we first find R x = 1 nx 2 + s 2 1/2 2nx 2 = nxr 1 nx 2 + s 2 1/2 n R s 2 = 1 2 = 1 2 nr 1 r x = Rn nxnxr 1 R 2 = n R n2 x 2 R 3 r s 2 = nxr 2 = n2 x 2R 3 Then the Jacobian is the absolute value of R x = nxr 1 r s 2 R r s 2 x 1 2 nr 1 12 n2 xr nr 1 nr 1 n 2 x 2 R 3 = 1 2 n3 x 2 R n2 R n3 x 2 R 4 = 1 2 n2 R 2, giving dx ds 2 = 2n 2 R 2 dr dr. We now combine this with the fact that to obtain x = n 1 rr s = n 1 R 2 x 2 1/2 = n 1 R 2 n 2 r 2 R 2 1/2 = n 1 Rn r 2 1/2 pr, r µ, σ R 2 px, s 2 µ, σ R 2 σ n s n 3 exp n 2σ 2 s2 + x 2 + µ 2 2xµ R 2 σ n s n 3 exp 1 2σ 2 R2 2µrR + nµ 2 R n 1 σ n vr exp 1 2σ 2 R2 2µrR + nµ 2 9

10 where vr n r 2 n 3/2. Now we remove the µ dependence by multiplying pr, r µ, σ by the prior over µ and then integrating out µ: fr, R, µ, σ exp 1 exp 2σ 2 R2 2µrR + nµ 2 pµ 1 2σ 2 R2 2µrR + n + 2σ 2 αµ 2 Let λ n + 2σ 2 α and m rr/λ; then fr, R, µ, σ exp 1 2σ 2 R2 1 r 2 /λ + λµ m 2 and fr, R, µ, σdµ λ 1/2 σ exp R2 2σ 2 1 r2 /λ Now, we are interested in what happens as α, β simultaneously. As β we see that the prior over σ goes to pσ σ γ 1, and as we approach this limit it becomes highly probably that σ is very near zero. Recall that γ > is fixed. Thus as α, β we can assume that ασ 2 is very small. Then hence and so Likewise, we have So we have R2 2σ 2 where, for α, β 1, λ 1 n 1 1 2n 1 ασ 2, 1 r 2 λ 1 1 n 1 r 2 + 2n 2 r 2 ασ 2, 1 r2 R2 λ 2σ 2 1 r2 αr2 R 2 n n 2. λ 1/2 n 1/2 1 n 1 ασ 2. fr, R, µ, σdµ σ exp R2 2σ 2 wr gα, σhα, r, R, wr 1 r 2 /n gα, σ 1 n 1 ασ 2 hα, r, R exp n 2 αr 2 R 2. Note that n r n and so wr 1. This gives us pr, r σ R n 1 vrhα, r, Rgα, σσ n+1 exp 1 R2 2σ 2 wr.

11 Now we remove the σ dependence by multiplying by the prior over σ and then integrating out σ: F α, r, R, σ pσgα, σσ n+1 exp R2 2σ 2 wr F 1 r, R, σ + F 2 α, r, R, σ F 1 r, R, σ pσσ n+1 exp R2 2σ 2 wr = Z 1 σ n γ βσ R2 exp 2σ 2 wr F 2 α, r, R, σ n 1 ασ 2 F 1 r, R, σ where Z is the normalization constant for the prior over σ. We now note that 1 > exp β/σ > exp β exp β/σ 2 Proof: 1 < σ + 1/σ, hence 1/σ < 1 + 1/σ 2, hence β/σ > β β/σ 2. This means that Z 1 σ n γ exp R2 2σ 2 wr > F 1 r, R, σ > e β Z 1 σ n γ exp R2 2σ 2 wr β σ 2. Applying the change of variables ρ = σ 2, with dρ = 2σ dσ, we find 2Z 1 ρ n+γ+1/2 exp R2 2ρ wr dρ > F 1 r, R, σ dσ > 2Ze β 1 ρ n+γ+1/2 exp R2 2ρ wr β dρ. ρ We ve now matched the form of an inverted-gamma distribution for ρ, so the integrals are just the corresponding normalization constants: > > Γn + γ 1/2 2ZR 2 wr/2 n+γ 1/2 Likewise, we find for F 2 that F 1 r, R, σ dσ Γn + γ 1/2 2Ze β R 2 wr/2 + β n+γ 1/2. αγn + γ + 1/2 2nZR 2 wr/2 n+γ+1/2 11

12 < F 2 α, r, R, σ dσ αγn + γ + 1/2 < 2nZe β R 2 wr/2 + β. n+γ+1/2 We see that as α, β, if wr then for some constant C 1. Then F α, r, R, σ dσ C 1 R 2 wr n+γ 1/2 pr r pr, r vrr n 1 hα, r, R F α, r, R, σ dσ C 1 vrr n 1 R 2 wr n+γ 1/2 R γ which is improper. What about the case wr =? This occurs when r = ± n, that is to say, when all the x i are identical. As previously mentioned, as β the prior for σ becomes concentrated near, and so the x i tend to group closely together in the prior predictive distribution. Thus, the prior predictive distribution for wr is δwr, making wr = a case of interest. For wr = we obtain F 1 r, R, σ = Z 1 σ n γ exp β/σ. Thus F α, r, R, σ has no dependence on r or R, nor does its integral over σ, and so we find pr r pr, r vrr n 1 hα, r, R F α, r, R, σ dσ R n 1 hα, r, R R n 1 which is also improper. 6 Conclusion What have we learned from this? Certainly, we see that the use of improper priors can be a tricky business. However, avoiding improper priors altogether may be an extreme solution. In a sense, the usual rule for dealing with improper priors, more-or-less explicitly laid out in PTLOS, still protects us from the Marginalization Paradox. That rule has two parts: 12

13 Make sure you carefully follow the rules of probability theory; don t skip steps that seem obvious or unimportant. If you ever find yourself trying to evaluate a divergent integral, stop immediately; you must abandon the improper prior and use a proper prior. Optionally, you may do the analysis for a sequence of proper priors and see if the solutions tend to a well-defined limit. This note does not, of course, constitute a proof that the above simple rule will always protect one from inconsistencies in the use of improper priors. It seems clear that some foundational work remains to be done on the safe use of improper priors. In particular, the following questions need to be answered: What, exactly, do we mean when we say that an improper prior is the limit of a particular sequence of proper priors? This is a trickier question than one might think; in particular, it appears that no reasonable definition can be given without reference to some class of likelihood functions with which the prior is to be combined. If π i, 1 i <, are a sequence of proper priors whose limit is the improper prior π, what rules for use of improper priors will guarantee that a solution obtained using π is the well-defined limit of the sequence of solutions obtained by use of the priors π i? References [1] Dawid, A. P., Stone, M., and Zidek, J. V Marginalization paradoxes in Bayesian and structural inference with discussion. J. Roy. Statist. Soc. B, 35, [2] Jaynes, E. T. 23. Probability Theory: The Logic of Science, Cambridge University Press. 13

Remarks on Improper Ignorance Priors

As a limit of proper priors Remarks on Improper Ignorance Priors Two caveats relating to computations with improper priors, based on their relationship with finitely-additive, but not countably-additive