The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

The Jeffreys Prior Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 1 / 13

Sir Harold Jeffreys English mathematician, statistician, geophysicist, and astronomer His book Theory of Probability, which first appeared in 1939, played an important role in the revival of the Bayesian view of probability. Harold Jeffreys (1891-1989) Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 2 / 13

Information Matrix Suppose data X has density f(x θ) which is twice differentiable in the coordinates of the unknown parameter θ = (θ 1,..., θ p ). Expected Fisher Information: p p matrix I with j, k entry [ 2 ] I j,k (θ) = E X θ log f(x θ) θ j θ k n [ 2 ] if indpt. = E Xi θ log f i (x i θ) θ i=1 j θ k [ 2 ] if i.i.d. = ne X1 θ log f(x 1 θ) θ j θ k Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 3 / 13

Jeffreys-rule prior Derived to have invariance under 1-1 transformations Jeffreys proposed p(θ) det [I(θ)], where I(θ) is Fisher information. This is called the Jeffreys-rule prior (often incorrectly shortened to the Jeffreys prior). If ψ is 1-1 transformation of θ: where J ij = θ i / ψ j. Since p(ψ) = p(θ)[detj], I (ψ) = JI(θ)J T = deti = [deti](detj) 2 Jeffreys prior p(ψ) det [I (ψ)] is invariant. Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 4 / 13

Example: Jeffreys Prior for Bernoulli Data The Jeffreys prior for iid Bernoulli data X i θ iid Ber(θ) is p(θ) 1 θ(1 θ) = Beta(1/2, 1/2) Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 5 / 13

Note: Jeffreys would often deviate from using the Jeffreys-rule prior (so Jeffreys prior is the name best used for the priors he recommended). For a Poisson mean λ, he curiously recommended instead of the Jeffreys-rule prior p(λ) 1/λ, p(λ) 1/ λ; and even though p(λ) 1/λ doesnt work when x = 0 is observed. For normal problems with unknown variance, he recommended the independent Jeffreys prior. Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 6 / 13

Example: Normal Data Suppose observations X i µ, σ 2 iid N(µ, σ 2 ), for i = 1, 2,..., n. Denote parameters θ = (µ, σ 2 ), then Fisher Information Jeffreys-rule prior: I(θ) = Independent Jeffreys prior: ( n/σ 2 0 0 n/(2σ 4 ) p(θ) (1/σ 6 ) 1/2 = 1/σ 3 p(θ) 1/σ 2 (ultimately recommended by Jeffreys) ) Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 7 / 13

Strengths of the Jeffreys-rule Prior Almost always defined Invariant to transformations Almost always yields a proper posterior Mixture models are the main known examples in which improper posteriors result. Great for one-dimensional parameters It arises from many other approaches, such as minimum description length and various entropy approaches. Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 8 / 13

Weaknesses of the Jeffreys-rule Prior It depends on the statistical model, and hence appears to violate the likelihood principle; but all reasonable objective Bayesian theories do this. Example: Suppose X is Negative Binomial, i.e. p(x r, θ) = Γ(x + r) Γ(x + 1)Γ(r) θr (1 θ) x, for x = 0, 1,... 1 The Jeffreys-rule prior is p(θ). θ(1 θ) It requires the Fisher information to exist. Example of non-existence: Uniform [0, θ] distribution. Often fails badly for higher-dimensional parameters Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 9 / 13

Example of Failure the Neyman-Scott Problem Suppose we observe X ij N(µ i, σ 2 ), i = 1,..., n; j = 1, 2 Defining X i = (X i1 + X i2 )/2, X = ( X 1,..., X n ), S 2 = n i=1 (X i1 X i2 ) 2, and µ = (µ 1,..., µ n ), the likelihood function p(x µ, σ 2 ) 1 exp [ 1σ )] ( σ X 2n 2 µ 2 + S2 4 Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 10 / 13

The Fisher information matrix is I(µ, σ 2 ) = diag{2/σ 2,..., 2/σ 2, n/σ 4 }, the last entry corresponding to the information for σ 2. Jeffreys rule prior: p(µ, σ 2 ) = 1/σ (n+2) Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 11 / 13

Jeffreys rule prior: p(µ, σ 2 ) = 1/σ (n+2) is bad. For instance, the marginal posterior for σ 2 is ) p(σ 2 1 X) = (σ 2 exp ( S2 ) n+1 4σ 2. Here S 2 depends on X. Posterior mean E(σ 2 X) = S 2 4(n 1) Suppose data X are generated from the sampling distribution under the true parameter σ 2 0. Because the frequentist density of S2 /(2σ 2 0 ) is Chi-Squared with n degrees of freedom, as n, the posterior mean S 2 /[4(n 1)] σ 2 0 /2. Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 12 / 13

Thus under the Jeffreys rule prior, the posterior distribution of σ 2 is inconsistent, i.e., it does not concentrate around the true value σ 2 0. The independent Jeffreys prior is fine here. There also exist cases where the independent Jeffreys prior doesn t work. Yingbo Li (Clemson) The Jeffreys Prior MATH 9810 13 / 13