Default priors and model parametrization

1 / 16 Default priors and model parametrization Nancy Reid O-Bayes09, June 6, 2009 Don Fraser, Elisabeta Marras, Grace Yun-Yi

2 / 16 Well-calibrated priors model f (y; θ), F(y; θ); log-likelihood l(θ) = log f (y; θ) can we find priors that are guaranteed to be well-calibrated, at least approximately what structure do such priors need to have can we do this for all component parameters simultaneously Welch & Peers, 1963: π(θ)dθ i 1/2 (θ)dθ expected Fisher information (matrix) i(θ) = E{ l (θ)} Peers, 1965; Tibshirani, 1989 parameter of interest θ = (ψ, λ): π(θ)dθ i 1/2 ψψ (θ)g(λ)dθ matching priors using Edgeworth expansion

3 / 16 Well-calibrated priors model f (y; θ), F(y; θ); log-likelihood l(θ) = log f (y; θ) can we find priors that are guaranteed to be well-calibrated, at least approximately what structure do such priors need to have can we do this for all component parameters simultaneously Welch & Peers, 1963: π(θ)dθ i 1/2 (θ)dθ expected Fisher information (matrix) i(θ) = E{ l (θ)} Peers, 1965; Tibshirani, 1989 parameter of interest θ = (ψ, λ): π(θ)dθ i 1/2 ψψ (θ)g(λ)dθ matching priors using Edgeworth expansion

4 / 16 Approximate location models Location model: Y f (y θ) = π(θ)dθ dθ Location model: θ θ + dθ, y y + dθ F(y; θ) unchanged, i.e. df(y; θ) = 0 scalar Y, θ General model: require df (y 0 ; θ) = 0 y 0 sample point F y (y 0 ; θ)dy + F ;θ (y 0 ; θ)dθ = 0 sample y 1,..., y n : dy = F ;θ(y 0 ; θ) F y (y 0 dθ = V (θ)dθ ; θ) dy i = V i (θ)dθ scalar Y, vector θ i.i.d Y i, vector θ

Default prior dy = V (θ)dθ; V (θ) = V 1 (θ). V n (θ) possible default prior π(θ)dθ V (θ)v (θ) 1/2 dθ convert to maximum likelihood coordinates l θ (ˆθ; y) = 0 = l θθ (ˆθ; y)d ˆθ + l θ;y (ˆθ; y)dy = 0 n p matrix d ˆθ = ĵ 1 Hdy = ĵ 1 HV (θ)dθ proposed default prior ĵ = j(ˆθ 0 ): observed Fisher information π(θ)dθ ĵ 1 HV (θ) dθ 5 / 16

6 / 16 V (θ) links dy to dθ ĵ 1 H links dy to d ˆθ... default prior π(θ)dθ ĵ 1 HV (θ) dθ gives right invariant prior for transformation parameter in transformation models provides extension of right invariant prior to general (continuous) models V (θ) = dy dθ = F θ(y; θ) y=y 0 f (y; θ) y=y 0 H = H(y 0 ; ˆθ 0 ) = 2 l(θ; y) θ y ĵ = j(ˆθ 0 ) = l θθ (ˆθ 0 ) ˆθ 0,y 0

7 / 16 Example: regression model y 1 = X 1 β + σɛ 1,..., y n = X n β + σɛ n V (θ) = dy dθ = {X, (y 0 Xβ)/(2σ)} y 0 ( X /ˆσ 2 ) H = ẑ 0 /ˆσ 3 ĵ = diag{x X/ˆσ 2, n/(2ˆσ 4 )} W (θ) = = { { I (X X) 1 X z 0 (θ)/(2σ) 2ẑ 0 ˆσX/n ẑ 0 z 0 (θ)ˆσ/(nσ) I ( β ˆ } 0 β)/2σ 2 0 ˆσ 2 /σ 2. }

8 / 16... regression y N(Xβ, σ 2 ), θ = (β, σ 2 ) V (θ) = ( X y 0 ) Xβ σ y = Xβ + σɛ design residuals d ˆβ = dβ + ( ˆβ 0 β)dσ 2 /2σ 2 d ˆσ 2 = ˆσ 2 dσ 2 /σ 2. π(θ)dθ dβdσ 2 /σ 2 dβdσ/σ nonlinear regression, y N(x(β), σ 2 ), leads to d βdσ/σ β coordinates for tangent plane ẋ( ˆβ 0 )(β ˆβ 0 )

Scalar parameter default prior is data dependent, changes with y 0 based on approximate local location model Jeffreys prior π J (θ)dθ i 1/2 (θ)dθ gives frequentist matching to second order depends on model, not data unconditional prior 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Flat priors for Gamma shape default Jeffreys 2 4 6 8 10 θ 9 / 16

Targetted priors If parameter of interest is curved, then prior needs to be targetted on the parameter of interest marginalization paradox (Dawid, Stone and Zidek) Example: Normal circle y i N(µ i, 1/n), i = 1,..., k; ψ = (µ 2 i )1/2 Stein Default prior π(µ)dµ dµ Posterior χ 2 k (n y 2 ) Exact χ 2 k (n ψ 2 ) Normal Circle, k=2, 5, 10 pvalue 0.0 0.2 0.4 0.6 0.8 1.0 posterior p-value 2 3 4 5 6 7 8 ψ 10 / 16

Normal Circle, k=2, 5, 10 pvalue 0.0 0.2 0.4 0.6 0.8 1.0 posterior p-value actual coverage 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 ψ nominal probability Bayes - frequentist Φ { } (k 1) ψ n not fixed by hierarchy of priors

12 / 16 Targetted priors: strong matching use Laplace approximations to posterior and to frequentist p-value structure of approximations makes comparison easy s(ψ). = Φ(r + 1 r log q B r ): Bayesian survivor value p(ψ). = Φ(r + 1 r log q F r ): Frequentist p-value r = r(ψ) = ± 2{l(ˆθ) l(ψ, ˆλ ψ )} q B contains the prior; q F various information functions and sample space derivatives q B = q F π(ˆθ) π(ˆθ ψ ) =... default prior along the curve θ = ˆθ ψ F&R, 2002 need to extend to full parameter space

13 / 16 q B = l ψ (ˆθ ψ ) j λλ(ˆθ ψ ) 1/2 j(ˆθ) 1/2... details π(ˆθ) π(ˆθ ψ ) q F = l ;V (ˆθ) l ;V (ˆθ ψ ) l λ;v (ˆθ ψ ) l θ;v (ˆθ) j(ˆθ) 1/2 j λλ (ˆθ ψ ) 1/2 π(ˆθ ψ ) π(ˆθ) l ψ (ˆθ ψ ) l θ;v (ˆθ) j λλ (ˆθ ψ ) l ;V (ˆθ) l ;V (ˆθ ψ ) l λ;v (ˆθ ψ ) j(ˆθ) along the profile curve C ψ = {θ : θ = (ψ, ˆλ ψ )} based on exponential family approximation at y 0 use observed information off the curve to extend prior to parameter space to get a version that when integrated via Laplace, brings us back to rf approximation π(θ) j (θθ) (ˆθ ψ ) 1/2 j (λλ) (θ) 1/2

14 / 16... details any continuous model can be approximated to O(n 1 ) by an exponential family model (at y 0 ) canonical parameter ϕ(θ) = l ;V (θ; y 0 ) = l yi (θ; y 0 )V i (ˆθ 0 ) V (θ) the same matrix as in the default prior connection through location model approximation ancillarity flat prior

15 / 16 Conclusions calibrated priors are data dependent focus motivated by asymptotic theory for likelihood inference reference priors also targetted on parameter of interest marginalization to curved parameters using flat priors may lead to poorly calibrated inferences hierarchical Poisson models: E(y ij ) = c 0 x ij exp(µ + α i + β j + γ ij ) non-informative uniform priors on µ, α, σ β, σ γ difficulties and opportunities with new large data sets checking sensitivity to prior specification can be done simply using asymptotic approximation Φ(r B ) connections to Empirical Bayes?

16 / 16 Some references Fraser, D.A.S., Marras, E., Reid, N. and Yi, G. (2009). Default priors for Bayesian and frequentist inference. Fraser, D.A.S. and Reid, N. (2002). Strong matching of frequentist and Bayesian inference. J. Statist. Plan. Infer. 103, 263 285. Berger, J.O., Bernardo, J. and Sun, D. (2009).The formal definition of reference priors. Ann. Statist. 37, 905 938. Datta, G.S. and Ghosh, M. (1995). Some remarks on noninformative priors. J. Amer. Statist. Assoc. 90, 1357 1363. Dawid, A.P., Stone, M., and Zidek, J.V. (1973). Marginalization paradoxes in Bayesian and structural inference. J. Roy. Statist. Soc. B 35, 189 233. Bernardo, J.M. (1979). Reference posterior distributions for Bayesian inference. J. Roy. Statist. Soc. B 41, 113 147 (with discussion).