Bayesian and frequentist inference

Size: px

Start display at page:

Download "Bayesian and frequentist inference"

Sheena Sabrina Black
5 years ago
Views:

1 Bayesian and frequentist inference Nancy Reid March 26, 2007 Don Fraser, Ana-Maria Staicu

2 Overview Methods of inference Asymptotic theory Approximate posteriors matching priors Examples Logistic regression normal circle Constructing priors Conclusions

3 A Basic Structure data y = (y 1,..., y n ) R n model f (y; θ) probability density parameters θ = (θ 1,..., θ k ) R k inference about θ (or components of θ) an interval of values of θ consistent with the data a probability distribution for θ, conditional on the data

5 The average 12 month weight loss on the Atkins diet was 10.3 lbs, compared to 3.5, 5.7 and 4.8 lbs on the Zone, Learn and Ornish diets, respectively. 1 mean 12 month weight loss on the Atkins diet was 4.7 kg (95% confidence interval 3.1 to 6.3 kg) mean 12 month weight loss was significantly different between the Atkins and the Zone diets (p < 0.05) The probability that the average weight loss on the Atkins diet is greater than that on the Zone diet is JAMA, Mar 7

8 π(θ y) f (y; θ)π(θ) f (y; θ)π(θ) π(θ y) = f (y; θ)π(θ)dθ parameter of interest ψ(θ) Bayesian inference π m (ψ y) = {ψ(θ)=ψ} π(θ y)dθ integrals often computed using simulation methods marginal posterior density summarizes all information about ψ available from the data

9 nonbayesian inference find a function of the data t(y), say, with 1. distribution depending only on ψ 2. distribution that is known usually based on the log-likelihood function l(θ; y) = log f (y; θ) ˆθ arg sup l(θ; y) ˆθ ψ arg sup ψ(θ)=ψ l(θ; y) likelihood ratio statistic: 2{l(ˆθ; y) l(ˆθ ψ ; y)} d χ 2 d 1 Wald statistic ˆΣ 1/2 d ( ˆψ ψ) N(0, I) if ψ is scalar: r (ψ). N(0, 1) statistic is derived from the model (via the likelihood function) distribution comes from asymptotic theory likelihood is not (usually) integrated but is (usually) maximized

10 ... Bayesian inference well defined basis for inference internally consistent leads to optimal results from one point of view requires a probability distribution for θ ( a prior distribution) not necessarily calibrated not always clear how much answers depend on the choice of prior requires (high-) dimensional integration, or simulation, or approximation useful in applications

11 ... nonbayesian inference properties well understood calibrated: properties correspond to long-run frequency choice of function t(y) may be non-optimal approximation to distribution of t(y ) may be poor approximation to distribution of t(y ) may be excellent useful in applications

12 Fisher (1920s, frequentist): I know only one case in mathematics of a doctrine which has been accepted and developed by the most eminent men of their time, and is now perhaps accepted by men now living, which at the same time has appeared to a succession of sound writers to be fundamentally false and devoid of foundation Lindley (1950s, Bayesian): Every statistician would be a Bayesian if he took the trouble to read the literature thoroughly and was honest enough to admit that he might have been wrong

13 Approximate posteriors ψ π m (ψ y)dψ = Φ(r B ){1 + O(n 3/2 )} r B = r + 1 r log q B r r = [2{l p ( ˆψ) l p (ψ)}] 1/2 q B = l p(ψ)j 1/2 p ( ˆψ) j λλ(ˆθ) 1/2 π(ˆθ ψ ) j λλ (ˆθ ψ ) 1/2 π(ˆθ) l p (ψ) = l(ψ, ˆλ ψ ) = l(ˆθ ψ )

14 Bayesian and frequentist inference Asymptotic theory Approximate posteriors Approximate posteriors Approximate posteriors ψ πm(ψ y)dψ = Φ(r B ){1 + O(n 3/2 )} rb = r + 1 r log q B r r = [2{lp( ˆψ) lp(ψ)}] 1/2 q B = l p(ψ)j 1/2 p ( ˆψ) jλλ(ˆθ) 1/2 π(ˆθψ) jλλ(ˆθψ) 1/2 π(ˆθ) lp(ψ) = l(ψ, ˆλψ) = l(ˆθψ) marginal posterior density: π m (ψ y) = {ψ(θ)=ψ} exp{l(θ; y}π(θ)dθ/ exp{l(θ; y)}π(θ)dθ = c exp{l(ˆθ ψ )} j λλ (ˆθ ψ ) 1/2 π(ˆθ ψ ) {1 + O(n 3/2 )} exp{l(ˆθ)} j(ˆθ) 1/2 π(ˆθ). j λλ (ˆθ ψ ) 1/2 π(ˆθ ψ ) = c exp{l(ˆθ ψ ) l(ˆθ)} j p ( ˆψ) 1/2 j λλ (ˆθ ψ ) 1/2 π(ˆθ) π m (r y) = c exp( 1 2 r 2 ) r q B = c exp( 1 2 r 2 log r ) q B

15 ... approximation to the posterior is really good Example: y i N(θ, I), y i = (y i1,..., y ik ), θ = (θ 1,..., θ k ) ψ = (θ θ2 k ) flat prior π(θ)dθ dθ posterior is proper, in fact is χ 2 k (n y 2 ) { rb = ) } 1 (k 1)/2 n( ˆψ ψ) + n( ˆψ ψ) (ψˆψ log

16 ... normal circle, k=2 normal circle, n=1 normal circle posterior marginal difference psi psi

17 ... normal circle, increasing k changing k increasing k posterior marginal difference k=5 k=10 k= psi psi

18 Are Bayesian and nonbayesian solutions really that different? posterior probability limit Pr θ Y {θ θ (1 α) y} = 1 α upper confidence bound Pr Y θ {θ θ (1 α) (π, Y )} = 1 α first depends on the prior, second is evaluated under the model is there a prior for which both limits are equal? no 2, but there might be a prior for which they are approximately equal: Pr Y θ {θ θ (1 α) (π, Y )} = 1 α + O(1/n) matching prior 2 except in a location model

19 ... matching priors if θ R, then π(θ) i 1/2 (θ) (Welch & Peers, 1963) if θ = (ψ, λ) then π(θ) i 1/2 ψψ (θ)g(λ) (Peers, 1965; Tibshirani, 1989) where we reparametrize if necessary so that i ψλ (θ) = 0 matching priors are one version of objective priors they depend on the model reference priors of Berger and Bernardo are a different version of objective priors with a slightly different goal: maximize the distance from prior to posterior good news: Bayesian inference with matching priors is calibrated bad news: there is a whole family of such priors, even in relatively simple situations

20 ... good news: matching priors are unique when used in the r approximation(diciccio & Martin 93, Staicu 07) ψ π m (ψ y)dψ = Φ(r B ){1 + O(n 3/2 )} r B = r + 1 r log q B r r = [2{l p ( ˆψ) l p (ψ)}] 1/2 q B = l p(ψ)j 1/2 p ( ˆψ) j λλ(ˆθ) 1/2 π(ψ, ˆλ ψ ) j λλ (ˆθ ψ ) 1/2 π( ˆψ, ˆλ) = i 1/2 when ψ orthogonal to λ, ˆλ ψ = ˆλ{1 + O p (n 1 )} ψψ (ψ, ˆλ ψ ) g(ˆλ ψ ) i 1/2 ψψ ( ˆψ, ˆλ) g(ˆλ)

21 Logistic regression The first ten out of 79 sets of observations on the physical characteristics of urine. Presence/absence of calcium oxalate crystals is indicated by 1/0. Two cases with missing values. 3 Case Crystals Specific gravity ph Osmolarity Conductivity Urea Calcium Andrews & Herzberg, 1985

22 ... logistic regression Model: Independent binary responses Y 1,..., Y n with Pr(Y i = 1) = linear exponential family: exp(x T i β) 1 + exp(x T i β) l(β; y) = exp{β 0 Σy i + β 1 Σx 1i y i + + β 6 Σx 6i y i c(β) d(y)} parameter of interest ψ = β 6, say orthogonal nuisance parameter λ j = E(yx j ) ˆλ ψ ˆλ (special to exponential families) π(ψ, λ) i 1/2 ψψ (ψ, λ), unique

23 BN and Laplace approximation beta_p

24 ... logistic regression Confidence/posterior interval for parameter of interest method lower bound upper bound p-value for 0 Φ(r ) Φ(rB ) st order First line uses a (very accurate) approximation to the conditional distribution of Σy i x 6i, given sufficient statistics for nuisance parameters (exponential family) Fitted using cond library of package hoa

25 Bayesian and frequentist inference Examples Logistic regression... logistic regression... logistic regression Confidence/posterior interval for parameter of interest method lower bound upper bound p-value for 0 Φ(r ) Φ(rB ) st order First line uses a (very accurate) approximation to the conditional distribution of Σy ix 6i, given sufficient statistics for nuisance parameters (exponential family) Fitted using cond library of package hoa n = 77, d = 7, but the information in 77 binary observations seems to be comparable to the information in about 10 continuous observations

26 Matching priors need to be targetted on the parameter of interest as do all objective priors flat priors for vector parameters don t lead to calibrated inferences Example: y i N(θ i, 1), i = 1,..., k exact and approximate posterior for θi 2 nearly identical

27 changing k increasing k posterior marginal difference k=5 k=10 k= psi psi

28 Matching priors need to be targetted on the parameter of interest as do all objective priors flat priors for vector parameters don t lead to calibrated inferences Example: y i N(θ i, 1), i = 1,..., k exact and approximate posterior for θ 2 i nearly identical posterior used prior π(θ) 1, marginalize to ψ = θ 2 i posterior limits are not probability limits there is an exact nonbayesian solution, based on marginal density of y 2 i (noncentral χ 2 ) Dawid, Stone, Zidek, 1974

29 dependence on dimension difference dim = 10 dim = 5 dim = psi y1

30 dependence on n difference y1 psi

31 nonbayesian asymptotics find a function of the data with known distribution depending on ψ our candidate: rf = r F (ψ; y) with N(0, 1) distribution r F = r + 1 r log q F r q F = {χ(ˆθ) χ(ˆθ ψ )}/ˆσ χ a maximum likelihood type departure χ(θ) = scalar function of ϕ(θ) θ ϕ(θ) = ϕ(θ; y 0 ): a re-parametrization of the model

32 ... nonbayesian asymptotics where does this come from and why does it work using ϕ we can construct an approximate exponential family model this is the route to accurate density approximation using saddlepoint-type arguments to get ϕ(θ) differentiate the log-likelihood on the sample space ϕ j (θ) = d dt l(θ; y 0 + tv j ) t=0 jiggle the observed data in directions v j, and record the change in the log likelihood gives a way to implement approximate conditioning leads to quite accurate frequentist solutions

33 ... does this help with finding priors? 1. strong matching priors (FR, 2002) posterior marginal distribution s(ψ) = Φ(r B ) nonbayesian p-value function p(ψ) = Φ(r F ) r F = r + 1 r log q F r r B = r + 1 r log q B r Set q F = q B, and solve for prior π (if possible) guarantees matching to 3rd order leads to a data dependent prior automatically targetted on parameter of interest only a function of the parameter of interest

34 Bayesian and frequentist inference Constructing priors... does this help with finding priors?... does this help with finding priors? 1. strong matching priors (FR, 2002) posterior marginal distribution s(ψ) = Φ(rB ) nonbayesian p-value function p(ψ) = Φ(rF ) rf = r + 1 r log q F r rb = r + 1 r log q B r Set q F = q B, and solve for prior π (if possible) guarantees matching to 3rd order leads to a data dependent prior automatically targetted on parameter of interest only a function of the parameter of interest q B = l p(ψ)j 1/2 p ( ˆψ) blue prior q F = {χ(ˆθ) χ(ˆθ ψ )}/ˆσ χ

35 ... does this help 2. approximate location models every model f (y; θ) can be approximated by a location model (locally) location models have automatic prior for vector θ: flat change at θ that generates an increment d ˆθ at observed data, W (θ), say flat prior is π(θ) W (θ) = AV (θ) = d ˆθ dθ V is a n k matrix with ith row V i (θ) = F ;θ (y 0 i ; θ) F y (y 0 i ; θ) gives a natural default prior for vector parameters still need to deal with marginalization to target parameters of interest

36 Bayesian and frequentist inference Constructing priors... does this help 2. approximate location models... does this help every model f (y; θ) can be approximated by a location model (locally) location models have automatic prior for vector θ: flat change at θ that generates an increment d ˆθ at observed data, W (θ), say flat prior is π(θ) W (θ) = AV (θ) = d ˆθ dθ V is a n k matrix with ith row V i(θ) = F;θ (y i 0 ; θ) Fy(yi 0 ; θ) gives a natural default prior for vector parameters still need to deal with marginalization to target parameters of interest approximate location model f (y β(θ)), say factors as f 1 (a)f 2 (ˆθ β(θ)) if f 2 ( ) = 0 then flat prior is β θ (θ) d ˆθ = β θ (θ)dθ

37 Summarizing asymptotic theory is useful for obtaining (very) good approximations to distributions can be used in Bayesian or nonbayesian contexts can be used to derive posterior intervals that are calibrated can be used to simplify the choice of matching priors finding priors is not straightforward likelihood plus first derivative change in the sample space is instrinsic to inference, at least asymptotically to O(n 3/2 ) flat priors for a location parameter will not lead to calibrated posteriors for curved parameters

38 Economist Jan 14, 2006 We re all Bayesians really

39 ... Bayesian Griffiths and Tenenbaum Psychological Science asked subjects to make predictions based on limited evidence concluded that the predictions were consistent with Bayesian methods of inference the results of our experiment reveal a far closer correspondence between optimal statistical inference and everyday cognition than that suggested by previous research

40 ... Bayesian Goldstein (2006): we argue first that the subjectivist Bayes approach is the only feasible method for tackling many scientific problems Neal (1998): Using technological or reference priors chosen solely for convenience, or out of a mis-guided desire for pseudo-objectivity... [or] using priors that vary with the amount of data that you have collected... have no real Bayesian justification, and since they are usually offered with no other justification either, I consider them to be highly dubious. Wasserman (2006): I think it would be hard to defend a sequence of subjective analyses that yield poor frequency coverage

41 References 1. ba.stat.cmu Volume 2 # 3 Berger 2006, Goldstein 2006, Wasserman DiCiccio and Martin Fraser Reid Little Tibshirani Welch and Peers 1963

Applied Asymptotics Case studies in higher order inference

Applied Asymptotics Case studies in higher order inference Nancy Reid May 18, 2006 A.C. Davison, A. R. Brazzale, A. M. Staicu Introduction likelihood-based inference in parametric models higher order approximations