An Overview of Objective Bayesian Analysis

Size: px

Start display at page:

Download "An Overview of Objective Bayesian Analysis"

Reynard Boyd
5 years ago
Views:

1 An Overview of Objective Bayesian Analysis James O. Berger Duke University visiting the University of Chicago Department of Statistics Spring Quarter,

2 Lectures Lecture 1. Objective Bayesian Analysis: Introduction and a Casual History Lecture 2. Objective Bayesian Estimation Lecture 3. Objective Bayesian Hypothesis Testing and Conditional Frequentist Testing Lecture 4. Essentials of Objective Bayesian Model Uncertainty Lecture 5. Methodology of Objective Bayesian Model Uncertainty 2

3 Objective Bayesian Hypothesis Testing A pedagogical example from high-energy physics: When the Large Hadron Collider at CERN reaches full power, one major goal will be to determine if the Higgs boson particle actually exists. 3

4 Department of Statistics ' & University of Chicago: 2011 Atlas Detector at the Large Hadron Collider (CERN) 4 $ %

5 Data: X = # events observed in time T that are characteristic of Higgs boson production in LHC particle collisions. Statistical Model: X has density where Poisson(x θ + b) = (θ + b)x e (θ+b) x!, θ is the mean rate of production of Higgs events in time T ; b is the (known) mean rate of production of events with the same characteristics from background sources in time T. To test: H 0 : θ = 0 vs H 1 : θ > 0. (H 0 corresponds to no Higgs. ) 5

6 P-value: p = P (X x b, θ = 0) = m=x Poisson(m 0 + b) Case 1: p = if x = 7, b = 1.2 Case 2: p = if x = 6, b = 2.2. Physicists mostly have not heard of p-values; they tend to talk in terms of the number of σ (standard errors) from the null. Those who have heard of them mostly interpret them incorrectly. Those who understand p-values know their use is difficult: Luc Demortier: In any search for new physics, a small p value should only be seen as a first step in the interpretation of the data, to be followed by a serious investigation of an alternative hypothesis. Only by showing that the latter provides a better explanation of the observations than the null hypothesis can one make a convincing case for discovery. Sequential testing: This is actually a sequential experiment, so p should be further adjusted to account for multiple looks at the data. 6

7 Bayes factor of H 0 to H 1 : ratio of likelihood under H 0 to average likelihood under H 1 (or odds of H 0 to H 1 ) B 01 (x) = Poisson(x 0 + b) Poisson(x θ + b)π(θ) dθ = b x e b 0 0 (θ + b)x e (θ+b) π(θ) dθ. Subjective approach: Choose π(θ) subjectively (e.g., using the standard physics model predictions of the mass of the Higgs). Objective approach: Choose π(θ) to be the intrinsic prior (discussed later) π I (θ) = b(θ + b) 2. (Note that this prior is proper and has median b.) Bayes factor: is then given by B 01 = 0 b x e b (θ + b)x e (θ+b) b(θ + b) 2 dθ = b(x 1) e b Γ(x 1, b), where Γ is the incomplete gamma function. Case 1: B 01 = (recall p = ) Case 2: B 01 = 0.26 (recall p = 0.025) 7

8 Posterior probability of the null hypothesis: The objective choice of prior probabilities of the hypotheses is Pr(H 0 ) = Pr(H 1 ) = 0.5, in which case Pr(H 0 x) = B B 01. Case 1: Pr(H 0 x) = (recall p = ) Case 2: Pr(H 0 x) = 0.21 (recall p = 0.025) Complete posterior distribution: is given by Pr(H 0 x), the posterior probability of null hypothesis π(θ x, H 1 ), the posterior distribution of θ under H 1 A useful summary of the complete posterior is Pr(H 0 x) and C, a (say) 95% posterior credible set for θ under H 1. Case 1: Pr(H 0 x) = ; C = (1.0, 10.5) Case 2: Pr(H 0 x) = 0.21; C = (0.2, 8.2) Note: For testing precise hypotheses, confidence intervals alone are not a satisfactory inferential summary. 8

9 x and H Figure 1: Pr(H 0 x) (the vertical bar), and the posterior density for θ given 9

10 A Lengthy Aside: What should a Fisherian or a frequentist think of the discrepancy between the p-value and the objective Bayesian answers in precise hypothesis testing? I. Many Fisherians (and arguably Fisher) prefer likelihood ratios to p-values, when they are available (e.g., genetics). A lower bound on the Bayes factor (or likelihood ratio): choose π(θ) to be a point mass at ˆθ, yielding B 01 (x) = Poisson(x 0 + b) 0 Poisson(x θ + b)π(θ) dθ ( ) x Poisson(x 0 + b) b Poisson(x ˆθ + b) = min{1, e x b }. x Case 1: B (recall p = ) Case 2: B (recall p = 0.025) 10

11 II. To a frequentist, p-values are not particularly relevant anyway; the difference between p-values and frequentist error probabilities can be seen from the conditional frequentist viewpoint (Kiefer, 1977 JASA; Brown, 1978 AOS) Find a statistic S that reflects the strength of evidence in the data. Compute the frequentist measure of error conditional on S. Artificial example: Observe X 1 and X 2 where θ + 1 with probability 1/2 X i = θ 1 with probability 1/ Confidence set for θ: C(X 1, X 2 ) = (X 1 + X 2 ) if X 1 X 2 X 1 1 if X 1 = X 2. Unconditional coverage: P θ (C(X 1, X 2 ) contains θ) = Strength of evidence in the data: Measured by S = X 1 X 2 (either 0 or 2) { Conditional coverage: P θ (C(X 1, X 2 ) contains θ S) = 0.5 if S = if S = 2. 11

12 For testing, Berger, Brown and Wolpert, 1994 AOS (continuous case) and Dass, 2001 Sankhya (discrete case) proposed: Develop S as follows: Let p i (x) be the p-value from testing H i against the other hypothesis; use the p i (x) as measures of strength of evidence, as Fisher suggested. Define the conditioning statistic S = max{p 0 (x), p 1 (x)}; its use is based on deciding that data (in either the rejection or acceptance regions) with the same p-value has the same strength of evidence. Accept H 0 when p 0 > p 1, and reject otherwise. Compute Type I and Type II conditional error probabilities as α(s) = P 0 (rejecting H 0 S = s) P 0 (p 0 p 1 S(X) = s) β(s) = P 1 (accepting H 0 S = s) P 1 (p 0 > p 1 S(X) = s), where P i refers to probability under H i. 12

13 Objective Bayes and Conditional Frequentist Agreement Theorem 1 The conditional frequentist error probabilities, α(s) and β(s), exactly equal the (objective) posterior probabilities of H 0 and H 1, so conditional frequentists and Bayesians report the same error probabilities. In the physics example, the conditional Type I error is thus α(s) = Pr(H 0 x) = B 01 /(1 + B 01 ) (= in Case 1; =0.21 in Case 2) The result can be viewed as a way to convert p-values into real frequentist error probabilities, when there is an alternative hypothesis. The conditional error probabilities α(s) and β(s) are fully data-dependent, yet fully frequentist. The procedure also applies to sequential settings, since the Bayesian error probabilities ignore the stopping rule. (Berger, Boukai and Wang, 1999 Biom.) Savage (1961): When I first heard the stopping rule principle from Barnard in the early 50 s, I thought it was scandalous that anyone in the profession could espouse a principle so obviously wrong, even as today I find it scandalous that anyone could deny a principle so obviously right. 13

14 General Formulation (Dass and Berger, 2003 Scand. J. Stat.) Test H 0 : X has density p 0 (x η) versus H 1 : X has density p 1 (x η, ξ} on X, with respect to a σ-finite dominating measure λ. {p 0 (x η) : η Ω} is invariant with respect to a group operation g G. g (η, ξ) = (g η, ξ) and {p 1 (x η, ξ) : η Ω} is invariant with respect to g for each given ξ. Utilize the right-haar prior, ν Ḡ (η), on η and a proper prior π(ξ) on ξ. The original problem is reduced to testing distributions of the maximal invariant, T: H 0 : T f 0 versus H 1 : T f 1, where f 0 (t) = p 0 (x η)dν Ḡ (η) and f 1 (t) = p 1 (x η, ξ)dν Ḡ (η)dπ(ξ). Since these are simple distributions, the rest of the analysis proceeds as before, and the equivalence between posterior probabilities and conditional frequentist error probabilities holds. 14

15 Example: Nonparametric testing of fit (Berger and Guglielmi, 2001 JASA) To test: H 0 : X N (µ, σ) versus H 1 : X F (µ, σ), where F is an unknown location-scale distribution. Determine the Bayes factor B 01 (x), using a Polya tree prior for F, centered at H 0 ; the right-haar prior π(µ, σ) = 1/σ for the location-scale parameters As before, the conditional frequentist Type I error, based on p-value conditioning, is α(x) = B 01 (x)/(1 + B 01 (x)) = Pr(H 0 x). Notes: This gives exact conditional frequentist error probabilities, even for small samples (e.g., n = 3). Computation of B(x) uses the (simple) exact formula for a Polya tree marginal distribution, together with importance sampling to deal with (µ, σ). 15

16 A Teaching Aside 1. One way to help teach students that p-values are very different from error probabilities (both Bayesian and frequentist) is to utilize the applet (of German Molina) available at berger. 2. Here s another useful way to explain the difference using (say) the physics example, where the p-value.00025, but the objective posterior probability (or conditional frequentist Type I error) of the null differs by a factor of 30: In the example, a factor of.0014/ is due to the difference between a tail area {X : X 7} and the actual observation X = 7. The remaining factor of roughly 5.4 in favor of the null results from the Ockham s razor penalty that Bayesian analysis automatically gives to a more complex model. (A more complex model will virtually always fit the data better than a simple model, and so some penalty is necessary to avoid overfitting.) 16

17 When there is no alternative hypothesis, robust Bayesian theory suggests a way to calibrate p-values. (Sellke, Bayarri and Berger, 2001 Am. Stat.). A proper p-value satisfies H 0 : p(x) Uniform(0, 1). Consider testing this versus H 1 : p f(p), where Y = log(p) has a decreasing failure rate (a natural non-parametric alternative). Theorem 2 If p < e 1, B 01 e p log(p). For the physics example, with p =.00025, ep log p = (recall that the Bayes factor we computed was B01 I = ). An analogous lower bound on the conditional Type I frequentist error is α (1 + [ e p log(p)] 1 ) 1. p α(p)

18 Choosing priors for common parameters in testing If pure subjective (evidence-based?) choice is not possible, here are some guidelines: Gold standard: if there are parameters in each hypothesis that have the same group invariance structure, one can use the right-haar priors for those parameters (even though improper) (Berger, Pericchi and Varshavsky, 1998) Silver standard: if there are parameters in each hypothesis that have the same scientific meaning, reasonable default priors (e.g. the constant prior 1) can be used (e.g., in the later vaccine example, where p 1 means the same in H 0 and H 1 ). Bronze standard: to try to obtain parameters that have the same scientific meaning (beware the fallacy of Greek letters ), one strategy often employed is to orthogonalize the parameters, i.e., reparameterize so that the partial Fisher information for those parameters is zero. 18

19 Example: (location-scale) Suppose X 1, X 2,..., X n are i.i.d with density Several models are entertained: M N : g is N(0, 1) M U : g is Uniform (0, 1) M C : g is Cauchy (0, 1) f(x i µ, σ) = 1 ( ) σ g xi µ σ M L : g is Left Exponential ( 1 σ ex µ, x µ) M R : g is Right Exponential ( 1 σ e (x µ), x µ) All models have location-scale parameters (µ, σ), for which the right-haar prior density is π(µ, σ) = 1 σ. 19

20 The marginal densities m(x M) = 0 n i=1 [ 1 σ g ( xi µ σ )] 1 σ dµ dσ for the five models are given by 1. Normal: m(x M N ) = 2. Uniform: m(x M U ) = Γ((n 1)/2) (2 π) (n 1)/2 n( i (x i x) 2 ) (n 1)/2 1 n (n 1)(x (n) x (1) ) n 1 3. Cauchy: m(x M C ) is given in Spiegelhalter (1985). 4. Left Exponential: m(x M L ) = (n 2)! n n (x (n) x) n 1 5. Right Exponential: m(x M R ) = (n 2)! n n ( x x (1) ) n 1 20

21 Darwin s Data, n = 15 Cavendish s Data, n = Stigler s Data Set 9, n = 20 Generated Cauchy Data, n =

22 The objective posterior probabilities of the five models, for each data set are as follows: Data Models set Normal Uniform Cauchy L. Exp. R. Exp. Darwin Cavendish Stigler Cauchy

23 Choosing priors for non-common parameters If subjective (evidence-based?) choice is not possible, some guidelines are: Vague proper priors are horrible (related to the Jeffreys-Lindley paradox): for instance, if X N(µ, 1) and we test H 0 : µ = 0 versus H 1 : µ 0 with a Uniform( c, c) prior for θ, the Bayes factor is B 01 (c) = f(x 0) c c f(x µ)(2c) 1 dµ 2c f(x 0) f(x µ)dµ for large c, which depends dramatically on the choice of c. Improper priors are problematical, because they are unnormalized; is B 01 = f(x 0) f(x µ)(1)dµ or B 01 = f(x 0)? f(x µ)(2)dµ Robust solution: if one can specify a plausible range c 1 c c 2, look at B 01 (c) over this range and hope the conclusion is robust. (Not obvious for higher dimensional parameters, but there is a literature.) 23

24 Various proposed default priors for non-common parameters Fractional priors (O Hagan): use a fraction γ of the model likelihood (usually γ = parameter dimension / sample size ) as the prior, with L(θ) 1 γ as the likelihood. Intrinsic priors (Berger, Pericchi and others): generate priors from training samples (either actual subsets of the data, or imaginary data generated under the null model). Conventional priors that have at least some nice properties: e.g., Zellner-Siow priors for linear models are invariant to scale changes in covariates consistent (the true model will be selected as n ) information-consistent (e.g., will reject as t or F statistics ) coherent (roughly, are logically connected) Various efforts at predictive matching priors. Approximations (such as BIC); these can capture part of the prior influence, but not all. 24

25 A Brief Aside: Approximating a believable precise hypothesis by an exact precise null hypothesis* A precise null, like H 0 : θ = θ 0, is typically never true exactly; rather, it is used as a surrogate for a real null H ϵ 0 : θ θ 0 < ϵ, ϵ small. Result (Berger and Delampady, 1987 Statistical Science): Robust Bayesian theory can be used to show that, under reasonable conditions, if ϵ < 1 4 σˆθ, where σˆθ is the standard error of the estimate of θ, then P r(h ϵ 0 x) P r(h 0 x). Note: Typically, σˆθ c n, where n is the sample size, so for large n the above condition can be violated, and using a precise null may not be appropriate, even if the real null is believable. *In non-precise hypothesis testing, such as one-sided testing, the discrepancy between p-values and Bayesian/conditional frequentist answers can be much smaller. 25

26 Department of Statistics ' University of Chicago: 2011 $ A Recent Medical Example & 26 %

27 Hypotheses, Data, and Classical Test: Alvac had shown no effect Aidsvax had shown no effect Question: Would Alvac as a primer and Aidsvax as a booster work? The Study: Conducted in Thailand with 16,395 individuals from the general (not high-risk) population: 74 HIV cases reported in the 8198 individuals receiving placebos 51 HIV cases reported in the 8197 individuals receiving the treatment Model: X 1 Binomial(x 1 p 1, 8198) and X 2 Binomial(x 2 p 2, 8197), respectively, so that p 1 and p 2 denote the probability of HIV in the placebo and treatment populations, respectively. Classical test of H 0 : p 1 = p 2 versus H 1 : p 1 p 2 yielded a p-value of

28 Bayesian Analysis: Reparameterize to p 1 and V = 100 so that we are testing H 0 : V = 0, p 1 arbitrary H 1 : V 0, p 1 arbitrary. Prior distribution: P r(h i ) = prior probability that H i is true, i = 0, 1, Let π 0 (p 1 ) = π 1 (p 1 ), and choose them to be either ( 1 p 2 p 1 ), uniform on (0,1) subjective (evidence-based?) priors based on knowledge of HIV infection rates Note: the answers are essentially the same for either choice. For V under H 1, consider the priors uniform on (-20, 60) (apriori subjective evidence-based beliefs) uniform on ( 100c/3, 100c) for 0 < c < 1, to study sensitivity (constrained also to V > 100(1 1 p 1 ). 28

29 Posterior probability of the null hypothesis: P r(h 0 data) = P r(h 0 )B 01 P r(h 0 )B 01 + P r(h 1 ), where the Bayes factor of H 0 to H 1 is B 01 = Binomial(74 p1, 8198) Binomial(51 p 1, 8197)π 0 (p 1 )dp 1 Binomial(74 p1, 8198) Binomial(51 p 2, 8197)π 0 (p 1 )π 1 (p 2 p 1 )dp 1 dp 2. For the prior for V that is uniform on (-20, 60), B 01 1/4 ( recall, p-value.04) If the prior probabilitites of the hypotheses are each 1/2, the overall posterior density of V has a point mass of size 0.20 at V = 0, a density (having total mass 0.80) on non-zero values of V : 29

30 RV144 data; no adjustment probability density of VE VE 30

31 Robust Bayes: For the prior on V that is uniform on ( 100c/3, 100c), the Bayes factor is Thai B01; psi ~ Un( c/3, c) B_01(c) Note: There is sensitivity to c, indeed 0.22 < B 01 (c) < 1, but why would this cause one to instead report p = 0.04, knowing it will be misinterpreted? Note: Uniform priors are the extreme points of monotonic priors, and so such robustness curves are quite general. c 31

the unification of statistics its uses in practice and its role in Objective Bayesian Analysis:

Objective Bayesian Analysis: its uses in practice and its role in the unification of statistics James O. Berger Duke University and the Statistical and Applied Mathematical Sciences Institute Allen T.