Bayes Testing and More

Size: px

Start display at page:

Download "Bayes Testing and More"

Bruno Carroll
5 years ago
Views:

1 Bayes Testing and More STA 732. Surya Tokdar Bayes testing The basic goal of testing is to provide a summary of evidence toward/against a hypothesis of the kind H 0 : θ Θ 0, for some scientifically important subset Θ 0 of the parameter space Θ. For a data model X f(x θ), θ Θ, a Bayesian would start by specifying a prior pdf π(θ) for θ. The prior then combines with the data X = x to produce a posterior pdf π(θ x) for θ. At this stage, we can simply summarize the evidence toward H 0 by P (H 0 x) = Pr(θ Θ 0 X = x) = π(θ x)dθ Θ 0 and the evidence against H 0 is simply 1 P (H 0 x). This probability represents our updated belief about the statement H 0. If a reject/accept H 0 type decision is indeed warranted, then we could do it by subjecting P r(θ Θ 0 X = x) to a cut-off of our choice. That is, we reject H 0 if Pr(θ Θ 0 X = x) < k for some (positive) cut-off k. How do we choose this cut-off? Loss function To guide the choice of a cut-off, we need to carefully think about the consequences of our decisions. We now have to pretend that θ is going to be observed (in future) and our decision is going to be checked against the observed value. If the decision matches the observed value, we incur no penalty, otherwise we are penalized a positive amount. Let d 0 denote we decide θ Θ 0 and d 1 denote we decide θ Θ 0. Then we incur a penalty if we go for d 0 and the observed θ turns out to be in Θ \ Θ 0, or if we go for d 1 and θ turns out to be in Θ 0. These two penalties can potentially differ in the amount we lose. This is expressed in the following loss table: θ Θ 0 θ Θ \ Θ 0 d 0 0 w 0 d 1 w 1 0 If we denote by loss(d, θ) the loss incurred when we go for a decision d {d 0, d 1 } and the parameter value is later observed to be θ, then loss(d 0, θ) = 0, θ Θ 0, loss(d 0, θ) = w 0, θ Θ \ Θ 0, loss(d 1, θ) = w 1, θ Θ 0, loss(d 1, θ) = 0, θ Θ \ Θ 0. 1

2 Therefore the posterior expected loss of a decision d r(d) = E[loss(d, θ) X = x] = loss(d, θ)π(θ x)dθ can be simplified to r(d 0 ) = w 0 π(θ x)dθ = w 0 Pr(θ Θ \ Θ 0 X = x) Θ\Θ 0 r(d 1 ) = w 1 Θ 0 π(θ x)dθ = w 1 Pr(θ Θ 0 X = x). If we go for the decision that minimizes our posterior expected loss, then we are committed to reject H 0 if (and only if) r(d 1 ) < r(d 0 ) Pr(θ Θ 0 X = x) Pr(θ Θ \ Θ 0 X = x) < w 0 w 1 Pr(θ Θ 0 X = x) < w 0 w 0 + w 1 the last equivalence follows from the fact that Pr(θ Θ\Θ 0 X = x) = 1 Pr(θ Θ 0 X = x). Tying back to the preceding section, we see that the cut-off k = w 0 /(w 0 + w 1 ) is determined by the relative gravity of the two possible mistakes we can make. Notice that the above approach starkly differs from the controlling errors foundation of the classical testing procedures. In the Bayesian setting, once the post-data belief about θ is expressed by the posterior π(θ x), the actual decisions are entirely based on expected costs associated with the two decisions where expectations are evaluated via π(θ x). Unlike the classical setting, there is no frequentist guarantee that s sought here. Issues with testing point nulls Consider the statistical analysis done by Laplace on female birthrate. He had modeled X =number of female births among n births as X Bin(n, p) with p Unif(0, 1) = Be(1, 1). The observed data were n = and X = which lead to the posterior pdf Be(249146, ). For testing H 0 : p 0.5 against H 1 : p < 0.5 Laplace would report Pr(p 0.5) = One can argue that what Laplace really wanted to study was whether H 0 : p = 0.5 against H 1 : p 0.5. This presents a unique challenge. Because p is modeled with a pdf over [0, 1], the posterior is also a pdf over [0, 1] and hence Pr(p = 0.5 X = x) = Pr(p = 0.5) = 0. Note that this zero does not reflect that the posterior concentrates away from p = 0.5. It is simply an artifact of our prior on p which treats p as a continuous random variable, and so the probability of any single value is simply zero. There are a couple of different ways to go about this. 2

3 Bayesian tail area probability The goal of testing a point null H 0 : θ = θ 0 can be interpreted as judging the plausibility of a special value θ 0 (i.e., for female birth rate p = 0.5 is special because it captures equal odds). This can be effectively done by communicating how central θ 0 is to the posterior pdf π(θ x). We could look at all 100(1 α)%, equal-tail, posterior credible intervals for θ [given by the α/2 and (1 α/2)th posterior quantiles of θ] and check what is the largest value of α for which this includes θ 0. This limiting α value is simply 2 min(pr(θ > θ 0 X = x), P (θ < θ 0 X = x)). If this summary is close to zero, it reflects that θ 0 is far out in the tails of the π(θ x) pdf. I refer to the above number a Bayesian tail area probability that quantifies evidence in support of H 0 [with obvious analogy to p-values for classical testing.] Ignorance range Some statisticians contest the basic premise of a point null, arguing that it gives an extreme abstraction of a range of interesting values. That is, with H 0 : θ = θ 0 we perhaps want to capture H 0 : θ θ 0 < d for some small positive number d. Thus one could instead report P ( θ θ 0 < d X = x) for all (interesting) d > 0. The best way to report this would be to make a plot P ( θ θ 0 < d X = x) as a function of d > 0. Formal testing There is in fact one other way to approach the point null testing problem. It requires using a prior distribution that recognizes that θ 0 is a special value and assigns it a positive probability. For female birthrate, this can be achieved if we describe p as follows: Pr(p = 0.5) = p 0, p [p 0.5] π 1 (p). The above indeed defines a random variable p which takes values in [0, 1], but it is described by a mixture of a point mass at 0.5 and a pdf over [0, 1]. In fact one can write the prior pdf of p as: π(p) = p 0 δ 0.5 (p) + (1 p 0 )π 1 (p) where δ a (x) denotes the Kronecker Delta function (δ a (x) = 1 if x = a, and is zero otherwise). This leads to the following calculation of posterior pdf π(p x) = const p x (1 p) n x π 1 (p) = p 0 (x)δ 0.5 (p) + (1 p 0 (x))π 1 (p x) where π 1 (p x) = const p x (1 p) n x π 1 (p) and p 0 (x) = p 0 p px (1 p) n x π 1 (p)dp (0.5) n 3

4 Notice that Pr(p = 0.5 X = x) is precisely p 0 (x). And therefore we could report p 0 (x) as a summary of evidence in support of H 0, as it precisely gives P (H 0 x). However, such a formal framework for hypothesis testing is not universally accepted. A major concern being the use of a drastically different prior on θ than what one would have used if only a credible interval was to be reported. The difference in the choice of prior can have a pronounced effect on the posterior inference. The difference is often stark when apparently low-information priors are used for both cases. See the next example [known as Lindley s paradox]. Example. Imagine a city where 49,581 boys and 48,870 girls have been born over a certain period of time. The number of female births X is modeled with X Bin(n, p), with n = and p [0, 1]. For the non-informative choice π(p) = Unif(0, 1) we get P (p 0.5 X = 49581, n = 98451) = 0.012, and so a Bayesian tail area probability is = 0.024, indicating moderately strong evidence against H 0. For a lowinformation point-null prior with p 0 = 0.5 and π 1 (p) = Unif(0, 1), we get p 0 (x) = 0.95, indicating rather strong evidence toward H 0. Several points are to be noted here. Under the formal Bayes approach, the continuous part of the posterior distribution is still Be(49582, 48871) which puts only probability to p being 0.5 or smaller. So the continuous part supports p being fairly different from 0.5, whereas the discrete part assigns a 95% posterior probability to p = 0.5. This is to be interpreted as it is fairly likely that p = 0.5, but if it is not, then it is likely to be substantially different from 0.5. A useful graphical summary of this is as follows, where a vertical bar shows the posterior probability of H 0 and the curve shows π 1 (p x) scaled appropriately One has to critically judge the role of the point null hypothesis toward the scientific goal of the study. In the female birthrate example above, if we instead tested for H 0 : p = 0.51 with π(p) = 0.5 δ 0.51 (p) Unif(0, 1) we will come up with a very similar conclusion: it is fairly likely that p = 0.51, but if it is not, then it is likely to be substantially different from The two conclusions 4

5 are conflicting. If testing for p = 0.51 is not deemed dramatically different than testing for p = 0.5, then none of the two point null hypotheses makes sense. Both are fake nulls which can lead to misleading and conflicting answers. We could write p 0 (x) 1 p 0 (x) = p 0 f(x p = 0.5) 1 p 0 f(x p)π1 (p)dp. The second ratio on the right, called the Bayes factor (more below) gives the ratio between the conditional data pdfs under the null [p = 0.5] and the alternative [p π 1 (p)] models. The fact that under the alternative model p is very likely to be different from p = 0.5 does not mean that the alternative model is in better agreement with the observed data than the null model. The point is, posterior summaries under an assumed model carry little information on how well the model fits the data. A related point is that Bayes tail area probabilities may lose relevance if a point null is deemed important and may produce very different answers than a formal Bayes analysis. Another related point is that while flat priors offer a lot of fidelity to any observed data, with the posterior being determined mostly by the likelihood function, they also carry little support to any observed data when compared against a more precise model. One needs extra caution when comparing a flat prior model with a precise prior model. Model comparison and Bayes factor In the point-null approach, we actually considered two different models: M 0 : X f(x θ 0 ) M 1 : X f(x θ), θ π 1 (θ) along with prior model probabilities, P (M 0 ) = p 0 and P (M 1 ) = 1 p 0. The quantity p 0 (x) is precisely p 0 (x) = P (M 0 x). This setting generalizes to a more complex framework with potentially many models: M 1 : X f 1 (x θ 1 ), θ 1 π 1 (θ 1 ), θ 1 Θ 1 M 2 : X f 2 (x θ 2 ), θ 2 π 2 (θ 2 ), θ 2 Θ 2. M k : X f k (x θ k ), θ k π k (θ k ), θ k Θ k 5

6 where each model can have its own distinct family of pdfs/pmfs with different parameters living on different spaces. The specification is completed by attaching prior model probabilities: P (M 1 ) = p 1,, P (M k ) = p k with p i 0 and i p i = 1. Bayes rule gives that the posterior probability of model M j is P (M j X = x) = p j (x) = p i Θ j f j (x θ j )π j (θ j )dθ j k i=1 p i Θ i f i (x θ i )π i (θ i )dθ i and the conditional posterior distribution of θ j under model M j is π j (θ j x) = f j (x j θ j )π j (θ j ) Θ j f j (x j θ j )π j (θ j )dθ j. Bayes factor The posterior odds of model M i to model M j is p i (x) p j (x) = p i Θ i f i (x θ i )π i (θ i )dθ i = p i BF ij (x) p j Θ j f j (x θ j )π j (θ j )dθ j p j where BF ij (x), called Bayes factor of M i to M j is the ratio of the marginal likelihoods of the two models. Many people prefer reporting the Bayes factor to the posterior odds, as the former does not depend on the prior odds. Any reader can multiply the reported Bayes factor with her prior odds to obtain her odds of posterior probabilities. Marginal likelihood calculations If X f(x θ), θ π(θ) is a conjugate model then the marginal likelihood f(x) = f(x θ)π(θ)dθ can be calculated in closed form [this is really the normalizing constant Θ in π(θ x) = f(x θ)π(θ)/f(x)]. For example, if X Bin(n, p) and p bet(a, b), then f(x) = ( ) n 1 x 0 p x (1 p) n x px (1 p) n x dp = B(a, b) ( ) n B(a + x, b + n x). x B(a, b) An alternative way to calculate the marginal likelihood is this nifty trick: f(x) = f(x θ )π(θ ) π(θ x) at every θ where the posterior pdf is positive. In the binomial model above, I could use the following code to get f(x) in log-scale (always preferred due to numerical stability) 6

7 p.star <- (x + a) / (n + a + b) log.f.x <- (dbinom(x, n, p.star, log = TRUE) + dbeta(p.star, a, b, log = TRUE) - dbeta(p.star, a + x, b + n - x, log = TRUE)) For a non-conjugate model, calculation of the marginal likelihood is a fairly challenging task, usually more challenging than sampling θ from the posterior π(θ x). Common numerical techniques include quadrature (when dim(θ) is small), or stochastic calculation based on importance sampling Monte Carlo frequently coupled with sequential sampling strategies [see Tokdar and Kass (2010).] The idea of importance sampling is (k) IID to find an importance density q(θ) on Θ and with samples θ q(θ), k = 1,, M approximate f(x) by ˆf(x) = 1 M k f(x θ (k) )π(θ (k) ). q(θ (k) ) This works because by SLLN, ˆf(x) f(x θ)π(θ)dθ provided q(θ) > 0 at every θ where f(x θ)π(θ) > 0. However, the variance of this Monte Carlo estimate could be extremely huge if q(θ) looks very different from π(θ x), and you will need a very very large M to get a reliable answer. A simple technique that works for standard regular models is as follows. 1. Run an optimizer on the log-posterior to find the posterior mode ˆθ and the hessian H (curvature of log π(θ x) = const l x (θ) log π(θ) at θ = ˆθ). 2. By Bernstein-von Mises theorem, π(θ x) N(ˆθ, H 1 ). This itself could be a good choice, except that normal distributions have tails that decay quickly. If the posterior pdf is slightly heavier, then again the importance estimate will have a high variance. 3. Instead it is recommended to take q(θ) = t ν (ˆθ, ah 1 ), the multivariate t pdf with a modest df ν (3 is a good small choice which guarantees two finite moments) and a scaling a > 1 appropriate to cover the range of the posterior pdf (a = 5 should suffice in most cases). See the code at the end of the handout. Improper prior In the birthrate example above, with the point-null model, we used a Unif(0, 1) prior on p given p 0.5. What happens if we used a uniform prior on log p? Recall that 1 p this corresponds to the improper Be(0, 0) prior on p with pdf π 1 (p) = c/{p(1 p)}, with c arbitrary. For our data with x = > 0 and n x = > 0 the resulting posterior is a proper Be(49518, 48870) pdf. But, p 0 (x) = c B(49581, 48870) 7

8 which depends on the choice of c. This is a common problem with using improper priors for comparing models, though some solutions now exist in the literature (see Berger and Pericci 1996 for reporting the intrinsic Bayes factor while testing with improper priors). Multiple Testing Although multiple testing may refer to many different statistical inference problems, we restrict ourselves to situations where a moderate to large number of related hypotheses are to be tested together. Two common situations are large scale significance testing (e.g., in microarray studies) and variable selection in linear regression models. Most large scale significance testing can be conceptualized as follows: we have IND data X 1,, X m on m objects (say genes) which are modeled as X i N(µ i, σ 2 ) and it is desired to test which of the means µ i are non-zero. In Gaussian linear regression of the form Y i = α + zi T IID β + ϵ i, ϵ i N(0, σ 2 ), it may be desired to determine which of the coordinates of β are non-zero. In this set of notes, I will only discuss the large scale significance testing problem. Similar concerns and concepts apply to regression (HW 4). Two excellent papers on these issues are Scott and Berger (2006) [ and Scott and Berger (2010) [ For large scale significance testing, either from the classical or the Bayesian perspective, a foundational point has been to treat the m separate cases not in isolation (as you d do for IID cases) but in unison within a framework of exchangeability. The idea is to learn from all cases even though separate decisions are to be taken on each. This concept underlies all modern classical multiple testing approaches based on false discovery rate and its variants, e.g., the method by Benjamini and Hochberg (2005). Scott and Berger (2006) recommends the following Bayesian approach. Conditionally on σ 2, assign the following product prior on (µ 1,, µ m ) determined by a common null propensity parameter p and a non-zero mean spread V : µ i (σ 2, p, V ) IND pδ 0 (µ i ) + (1 p)n(µ i 0, V ), i = 1,, m, and assign these new parameters the following prior (p, V ) σ 2 ap a 1 1 σ 2 (1 + V/σ 2 ) 2. In some cases σ may be assumed known, otherwise assign a default prior σ 2 = 1/σ 2. For the conditional prior on p, a default choice could be a = 1 leading to the uniform pdf. In many situations only a small proportion of cases are expected to be non-zero, and so one could choose a large value of a to reflect such prior belief. The posterior 8

9 probability of a zero mean is: p i := P (µ i = 0 x) [ = p p V/σ 2 exp { } ] x 2 1 i V 2σ 2 (σ 2 + V ) π(p, V, σ 2 x) dp dv dσ 2 and the integral is easily evaluated by an efficient importance sampling Monte Carlo. Berger [ provides the following toy example to illustrate why such a prior choice makes sense for the large scale significance testing problem. Assume σ = 1 is known. Consider the following ten signal observations: 8.48, 5.43, 4.81, 2.64, 2.40, 3.32, 4.07, 4.81, 5.81, Next, generate n = 10, 50, 500, and 5000 N(0, 1) noise observations. Mix them together and try to identify the signals. Here are results from such an experiment: The ten signal observations #noise n p i < Clearly, the joint analysis provides a multiplicity adjustment, the same signals are deemed weaker when lots of noise observations are added. In contrast, if one treated the cases independently, with the following prior for the i-th case: µ i π 0 δ 0 (µ i ) + (1 π 0 )N(µ i 0, V i ), V i σ 2 1/{σ 2 (1+V i /σ 2 )}, then a the number of noise observations with p i < 0.6 will have grown linearly in n. One final point about this. The I used the model of Scott and Berger (2006) to illustrate the issue of multiplicity and the need of a joint (hierarchical) model. There are other important issues that one needs to care about in large scale significance testing. Brad Efron has a series of interesting work on this (with his two groups model). I did some work on this incorporating non-parametric Bayes within Efron s frameowrk ( P-value calibration In formal Bayesian testing, we are able to quantify in p(h 0 x) our certainty about H 0 (modulo the prior and the data). In classical testing, on often use the p-value to reflect strength of evidence against H 0. How do these two measures compare? An excellent read on this is Berger (Stat Sci, 2003, ss/ ) who summarizes a long series of work by Berger, Wolpert, Sellke, Delampadi, Bayarri and others. 9

10 Berger (2003) argues almost in every common situation, a p-value carries less evidence against H 0 than what its numerical value suggests. A p-value = 0.05 usually reflects a chance for H 0 and at worst at least 25% chance of H 0 (under equal prior odds). Recall that for a family classical tests based on a single test statistics T (x) and all possible thresholds c, the p-value is p(x) = max θ Θ0 P [X θ] (T (X) > T (x)) where the maximum is usually attained at some fixed point θ 0 Θ 0. So under the null, p(x) Unif(0, 1), and under the alternative one will expect p(x) to be smaller. Thus one could formulate a testing problem on p(x) itself: H 0 : p(x) Unif(0, 1) and H 0 : p(x) f(x) where f(x) is a pdf on [0, 1] concentrated around 0. Sellke, Berger and Bayarri (Am Stat 2001) consider various reasonable non-parametric choices of f(x) and show that they yield o a lower bound on the Bayes factor B 01 (p) e p log p and so a lower bound on the posterior null probability P (H 0 x) (1+[ e p log p] 1 ) 1. The main reason behind this discrepancy between the numeric value p(x) = p and the lower bound on P (H 0 x) is as follows. If we had only observed p(x) < p then the lower bound will indeed be p(h 0 x) p. But we do get to see a precise value p for p(x) and the information p(x) = p is very different (and less harsh on H 0 ) than the information p(x) < p. The lower bound calibrates the strength of evidence against H 0 based on p(x) = p. 10

11 ## a function to calculate negative loglik + negative log.prior neg.lp <- function(theta,...) {} ## this return log(sum(exp(lx))), but is numerically more stable logsum <- function(lx) return(max(lx) + log(sum(exp(lx - max(lx))))) ## the importance sampling log f(x) calculator imp <- function(neg.lp, theta.start, nu = 3, a = 5, nsamp = 1e4){ d <- length(theta.start) op1 <- optim(theta.start, neg.lp) op <- nlm(neg.lp, op1$par, hessian = TRUE) theta.hat <- op$est H <- op$hessian / a R <- chol(h) u.samp <- rgamma(n.samp, nu / 2, 1 / 2) u.mat <- outer(rep(1, d), u.samp) z.samp <- matrix(rnorm(d * n.samp), nrow = d) theta.samp <- (theta.hat + backsolve(r, z.samp, transpose = TRUE) * sqrt(nu / u.mat)) log.wt <- (-apply(theta.samp, 2, neg.lp) - log(sum(diag(r))) * log1p(colsums(z.samp^2) / nu)) return(logsum(log.wt) - log(n.samp)) } 11

Statistical Inference: Maximum Likelihood and Bayesian Approaches

Statistical Inference: Maximum Likelihood and Bayesian Approaches Surya Tokdar From model to inference So a statistical analysis begins by setting up a model {f (x θ) : θ Θ} for data X. Next we observe