(Dated: July 1, 2004) Abstract. In Run I, the DØ Collaboration adopted a Bayesian method for calculating confidence limits that

Size: px

Start display at page:

Download "(Dated: July 1, 2004) Abstract. In Run I, the DØ Collaboration adopted a Bayesian method for calculating confidence limits that"

Milo Neal
5 years ago
Views:

1 DRAFT Calculating Confidence Limits DØNote 4491 J. Linnemann 1, M. Paterno 2 and H.B. Prosper 3 1 Michigan State University, East Lansing, Michigan Fermi National Accelerator Laboratory, Batavia, Illinois Florida State University, Tallahassee, Florida (Dated: July 1, 2004) Abstract In Run I, the DØ Collaboration adopted a Bayesian method for calculating confidence limits that was recommended by the Confidence Limits Working Group [1]. This paper describes the method and discusses a few important issues pertaining to it. 1

2 2 I. INTRODUCTION This paper addresses the question of how to present the results of an experiment that observes few or no events. There is broad agreement that the answer is to set an upper limit at some specified confidence level (CL) on the cross section for the process of interest. The aim of this paper is: 1) to provide arguments for the adoption of a fundamentally Bayesian method, and 2) to describe that method in detail. To motivate our choice, and to provide background for the reader, we review in Section II the concepts of confidence and credible intervals, while in Section III we briefly review a few published prescriptions for constructing confidence intervals. We then present in Section IV our reasons for choosing the method adopted by the DØ Collaboration, during Run I, and describe the procedure. Finally, in Section VII, we provide a summary and conclusions of the paper. II. FREQUENTIST AND BAYESIAN INTERVALS A What is Probability? There are two schools of statistical inference (Frequentist and Bayesian) that differ in their interpretation of the concept of probability. The laws of probability theory the mathematical rules that govern the manipulation of abstract symbols called probabilities are used by both. Much of the confusion that pervades the subject of intervals can be traced, ultimately, to a failure either to accept, or appreciate, that there is a difference between the two approaches and that this difference matters. The confusion becomes particularly acute when the numerical results of the two schools agree, from which it is sometimes concluded that the conceptual difference between these schools is merely philosophical, and as such irrelevant. This is not so! To see this, let us begin with the definition of probability according to the (dominant) Frequentist school.

3 3 1 The Frequentist Meaning of Probability Although introduced in the earlier work of J. W. Gibbs, the definition we give of probability is associated with R. Fisher. Two other definitions (those of J. Neyman and R. Mises), also of the Frequentist school, are less suitable. A clear explanation of each can be found in Ref. [2]. See, also, Ref. [3] for an historical perspective. Letting x denote the outcome of some observation, and I some other information of relevance to that observation, we denote the probability of x, given I, as P (x I). The Frequentist school defines this probability P (x I) as the limit (as the sample size tends to infinity) of the relative frequency with which x occurs, out of a sample of observations that satisfy whatever constraints implied by the specification of I. To be precise, let k be the number of observations that satisfy both x and I, i.e., the successes, in n observations that satisfy I, i.e., the trials, the Frequentist definition of probability is k P (x I) = lim n n. (II.1) This definition is extremely important in practice because it provides the connection between frequency data, for example, counts in the bins of a histogram, and the true underlying density function of which the distribution of counts is an estimate. However, the definition is not without difficulty. The problem is that the limit does not exist in a strict mathematical sense! The reason is that there is no deterministic rule that connects the outcome of trial n + 1 with that of the previous trial n, because, by assumption, the outcomes are random. Moreover, the limit cannot be found empirically since no sequence of trials is ever infinite. In practice, probabilities are predicted from a judicious use of combinatoric reasoning of which the derivation of the binomial distribution is a classic example. The relationship between these abstractly derived probabilities and the relative frequencies we measure in experiments is a subtle subject that is well beyond the scope of this paper. We merely note that the pivotal theorem that underpins the relative frequency interpretation establishes that the limit in Eq. (II.1) exists with probability one,

4 4 provided that the order of trials is unimportant and the probability of success at each trial is the same. This theorem is due to de Finetti and is itself based on another important theorem (see Appendix C) by the same author. 2 The Bayesian Meaning of Probability In the Bayesian school, probability is interpreted as a number that measures the degree of belief in or plausibility of a proposition. It therefore makes sense to talk about the probability of an hypothesis, whereas such a statement has no meaning for Frequentists. The Bayesian school derives probabilities using the same combinatoric reasoning as deployed by the Frequentists, and so the basic difference between the two schools lies in the fact that in the Frequentist interpretation probability describes a state of Nature, while in the Bayesian view a probability describes a state of knowledge. It is this difference that leads to the difference in the interpretation of Frequentist and Bayesian intervals and upper limits. In order to highlight these differences, we shall reserve the term confidence intervals for Frequentist intervals and credible intervals for Bayesian intervals. B Confidence Intervals The concept of the confidence interval was introduced by Neyman in a seminal paper published in 1937 [5]. It is easier to explain the concept with an example. Let us consider the measurement of a cross section x. We suppose that the cross section has one fixed value x 0 and that we have acquired data D that can be used to measure, or as statisticians say estimate, the cross section. By a measurement we shall mean a procedure that yields an interval [L(D), U(D)] whose lower and upper limits, L and U, respectively, depend on the data D. Now imagine repeating the experiment an arbitrarily large number of times. Each repetition of the experiment would, in general, yield different data D and a different interval. Some fraction of the intervals in this ensemble of intervals will bracket,

5 5 that is, contain, the true (but unknown) value x 0 of the cross section. That fraction, a relative frequency, is called the coverage probability of the ensemble of intervals. Neyman required that the coverage probability for this ensemble be greater than or equal to the desired confidence level, say β = If this were so, then the probability to draw an interval at random from the ensemble of intervals would be, by construction, β. There is a snag, however, with constructing a single ensemble of intervals satisfying the aforementioned criterion: we do not know the true value of the cross section. It is therefore clearly necessary, as Neyman recognized, that the bound on coverage probability hold true whatever the true value of the parameter to be estimated, here the cross section x. This requires one to construct an ensemble of intervals for every possible value of the cross section. If each ensemble has, by construction, a coverage probability greater than or equal to the desired confidence level then one is assured that an interval will be drawn with a probability not less than the confidence level. Intervals that satisfy this criterion, called the Neyman criterion, are called confidence intervals and are said to cover. A few things are worth emphasizing: 1) the intervals themselves are the random variables and 2) the coverage property refers to repeated applications of the method. Consequently, it applies to repetitions of different experiments as well as to repetitions of the same experiment. 1 Credible Intervals A credible interval [L(D), U(D)] is constructed so that the probability, i.e., the measure of the plausibility of the proposition, that the true value x 0 is in the range [L(D), U(D)] is exactly β. As in the Frequentist prescription, the unknown true value x 0 is assumed fixed. Unlike the Frequentist prescription, there is no statement concerning the relative frequency of the result of a set of measurements, and no reference to an ensemble of intervals. There is, however, the necessity of assigning a probability in the absence of the measurement to the proposition that the value x lies between x and x + dx. This probability is called the

6 6 prior probability (often just the prior) for the quantity of interest, here, the cross section. The probability of x in light of the observed data D is called the posterior probability and is denoted P (x D, I). Since x is a continuous parameter, we can write this probability in terms of its probability density f(x D, I) where P (x D, I) = f(x D, I) dx. The Bayesian upper limit x UP on x at confidence level β is defined by β = x UP 0 f(x D, I) dx, (II.2) where f(x D, I) is called the posterior density. As we decribe later, the Bayesian definition of probability allows one to describe directly P(theory data), while the Frequentist definition does not. The likelihood, used by both Frequentists and Bayesians, only describes P(data theory). III. COMMENTS ON RECENT LITERATURE We have surveyed recent literature and summarize our perceptions in this Section. It is clear from our survey that particle physicists, being pragmatists, use a Bayesian-like approach to deal with systematic uncertainty. Statements such as:...the Poisson distribution is smeared with Gaussian errors... occur often in publications, and usually without comment. Unfortunately, however, mixing the two interpretations of probability sometimes causes confusion or renders the interpretation of the probabilities ambiguous. To illustrate both points, we consider briefly papers by Helene [7] and Cousins and Highland [8]. We also comment on the method of Feldman and Cousins [9] and the CLs method developed at LEP [10]. A Helene Helene [7] notes the difficulty of taking into account cogent prior information in the Frequentist approach (e.g., that a rate must be positive). After performing a proper Bayesian calculation of confidence intervals for the Poisson mean, using a flat prior, Helene purports

7 7 to show the superiority of the Bayesian intervals to those of the Frequentists. He tries to do this using an ensemble of Monte Carlo generated experiments. The first point of confusion arises because of Helene s use of a Frequentist concept, the coverage probability over an ensemble (the fraction of intervals that bracket the true value), in a Bayesian context[26]. For Bayesians, an ensemble is not needed to interpret a probability; in particular, it is not needed to interpret a Bayesian confidence level. The second point of confusion arises because the Monte Carlo method used by Helene, as noted by James [13] (see also, Prosper [14]), is not the correct method to compute coverage probabilities. To compute the latter one should keep the unknown Poisson mean fixed, generate an ensemble of experiments, and compute the fraction of intervals that contain the true fixed value that is, the coverage probability. The procedure should then be repeated with another fixed value for the Poisson mean, and repeated often enough to provide a fair sampling of the entire parameter space. B Cousins and Highland Cousins and Highland [8] consider the very important problem of how to include systematic uncertainty in a Poisson upper limit. They write the Poisson mean, µ, as µ = R S where R might be, for example, a cross section times branching fraction the parameter of interest and S is some sensitivity factor for the measurement, for example, the effective integrated luminosity, which is of no intrinsic interest; that is, the sensitivity is a nuisance parameter. A key assumption is that subsidiary measurements lead to an estimate Ŝ for S, with a known standard deviation δs. Cousins and Highland acknowledge, from the outset, the inability of Frequentist methods to account for systematic uncertainty in a straightforward and conceptually coherent way. They therefore use a Bayesian approach to represent the information about S. Given a likelihood function, f(ŝ S), describing the result of the subsidiary measurement of the sensitivity they compute the posterior probability density W (S Ŝ) assuming a flat prior

8 8 density for S. They then convolve the (cumulative) Poisson distribution with the posterior probability density W (S Ŝ) for S and, having thereby removed the nuisance parameter S, compute upper limits for R using the resulting smeared cumulative Poisson distribution function where C(N R) ( N 0 P oisson(rs, n) n=0 In Ref. [8], 90% upper limits were computed using ) W (S Ŝ) ds, (III.1) P oisson(µ, k) = e µ µ k. (III.2) k! W (S Ŝ) = Gaussian(S, Ŝ, δs), (III.3) for δs/ŝ 0.3, where Gaussian(λ, a, σ) = 1 σ 2π exp[ 1 2 ( ) 2 λ a ]. (III.4) The authors have clearly accepted, and make use of, the notions of Bayesian prior and posterior probabilities as well as the utility of integrating with respect to a parameter, not to mention the practical utility of a flat prior, yet they refrain from performing a consistent Bayesian calculation. The reason offered is that physicists prefer to treat R using Frequentist methods, since they regard them as objective. We concede that using a prior for R comes at a price: the upper limit on R would be more sensitive to the choice of prior for R than to the choice of prior for S since the posterior probability for S is dominated by its likelihood function. This, however, does not alter the fact that choice of prior for S is no more or less subjective than would be a similar choice of prior for R. Moreover, it is not clear how the confidence level is to be interpreted, because of the mixture of Frequentist and Bayesian elements in the calculation. We see no obvious way of proving that the intervals derived by this mixed prescription possess Frequentist coverage in general, though they do possess the property in some specific cases which have been examined. σ

9 9 1 Further Comments In the method of Cousins and Highland the estimate Ŝ is required to be unbiased, that is, < Ŝ >= S. It is clear that we should also require Ŝ > 0, which implies that the likelihood function f(ŝ S) must be restricted to the domain Ŝ [0, ). If the likelihood is a Gaussian it would have to be renormalized, yielding f(ŝ S) = Gaussian(Ŝ, S, δs)/ Gaussian(Ŝ, S, δs) dŝ, 0 [ ( )] S = Gaussian(Ŝ, S, δs)/1 1 + Erf 2 δs, Ŝ > 0, (III.5) 2 where Erf is the error function. One consequence of the truncation is that the estimate Ŝ is no longer unbiased; but it is not clear that this is a problem. What, in principle, is more important is that the likelihood f(ŝ S) be correctly normalized before it is used in Bayes Theorem to compute the posterior density W (S Ŝ), which itself must be normalized with respect to S. In practice, however, we have found that the use of an unrenormalized Gaussian for the posterior probability density, W (S Ŝ), is a good approximation for δs/ŝ < 0.3. (Strictly speaking, since Ŝ > δs/0.3, to be consistent, the likelihood function should be renormalized on the restricted domain Ŝ [δs/0.3, ), instead of [0, ). However, we have not investigated the effect of this.) Although it is unclear that coverage obtains in general for the confidence intervals computed with this method, there is one circumstance in which the limits cover. In Eq. (III.1) if we swap the order of the sum and the integral we shall have a sum over terms each of which is of the form P (n R) = 0 P oisson(rs, n)w (S Ŝ) ds. Since n=0 P (n R) = 1, P (n R) is a non-poisson discrete probability distribution. If the counts were distributed according to P (n R), rather than according to a Poisson distribution, application of the standard Frequentist procedure to compute upper limits namely, setting the cumulative distribution, Nn=0 P (n R), equal to 1 β, where β is the desired confidence level would yield exactly the same upper limits as obtained with the method of Cousins and Highland. But with the new sampling distribution P (n R) coverage is guaranteed!

10 10 C Feldman and Cousins Feldman and Cousins[9] have described a construction of confidence intervals that avoids unphysical intervals, is consistent with Neyman s viewpoint[5] and can be applied when there is a known background. The algorithm proceeds as follows: for each possible value of the parameter of interest, say the cross section x, event counts are ordered in descending order of their likelihood ratio l(x)/l(ˆx), where ˆx is the maximum likelihood estimate of the parameter x; a range of counts is constructed by including counts one at a time, starting with the first of the ordered counts, until the probability content of the range of counts is no less than the desired confidence level. This procedure results in a confidence band, which, for a given count n, yields a confidence interval [L(n), U(n)]. The Feldman and Cousins procedure has great merit, however, it is far from clear how systematic uncertainty can be taken into account in a manner consistent with a fully Frequentist viewpoint and in a way that is computationally feasible. D The CLs Method The CLs method [10], which has been used by some groups within DØ, was developed in the context of the Higgs search at LEP. The method is a generalization of a formula proposed by Zech [11]. (See also the discussion by Bob Cousins [12].) Limits calculated with the CLs method are based on the sampling distribution of the statistic or rather its logarithm Q = = M exp[ (s i + b i )] (s i + b i ) n i i exp( b i ) b n i i ( ) M ni si + b i exp( s i ), i b i, (III.6) q ln Q, M M = s i + n i ln (1 + s i /b i ), i=1 i=1 (III.7)

11 11 under the signal plus background and background only hypotheses, S and B, respectively. The index i = 1 M is over channels, where s i l i σ is the mean signal count with l i the acceptance times integrated luminosity, b i is the mean background count and n i is the observed count in the i th channel. Given the sampling distributions p(q S) and p(q B) one calculates a CLs limit, at level β, by solving the following ratio of p-values (that is, tail probabilities) β = q q 0 p(q S) q q 0 p(q B), (III.8) for the upper limit on the cross-section σ, where q 0 is the observed value of the statistic q. In practice, the densities p(q H), with H = S or B, depend on parameters that are not known precisely such as the mean backgrounds b i and the effective luminosities l i. In order to account for systematic uncertainty in these parameters, the densities p(q H) are integrated with respect to l i and b i, weighted by a known prior density π(l i, b i ), β = q q 0 p(q S) π(li, b i ) dl i db i q q 0 p(q B) π(l i, b i ) dl i db i. (III.9) The principal difficulty with Eq. (III.8) is that it has not been derived from broadly accepted principles. Moreover, since the more general procedure, Eq. (III.9), is neither purely Frequentist nor purely Bayesian, the confidence level β cannot be readily interpreted as either a relative frequency or a degree of belief, as clearly stressed in Ref. [10]. IV. OUR SUGGESTED PROCEDURE A Requirements In choosing a procedure for the setting of upper limits, we felt it necessary to be guided by a few principles. The question of how best to treat uncertainty is under active discussion in our field [16]. We believe that a consensus may be emerging and we are hopeful that the following principles will form part of this consensus:

12 12 Practicality and Power. The method must have sufficient analytical power to answer the questions we want to ask. This includes the handling of searches in which the predicted background is greater than the number of observed events, the handling of unphysical regions, and dealing with uncertainties of all kinds. Generality. The chosen method should apply to a wide variety of problems so that special cases should come up only rarely. Self-consistent. Ideally, the method should be conceptually sound and avoid techniques that violate underlying mathematical theorems. This will assure that conclusions will always be warranted, especially in circumstances where the answers are not readily apparent. B Consideration of the Two Schools Proponents of the Frequentist approach hold that it is objective and, for this reason, is preferred for scientific investigations. Consider a typical Frequentist calculation of limits for a cross section. One devises a procedure, which for a well-specified ensemble, constructs for each possible true value of the cross section x an ensemble of intervals. By construction the coverage probability of each ensemble of intervals within the class of ensembles is greater than or equal to the desired confidence level, say 95%. Moreover, as noted in Sect. II B, if one repeats such a procedure over an arbitrarily large number of different experiments, each measuring a different quantity, the confidence level over this class of classes of intervals is 95%. This is held to be a useful property. Sampling from this class of classes is analogous to the following sampling experiment: pick a ball at random from an urn chosen at random from ten urns, each containing 95% red balls and 5% white ones. Replace the ball in the urn from which it was drawn and repeat the experiment as many times as desired. Each urn is analogous to an ensemble of experiments of a give type for which 95% limits have been calculated. Clearly, the relative frequency with which the statement I have picked a red

13 13 ball is true will 0.95 as the number of repetitions grows without limit. Therefore, the argument goes, one would expect that in the ensemble of 95% CL limits published over the past fifty years at least 95% of statements of the form y < y UP are true, provided that they were all arrived at via a correct Frequentist calculation. A confidence interval is a thoroughly Frequentist concept. Therefore, it is of interest to consider why Fisher, a staunch Frequentist and the inventor of many of the statistical ideas in common use, rejected Neyman s innovation out of hand. Fisher rejected the notion of confidence intervals because of what he regarded as the inherent subjectivity in the definition of their associated confidence level. If an ensemble objectively exists then there is no problem, but Fisher [6] noted, in connection with significance tests, that.. if we possess a unique sample on which significance tests are to be performed, there is always... a multiplicity of populations to each of which we can legitimately regard our sample as belonging; so the phrase repeated sampling from the same population does not enable us to determine which population is to be used to define the probability level, for no one of them has objective reality, all being products of the statistician s imagination. One may say the same of almost all ensembles that define Frequentist confidence levels. In the end, the choice of which ensemble to use is a matter of expert judgement, convention or abstract reasoning rather than the result of an empirically verifiable procedure. It is of course true that the ensemble of published 95% limits has a coverage probability. But in view of the invariably non-objective nature of the ensembles used by each experiment it is less than clear that the coverage probability of the ensemble of published limits is indeed 95%. More to the point, even if it is, it is not clear how that fact is to be used in a scientific investigation. There is, however, a more serious problem, albeit a conceptual one. Because it relies on the relative frequency interpretation of probability, the Frequentist view encounters diffculties addressing systematic uncertainties. The supporting mathematical theory does not

14 14 provide a general, practical, algorithm to eliminate nuisance parameters. Unfortunately, most analyses contain such parameters. The Bayesian approach, on the other hand, provides a general method to handle uncertainties, irrespective of their provenance and nature. In practice, many nominally Frequentist analyses in high energy physics include the idea of integrating over uncertainties. The Cousins and Highland and CLs methods are good examples. But integrating over parameters has no justification within the Frequentist school because it eschews the use of a probability density for a parameter. Such a thing simply makes no sense in this approach. In the Bayesian school, nuisance parameters are eliminated by integrating over their probability densities because that is what probability theory dictates should be done. For practical reasons, as well as for conceptual clarity, the Run I Confidence Limits Working Group [1], recommended the Bayesian approach. This approach is preferred if for no other reason than the fact that most people, whether they are scientists, or not, cannot help but think in a Bayesian way. This is hardly surprising since the Bayesian reasoning mirrors closely the way we reason inductively [18]. One feature of the Bayesian school, recognized by some as a flaw and by others as a virtue, is that the posterior probability depends, necessarily, on the priors used in the analysis. Critics cite this as evidence that the method is subjective. Supporters answer that this is merely a platitude, because the result of an analysis will always depend on the information that goes into it. We are not aware of any useful analysis in high energy physics that does not rely upon the use of judgement (that is, subjective input) on the part of the analyst. It is the rules by which the probabilities are manipulated that should be (and are) objective. The explicit dependence on prior information allows one to study this dependence and to assess the robustness of the conclusions. Moreover, as noted above, the choice of ensemble in Frequentist analysis is invariably subjective; it is whatever a physicist judges to be sensible. A degree of subjectivity seems unavoidable in any realistic analysis, whether the approach is Frequentist or Bayesian. Critics charge that some aspects of the Bayesian approach are arbitrary. Although this charge is a serious one, it cannot be leveled at Bayesians without also being leveled at

15 15 Frequentists. The chief weakness of the Bayesian method is the difficulty of specifying a prior to represent minimal information about a parameter. If all that one is willing to say about a cross section to be measured is that it is a number between zero and forty milli-barns, then the following two-part question arises: can such minimal information be represented by a prior and, if so, how? Several different priors have been advocated [19] for parameters for which minimal information is available or assumed, but it is not clear how one is to choose between them. We believe, however, that this problem is not fatal, but merely serves as a warning that we may have to rely upon a convention, or a formal rule, to make progress. However, given that every serious scientific investigation entails subjective choices it is important to examine the sensitivity of one s conclusions to the choices one makes, including the choice of priors. Should it be found that the conclusions are not robust with respect to reasonable choices the appropriate response it to acquire more and better data. C The Method In our judgement, the benefits of a Bayesian method to construct credible intervals outweigh its principal defect. Its main benefit is a practical one: the method for constructing intervals is conceptually simple, even in the presence of nuisance parameters. Moreover, the method is general and mathematically coherent. The unsatisfactory aspect is the need to accept a convention, or a formal rule, for choosing certain prior probabilities, as described below: 1. Define the model. For the case of new particle searches, there is one accepted model: the expected number of events µ is related to the signal cross section x, the signal efficiency ɛ, the integrated luminosity L, and the expected background b through: µ = b + Lɛx. (IV.1)

16 16 In different analyses, one may replace b by some more complicated expression, containing perhaps a sum of many terms, each of which is a product of background acceptances, background cross sections, branching fractions, and integrated luminosity. Sometimes it is convenient to subsume Lɛ into a single parameter, the effective integrated luminosity, l Lɛ. 2. Given the model, determine the likelihood function for the data. In the case of counting experiments, there is one conventionally accepted likelihood function: it is proportional to the Poisson distribution with expectation value (mean) µ. The probability of observing k events, given an expectation value of µ, is P (k µ, I) = P oisson(µ, k). (IV.2) Here I indicates all the information used to build µ, as well as the assumptions that led us to choose the Poisson distribution. With the above model, the likelihood function is proportional to P (k x, L, ɛ, b, I) = e (b+lɛx) (b + Lɛx) k. (IV.3) k! Note that the above is P(data theory), where theory includes the parameters x, L, ɛ, and b; that is, it is the probability of observing k events, given x, L, ɛ, and b. It is not P(theory data), that is, it is not the probability for any of x, L, ɛ, or b, given k. 3. Assign prior probabilities to all parameters. Next, we use the available information (such as the knowledge of the integrated luminosity, within some bounds of uncertainty) to assign prior probabilities to each of the parameters in the problem. In general, the prior can contain correlations between the parameters. It is useful to think of the prior as P(theory), where theory includes both the cross section x and the nuisance parameters θ. When correlations are negligible, the prior can be factorized into a product of independent priors. Even when correlations are present, ignoring them would not be wrong,

17 17 but would rather correspond to accepting a less precise answer than could otherwise be obtained. In most cases, it will be of interest to factor the prior P (x, θ I) = P (x θ, I)P (θ I), = P (x I)P (θ I) (IV.4) into P (x I) and P (θ I). The choice of prior probabilities for the nuisance parameters is discussed in detail in Section VI B. The choice of prior for the signal cross section is more of a problem. As a matter of convention, we suggest adopting a flat prior of finite range: P (x I) = dx/m if 0 x M, 0 otherwise, (IV.5) where M is chosen sufficiently large so that the likelihood function for x > M is negligible, or more specifically, so that the answer does not depend on M, or remains finite as the limit M is taken at the end of the calculation. We have found that this conventional choice leads to answers that most physicists find intuitively reasonable, and which in the simple case of no background and no uncertainty in sensitivity, give the same upper limits as the Frequentist procedure. This works because in these circumstances, the cross section is tightly coupled to the expected number of signal events. Nonetheless, we recommend checking the robustness of limits by studying their behavior over a plausible class of priors. Appendix B describes an important case where the flat prior can cause problems in limit calculations. It is perhaps worth pointing out that we do not recommend choosing a signal prior flat in a theoretical parameter this can give rather different upper limits from our choice of flat prior in cross section. For example, the signal cross section can be a rather steeply falling function of a SUSY mass parameter, or of a contact term scale parameter. Choosing a prior flat in such a parameter is the same as choosing a strong dependence in cross section, and will thus represent a strong, rather than weak, opinion about the

18 18 likely signal count. As will be seen in Section VI A, such a steeply falling signal prior would result in significantly smaller upper limits than a flat prior. xxxxxxxxxxxx now should put in a specific example xxxxxxxxxxxxx compare DeMortier signal count flat 4. Apply Bayes Theorem to find the posterior probability. Bayes Theorem, when interpreted in a Bayesian way, relates the pre-data knowledge of the parameters (the prior probabilities) to the post-data knowledge of the parameters (the posterior probabilities), with the linkage being the likelihood function and a normalization constant. This process can be thought of as a form of logic [23], in which the prior knowledge and likelihood are premises for an inference whose outcome is the updated knowledge taking into account both the prior knowledge and the data. That is, Bayes theorem can be usefully sketched as P(theory data) P(data theory) P(theory). Stated using abstract notation, P (A BI) = P (B AI)P (A I) P (B I), (IV.6) where, in the context in which this theorem is used, P (A BI), P (B AI) and P (A I) are the posterior, likelihood and prior, respectively. The denominator in Eq. (IV.6) is determined through the normalization condition: P (A BI) = 1. all A (IV.7) To apply Bayes Theorem to our problem, we identify the following propositions: A is the proposition that the cross section lies between x and x+dx, the integrated luminosity is between L and L + dl, the signal efficiency is between ɛ and ɛ + dɛ, and the background count is between b and b + db B is the proposition that we have observed k events in the data I reflects all relevant prior knowledge. This includes the descriptions of the parameters x, L, ɛ, and b, as well as the assumptions that went into our model.

19 19 Thus Bayes Theorem, for our problem, becomes P (x, L, ɛ, b k, I) e (b+lɛx) (b + Lɛx) k P (x I)P (L, ɛ, b I), k! (IV.8) where the constant of proportionality is determined by the condition M 0 1 dx dl dɛ db f(x, L, ɛ, b k, I) = 1, (IV.9) where the probability density f(x, L, ɛ, b k, I) is defined by P (x, L, ɛ, b k, I) f(x, L, ɛ, b k, I)dx dl dɛ db. (IV.10) Our notation was careful to write Eq. (IV.8) in terms of densities times a differential, since, strictly speaking, Bayes Theorem is a theorem about probabilities, not probability densities. 5. Remove nuisance parameters. To remove the nuisance parameters L, ɛ, and b, we merely integrate Eq. (IV.8) over them. The result is the posterior distribution for the cross section x 1 P (x k, I) = dx dl dɛ db f(x, L, ɛ, b k, I), f(x k, I)dx. (IV.11) 6. Use the resulting posterior probability distribution to calculate quantities of interest. Subsequent to the Bayesian analysis, all information about the cross section is contained in the posterior density function for x. We recommend that this function be published along with the specific choices made for the prior probabilities. Nevertheless, it may be of interest to convey the most important information in as few numbers as possible. For example, in a search for some new process, the relevant number is usually the upper limit, calculated by integrating Eq. (IV.11) up to the upper limit as described in Section II B 1. Thus, the 95% CL upper limit U is obtained by solving 0.95 = U 0 dx f(x k, I). (IV.12)

20 20 If it is determined that the posterior distribution for x is peaked away from zero, it may then be more appropriate to quote the mean and the variance of the distribution and, of course, to claim discovery! V. A BAYESIAN EXAMPLE FOR ZERO BACKGROUND To illustrate the method described above we apply it to the problem considered by Cousins and Highland, namely, that of calculating an upper limit U on x for the case of zero background. We assume that the probability to observe n events is given by the Poisson distribution P (n x, l, I) = P oisson(xl, n), Eq. (III.2). The parameter x may be taken to be the cross section times the branching fraction. The sensitivity factor l is then the effective integrated luminosity; that is, l = ɛl. Following Cousins and Highland [8], we assume that we have an estimate ˆl of l with uncertainty δ l. As discussed in Section VI B, there are strong arguments to favor representing this information as a prior probability density that goes to zero at l = 0. Accordingly, we model the sensitivity information as a Gamma prior density Gamma(λ, a, b) (see Section VI B 4). A prior of this form is expected if the sensitivity is ultimately the result of a counting experiment that yields a count k, which is then scaled by some factor γ, that is, ˆl = γk, δ l = γ k, (V.1) (V.2) assuming that the probability to observe count k can be modeled by a Poisson distribution P (k λ, I) = P oisson(λ, k), (V.3) with mean count λ. When this likelihood is combined with a flat prior in λ it yields a Gamma posterior density Gamma(λ, k + 1, 1), which serves as an informative prior P (λ I) = Gamma(λ, k + 1, 1), (V.4)

21 21 for the likelihood function P (n x, l, I) = P (n y, λ) = P oisson(yλ, n), where we have used the fact that l = γλ and where y γx. We have here chosen to write the prior for λ instead of l to simplify the calculations that follow. If we are given ˆl and δl, we can find the effective count k and scale factor γ from k = η 2, γ = ˆlη 2, (V.5) (V.6) where η = δ l /ˆl is the relative uncertainty. In practice, it is often convenient first to integrate out the nuisance parameters P (n y, I) = = and then apply Bayes Theorem 0 P oisson(yλ, n) Gamma(λ, k + 1, 1) dλ, Γ(n + k + 1) y n, (V.7) Γ(n + 1)Γ(k + 1) (1 + y) n+k+1 P (y n, I) = P (n y, I) P (y I) 0 P (n y, I) P (y I), (V.8) where P (y I) = f(y I) dy is the prior for the scaled cross section y = γx. If we adopt the convention suggested above and take P (y I) to be flat we obtain P (y n, I) = Γ(n + k + 1) y n dy, (V.9) Γ(n + 1)Γ(k) (1 + y) n+k+1 as the posterior probability for y, from which upper limits, or credible intervals can be computed as described in Section IV C. In Table I we show the 90% upper limits computed using the posterior probability in Eq. (V.9), where, without loss of generality, we have set the estimated sensitivity ˆl = 1. VI. COVERAGE AND SENSITIVITY TO PRIOR OF BAYESIAN INTERVALS A Dependence on Choice of Prior for Cross Section The choice of a flat prior in cross section is clearly a convention. What motivates such a choice, and how dependent are the resulting intervals on the details of this choice? We have

22 22 n η U(n) a U(n) b TABLE I: (a) 90% Bayesian upper limits for the Cousins and Highland problem, computed using a Gamma prior for the sensitivity, with the estimated sensitivity ˆl = 1, for different relative errors η = δ l /ˆl. (b) For comparison we show the Cousins and Highland limits[8] computed using a Gaussian posterior probability for the sensitivity l. (Note: The translation to the R and S notation of Cousins and Highland is R x, S l, Ŝ ˆl and σ r η.)

23 23 performed the following tests. Our basic problem is finding limits on the cross section in the model µ = b + Lɛx = b + s. (VI.1) To examine the dependence on priors for s, take the expected background b, the luminosity and the efficiency as perfectly known constants, reducing our problem to finding Poisson upper confidence limits. In this discussion, we can substitute s for x, as they differ by a known scale factor (sensitivity) which we take for this discussion as ɛl = 1. We now consider the β = 95% upper limit for several plausible priors. The prior we recommend for s is = 1/M, where M is some maximum value. Jeffreys has offered various suggestions for priors [2, 19]. The constant prior is consistent with Jeffreys recommendations for a finite-range location parameter. For a semi-infinite parameter he suggested 1/s, but his analysis emphasizing invariance under change of variable leads to 1/ s for a Poisson mean. This latter is derived from a procedure usually referred to as Jeffreys prior by statisticians [27]. Together, these ideas suggest consideration of the family 1/s p. A second interesting family is the exponential family e as. Phenomenologically, this family places finite rather than infinite weight at s = 0 and still includes the flat prior as a member. More theoretically, this family results from a natural Bayesian updating procedure (see Appendix A) by taking as the prior the posterior of a previous experiment with a flat prior, with 0 events observed (independent of the original experiment s expected background). The parameter a gives the relative sensitivity of the previous experiment to the present experiment by a = ɛ 0 L 0 /ɛl. A previous experiment with one tenth the present experiment s sensitivity would have a = 0.1, while an equally sensitive previous experiment with which you wished to combine results would have a = 1. It is not difficult to see how the prior affects the limits. The Bayesian tool for limit calculation is the posterior density f(s k, I) = const P (k s + b)f(s I) = e s (s + b) k s p, (VI.2)

24 24 or f(s k, I) = const P (k s + b)f(s I) = e s(1+a) (s + b) k. (VI.3) We will set the right tail probability of f(s k, I) to 1 β. Changing the prior density f(s I) will change the shape of the posterior distribution and thus the upper limit on s. Priors which emphasize the region near s = 0 more heavily than the flat prior will produce smaller (more optimistic ) limits, while priors which emphasize larger s will produce larger ( more conservative ) upper limits. It is impossible to define an upper limit if the posterior is not normalizable, though this requirement does not necessarily imply that the prior itself must be normalizable. Consider values of s in the range from m to M (we can let the limits tend toward 0 and later). The Bayesian credible limits U of size β are defined by β = U m M f(s k, I) ds/ f(s k, I) ds. m In terms of the (un-normalized) cumulative distribution function (VI.4) w(x) = the definition of the interval becomes x m ds e s (s + b) k f(s I), (VI.5) β = w(u)/w(m), (VI.6) which is solved by U = w 1 (β w(m) ). (VI.7) The limits depend weakly on M so long as M is chosen sufficiently larger than k; larger M gives slightly larger limits. This is in accord with the expectation that if one really knows that M is small before the experiment, this knowledge should be incorporated in the estimation of the upper limit. The power family of priors shows considerable dependence on the chosen power. Jeffreys 1/s suggestion leads to an upper limit of 0 regardless of the number of events seen, unless

25 25 noxxxxcheckxxx events are seen and rigorously no background is expected. This choice concentrates too much probability near s = 0. Lest this prior be derided as absurd, it is equivalent to a prior flat in ln(s), and thus is invariant under transformation of variable to any power of s: a prior flat in the logarithm of mass would give the same result as s 1/mass 2. However, the square root prior in s is invariant under even more general transformations. For k = 0, we know that no signal event have been observed, and the limits are independent of b. The 95% limits are 1.9 for the 1/ s prior, 3.0 for the flat prior and 3.9 for a s prior. An e s prior would give 1.5 for the upper limit (half the value for the flat prior) as a result of combining two k = 0 experiments of equal sensitivity. The Figure C shows the dependence of the 95% upper limit for the cases of b = 3, k = 0, 3, 10 for the power prior as a function of the power (with 0 representing the flat prior). When we have substantial data (for example k = 10, b = 3 the assumptions of the prior become less important, but when little data are available, the prior counts significantly; this is the case with many search limit calculations. So we must choose our convention carefully. One of the many criteria which have been proposed for choosing priors to represent lack of relevant knowledge is whether the resulting credible intervals satisfy Frequentist coverage[19]. One attraction of choosing a criterion with good coverage properties is that it satisfies expectations built up by limit definitions in simpler situations. Further, in the simplest case of a Poisson mean with no background, a flat prior for the mean results in a limit of the same size as the Frequentist limit, giving a connection in continuity for the simple cases where the limit values can be calculated in either scheme. Though coverage is clearly a criterion outside the interest of the many Bayesians, it has been studied often in the statistical literature [19]. The coverage of Bayesian intervals can be shown to converge to Frequentist values as k becomes large, possibly as strongly as k 3/2 for carefully selected priors. For a lucid discussion of these issues see Ref. [20]. To summarize, the flat prior has some attractive features, but does not stand out as a special, stationary point in limit calculations. The choice of a prior matters most, in a case

26 26 such as calculation of upper limits, when there is insufficient data to dominate the prior. As a result, in a plausible range of priors (such as s to flat, the Bayesian limits may vary by as much as 30% xxxxxx. The conclusion to be drawn from this is that limits are intrinsically, unavoidably, soft when the data on which they are based are limited. xxxxx should include something from Narsky about coverage. B Dependence on Choice of Informative Prior for Efficiency and Luminosity The prior for the cross section is intended to be minimally-informative, that is, contain little of our opinions about the un-measured cross section. We will in this section study the sensitivity of the upper limit to varying forms of the informative prior for the nuisance parameters, each representing the same estimate and uncertainty. For simplicity, consider a measurement with a perfectly known expected background, b. The efficiency and luminosity ɛ, L are nuisance parameters, parameters we must estimate, but which are not the goal of the experiment. We typically estimate these parameters in side experiments, or by approximate calculation. We represent the results of these subsidiary measurements ˆɛ ± δ ɛ, ˆL±δL with informative priors for the unknown actual values ɛ, L of the nuisance parameters. We may find it convenient to combine these into an effective luminosity, l = ɛl. Then we will integrate out these nuisance parameters, with most weight being placed at the most likely values of the nuisance parameters. This results in an averaged likelihood, known as a marginal likelihood. We define the fractional uncertainty in the sensitivity by ) 2 ( δɛ η = + ˆɛ ( δl ˆL ) 2 = δ l ˆl. (VI.8) For the purposes of this section, it is convenient to define a scaled sensitivity variable φ = l/ˆl where ˆφ = 1 ± η. In this spirit, we will use η to parameterize the informative prior for φ, rather than adjust the posterior mean and rms of this distribution to precisely match the estimates. Without loss of generality, we can further consider unit expected sensitivity

27 27 ˆl = 1, so that s = lx = ˆlφx = φx and we can easily compare numerical values of the upper limits with other results. In the usual fashion, the posterior probability for the cross section will be given by P (x k) P (x) dφp (k φx + b)f(φ η) (VI.9) and the proportionality constant is fixed by dividing by the integral of the expression over x. f(φ η) is the informative prior density for the sensitivity. We now consider a number of possible forms for this informative prior, and their motivation. 1 Truncated Gaussian The most obvious candidate is a Gaussian form, 1 T Gauss(φ η) = η 2πZ exp 1 ( ) 2 φ 1, φ 0, 2 η Z = d φ 1 0 η 2π exp 1 ( ) 2 φ 1. 2 η (VI.10) (VI.11) As noted in Section V, because a Gaussian allows the possibility of negative values of sensitivity, it must be truncated so that φ 0, with the probability normalized to one by dividing by the integral over the allowed range. On its face, this seems somewhat odd, as the need for the truncation goes hand in hand with there being a nonzero probability density at φ = 0, meaning that it is possible that the experiment had no sensitivity. The truncation is a nearly negligible effect for small η, but the distribution is more problematic at larger values of η, culminating with φ = 0 not only possible, but, according to the truncated Gaussian, nearly the most probable value. Note that the effect of the truncation is to give a posterior φ mean sensitivity > 1 and a rms less than the expected value of η. 2 Lognormal Prior If the uncertainty is due to the sum of a number of small additive effects, a Gaussian, as above is an appropriate model. However, if the uncertainty is better modeled by a number

28 28 of small multiplicative errors, then a Gaussian in ln φ, the lognormal distribution [21], is more appropriate. For our case, lnor(φ η) = [ 1 ηφ 2π exp 1 ] 2 (ln φ/η)2, (VI.12) where we have again used ˆφ = 1 for simplicity, so that ln ˆφ = 0. The skewness of the distribution again comes into play: the posterior mean and rms of φ are e η2 /2 and e η2 1, both larger than expected values. 3 Beta prior One can also study some less phenomenological priors. Consider a sensitivity with an efficiency variable ɛ dominating the uncertainty, that is, δ L = 0, and η = δ ɛ /ɛ. A plausible prior density in this case is the Beta distribution, Beta(ɛ, a, b) = Γ(a + b) Γ(a)Γ(b) ɛa 1 (1 ɛ) b 1, 0 ɛ 1, (VI.13) which arises naturally from the following consideration. One performs a Monte Carlo simulation of n events, of which k pass certain cuts. The probability to obtain k counts out of n is given by the binomial distribution n Binomial(ɛ, k, n) = ɛ k (1 ɛ) n k, k (VI.14) which when combined with either a flat prior in ɛ, or a Beta prior, yields a Beta posterior density for the efficiency. This posterior density then serves as a Beta prior in a subsequent application of Bayes Theorem. The Beta distribution is the conjugate prior[22] for a binomial likelihood. This means that the posterior of a binomial variable with a Beta(a, b) prior is again a Beta distribution with a increased by the number of successes and b increased by the number of failures, to give a 1 total successes and b 1 total failures. A flat (a = b = 1) distribution in these terms represents no successes and no failures. (For comparison, the Jeffreys prior has a = b =

Journeys of an Accidental Statistician

Journeys of an Accidental Statistician A partially anecdotal account of A Unified Approach to the Classical Statistical Analysis of Small Signals, GJF and Robert D. Cousins, Phys. Rev. D 57, 3873 (1998)