Journeys of an Accidental Statistician

Size: px

Start display at page:

Download "Journeys of an Accidental Statistician"

Lawrence McCormick
5 years ago
Views:

1 Journeys of an Accidental Statistician A partially anecdotal account of A Unified Approach to the Classical Statistical Analysis of Small Signals, GJF and Robert D. Cousins, Phys. Rev. D 57, 3873 (1998) and further unpublished work. Gary Feldman 1 Journeys

2 A Simple Problem Suppose you are searching for a rare process and have a well-known expected background of 3 events, and you observe 0 events. What 90% confidence limit can you set on the unknown rate for this rare process? A classical (or frequentist) statistician makes a statement about the probability of data given theory. That is, given a hypothesis for the value of an unknown true value µ, he or she will give you the probability of obtaining a set of data x, P(x µ). A classical confidence interval (Jerzy Neyman, 1937) is a statement of the form: The unknown true value of µ lies in the region [µ 1,µ 2 ]. If this statement is made at the 90% confidence level, then it will be true 90% of the time, and false 10% of the time. Poisson statistics P(x = 0 µ = 2.3) = 0.1. Therefore, in the standard classical approach, µ < 2.3 at 90% C.L. Since µ = s + b, and b = 3.0, s < -0.7 at 90% C.L. Thus, we are led to a statement that we know is a priori false. Gary Feldman 2 Journeys

3 Bayesian Statistics A Bayesian takes the opposite position from a classical statistician. He or she calculates the probability of theory given data. That is, given a set of data x, he or she will calculate the probability that the unknown true value is µ, P(µ x). This appears attractive because it is what you really want to know. However, it comes at a price: P(x µ) and P(µ x) are related by Bayes Theorem, which in set theory is the statement that an element is in both A and B is P(A B) P(B) = P(B A) P(A) which for probabilities becomes P(µ x) = P(x µ) P(µ)/P(x). P(x) is just a normalization term, but Bayes Theorem transforms P(µ), the prior distribution of degree of belief in µ, to P(µ x), the posterior distribution. A credible interval or Bayesian confidence interval is µ 2 formed by P(µ x) dµ = 90%. µ 1 Gary Feldman 3 Journeys

4 An Example Suppose you have a large number of marbles, which are either white or black, and you wish information on the fraction that are white, µ. You draw a single marble, and it is white. What can you say at 90% confidence? Classical: µ 0.1 Prior Bayesian: flat µ /µ µ 0.1 1/(1-µ) unnormalizable µ µ (1-µ) µ Notice that most of the Bayesian priors do not cover, i.e., they are not true statements the stated fraction of the time (90% in this case). There is no requirement that credible intervals cover. However, Bob Cousins warns [Am. J. Phys. 63, 398 (1995)], if a Bayesian method is known to yield intervals with frequentist coverage appreciably less than the stated C.L. for some possible value of the unknown parameters, then it seems to have no chance of gaining consensus acceptance in particle physics. Gary Feldman 4 Journeys

5 The Role of Bayesian Statistics Prosper [Phys. Rev. D 37, 1153 (1988)] argues for a 1/µ prior based on a scaling argument. I found it unsatisfactory for two reasons: (1) It fails for x = 0. (Unnormalizable) (2) In general, it undercovers. I decided to continue my search for a solution to the simple problem in classical statistics. To quote Bob s prose from our paper: In our view, the attempt to find a non-informative prior within Bayesian inference is misguided. The real power of Bayesian inference lies in its ability to incorporate informative prior information, not ignorance. Prosper wrote that he was using a Bayesian approach because we are merely acknowledging the fact that a coherent solution to the small-signal problem is more easily achieved within a Bayesian framework than one which uses the methods of classical statistics. By the end of this talk, I hope to convince you that this is no longer true. Gary Feldman 5 Journeys

6 Construction of Confidence Intervals Neyman s prescription: Before doing an experiment, for each possible value of theory parameters determine a region of data that occurs C.L. of the time, say 90%. After doing the experiment, find all of values of the theory parameters for which your data is in their 90% region. This is the confidence interval. Notice that there is complete freedom of choice of which 90% to choose. This will be the key to our solution. Gary Feldman 6 Journeys

7 Examples of Poisson Confidence Belts 90% C.L. upper limits for Poisson µ with background = 3 90% C.L. central limits for Poisson µ with background = 3 Gary Feldman 7 Journeys

8 The Solution For both the upper limit and central limit, x = 0 excludes the whole plane. But consider the problem from the point of view of the data. If one measures no events, then clearly the most likely value of µ is zero. Why should one rule out the most likely scenario? Therefore, we propose a new ordering principle based on the ratio of a given µ to the most likely µ, µ ˆ : P(x µ) R = P(x µ ˆ ) Example for µ = 0.5 and b = 3: x P(x µ) ˆ µ P(x ˆ µ ) R rank U.L. C.L Gary Feldman 8 Journeys

9 The Poisson Limits 90% C.L. unified limits for Poisson µ with background = 3 Excerpt from Table IV of the paper (90% C.L.): x/b Gary Feldman 9 Journeys

10 Examples of Gaussian Confidence Belts 90% C.L. upper limits for Gaussian µ 0 vs. x (total background) in σ 90 % C.L. central limits for Gaussian µ 0 Gary Feldman 10 Journeys

11 Flip-flopping How might a typical physicist use these plots? (1) If the result x < 3σ, I will quote an upper limit. (2) If the result x > 3σ, I will quote central confidence interval. (3) If the result x < 0, I will pretend I measured zero. This results in the following: 10% 5% In the range 1.36 µ 4.28, there is only 85% coverage! Due to flip-flopping (deciding whether to use an upper limit or a central confidence region based on the data) these are not valid confidence intervals. Gary Feldman 11 Journeys

12 Unified Solution to the Gaussian Case Notes: (1) This approaches the central limits for x >>1. (2) The upper limit for x = 0 is 1.64, the two-sided rather than the one-sided limit. (3) From the defining 1937 paper of Neyman, this is the only valid confidence belt, since there are 4 requirements for a valid belt: (a) It must cover. (b) For every x, there must be at least one µ. (c) No holes (only valid for single µ). (d) Every limit must include its end points. Gary Feldman 12 Journeys

13 Including Errors on the Background Our paper assumed perfect knowledge of the backgrounds. However, often backgrounds are estimated from an ancillary experiment (sidebands, target-out run, etc.) How does the error on the ancillary experiment affect the limits? A Bayesian can solve this problem trivially by integrating over the background distribution. However, this is not permitted for frequentist. He or she must provide coverage for all possible values of the unknown true background rate. [The unknown true background rate is technically known as a nuisance parameter, because one must provide coverage for it, but you really do not care what its value is.] The exact efficient solution to this problem is computationally taxing, and has never been published. This is a second reason why physicists have been driven to Bayesian statistics. However, there is a simple solution to this problem within the unified approach, and Bob and I have agreed that we will publish it this summer. Gary Feldman 13 Journeys

14 Trick for Including Errors on Backgrounds If one provides coverage for the worst case of the nuisance parameter, then perhaps one will have provided coverage for all possible values of the nuisance parameter. Let µ = unknown true value of the signal parameter β = unknown true value of the background parameter x = measurement of µ + β b = measurement of β in the ancillary experiment The rank R becomes R ( µ, x,b) = P(x µ + ˆ β )P(b ˆ β ) P(x µ ˆ + β ˆ )P(b ˆ β ), where µ ˆ and β ˆ maximize the denominator (as usual), and ˆ β maximizes the numerator. Since R is only a function of µ and the data, one proceeds to construct the confidence belt as before. Gary Feldman 14 Journeys

15 Examples and Test of the Trick Let x be a Poisson measurement of µ + β and b be a Poisson measurement of β/r in an ancillary experiment (i.e., r = signal region/control region), r / n, rb 0, 3 3, 3 6, 3 9, A test of coverage for r = 1.0, 0 µ 20, 0 β 15 at 10,000 randomly picked values of µ and β: Median coverage: 90.25% Fraction that undercovered: 4.10% Median coverage for those that 89.96% undercovered: Worst undercoverage: 89.46% Comment from Harvard statisticians: We have never seen a statistical approximation work this well. Gary Feldman 15 Journeys

16 Back to Neutrinos As I had hoped, once we had the solution to the simple problem, the solution to the neutrino problem was obvious, unique, and natural. (My postdoc, Fred Weber, was doing almost the correct thing instinctively.) Probability of oscillation: P = sin 2 (2θ) sin 2 (1.27 m 2 L/E) Since χ 2 = 2ln(P), the ordering principle becomes a χ 2 = [χ 2 (data point in plane) - χ 2 (data best fit point in plane)]. The prescription: To calculate the acceptance region for each point in the plane, do a Monte Carlo simulation and determine what critical value of χ 2 will contain 90% of the data. Gary Feldman 16 Journeys

17 Combining Experiments: The Higgs Mass An example that may be of interest to LEP experimenters is how to use these techniques to combine different experimental measurements of the Higgs Mass. Let µ be the unknown true value of the Higgs mass, which is presumably a known function of the counting rate, ρ i, where i is an index denoting the experiment. In each experiment i, let β i be the unknown true rate of background. Let b i be a measure of the background in an ancillary experiment, say the sidebands. Finally, let n i be the measured rate of ρ i + β i. The ordering principle becomes: ˆ ˆ P(n i µ, β i )P(b i β i ) i R(µ,n i,b i ) = P(n i µ ˆ, β ˆ i )P(b i β ˆ, i ) i where µ ˆ and β ˆ i maximize the denominator and ˆ β i maximize the numerator. Gary Feldman 17 Journeys

18 The Higgs Mass Example (continued) The maximization of the numerator can be done analytically; in general, the maximization of the denominator requires a simple numerical solution. For each value of µ, the critical value of R c (µ) n i, b i R(µ,n i, b i )<R c ( µ) P(n i µ, ˆ β i )P(b i ˆ β i ) = 0.9 is determined. If the explicit sum is computationally impractical, a Monte Carlo calculation can be substituted. If the measured values of n i and b i are N i and B i, then the 90% confidence interval for µ is given by R(µ,N i, B i ) R c (µ) Gary Feldman 18 Journeys

19 The Higgs Mass Example (continued) Using the following information from CERN-EP/98-046, Lower bound for the Standard Model Higgs boson mass from combining the results of the four LEP experiments Experiment Expected Background Seen Aleph Delphi Opal Total (Not enough information given to include L3.) and the figure giving expected signal vs. Higgs mass, assuming backgrounds are perfectly known, I obtained m H < 75 GeV at 95% C.L., but a sensitivity to the Higgs Mass of only 71 GeV. (5% probability of observing 4 events when you expect 9.15.) (CERN-EP/ numbers: 77.5 GeV and 75.5 GeV) Gary Feldman 19 Journeys

20 Brand X Comparisons: Raster Scan We built a toy model to compare this technique with three other classical techniques that have, or could have, been used in previous experiments. Raster scan: For each m 2, scan along sin 2 (2θ) to find the minimum χ 2 (including the unphysical region), and take the 90% confidence interval to have χ 2 2 χ min (the two-sided limit). This technique (and the other two to be discussed) will have all the problems of the possibility of excluding all physically allowed values, which we originally set out to solve. However, it does cover exactly, but it is not powerful (i.e., it does not do an efficient job of rejecting false hypotheses). Gary Feldman 20 Journeys

21 Brand X Comparisons: Flip-flop Raster Scan Flip-flop raster scan: Same as raster scan, but use a onesided limit (χ 2 2 χ min ), unless the signal is > 3σ. This has all the features of the raster scan, but it significantly undercovers (85%) in a substantial region: Gary Feldman 21 Journeys

22 Brand X Comparisons: Global Scan Global Scan: Find the global minimum and take (χ 2 2 χ min , the two-sided 90% C.L. for χ 2 with 2 degrees of freedom). This technique is powerful, but has significant regions of both undercoverage (76%) and overcoverage (94%). Possible reasons: (1) Sinusoidal nature of the oscillation function. (2) One-dimensional regions. Gary Feldman 22 Journeys

23 Brand X Comparisons: Scorecard Technique Always gives useful information Gives proper coverage Is powerful Raster scan X X Flip-flop R.S. X X X Global scan X X This technique Gary Feldman 23 Journeys

24 Sensitivity The main objection to this work has been that an experiment that observes fewer events than the expected background may report a lower upper limit than a (better designed?) experiment that has no background. To address this problem and to provide additional information for the reader s assessment of the significance of the results, we suggest that experiments that have fewer counts than expected background also report their sensitivity, which we define as the average upper limit that would be obtained by an ensemble of experiments with the expected background and no true signal. For example, in the initial NOMAD publication of 1995 data, we reported an upper limit on high-mass ν µ ν τ oscillations of sin 2 (2θ) < and a sensitivity of A few days ago, at the Neutrino 98 conference, we reported preliminary results from a partial analysis of data yielding an upper limit of and a sensitivity of Gary Feldman 24 Journeys

25 Visit to Harvard Statistics Department Towards the end of this work, I decided to try it out on some professional statisticians whom I know at Harvard. They told me that this was the standard method of constructing a confidence interval! I asked them if they could point to a single reference of anyone using this method before, and they could not. They explained that in statistical theory there is a one-toone correspondence between a hypothesis test and a confidence interval. (The confidence interval is a hypothesis test for each value in the interval.) The Neyman-Pearson Theorem states that the likelihood ratio gives the most powerful hypothesis test. Therefore, it must be the standard method of constructing a confidence interval. I decided to start reading about hypothesis testing Gary Feldman 25 Journeys

26 Kendall and Stuart (1961) At the start of chapter 24 of Kendall and Stuart s The Advanced Theory of Statistics (chapter 23 of Stuart and Ord), I found 1 1/4 cryptic pages that contain all of the information, published and unpublished, in this lecture. (See text) Gary Feldman 26 Journeys

27 1998 Particle Data Group Review The 1998 PDG Review of Particle Physics [European Physics Journal C3 (1998) 1] has supported this approach strongly: The Bayesian methodology, while well adapted to decision-making situations, is not in general appropriate for the objective presentation of experimental data. In order to correct this problem [undercoverage due to flip-flopping] it is necessary to adopt a unified procedure. The appropriate unified procedure and ordering principle are given in Ref. 11 [Feldman- Cousins]. If it is desired to report an upper limit in an objective way such that it has a well-defined statistical meaning in terms of limiting frequency, then report the Frequentist confidence bound(s) as given by the unified Feldman- Cousins approach. we suggest that a measure of sensitivity should be reported whenever expected background is larger or comparable to the number of observed counts. The best such measure we know of is that proposed and tabulated in Ref. 11. Gary Feldman 27 Journeys

28 Conclusion I have yet to find a problem in the construction of classical confidence intervals and regions that is not solvable by the ordering principle suggested here. For example, we have been using it with success in the NOMAD experiment to combine the results for different τ decay modes, including errors on the background determination. Post script for LEP audiences: The LEP Higgs paper (CERN-EP/98-046) states: There is no unique, exact, method to deal with this problem. The method presented here is exact by construction. I believe that this method, due to its simplicity and reliance on classical methods, the classical construction of the confidence interval and the use of the likelihood ratio, deserves to be called unique, if any method does. Gary Feldman 28 Journeys

Statistics of Small Signals

Statistics of Small Signals Gary Feldman Harvard University NEPPSR August 17, 2005 Statistics of Small Signals In 1998, Bob Cousins and I were working on the NOMAD neutrino oscillation experiment and we