Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, PDF Free Download

Statistics for Particle Physics Kyle Cranmer New York University 91

Remaining Lectures Lecture 3:! Compound hypotheses, nuisance parameters, & similar tests! The Neyman-Construction (illustrated)! Inverted hypothesis tests: A dictionary for limits (intervals)! Coverage as a calibration for our statistical device! The likelihood principle, and the logic for likelihood-based methods Lecture 4:! Systematics, Systematics, Systematics! Generalizing our procedures to include systematics! Eliminating nuisance parameters: profiling and marginalization! Introduction to ancillary statistics & conditioning! High dimensional models, Markov Chain Monte Carlo, and Hierarchical Bayes! The look elsewhere e"ect and false discovery rate 92

Lecture 3 93

Addition to the References Flip-flopping Talking with Fred James over lunch, he mentioned Gary Feldman s lectures: Journeys of an Accidental Statistician How might a typical physicist use these plots? (1) If the result x < 3!, I will quote an upper limit. (2) If the result x > 3!, I will quote central confidence interval. (3) If the result x < 0, I will pretend I measured zero. This results in the following: 10% 5% http://www.hepl.harvard.edu/~feldman/journeys.pdf In the range 1.36 " µ " 4.28, there is only 85% coverage! Due to flip-flopping (deciding whether to use an upper limit or a central confidence region based on the data) these are not valid confidence intervals. Gary Feldman 11 Journeys 94

Simple vs. Compound Hypotheses The Neyman-Pearson lemma is the answer for simple hypothesis testing! a hypothesis is simple if it has no free parameters and is totally fixed f(x H 0 ) vs. f(x H 1 ) What about cases when there are free parameters?! eg. the mass of the Higgs boson f(x H 0 ) vs. f(x H 1,m H ) A test is called similar if it has size parameters for all values of the A test is called Uniformly Most Powerful if it maximizes the power for all values of the parameter Uniformly Most Powerful tests don t exist in general α 95

Similar Test Examples In some cases Uniformly Most Powerful tests do exist:! some examples just to clarify the concept:! H0 is simple: a Gaussian with a fixed µ = µ 0, σ = σ 0! H1 is composite: a Gaussian with µ<µ 0, σ = σ 0 # consider H - and H-- # same size, di"erent power, but both max power H- H0 H-- H0

Similar Test Examples In some cases Uniformly Most Powerful tests exists:! some examples just to clarify the concept:! H0 is simple: a Gaussian with a fixed µ = µ 0, σ = σ 0! H1 is composite: a Gaussian with µ>µ 0, σ = σ 0 # consider H + and H++ # same size, di"erent power, but both max power H0 H+ H0 H++ 97

Similar Test Examples Slight variation, a Uniformly Most Powerful test doesn t exit:! some examples just to clarify the concept:! H0 is simple: a Gaussian with a fixed µ = µ 0, σ = σ 0! H1 is composite: a Gaussian with # Either H + has good power and H- has bad power # or vice versa µ = µ 0, σ σ 0 H- H0 H+ H- H0 H+ 98

Similar Test Examples Another example that is Uniformly Most Powerful:! H0 is simple: a Gaussian with a fixed µ = µ 0, σ = σ 0! H1 is composite: a Gaussian with µ = µ 0, σ > σ 0 # consider H + and H++ # same size, di"erent power, but both max power H0 H+ 99

Composite Hypothese & the Likelihood Function When a hypothesis is composite typically there is a pdf that can be parametrized f( x θ)! for a fixed θ it defines a pdf for the random variable x! for a given measurement of x one can consider f( x θ) as a function of θ called the Likelihood function! Note, this is not Bayesian, because it still only uses P(data theory) and # the Likelihood function is not a pdf! Sometimes θ has many components, generally divided into:! parameters of interest: eg. masses, cross-sections, etc.! nuisance parameters: eg. parameters that a"ect the shape but are not of direct interest (eg. energy scale) 100

A simple example: A Poisson distribution describes a discrete event count n for a real-valued mean $. P ois(n µ) =µ n e µ n! The likelihood of $ given n is the same equation evaluated as a function of $! Now it s a continuous function! But it is not a pdf! Common to plot the -2 log L! why? more later L(µ) =P ois(n µ)!"#$%&'(%)*'+,'-)$."/.0''''''''''''' 1*,'2,'345.,'67'789':;88<= 101

Confidence Interval What is a Confidence Interval?! you see them all the time: 80.5 LEP1 and SLD LEP2 and Tevatron (prel.) 68% CL m W [GeV] 80.4 80.3 α m H [GeV] 114 300 1000 150 175 200 m t [GeV] 102

Confidence Interval What is a Confidence Interval?! you see them all the time: 80.5 LEP1 and SLD LEP2 and Tevatron (prel.) 68% CL Want to say there is a 68% chance that the true value of (mw, mt) is in this interval! but that s P(theory data)! Correct frequentist statement is that the interval covers the true value 68% of the time! remember, the contour is a function of the data, which is random. So it moves around from experiment to experiment m W [GeV] 80.4 80.3 α m H [GeV] 114 300 1000 150 175 200 m t [GeV] 102

Confidence Interval What is a Confidence Interval?! you see them all the time: 80.5 LEP1 and SLD LEP2 and Tevatron (prel.) 68% CL Want to say there is a 68% chance that the true value of (mw, mt) is in this interval m W [GeV] 80.4! but that s P(theory data)! Correct frequentist statement is that the interval covers the true value 68% of the time! remember, the contour is a function of the data, which is random. So it moves around from experiment to experiment 80.3 α m H [GeV] 114 300 1000 150 175 200 m t [GeV]! Bayesian credible interval does mean probability parameter is in interval. The procedure is very intuitive: P (θ V )= V π(θ x) = V dθ f(x θ)π(θ) dθf(x θ)π(θ) 102

Inverting Hypothesis Tests There is a precise dictionary that explains how to move from from hypothesis testing to parameter estimation.! Type I error: probability interval does not cover true value of the parameters (eg. it is now a function of the parameters)! Power is probability interval does not cover a false value of the parameters (eg. it is now a function of the parameters) # We don t know the true value, consider each point θ 0 as if it were true What about null and alternate hypotheses?! when testing a point it is considered the null! all other points considered alternate So what about the Neyman-Pearson lemma & Likelihood ratio?! as mentioned earlier, there are no guarantees like before! a common generalization that has good power is: f(x H 0 ) f(x H 1 ) θ 0 f(x θ 0 ) f(x θ best (x)) 103

The Dictionary From Kendall: 104

Extending a model We can extend our simple number counting example P (n H 0 )=P ois(n b) Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 ±10 50 Events 60 80 100 120 140 160 180 P (n H 1 )=P ois(n s + b) Events Observed into a parameter estimation in a more general problem P (n µ) =P ois(n µs + b) H 0 : µ =0 H 1 : µ 0 or H 1 : µ =1 105

Extending a model We can extend our simple number counting example P (n H 0 )=P ois(n b) Probability into a parameter estimation in a more general problem P (n µ) =P ois(n µs + b) H 0 : µ =0 H 1 : µ 0 or H 1 : µ =1 Discovery corresponds to the 5σ confidence interval for µ not including 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 µ =0 ±10 50 Events 60 80 100 120 140 160 180 P (n H 1 )=P ois(n s + b) Events Observed 105

Now let s take on Feldman-Cousins 7 6 5 4 µ 3 2 1 0 0 1 2 3 4 5 6 7 x 106

Neyman Construction example For each value of θ consider f(x θ) f(x θ) θ θ 2 θ 1 θ 0 x 107

Neyman Construction example Let s focus on a particular point f(x θ o ) f(x θ 0 ) x 108

Neyman Construction example Let s focus on a particular point f(x θ o )! we want a test of size α! equivalent to a 100(1 α)% confidence interval on θ! so we find an acceptance region with 1 α probability f(x θ 0 ) 1 α x 109

Neyman Construction example Let s focus on a particular point f(x θ o )! No unique choice of an acceptance region! here s an example of a lower limit f(x θ 0 ) 1 1 α α α x 110

Neyman Construction example Let s focus on a particular point f(x θ o )! No unique choice of an acceptance region! and an example of a central limit f(x θ 0 ) 1 α α/2 x 111

Neyman Construction example Let s focus on a particular point f(x θ o )! choice of this region is called an ordering rule! In Feldman-Cousins approach, ordering rule is the likelihood ratio. Find contour of L.R. that gives size α f(x θ 0 ) 1 α f(x θ 0 ) f(x θ best (x)) = k α x 112

Neyman Construction example Now make acceptance region for every value of f(x θ) θ θ θ 2 θ 1 θ 0 x 113

Neyman Construction example This makes a confidence belt for θ f(x θ) θ θ 2 θ 1 θ 0 x 114

Neyman Construction example This makes a confidence belt for the regions of data in the confidence belt can be considered as consistent with that value of θ θ θ θ 0 x 115

Neyman Construction example Now we make a measurement the points θ where the belt intersects x 0 a part of the confidence interval in θ for this measurement eg. [θ, θ + ] θ x 0 θ + θ x 0 x 116

Neyman Construction example Because the value of is random, so is the confidence interval [θ, θ + ]. However, the interval has probability 1 α x 0 to cover the true value of θ. θ θ + θ x 0 x 117

A Point about the Neyman Construction This is not Bayesian... it doesn t mean the probability that the true value of is in the interval is 1 α! θ θ θ true θ + θ x 0 x 118

A Joke Maybe it s funnier this time? Bayesians address the question everyone is interested in, by using assumptions no-one believes Frequentists use impeccable logic to deal with an issue of no interest to anyone - P. G. Hamer 119

From the archives 120

The Dictionary (again) Showing this again to reinforce the point that there is a formal 1-to-1 mapping between hypothesis tests and confidence intervals:! some refer to the Neyman Construction as an inverted hypothesis test 121

Feldman-Cousins (again) 7 6 5 4 µ 3 2 1 0 0 1 2 3 4 5 6 7 x 122

Flip-flopping One of the features of Feldman- Cousins is that it provides a unified method for upperlimits and ~central confidence intervals. Flip-flopping How might a typical physicist use these plots? (1) If the result x < 3!, I will quote an upper limit. (2) If the result x > 3!, I will quote central confidence interval. (3) If the result x < 0, I will pretend I measured zero. This results in the following: There is a some of debate about how important this flip-flopping problem is and how satisfactory the unified limits are, but flipflopping is definitely important and the Feldman-Cousins approach avoids the problem.! see phystat, the F-C paper, or Feldman s lectures for more 10% 5% In the range 1.36 " µ " 4.28, there is only 85% coverage! Due to flip-flopping (deciding whether to use an upper limit or a central confidence region based on the data) these are not valid confidence intervals. Gary Feldman 11 Journeys 123

Coverage Coverage is the probability that the interval covers the true value. Methods based on the Neyman-Construction always cover... by construction.! sometimes they over-cover (eg. conservative ) Bayesian methods, do not necessarily cover! but that s not their goal.! but that also means you shouldn t interpret a 95% Bayesian Credible Interval in the same way Coverage can be thought of as a calibration of our statistical apparatus. [explain under-/over-coverage] 124

Discrete Problems In discrete problems (eg. number counting analysis with counts described by a Poisson) one sees:! discontinuities in the coverage (as a function of parameter)! over-coverage (in some regions)! Important for experiments with few events. There is a lot of discussion about this, not focusing on it here 125

Coverage Coverage can be di"erent at each point in the parameter space Coverage coverage Example: G. Punzi - PHYSTAT 05 - Oxford, UK Max coverage ε µ Max 126

Another point about the construction Note, that the confidence belt is constructed before we have any data. In some sense, the inference is influenced by data that we didn t get. θ θ + θ x 0 x 127

The Likelihood Principle Likelihood Principle As noted above, in both Bayesian methods and likelihood-ratio based methods, the probability (density) for obtaining the data at hand is used (via the likelihood function), but probabilities for obtaining other data are not used! In contrast, in typical frequentist calculations (e.g., a p-value which is the probability of obtaining a value as extreme or more extreme than that observed), one uses probabilities of data not seen. This difference is captured by the Likelihood Principle*: If two experiments yield likelihood functions which are proportional, then Your inferences from the two experiments should be identical. L.P. is built in to Bayesian inference (except e.g., when Jeffreys prior leads to violation). L.P. is violated by p-values and confidence intervals. Although practical experience indicates that the L.P. may be too restrictive, it is useful to keep in mind. When frequentist results make no sense or are unphysical, in my experience the underlying reason can be traced to a bad violation of the L.P. *There are various versions of the L.P., strong and weak forms, etc. Bob Cousins, CMS, 2008 46 128

Probability density Building the distribution of the test statistic In the case of LEP Higgs: 0.12 0.1 0.08 0.06 0.04 0.02 (a) Q = L(x H 1) L(x H 0 ) = 0-15 -10-5 0 5 10 15 Nchan LEP Observed m H = 115 GeV/c 2 Expected for background Expected for signal plus background i P ois(n i s i + b i ) n i Nchan N chan q = ln Q = s tot -2 ln(q) -2 ln(q) 50 40 30 20 10 0-10 -20-30 i i j s i f s (x ij )+b i f b (x ij ) s i +b i P ois(n i b i ) n i j f b (x ij ) ( ln 1+ s ) if s (x ij ) b i f b (x ij ) n i j Observed Expected for background Expected for signal plus background LEP 106 108 110 112 114 116 118 120 m (GeV/c 2 ) 129

Building the distribution of the test statistic!!!!!!!!!!!! LEP Higgs Working group developed formalism to combine channels and take advantage of discriminating variables in the likelihood ratio.! b b! s+b s+b Q = L(x H 1) L(x H 0 ) = sf s (x) sf bf s (x) (x) b bf (x) b f b(x) q(x)=log(1+ ) f b(x) q(x)=log(1+ ) f (x) s f (x) s f (x) 1,b (q(x))=! 1,s (q) b!! (q) 1,b (q(x))= f (x) 1,s b CL CL b b s! b b! (q)! 1,b (q) 1,b! s+b s+b Nchan i P ois(n i s i + b i ) n i Nchan N chan q = ln Q = s tot L!s L FFT FFT s ~(s+b)l!s ~(s+b)l FFT 1 FFT!1! (q) 1,b! (q) 1,b! (q) 1,s! (q) 1,s = exp[b(! 1)]! b = exp[b( 1,b!!1)] b 1,b = exp[b(! 1) + s(! 1)]! s+b = exp[b( 1,b!!1) + s( 1,s!!1)] s+b! b b! s+b s+b 1,b + i i 1,s j s i f s (x ij )+b i f b (x ij ) s i +b i P ois(n i b i ) n i j f b (x ij ) ( ln 1+ s ) if s (x ij ) b i f b (x ij ) n i j Hu and Nielsen s CLFFT used Fourier Transform and exponentiation trick to transform the log-likelihood ratio distribution for one event to the distribution for an experiment Cousins-Highland was used for systematic error on background rate. Getting this to work at the LHC is tricky numerically because we have channels with n i from 10-10000 events (physics/0312050) 130

Building the distribution of the test statistic!!!!!!!!!!!! LEP Higgs Working group developed formalism to combine channels and take advantage of discriminating variables in the likelihood ratio. f b(x) q(x)=log(1+ ) f b(x) q(x)=log(1+ ) f (x) s f (x)! b b! s+b s+b CL CL b b Q = L(x H 1) L(x H 0 ) = s! b b! s+b s+b Nchan i P ois(n i s i + b i ) n i Nchan N chan q = ln Q = s tot L s ~(s+b)l!s ~(s+b)l FFT 1 FFT!1! s+b s+b N=0 i i j s i f s (x ij )+b i f b (x ij ) s i +b i P ois(n i b i ) n i j f b (x ij ) ( ln 1+ s ) if s (x ij ) b i f b (x ij ) n i = exp[b(! 1)]! b = exp[b( 1,b!!1)] b 1,b = exp[b(! 1) + s(! 1)]! s+b = exp[b( 1,b N times!!1) + s( 1,s!!1)] s+b! b b j sf! s (x)! (q) 1,b (q) Hu and Nielsen s CLFFT used Fourier Transform and exponentiation trick to transform sf 1,b bf (x) FFT! s (x)! (q) 1,b (q) b 1,b bf (x) FFT b For N events, use! f (x) (q) 1,s 1,b (q(x))=! 1,s (q) Fourier transform to perform the log-likelihood N convolutions ratio distribution for one b!! s! (q) f (x) (q) 1,b (q(x))= 1,s b 1,s event to the distribution for an experiment!s L Z n o ρ N,i (q) = ρ N,i (q) ρ N,i (q) = F 1 [F (ρ 1,i )] N {z } Cousins-Highland was used for systematic error on background rate. 1,b 1,s To include Poisson fluctuations on N for a given luminosity, one can exponentiate ρ i (q) = X + Getting this to work at the LHC is tricky numerically because we have channels with n i from 10-10000 events (physics/0312050) P (N; Lσ i ) ρ N,i (q) = F 1 n e Lσ i[f(ρ 1,i (q)) 1] o 130

Examples of Likelihood Analysis In these examples, a model that relates precision electroweak observables to parameters of the Standard Model was used! the inference is based only on the likelihood function for data at hand # there is no prior, so it s not Bayesian. And no Neyman Construction! # what is the meaning of this contour if it s not the Neyman Construction? χ 2 6 5 4 3 2 Theory uncertainty α had = α (5) 0.02758±0.00035 0.02749±0.00012 incl. low Q 2 data m Limit = 144 GeV m W [GeV] 80.5 80.4 LEP1 and SLD LEP2 and Tevatron (prel.) 68% CL 1 Excluded Preliminary 0 30 100 300 m H [GeV] 80.3 α m H [GeV] 114 300 1000 150 175 200 m t [GeV] 131

Logic of Likelihood-based Methods Likelihood-based methods settle between two conflicting desires:! We want to obey the likelihood principle because it implies a lot of nice things and sounds pretty attractive! We want nice frequentist properties (and the only way we know to incorporate those properties by construction will violate the likelihood principle) If we had a way to approximately get the distribution of our test statistic for every value of θ based only f(x θ) on the likelihood function (and no prior) then we would have a θ workable solution! θ 0 θ 1 θ 2 x There is a way to get approximate frequentist results. It s the basis of MINUIT/MINOS. Next Time! 132

Systematics, Systematics, Systematics 133

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009