theta a framework for template-based modeling and inference

theta a framework for template-based modeling and inference Thomas Müller Jochen Ott Jeannine Wagner-Kuhr Institut für Experimentelle Kernphysik, Karlsruhe Institute of Technology (KIT), Germany June 17, 2010 Statistical methods such as hypothesis tests and interval estimation for Poisson counts in multiple channels are frequently performed in high energy physics. We present an efficient and extensible software framework which uses a templatebased model approach. It includes modules to calculate several likelihood-based quantities on a large number of pseudo experiments or on data. The generated values can be used to apply frequentist methods such as hypothesis testing and the Neyman construction for interval estimation, or modified frequentist methods such as the CLs method. It also includes an efficient Markov-Chain Monte-Carlo implementation for Bayesian inference to calculate Bayes factors or Bayesian credible intervals. 1

Contents 1 Introduction 3 2 Template-Based Modeling 3 2.1 Example Model............................ 4 3 Statistical Methods 5 3.1 Discovery............................... 6 3.1.1 Direct Estimation of Z.................... 6 3.1.2 Z via Test Statistic Distribution.............. 7 3.1.3 Bayes Factor......................... 8 3.2 Measurement............................. 10 3.2.1 Profile Likelihood Method.................. 10 3.2.2 Neyman Construction.................... 12 3.2.3 Bayesian Inference: Posterior................ 14 3.3 Exclusion............................... 14 4 Including Systematic Uncertainties 17 4.1 Rate Uncertainties.......................... 18 4.2 Template Uncertainties....................... 18 5 theta Framework 21 5.1 Combination with external Likelihood Functions......... 22 5.2 Markov Chains in theta....................... 22 5.3 Testing................................ 23 2

1 Introduction In High Energy Physics (HEP), the result of an analysis is often the outcome of a statistical procedure. Common cases are interval estimations and hypothesis tests based on models of Poisson counts in multiple channels. In all but the most simple cases, not only the mere number of events is used as input to the statistical method but the measured distribution of a certain observable. Here, we only consider the case where the expected and measured distributions are binned, i.e., given as histograms. The probability of observing n i events in bin i is the Poisson probability with mean m i where the expected poisson mean m i is the sum of expected poisson means of the different contributing processes. Writing the expectation as linear combination of templates is the basis of the (more general) model definition used in theta. It will be defined and discussed in more detail in Section 2. Analytical solutions to the test statistics commonly used, such as the maximum likelihood estimate for a certain parameter or the value of a likelihood ratio, are in general not known for such a model. Therefore, a numerical treatment is necessary. In order to accurately generate the test statistic distribution for a certain model, large-scale Monte-Carlo production of pseudo data and efficient calculation of test statistic is necessary. This is one main application of theta. Details on how theta can be used in various statistical methods is discussed in Section 3. Section 4 discusses how systematic uncertainties affecting the shape and normalization of templates can be included in theta. Section 5 gives an overview of the architecture and some implementation details of theta. An Appendix is included which contains example configuration files for theta. theta is licensed under the GPL. The software package and further documentation can be obtained via http://theta-framework.org/. 2 Template-Based Modeling A typical model predicts the Poisson mean in each bin i as linear combination of different components m i = p 1 t 1,i + p 2 t 2,i + p 3 t 3,i (1) where p j are real model parameters. Fixing j, t j,k is a one-dimensional template 1 for process j. The resulting template m contains the predicted Poisson mean for each bin. While this covers some simple use cases, systematic uncertainties will complicate the picture, namely one source of uncertainty might affect p 1 and p 2, but not p 3, another source of uncertainty might affect the shape of the templates t i. Therefore, a more general model is considered here which allows the coefficients and templates in the linear combination in the right hand side of eq. (1) to depend arbitrarily on model parameters. 1 The term template is used here to mean both, a probability density in an observable as well as the event count, binned in the observable. It can be thought of a histogram normalized to the expectation. 3

The most general statistical model in theta predicts templates of a set of observables o 1,... o N. Both, the coefficients and the templates are functions of real model parameters p. Thus, the model prediction for observable o i is M i m i ( p) = c i,k ( p)t i,k ( p) (2) k=1 where the c i,k are real-valued coefficients and t i,k are the templates of the M i individual processes contributing in this channel. It is assumed that the template bin contents t i,k are all strictly positive. Given model parameters p, the probability of observing data template d is given by a Poisson for each bin of the template: p m (d p) = N b i Poisson(d i,l m i,l ( p)) (3) i=1 l=1 where i runs over all observables, b i is the number of bins for observable o i, d i,l is the number of observed events of observable o i in bin l, and Poisson(n λ) = λn e λ is the poisson probability of observing n events, given mean λ. Fixing d and reading eq. (3) as function of p, is the likelihood function used in many statistical methods discussed in Section 3: n! L( p d) = p m (d p). (4) In general, an additional term D( p) is introduced in the likelihood function and eq. (4) becomes L( p d) = p m (d p)d( p). (5) The origin and interpretation of term D( p) might differ, depending on the application. A typical application is a two-step measurement with two different models. In the first ( sideband ) measurement, some parameter (such as the background rate) is measured as ˆp ± ˆp. In the second measurement, the parameter of interest is estimated. As part of the likelihood function in the second model, a Gaussian term for p with mean ˆp and width ˆp is included in D. In this application, D can be seen as the approximate likelihood of the first measurement. Note that the templates t i,k have been introduced as one-dimensional objects. However, all bins are statistically independent. Therefore, bins can be re-ordered arbitrarily without changing the outcome of any method. In particular, it is also possible to start with multi-dimensional templates and concatenating all their bins to obtain a one-dimensional template. Therefore, using one-dimensional templates does not restrict applications as much as it might seem at first. 2.1 Example Model As example used in the following sections, consider the measurement of signal using two channels: the first channel is a signal-free region used to measure 4

arb. 0.45 Background 0.4 0.35 Signal 0.3 0.25 0.2 0.15 0.1 0.05 0 0 50 100 150 200 250 300 350 400 450 500 "mass" Figure 1: Templates in the signal region of the signal and background process over some mass-like observable. The Signal template is t s,2 in (7), Background corresponds to t s,1. background ( sideband channel sb), the second channel contains signal ( signal channel s). The parameters of this model are the Poisson mean of the background, µ b, and the Poisson mean of the signal, µ s, both for the signal channel. It is assumed that the Poisson mean in the sideband channel is a fixed multiple of µ b, where the factor is known precisely. The model prediction written in form of eq. (2) is m sb (µ s, µ b ) = µ b t sb,1 (6) m s (µ s, µ b ) = µ b t s,1 + µ s t s,2 (7) where the templates t i,j are constant, i.e., they do not depend on any model parameter. The parameter of interest is µ s ; µ b is a nuisance parameter. In this example, the sideband template, t sb,1, is taken to be a counting-only measurement, i.e., it is a one-bin template with τ as bin content, whereas the templates for the signal channel are a mass like distribution with 100 bins on the range 0 500 where the background template t s,1 is exponentially falling. The signal template t s,2 is a normal distribution around 250 with width 50. Both templates in the signal channel are normalized such that the sum of their bin entries equals one. The templates in the signal region are depicted in Figure 1. The values used in the following sections will be µ b = 20, µ s = 10, and 10 as normalization of the sideband background template t sb,1. 3 Statistical Methods The statistical methods discussed in the following sections are: 1. Discovery: make a hypothesis test with null hypothesis µ s = 0 versus µ s > 0. 2. Measurement: estimate µ s, and an interval for µ s. 5

3. Exclusion: give an upper / lower limit on µ s. In theta, different producers are available to address these questions. A producer calculates one or more values, given data and a model as C++ objects. The calculated values are written to an output file and called products. These products can be used in a second step to make the actual statistical inferences. Examples for producers are maximum likelihood estimator, which produces the estimated parameter values ln L producer for interval estimation which produces profile likelihood intervals Likelihood ratio producer which calculates the negative log-likelihood for different parameter restrictions of the same model Markov-Chain Monte-Carlo producer which calculates a histogram of the posterior in one parameter. In the following sections, some applications of these producers are discussed in more detail with the example model from Section 2.1. Most statistical methods discussed here (such as likelihood-ratio test, maximum-likelihood estimator, Neyman interval construction) are discussed in detail in many introductory texts in statistics (such as [4]) and will not be explained or referenced in detail here. 3.1 Discovery Three methods are discussed: 1. direct estimation of significance using asymptotic properties of the likelihood ratio 2. significance estimation via the tail of the test statistic distribution 3. Bayes factor. The first two methods are frequentist methods. While the first relies on properties which hold for a large number of events, the second one is more general but also much more time-consuming to apply as a large number of pseudo experiments are required for reliable results. They both yield a p value or a Z value (as in a significance of Z sigma ) if applied to a dataset. The third method is the Bayesian counterpart to frequentist hypothesis tests. The Bayes factor is the ratio of posterior probabilities of the null hypothesis and an alternative hypothesis. Thus, it expresses the relative credibility in the null hypothesis: a small Bayes factor would lead to the assumption that the null hypothesis is incorrect. 3.1.1 Direct Estimation of Z The model is formulated such that the null hypothesis corresponds to a subset B of parameter values p. In this example, B consists of those parameter values with µ s = 0. The ratio of maximized likelihoods λ(d) is defined as λ(d) = ln max p L( p d) max p B L( p d) 6 (8)

where L is the likelihood function as defined in eq. (5). In the nominator, the maximum is taken over all allowed parameter values p while in the denominator, the maximum is taken only over the subset B which corresponds to the null hypothesis, i.e., the model without signal. λ(d) is a real value indicating the compatibility with the null hypothesis: a large value disfavours the null hypothesis. The p value for a certain dataset ˆd is defined as the probability of observing a value of λ(d) as least as large as λ( ˆd), assuming the null hypothesis is true. Calculating the p value thus requires the knowledge of the distribution λ(d) for the null hypothesis. According to Wilks Theorem, λ is asymptotically distributed according to a χ 2 distribution, if the data is drawn from the model p B. 2 In HEP, it is common to cite the Z value instead of the p value if reporting the outcome of a hypothesis test. The Z value for a given p value is the number of standard deviations in the sense that Z G(x)dx = p where G(x) is the normal distribution around zero with width one. In the case considered here, the estimated Z value for a given value of λ is Z est = 2λ. (9) In theta, this is done using the likelihood ratio producer. It minimizes the likelihood function for two model variants which only differ in the definition of the prior distribution D( p). By choosing a δ-distribution fixing µ s to zero, the background-only model variant of the example model is defined and by choosing a flat distribution for µ s, the signal-plus-background variant is defined. theta is used to generate the Z est distribution by throwing 100,000 pseudo data distributions d according to the model and calculating Z est (d) each time. The resulting Z est distribution can be seen in Figure 2. The median estimated Z value is 3.02 which can be quoted as expected sensitivity. 3.1.2 Z via Test Statistic Distribution Alternatively, one can determine the distribution of estimated Z est values as defined in eq. 9 for the null hypothesis numerically. From this distribution, the p value of a measurement Ẑ is p = Ẑ Z est (Z)dZ (10) where Z est (Z) is the distribution of Z est values for the null hypothesis. In order to give an expected p value in case signal is present, eq. (10) is evaluated setting Ẑ to the median Z est value for the signal-plus-background case. theta is run twice, for µ s = 0.0 to generate the Z est distribution for the null hypothesis and for µ s = 10 to determine the median value of the Z est distribution. For the first case, one million pseudo experiments are thrown 2 For simplicity, it is assumed that B fixes only one parameter, as is the case for the example model. 7

N PE 4000 3500 3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 Z est Figure 2: Distribution of estimated Z values for the example model with µ s = 10 using Z est from eq. (9.) The error bars are approximate uncertainties from the finite number of pseudo experiments. yielding the distribution shown in Figure 3. The second run was already carried out for the previous section and yielded a median value Ẑ = 3.02. The median p value is given by the fraction contained in the filled area in the Figure and corresponds to a Z value of 3.06 which is the expected sensitivity as usually quoted. The advantages of this method compared to a direct significance estimate discussed in the previous section are that it gives the correct answer even if the assumption of large event numbers used for Z est in eq. (9) is not fulfilled. More importantly, this method is more general than using Z est and allows the straight-forward inclusion of systematic uncertainties via the prior-predictive method which will be discussed in Section 4. The main disadvantage is that it is much more CPU-intensive as it requires a large number of pseudo experiment in order to accurately describe the tail of the test statistic distribution. For the background only distribution evaluating one million pseudo experiments, theta took about 10 minutes using one core of a recent CPU. 3.1.3 Bayes Factor The Bayes factor is the ratio of posterior probabilities B 01 (d) = p(h 0 d) p(h 1 d) (11) where p(h 0 d) is the posterior probability (degree of belief) of the null hypothesis, given data d, and p(h 1 d) is the posterior probability of the alternative 8

N PE per bin 10 6 10 5 10 4 median of S + B B only hypothesis S + B hypothesis 10 3 10 2 N PE ˆp 10 1 10 0 0 1 2 3 4 5 Z est Figure 3: Distribution of Z est for the background-only and signal-plusbackground variants of the example model. The y-axis is the number of pseudo experiments per bin; the error bars indicate the approximate statistical errors from the finite number of pseudo experiments. The median of the S+B Z est distribution is marked with a vertical line and defines the integration region of the B only Z est distribution used to calculate the median expected p value, ˆp. 9

hypothesis. The smaller B 01, the smaller belief there is that the null hypothesis is correct. The posterior probabilities of the above equation are given via Bayes Theorem: p(h i d) = p(d H i ) π(h i) π(d) where π are the prior probabilities and p(d H i ) is the probability of observing data d, given hypothesis H i. π(d) is a normalization factor which does not have to be determined explicitely as it cancels in the ratio (11). In the case considered here, the null hypothesis H 0 is µ s = 0 and the alternative H 1 is given by µ s = 10. 3 In order to determine the posterior for H 0 or H 1, the probability is integrated over the nuisance parameters, in this case only µ b p(d H 0 ) = p(d µ s = 0, µ b )π(µ b )dµ b (12) µ b and similarly for H 1, where µ s = 10 is used on the right hand side. As prior π(µ b ) for µ b, a flat prior is used. While the integral in eq. (12) is only one-dimensional, it is not uncommon to have of the order of 10 nuisance parameters. Therefore, the numerical evaluation of the integral is done with the Metropolis-Hastings Markov-Chain Monte-Carlo method [5] which has good performance in high-dimensional parameter spaces. theta was used to generate the negative logarithm of the posteriors for the two model variants (signal-plus-background and background-only) using Markov Chains. Figure 4 shows the distribution of Bayes factors calculated for 10,000 pseudo datasets sampled according to the signal-plus-background hypothesis. The expected (median) Bayes factor is 0.010. 3.2 Measurement The term measurement in this context is used as synonym for point estimation and interval estimation in statistic literature. Three methods are discussed: 1. the profile likelihood method 2. the Neyman construction for central intervals using as test statistic either (i) the likelihood ratio or (ii) the maximum likelihood estimate and 3. as Bayesian method, the (marginal) posterior of the signal cross section. 3.2.1 Profile Likelihood Method The profile likelihood function L p in the parameter µ s for the example model is defined as L p (µ s d) = max µ b L(µ s, µ b d) where L is the likelihood function from eq. (5). In general, the maximum on the right hand side is taken over all nuisance parameters, i.e., over all parameters 3 If the alternative hypothesis H 1 does not specify a concrete value for µ s but a whole parameter range, such as µ s > 0, the minimum Bayes factor obtained by scanning over the allowed values of µ s can be cited as result. 10

N PE 3000 2500 2000 1500 1000 500 0 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 B 01 Figure 4: Distribution of Bayes factors B 01 in case H 1 is true, setting µ b = 20 using 10,000 pseudo experiments. The error bars in x direction indicate the bin borders, the y errors are approximate statistical errors from finite number of pseudo experiments. but the parameter of interest. The likelihood ratio λ defined in eq. (8) can then be written as L p (ˆµ s d) λ(d) = ln L p (µ s = 0 d) where ˆµ s is defined as the value of µ s which maximizes L p. In general, any hypothesis test can be used to construct confidence intervals: the confidence interval with confidence level 1 α/2 for µ s consists of those values µ s which are not rejected by a hypothesis test at level α. Applying this principle to the hypothesis test discussed in Section 3.1.1 yields the following construction for a confidence interval at level l for µ s : 1. construct L p (µ s d) by minimizing with respect to µ b 2. find the value ˆµ s which maximizes L p (µ s d) 3. include all values in the interval for which ln L p (µ s ) ln L p (ˆµ s ) < (l). The value (l) in the last step is found by applying Wilks Theorem to the likelihood ratio test statistic (cf. Section 3.1.1): (l) = 2erf 1 (l) where erf is the error function. theta was run producing pseudo data distributed according to the prediction of the example model setting µ s = 10 and µ b = 20 and determining the lower and upper interval borders for each generated pseudo data distribution. The 11

median value of estimated µ s, ˆµ s, lower and upper one-sigma interval borders (where the medians are taken separately over all pseudo experiments) are ˆµ s = 9.8 +4.6 3.9. The bias in the central is due to the smallness of µ s and a well-known property of the maximum likelihood method. As cross-check, it was confirmed that this bias vanished if scaling µ s and µ b simultaneously by a factor 100. The coverage of the intervals was found to be 68%, as desired. 3.2.2 Neyman Construction Using the Neyman construction to construct central intervals, one has to determine the test statistic distribution as function of the value of the parameter of interest µ s. Then, for each fixed µ s, the central 68% (95%) of the test statistic are included in a confidence belt. Given a measurement which yields a test statistic value ˆT, the cited interval consists of those values of µ s contained in the belt at this test statistic value. This construction is depicted in Figure 5: for each fixed value of µ s, the band indicate the central 68% and 95% of the test statistic distribution (in this case Z est ), respectively. The interval construction starts at the observed test statistic value ˆT at the y-axis. The central value and 68% C.L. interval are then read from the intersection points of a horizontal line at ˆT with the 68% belt. Likelihood Ratio as Test Statistic As test statistic in the first case, Z est from eq. 9 is used. The 68% and 95% confidence belts are shown in figure 5. This Figure was constructed by throwing 200,000 pseudo experiments where for each pseudo experiment, µ s is drawn randomly from a flat distribution between 0.0 and 30.0. After throwing the pseudo experiments, the µ s range is divided into 30 bins, and for each bin, the quantiles (0.025, 0.16, 0.5, 0.84, 0.975) of the test statistic distribution are determined. These quantiles define the belts as explained above. For low values of µ s, a considerable fraction of pseudo experiments yield Z est = 0 and the band contains more than 68% or 95% of the Z est values. For µ s = 10, the median test statistic value is ˆT = 3.02 and the expected confidence interval in this case is ˆµ s = 10.0 +4.7 3.9. (13) Maximum Likelihood Estimate as Test Statistic The maximum likelihood estimate ˆµ s is defined as the value of µ s which maximizes the likelihood function L(µ s, µ b d), if varying both µ s and µ b. This estimate is used as test statistic for the Neyman construction. The confidence belts are shown in Figure 6. The interval in the median case was found to be ˆµ s = 10.0 +4.7 3.9 and thus gives the same expected interval as if using Z est as test statistic. 12

Z est 10 8 68% central belt 95% central belt 6 4 median Z est 2 0 0 5 10 15 20 25 30 µ s Figure 5: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model using Z est as test statistic (which is equivalent to using a likelihood ratio). 40 ˆµ s45 35 30 25 20 15 median ˆµ 10 s 5 68% central belt 95% central belt 0 0 5 10 15 20 25 30 µ s Figure 6: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model using the maximum likelihood estimate for µ s as test statistic. 13

3.2.3 Bayesian Inference: Posterior In Bayesian statistic, the posterior in µ s can be considered the final result of a measurement. Given data d, the posterior in all model parameters (µ s, µ b ) is given by p(µ s, µ b d) = p(d µ s ) π(µ s, µ b ) π(d) (14) where π are the priors. The denominator on the right hand side, π(d), is a normalization factor which does not have to be determined explicitely. In order to make statements about the parameter of interest, µ s, only the marginal distribution of µ s of the full posterior is considered. It is given by p(µ s d) = p(µ s, µ b d)dµ b. This marginal posterior is extracted by a Markov-Chain Monte-Carlo method which creates a Markov Chain of parameter values (µ s, µ b ) distributed according to the posterior 14. The posterior for µ s is shown in Figure 7. As dataset, the template of the model prediction is used directly, i.e., without throwing random pseudo data. The posterior was determined using a chain length of one million and flat priors in µ s and µ b. The run time for this Markov Chain was about 6 seconds. While the posterior in µ s, p(µ s ) can be considered as the final result, some derived quantities can be used to summarize the posterior. In this case, the most probable value and the 68% credible level central interval are considered. The most probable value can be determined robustly by fitting a normal distribution to the peak region of the posterior and taking the mean of the fitted distribution as estimated value. The central 68% credible level interval is illustrated in Figure 7. Using these values, the expected result can be summarized as ˆµ s = 10.1 +5.0 3.6. 3.3 Exclusion Exclusion can be handled by methods discussed in Section 3.2, with some minor modifications: The upper end of a 90% C.L. interval from the profile likelihood method can be cited as the 95% C.L. upper limit. Note, however, that the profile likelihood method might perform poorly near parameter boundaries µ s 0 and this is typically the case if you want to calculate an upper limit. In this case, coverage tests should be carried out to check the validity of this method. The Neyman construction remains valid; instead of taking the central 68% of the test statistic distribution as the confidence belt, one would include the upper 95%. The Bayesian result is the marginal posterior in µ s, from which the 95% quantile can be easily derived. 14

p(µ s ) 0.10 0.08 0.06 68% 0.04 0.02 0.00 0 5 10 15 20 25 30 µ s Figure 7: Posterior in µ s for the example model using µ s = 10 and a flat prior in µ s, µ b. The values for µ s from the chain were binned in 60 bins from 0.0 to 30.0. The y-axis are the number of chain elements in the bin which is proportional to the posterior p(µ s d). The only method discussed here in more detail is the CLs method [6], which can be used to construct upper limits. Unlike other methods, it has the property that a downward fluctuation of the background does not lead to more stringent upper limits. This is seen as a desired property by many physicists, as otherwise, a poor background model which systematically overestimates the background level would yield a better upper limit than a realistic background model. The CLs value for a certain signal s and background b is defined as CLs = 1 p s+b 1 p b (15) where p i is the upper tail of the test statistic distribution for model i, i.e., the probability of observing a test statistic value as least as signal-like as the one observed, assuming model i is true. This definition of CLs is depicted in Figure 8. The 95% upper limit is given by the amount of signal for which CLs as defined in eq. (15) is 0.05. For this example, Z est is used as test statistic, and the p values for eq. (15) are given by eq. (10). The expected 95% C.L. upper limit is calculated twice, for µ s = 0 and for µ s = 10. For these two cases, the median of the Z est distribution is 0.00 and 3.02, respectively. The pseudo data sample used is the same as in the Neyman construction using Z est in Section 3.2.2. As was done there, 30 bins in µ s are created and the test statistic distribution in each bin is used to calculate the p values in eq. (15). The 95% C.L. upper limits are determined by calculating CLs for all bins in µ s, 15

d(ts) 0.25 0.20 1 p b p s +b ˆT B hypothesis S + B hypothesis 0.15 0.10 0.05 0.00 0 5 10 15 20 25 30 TS Figure 8: Illustration for the definition of the CLs value. With the test statistic (T S) distributions for the background-only and the signal-plus-background, the CLs value for a given measurement with test statistic value ˆT is defined as 1 p s+b 1 p b. interpolating this dependence linearly and finding the value µ s for which CLs is 0.05. This yields an expected upper limit of 7.1 in case of no signal. 16

4 Including Systematic Uncertainties Systematic uncertainties can be included in the model by introducing additional (nuisance) parameters q into the model which parametrize the effect of the systematic uncertainty on the predicted templates. Within theta, these nuisance parameters are treated no different than other model parameters p. They are given another name here only in order to unambiguously refer to these parameters in the following discussion. Once the uncertainty is included in the model, there are different possibilities to account for them in the methods previously presented. In a Bayesian method, the most natural way would be to include priors for the parameters q and proceed as before, i.e., integrate with repect to q in order to construct the posterior for the parameter of interest or the Bayes factor. For methods which use numerical distributions of the test statistic (such as the Neyman construction, hypothesis test as in Section 3.1.2, CLs method), the pseudo data to calculate the test statistic is generated including the systematic uncertainties by choosing random values for q before drawing a pseudo data distribution from the model and calculating the value of the test statistic. In this case, assumptions about the distribution q in the form of a prior for q have to be made. This method is known as prior-predictive method. 4 The definition of the test statistic is often based on maximizing the likelihood function. The model used to define this likelihood function can be either the model which includes q (including priors for q), or a model which fixes the value of q to values which reproduce the most probable one, i.e., the model used before including the systematic uncertainties. The latter approach is often more robust in practice as the number of parameters to vary during the minimization of the negative log-likelihood is smaller. However, including the dependence on some nuisance parameters for the definition of the test statistic can improve the results as the value of the nuisance parameters can be determined from data simultaneously to the parameter of interest. This is especially true for systematic uncertainties which have a large impact on the background in signal-free regions. In this case, minimization of the negative log-likelihood effectively measures the parameter value q j for this uncertainty which improves the background prediction in the signal region. On the other hand, including a nuisance parameters q j in the model for the likelihood definition which has little impact on the model prediction and thus can hardly be measured from data might even worsen results. Thus, a general advise about whether or not to include the uncertainties in the model definition used for the likelihood function is to start with a model without these uncertainties and include one by one the uncertainties, starting with those which have the largest impact on the result. Then, these uncertainties would only be kept if they improve the expected result. In a simple case, the uncertainty to consider is merely an uncertainty of the rate of a certain process. How such a case is included in theta is discussed first. More generally, an uncertainty affects the whole template, i.e., the template depends on q. This dependence affects not only the rate but also the shape of the template. This case is discussed in the second subsection. 4 In an alternative method, the posterior-predictive method, posteriors for q are derived first in an auxiliary measurement. These posteriors are then used to throw random values for the test statistic generation as discussed. This will not be discussed here. 17

4.1 Rate Uncertainties Consider an uncertainty in the example model on the relative normalizations of the background templates in the different channels, i.e., on the relative normalization of t sb,1 and t b,1 (see eqs. (6) and (7)). To include this uncertainty in the model, an additional parameter τ is introduced in the model, which is the ratio of the background expectation in the sideband and the signal region. Then, the model equations (6) and (7) become m sb (µ s, µ b ) = τµ b t sb,1 m s (µ s, µ b ) = µ b t s,1 + µ s t s,2. The absence of any uncertainty corresponds to fixing τ = 1. Assume that by external knowledge (e.g., an auxiliary measurement), τ is known to be τ = 1 ± 0.2. This can be considered in theta by defining D( p) to include a normal distribution for τ (see eq. (5)). As example, the Neyman construction using the maximum likelihood estimate ˆµ s as test statistic from Section 3.2.2 is re-evaluated by generating pseudo data where for each pseudo experiment, τ is chosen randomly from a normal distribution around 1 with width 0.2. The model used to calculate the maximum likelihood test statistic fixes τ to 1.The resulting confidence belt using 200,000 pseudo experiments is shown in Figure 9. The expected result (for median value of the test statistic) is ˆµ s = 10.0 +5.0 4.0 and the interval is larger than without this uncertainty. 4.2 Template Uncertainties In general, a mere rate uncertainty does not suffice to describe the effect of the systematic uncertainty on the expected distribution in an observable. Rather, the whole template is affected. An example would be a bias in the energy measurement which shifts the signal peak of the example model (see Figure 10). The templates shown there represent the estimated one-sigma deviations, i.e., 68% credible intervals. In order to include this effect in the model, a parameter δ is introduced in the model which is used to interpolate between the original template without uncertainties, t s,2, and the templates which include the uncertainty t s,2,+ and t s,2,. This is written as a template function t s,2 (δ) such that t s,2 (0) = t s,2 t s,2 (1) = t s,2,+ t s,2 ( 1) = t s,2,. As the templates affected by uncertainties represent a one-sigma deviation, a reasonable prior distribution for δ is a normal distribution with mean 0 and width 1. Additional desirable properties of a template interpolation are: for all values of δ, all bins should have positive values, and the bin values as function of δ should be continuously differentiable. 18

40 ˆµ s45 35 30 25 20 15 median ˆµ 10 s 5 68% central belt 95% central belt 0 0 5 10 15 20 25 30 µ s Figure 9: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model. Pseudo data was generated including a 20% relative uncertainty on the relative background expectation in the sideband and signal region, τ. The test statistic is the maximum likelihood estimate of µ s using a model without the uncertainty on τ. arb. 0.5 0.4 t s, 2 t s, 2, + t s, 2, 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 "mass" Figure 10: Signal templates for the example model. t s,2 is the original signal template unaffected by uncertainties (cf. Figure 1), t s,2,± are the signal templates affected by a energy-scale like uncertainty which affects both the shape and the normalization of the template. 19

40 ˆµ s45 35 30 25 20 15 median ˆµ 10 s 5 68% central belt 95% central belt 0 0 5 10 15 20 25 30 µ s Figure 11: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model. Pseudo data was generated including the template uncertainty as depicted in Figure 10. The first property ensures that the model always remains meaningful and the evaluation of the negative log-likelihood always yields finite values. The second property is important for stable numerical minimization of the negative loglikelihood, as many algorithms assume a continuous first derivative. There are many different possibilities for a template interpolation with these properties. The one chosen here is t s,2 (δ) = t s,2 ( ) δ ts,2,sign(δ) where the equation holds for each individual bin of the templates involved. For the Neyman construction, pseudo data is generated by choosing a random value for δ from a Gaussian around zero with width one. Then, poisson data is generated from the model prediction for this value of δ. The model used for calculating the maximum likelihood estimate used as test statistic was done using the same model, i.e., including the template interpolation with δ as parameter and the Gaussian prior for δ. The belts calculated from 200,000 pseudo experiments are shown in Figure 11. The expected (median) result of the interval estimation in this case is t s,2 ˆµ s = 10.0 +4.9 3.9. The runtime of theta was about 5 minutes. 20

Figure 12: Overview of the collaboration of some important classes in theta. Rectangles are core classes of theta while the ellipses represent abstract classes which are implemented as plugins. The collaboration of plugin classes depicted here is only an example: theta imposes no restriction on the architexture of plugins. 5 theta Framework Figure 12 gives an overview over the architecture of theta. As depicted there, theta consists of a few core classes which implement central concepts of templatebased modeling, such as the model, data and the negative log-likelihood function (NLLikelihood). Other classes are merely abstract classes in theta; concrete classes are provided by plugins. While theta includes some common plugins, it was specifically designed to enable users to write their own plugins for their needs. Which plugins to use and the configuration parameters for these plugins are specified in a configuration file. The theta main program reads this configuration file and creates the requested plugin instances. This includes setting up the model which contains a number of Functions and HistogramFunctions which represent the coefficients c i,k ( p) and templates t i,k ( p) in equation (2), respectively. The distribution D( p) of a model is represented by a Distribution instance. It can be used for both, generation of random parameter values for pseudo data construction as well as term for the likelihood function. For each pseudo experiment, Data is produced by a DataSource. This can mean throwing random Poisson data according to a model or always passing 21

the same Data which was read from a ROOT file. This Data is passed to each configured Producer which has access to the Model and Data. The Producer typically constructs the likelihood function for the Model and the Data (NLLikelihood), calculates some quantities and writes them to a Database. A Database contained in theta writes the products in a SQL table to a sqlite3 file. The only external dependency of the theta core system is Boost [1], a generalpurpose C++ library. This core system becomes useful only through plugins. theta provides plugins for many common use cases, such as the access to ROOT histograms, to write the results to a sqlite3 database, and many more. If no there is no suitable plugin for a particular use case, defining an own plugin is straight-forward and is merely a matter of deriving from the appropriate base class and implementing its purely virtual methods. 5.1 Combination with external Likelihood Functions Combination of different channels is possible by configuring a theta model which includes all channels as different observables. A combination with an external analysis on the likelihood-level is possible by exploting the plugin system and writing a Function plugin which calculates the external negative log-likelihood. This Function can then be used as part of the prior D( p) (cf. eq. (5)) of the model. This allows to either use methods implemented in theta on the external likelihood function or combine the external likelihood function with the likelihood function calculated internally in theta. 5.2 Markov Chains in theta The Metropolis-Hastings Markov-Chain Monte-Carlo algorithm produces a Markov chain which are distributed according to a probability density f(x). Given a point x i within the sequence, the next point in the sequence is found by choosing randomly a proposed point ˆx in the neighborhood of x i. If the probability density at that proposed point is larger than the current one, i.e., if f(ˆx) > f(x i ), the proposal is accepted, i.e., the next point in the chain is ˆx. Otherwise, ˆx is only accepted with probability f(ˆx)/f(x i ). If the proposal is not accepted, the next point in the sequence is x i (again). One crucial point when applying the algorithm is the typical step size of the proposal step. If the step size is too large, the proposed points are too far from the current point x i, the posterior at the proposed point is very small and rejected too often. On the other hand, if the step size is very small, the acceptance rate will be high. In this case, the chain elements behave like a particle in a diffusion process and the covered volume in parameter space is proportional to N for a chain of length N. In theta, the jump kernel is a multivariate normal distribution. Its covariance is determined iteratively from the posterior of average data, i.e., using the average prediction of the model without any randomization. The found width is further scaled to maximize the diffusion speed according to [3] by a factor 2.38 n 22

where n is the number of dimensions, i.e., the number of non-fixed parameters. The covariance is determined only once per theta run and used for all pseudo experiments. This provides a considerable speedup compared to methods which determine the jump kernel width for each pseudo experiment. 5.3 Testing Testing is done by evaluating the producers for simple models for which the result is known analytically or from statistic HEP papers. This includes the Likelihood ratio producer for counting experiments, the Profile likelihood interval producer on counting experiments, including tests against the values given in [2] for the on/off problem and the gaussian-mean problem. Also, Bayesian credible intervals for a counting experiment has been checked against analytical results. 23

Appendix Two annotated configuration files are given below. The first, exmodel.cfg, defines the example model. The templates in the signal region from eq. 7, t s,1 and t s,2, are saved in a ROOT-file templates.root and called "bkg" and "signal", respectively. The second configuration file, deltanll.cfg, was used to produce the values of the minimized negative log-likelihood function for eq. (8) used in Section 3.1.1 as well as for the signal-plus-background distribution of the Z est test statistic in Section 3.1.2. Plugins used in this example are: fixed poly, a HistogramFunction which does not depend on any parameters and is defined by a polynomial. root histogram, a HistogramFunction which does not depend on any model parameters and returns the histogram from a ROOT file. mult, a Function which multiplies all parameters in a list. product distribution, a Distribution defined as product of other Distributions. flat distribution, a Distribution which is flat on a given interval. delta distribution, a Distribution to fix a parameter to a given value. deltanll hypotest, a Producer which produces the values of the negativelog-likelihoods appearing on the right hand side of equation (8). model source, a DataSource which throws random Poisson data according to a given model. sqlite database, a Database which writes all data to a single sqlite3 database file. exmodel.cfg: // the names of all model parameters parameters = ( mu s, mu b ); // the observables of the model observables = { // mass variable in the signal region has 100 bins on // the range (0, 500): mass = { range = (0.0, 500.0); nbins = 100; // counting only variable for the sideband, i.e., one bin // on an arbitrary range: sb = { 24

range = (0.0, 1.0); nbins = 1; tau = 10.0; // flat template, normalized to tau for the sideband counting part: t sb 1 = { type = fixed poly ; observable = sb ; normalize to = @tau ; coefficients = [1.0]; // templates from a root file for the others: t s 1 = { type = root histogram ; filename = templates.root ; histoname = bkg ; normalize to = 1.0; t s 2 = { type = root histogram ; filename = templates.root ; histoname = signal ; normalize to = 1.0; // within the model, the predition is given for each observable by listing, // the terms of the sum in eq. (2). For each term, a // coefficient function and a histogram is specified which correspond // to c ik and t ik in eq. (2). example model = { mass = { background = { coefficient function = { type = mult ; parameters = ( mu b ); histogram = @t s 1 ; signal = { coefficient function = { type = mult ; parameters = ( mu s ); histogram = @t s 2 ; sb = { background = { coefficient function = { type = mult ; parameters = ( mu b ); histogram = @t sb 1 ; 25

// as distribution for the parameters, the product of two // flat distributions is used which are defined below. parameter distribution = { type = product distribution ; distributions = ( @mu b flat, @mu s flat ); // use flat distributions which restrict the // mu parameters to positive values, but do not add an additional // term to the likelihood function. If sampling from these // distributions, they will always return a fixed value // given by fix sample value. This value is also used as // initial guess for likelihood function minimization. // As initial step size for minimization or Markov Chains, // the width parameter is used. mu b flat = { type = flat distribution ; mu b = { range = (0.0, inf ); fix sample value = 20.0; width = 2.0; mu s flat = { type = flat distribution ; mu s = { range = (0.0, inf ); fix sample value = 10.0; width = 1.0; deltanll.cfg: @include exmodel.cfg // hypotest is a producer which produces random variable // distributions, given data and a model. // The deltanll hypotest produces the two minima of the likelihood functions // using the same model but different parameter distributions, namely the // background only distribution and the // signal plus background distribution. hypotest = { type = deltanll hypotest ; 26

name = hypotest ; minimizer = { type = root minuit ; background only distribution = { type = product distribution ; distributions = ( @mu s zero, @mu b flat ); // as s+b distribution, the distribution as // defined in the model is used: signal plus background distribution = @example model.parameter distribution ; mu s zero = { type = delta distribution ; mu s = 0.0; // main is the setting which glues all together. It will run n events // pseudoexperiments which // 1. sample pseudo data from the data source // 2. call all the producers on the pseudo data, along with model // 3. save the products produced in 2. in output database main = { data source = { type = model source ; name = source ; model = @example model ; //included via exmodel.cfg model = @example model ; producers = ( @hypotest ); n events = 100000; output database = { type = sqlite database ; filename = deltanll hypo.db ; options = { plugin files = (../../ lib/core plugins.so,../../ lib/root.so ); 27

References [1] boost C++ libraries. http://boost.org/. [2] R. D. Cousins, J. T. Linnemann, and J. Tucker. Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 595(2):480 501, 2008. [3] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis Jumping Rules. Bayesian Statistics, 5:599 607, 1996. [4] F. James. Statistical Methods in Experimental Physics. 2006. [5] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. [6] A. L. Read. Modified frequentist analysis of search results (the CL s method). (CERN-OPEN-2000-205), 2000. 28