arxiv: v3 [physics.data-an] 23 May 2011

Size: px

Start display at page:

Download "arxiv: v3 [physics.data-an] 23 May 2011"

Francine Walker
5 years ago
Views:

1 Date: October, 8 arxiv:.7v [hysics.data-an] May -values for Model Evaluation F. Beaujean, A. Caldwell, D. Kollár, K. Kröninger Max-Planck-Institut für Physik, München, Germany CERN, Geneva, Switzerland Georg-August-Universität, Göttingen, Germany Abstract Deciding whether a model rovides a good descrition of data is often based on a goodness-of-fit criterion summarized by a -value. Although there is considerable confusion concerning the meaning of -values, leading to their misuse, they are nevertheless of ractical imortance in common data analysis tasks. We motivate their alication using a Bayesian argumentation. We then describe commonly and less commonly known discreancy variables and how they are used to define -values. The distribution of these are then extracted for examles modeled on tyical data analysis tasks, and comments on their usefulness for determining goodness-of-fit are given.

2 Introduction Progress in science is the result of an interlay between model building and the testing of models with exerimental data. In this aer, we discuss model evaluation and focus rimarily on situations where a statement is desired on the validity of a model without exlicit reference to other models. We introduce different discreancy variables for this urose and define -values based on these. We then study the usefulness of the -values for assing judgments on models with a few simle examles reflecting commonly encountered analysis tasks. In the ideal case, it is ossible to calculate the degree-of-belief in a model based on the data. This otion is only available when a comlete set of models and their rior robabilities can be defined. However, the conditions necessary for this ideal case are usually not met in ractice. We nevertheless often want to make some statement concerning the validity of the model(s). We then are left with using robabilities of data outcomes assuming the model to try to make some judgments. These robabilities can be determined deductively since the model is assumed, and therefore frequencies of ossible outcomes can be roduced within the context of the model. These can then be used to roduce frequency distributions of discreancy variables, and -values (one-sided robabilities for the discreancy variables) can be calculated using the distributions and the observed values. The use of -values has been widely discussed in the literature [] and many authors have commented that -values are frequently misused in claiming suort for models []. We give a Bayesian argumentation for the use of -values to make judgments on model validity, and it is in this Bayesian sense that we will use -values. We start with a review of model testing in a Bayesian aroach when an exhaustive set of models is available and a coherent robability analysis is ossible. We then move to situations where this is not the case, and review some ossible choices of discreancy variables for Goodness-of-Fit (GoF) tests. Next, we define the -values for the discreancy variables and evaluate their usefulness for several examle data sets. Note that we omit a discussion of Bayes factors since these require the definition of at least two models for evaluation. Full set of models available. Formulation Assume we have a comlete set of models available for describing the data, such that we are sure the data can be described by one of the models available. The models can be used to calculate direct robabilities ; i.e., relative frequencies of ossible outcomes of the results were one to reroduce the exeriment many times under identical conditions. The robability of a model, M, is denoted by P (M), with P (M), () A discreancy variable [] is an extension of classical test statistics to allow ossible deendence on unknown (nuisance) arameters. We use robability to also include robability density.

3 while the robability densities of the model arameters, θ, are tyically continuous functions. In the Bayesian aroach, the quantities P (M) and P ( θ M) are treated as robabilities, although they are not frequency distributions and are more accurately described as degrees-ofbelief (DoB). This DoB is udated by comaring data with the redictions of the models. P (M = A) = reresents comlete certainty that A is the model which describes the data, and P (A) = reresents comletely certainty that A is not the correct model. The rocedure for udating our DoB using exerimental data is P i+ ( θ, M D) P ( x = D θ, M)P i ( θ, M), () where the index on P reresents a state-of-knowledge. The osterior robability density function, P i+, is usually written simly as P, and the rior is written as P. The osterior describes the state of knowledge after the exeriment is analyzed. The quantity P ( x = D θ, M) reresents a robability of getting the data D given the model and arameter values, and can usually be defined in a number of ways (see for examle Section.). Normalizing, and using P ( D) = M P ( x = D θ, M)P ( θ, M)d θ yields P ( θ, M D) = P ( D θ, M)P ( θ, M) P ( D) This is the classic equation due to Bayes and Lalace [].. () Models can be comared and the DoB in a model can be obtained using P (M D) = P ( θ, M D)d θ. () This evaluation requires the secification of a full set of models and a definition of the rior beliefs such that P (M) =. (5) M It is very sensitive to the definitions of the riors when the data are not very selective.. Examle with full set of models An examle where this aroach was used is given in [5], and is reviewed here. The analysis is to be erformed on an observed energy sectrum, for which we can form a backgroundonly hyothesis, or a background+signal hyothesis. An examle is the search for neutrinoless double beta decay. Note that there is in rincile no mathematical distinction between model and arameters. In ractice, we distinguish them because models are fixed constructs for which we evaluate the degree-of-belief that the model is correct, whereas arameters can take on a range of values and the analysis is used to extract a degree-of-belief for a articular value.

4 In the following, H denotes the hyothesis that the observed sectrum is due to background only; the negation, interreted here as the hyothesis that the signal rocess contributes to the sectrum, is labeled H. The osterior robabilities for H and H can be calculated using P (H sectrum) = P (sectrum H) P (H) P (sectrum) () and P (H sectrum) = P (sectrum H) P (H) P (sectrum). (7) The robability P (sectrum) is P (sectrum) = P (sectrum H) P (H) + P (sectrum H) P (H). (8) The robabilities for a given sectrum can be calculated based on assumtions on the signal strength and background shae as described in [5]. Evidence for a signal or a discovery is then decided based on the resulting value for P (H sectrum). In this analysis, it was assumed that the observed sectrum must come from either the background model or a combination of the background and double beta decay signal. The robability of each case is evaluated and conclusions are drawn from these robabilities. Incomlete set of models In most cases, we analyze data without having an exhaustive set of models available, but nevertheless want to reach conclusions on how well the models account for the data. This information can be used, for examle, to guide the search for new models. In the examle given above, it is ossible that there are unknown sources of background for which redictions are not ossible before the exeriment is erformed. The quantities P (sectrum H) and P (sectrum H) could be individually examined and, if both are on the small side of the exected distribution, doubts concerning the comleteness of our set of models could arise.. General aroach to GoF tests For a given model, we can define one or more discreancy variable(s) and calculate the exected frequency distribution of this discreancy variable. If the discreancy variable is well chosen, then the distribution for a good model should look significantly different than for a bad model. Finding the discreancy variable in the region oulated by incorrect models then gives us cause to think our model is not adequate. Since the shae of the background sectrum is assumed to be known the case of unknown background sources contributing to the measured sectrum is ignored. However, the overall level of background is allowed to vary.

5 . Definition of a -value A -value is the robability that, in a future exeriment, the discreancy variable will have a larger value (indicating greater deviation of the data from the model) than the value observed, assuming that the model is correct and all exerimental effects are erfectly known. In other words, not only is the model the correct one to describe the hysical situation, but correct distribution functions are used to reresent data fluctuations away from the true values. We will focus on GoF tests for the underlying model, but it should be clear that incorrect formulations of the data fluctuations will bias the -value distributions to lower (if the data fluctuations are underestimated) or higher (if the data fluctuations are overestimated) values. In general, any discreancy variable which can be calculated for the observations can be used to define a -value. We use R( x θ, M) and R( D θ, M) to denote discreancy variables evaluated with a ossible set of observations x for given model and arameter values, and for the observed data, x = D, resectively. To simlify the notation, we will occasionally dro the arguments on R and use R D to denote the value of the discreancy variable found from the data set at hand. R can be interreted as a random variable (e.g., ossible χ values for a given model), whereas R D has a fixed value (e.g., the observed χ derived from the data set at hand). Assuming that smaller values of R imly better agreement between the data and model redictions, the definition of (for continuous distributions of R) is written as: = P (R θ, M)dR. (9) R>R D The quantity is the tail-area robability to have found a result with R( x) > R D, assuming that the model M and the arameters θ are valid. If the modeling is correct (including that of the data fluctuations), will have a flat robability distribution between [, ]. For discrete distributions of R, the integral is relaced by a sum, the -value distribution is no longer continuous, and the cumulative distribution for will be ste-like. If the existing data are used to modify the arameter values, the extracted -value will be biased to higher values. The amount of bias will deend on many asects, including the number of data oints, the number of arameters, and the riors. We can remove the bias for the number of fitted arameters in χ fits by evaluating the robability of R = χ for N n degrees-offreedom, P (χ N n), where N is the number of data oints and n is the number of arameters fitted [], when the data fluctuations are Gaussian and indeendent of the arameters, the function to be comared to the data deends linearly on the arameters, and the arameters are chosen such that χ is at its global minimum. In general, the bias introduced by the number of fitted arameters becomes small if N n. -values cannot be turned into robabilistic statements about the model being correct without riors, and statements of suort for a model directly from the -value behave incoherently []. Furthermore, aroximations used for the distributions of the discreancy variables,

6 biases introduced when model arameters are fitted and difficulties in extracting reliable information from numerical algorithms used to evaluate the discreancy variable (see section..) further comlicate their use. -values should therefore be handled with care. Nevertheless, we discuss the use of -values to make judgments about the models at hand. The judgment will be based on a sequence of considerations of the tye: the -value distribution for a good model is exected to be (reasonably) flat between [, ]; the -values for bad models usually have sharly falling distributions starting at = ; small -values are worrisome; if we know that other models can be reasonably constructed which would have higher -values, then a small -value for the model under consideration indicates that we may have icked a oor model; if the -value is not too small, then our model is adequate to describe the existing data.. Bayesian argumentation We contend that the use of -values for evaluation of models as just described is essentially Bayesian in character. Following the arguments given above, assume that the -value robability density for a good model, M, is flat, P ( M ) =, and that for oor models, M i (i =... k), can be reresented by P ( M i ) c i e c i where c i so that the distribution is strongly eaked at and aroximately normalized to. The DoB assigned to model M after finding a -value is then If we take all models to have similar rior DoBs, then P (M ) = P ( M )P (M ) k i= P ( M i)p (M i ). () P (M ) P ( M ) k i= P ( M i). In the limit, we have while for c i i > P (M ) + k i= c i P (M ). Although this formulation in rincile allows for a ranking of models, the vague nature of this rocedure indicates that any model which can be constructed to yield a reasonable - value should be retained. A further consideration is that the correct distributions for the data fluctuations are often not known (due to the vague nature of systematic uncertainties) and best guesses are used. This will generally also lead to non-flat -value distributions for good models. Scientific rejudices (Occam s razor, elegance or esthetics, etc.) will influence the decision and act as a guide in selecting the best model in cases where several good models are available. 5

7 . Discreancy variables considered in this aer.. χ test for data with Gaussian uncertainties For uncorrelated data assumed to follow Gaussian robability distributions relative to the model redictions, the χ value is a natural discreancy variable for a GoF test: ( N y i f(x i ) θ, M) R G = χ = () i where the N data oints are given by {(x i, y i )}, and the rediction of the model for y i is f(x i θ, M). The modeling of the data fluctuations uses fixed standard deviations σ i. If the arameters of the model are fitted to the data by minimizing χ and there are n arameters we relace θ in the formula above with θ. The χ robability distribution is evaluated for N n degrees-of-freedom. In the secial case where f is linear in the arameters, this rocedure again yields a flat -value distribution between [, ]. σ i.. Runs test The standard χ test does not take into account clustering of data below or above exectations. To detect clusters the ordered set of N observations {(x i, y i )} is artitioned into subsets containing the success and failure runs (defined as sequences of consecutive y i above or below the exectation from the model, f(x i θ, M), resectively). Several discreancy variables based on success runs can be found in the literature [7] but these do not take into account the size of the deviation, y i f(x i θ, M). Recently, a discreancy variable for runs incororating this extra information was roosed for ordered data with Gaussian fluctuations [8]. Let A j denote the subset of the observations of the j th success run. The weight of the j th success run is then taken to be ( y i f(x i θ, ) M) χ run,j = j +N j i=j where the sum over i covers the {(x i, y i )} A j and N j is the length of the run. The discreancy variable is the largest weight of any run σ i R sr = max χ run,j. j The exlicit distribution of R sr, used to define the -value, = P (R sr > R D sr N), is given in [8] for the case when ( θ, M) are fully secified (no fitting). A similar discreancy variable can be defined for failure measurements, R fr. To illustrate the definition we resent a simle examle. Suose N = 5 observations at x ositions (,,,, 5) with standardized residuals (y i f(x i θ, M))/σ i given by (.,.,.8,.,.). Then there are two success runs A = {(,.)}, A = {(,.), (5,.)} and we find R sr =. +. =. due to the second run. Similarly, for the single failure run, R fr =.5.

8 .. χ tests for Poisson distributed data In the case of binned, Poisson distributed data, the quantity R P = χ P = N b i= (m i λ i ) λ i () can be used as a discreancy variable where N b is the number of bins, the m i are the numbers of measured events and λ i = λ i ( θ, M) are the model exectations for event bin i; for an exectation λ i the exected variance is λ i. This Pearson χ statistic [9] was originally roosed for multinomial data but has found wide use in analyzing Poisson distributed data. Rather than using the robability distribution of this discreancy variable directly, P (χ N b ) is often used as an aroximation for P (R P ) 5. The distribution of R P has been shown to asymtotically reach a χ distribution for multinomially distributed data, giving some justification for this rocedure. Practically, this is the case for data with a large number of entries m i in all bins. When arameters are first estimated from a fit to the data, the arameter values which minimize R P are used to calculate the λ i, and the -value is evaluated using P (χ N b n). It is often seen that the exected weight, λ i, is relaced with the observed weight, m i, in the denominator (Neyman χ []). The discreancy variable is then R N = χ N = N b i= (m i λ i ) m i. () Again, rather than using the robability distribution of this discreancy variable directly, it is assumed that R N has a distribution which aroximates a χ distribution with N b degrees of freedom and P (χ N b ) is used as an aroximation for P (R N ). In cases where m i =, ractitioners of this aroach set m i = to avoid divergence. Sometimes bins with m i = are ignored, which can lead to very misleading results since finding m i = is valuable information. When arameters are first estimated from a fit to the data, the arameter values which minimize R N are used, and the -value is evaluated using P (χ N b n). Note that in both of these cases, we do not exect flat -value distributions since only aroximations are used for P (R P/N ). The deviations from flatness are exected to be greatest when small event numbers are resent in the data sets... Likelihood ratio test for Poisson distributed data Another otion for a Poisson model is based on the (log of) the likelihood ratio [] (sometimes referred to as the Cash statistic []) R C = log P ( x λ i = m i ) P ( x λ i = λ i ( θ)) = N b i= [ λ i m i + m i log m ] i λ i. () 5 The use of this aroximation robably dates back to a time when comlicated numerical calculations were not ossible. 7

9 In bins where m i =, the last term is set to. Again, rather than using the robability distribution of this discreancy variable directly, since R C has a distribution which aroximates a χ distribution with N b degrees of freedom for large λ i, P (χ N b ) is used as an aroximation for P (R C ). The validity of this aroximation is critically discussed in []. When arameters are first estimated from a fit to the data, the arameter values which maximize the likelihood are used, and the -value is evaluated using P (χ N b n). Note that the distribution of R C asymtotically converges faster to the χ distribution than the distribution of R P (see [])...5 Probability of the data test Any robability of the data can be chosen as the discreancy variable: R L = P ( x θ, M). In this case, larger values of R L imly better agreement with the data. The robability P ( x θ, M) can be used to extract the robability for R L as P (R L )dr L = P ( x θ, M)d x. R L <P ( x θ,m)<r L +dr L An examle of how this is done numerically for Poisson distributed data and using P ( m θ, M) = N b i e λ i λ m i i m i! is given in the Aendix. Once we have P (R L θ, M) we can then evaluate = P (R L )dr L where R D L R L <R D L is the value observed with the data set at hand. In a model with Gaussian uncertainties where we use a roduct of Gaussian densities, P ( x θ, M) N P (x i µ i, σ i ) i where x i N (µ i, σ i ), then taking R L = P ( x θ, M) is equivalent to the usual χ test. If the model arameters are first fitted using the data, we roose the following correction for the number of fitted arameters: calculate the -value taking RL D (no fitted arameters); = P ( x = D θ, M), but assuming a simle hyothesis calculate the χ value which corresonds to this -value using the inverse χ distribution corresonding to N degrees of freedom; 8

10 recalculate the -value using the χ value and N n degrees of freedom. This rocedure is valid for the case where we have Gaussian uncertainties, but is ad-hoc for other cases. We test its usefulness below. Care must be taken in using R L as a discreancy variable, articularly when it is written as the roduct of individual robability densities. The overall shae of the distribution is otentially not tested, and large -values can be roduced with incorrect model choices. An examle of this is given below in Section.... Partial/Prior/Posterior-redictive -value Rather than using the arameters at the mode of the osterior, it is also ossible to define a -value by averaging over the arameter values according to a robability distribution [5]. For the osterior-redictive case [], assuming we are using a robability of the data as discreancy variable, [ = R L <R D L P (R L θ, M)dR L ] P ( θ D, M)d θ. (5) While the artial osterior-redictive -value [5] has the desirable roerty of a flat distribution on [, ], at least in the large samle limit as N, it is not known in general how to comute it for realistic roblems. Furthermore, the numerical effort required to evaluate the double integral in (5) quickly becomes rohibitive. Therefore, these -value definitions are not considered here desite their aeal from a Bayesian ersective...7 Johnson test Johnson [7] roosed a modification of the χ definition to take into account the osterior robability density for the arameters of a model. Rather than evaluating the robability of the data at fixed values of the arameters, the arameters are given values according to their robability density after evaluating the data. Johnson s statistic, R J, is exected to behave asymtotically as a χ distribution with N b degrees of freedom, where N b is a number of bins to be defined, regardless of the number of arameters. Imagine the data is given by a vector of values x and these values are exected to deviate from the model rediction according to individual robabilities P (x j θ, M). The Johnson rescrition is to define N b bins, where bin i contains a robability P i. Intervals x i,j are defined for each data oint j via P i ( θ) = P (x j θ, M)dx j. x i,j The intervals x i,j cover the full range of ossible values for each x j and are ordered so that they include increasing values of x. The definition of the intervals deends on the values of the arameters θ and vary for each data oint j. The arameter values are to be samled from 9

11 the full osterior robability P ( θ D, M). For each θ, a discreancy variable R J is calculated according to ( m i NP i ( θ) ) R J = N b i= NP i ( θ) where m i is the actual number of data oints falling within the intervals x i,j and N is the total number of data oints in the data set. In [7] it is shown that asymtotically R J is distributed as a χ variable with N b degrees of freedom, regardless of the number of fitted arameters n. Hence the -value is calculated using P (χ N b ). For data where the x follow continuous robability densities the bins are usually chosen to have equal robabilities. The number of bins to be chosen is given by a rule of thumb due to Mann/Wald [8] with a modification for small number of bins such that there are at least three: N b = e. log(n) +. This means by default we have N b (N = 5) = 5, N b (N = ) = 8, N b (N =, ) = 7. In case the values of x follow a discrete distribution it is usually not ossible to have equal robabilities for all P i. In this case, a randomization rocedure is used to allocate data oints to bins. () Test of -value definitions In the following, we test the usefulness of the different -values given above by looking at their distributions for secific examles motivated from common situations faced in exerimental hysics. We first consider a data set which consists of a background known to be smoothly rising and, in addition to the background, a ossible signal. This could corresond for examle to an enhancement in a mass sectrum from the resence of a new resonance. The width of the resonance is not known, so that a wide range of widths must be allowed for. Also, the shae of the background is not well known. We do not have an exhaustive set of models to comare and want to look at GoF s for models individually to make decisions. In this examle, we will first assume that distributions of the data relative to exectations are modeled with Gaussian distributions. We then consider the same roblem with small event numbers, so that Poisson statistics are aroriate, and again test our different -value definitions. These examles were also discussed in [9]. Finally, we consider the case of testing an exonential decay law on a samle of measured decay times.

12 . Data with Gaussian uncertainties.. Definition of the data A tyical data set is shown in Fig.. It is generated from the function f(x i ) = A + B x i + C x i + D σ (x i µ) π e σ, (7) with arameter values (A =, B =.5, C =., D = 5, σ =.5, µ = 5.). The y i are generated from f(x i ) as y i = f(x i ) + z i where z i is samled according to N (, ). y 5 5 Data Model I Model II Model III Model IV x Figure : Examle data set for the case N = 5 with Gaussian fluctuations. The fits of the four models are suerosed on the data. The x domain is [, ] and two cases were considered: N = 5 data oints evenly samled in the range, and N = data oints evenly samled in the range. The exerimental resolution (Gaussian with width ) is assumed known and correct. Ensembles consisting of, data sets with N = 5 or N = data oints each were generated and four different models were fitted to the data. Table summarizes the arameters available in each model and the range over which they are allowed to vary. In all models, flat riors were assumed for all arameters for ease of comarison between results. The models were fitted one at a time. Two different fitting aroaches were used: ) The fitting was erformed using the gradient-based fitter MIGRAD from the MINUIT ackage [] accessed from within the Bayesian Analysis Toolkit (BAT) [9]. The starting values for the arameters were chosen at the center of the allowed ranges given in Table.

13 Small range Large range Model Par Min Max Min Max I. A I + B I x i + C I x i A I 5-5 B I. -5 C I II. A II + D II σ II π e (x i µ II ) σ II A II -5 B II -5 µ II 8 5 σ II. III. A III + B III x i + D III σ III π e (x i µ III ) σ III A III -5 B III -5 D III µ III 8 5 σ III. IV. A IV + B IV x i + C IV x i + D IV σ IV π e (x i µ IV ) σ IV A IV -5 B IV -5 C IV.5-5 D IV µ IV 8 5 σ IV. Table : Summary of the models fitted to the data, along with the ranges allowed for the arameters.

14 ) The Markov Chain Monte Carlo (MCMC) imlemented in BAT was run with its default settings first, and then MIGRAD was used to find the mode of the osterior using the arameters at the mode found by the MCMC as starting oint. Since it is known that the distributions of the data are Gaussian and flat riors were used for the arameters, the maximum of the osterior robability corresonds to the minimum of χ. The -values were therefore extracted for each model fit using the χ robability distribution as discussed in Sec.... Model I Model II 8 MIGRAD MCMC Model III.5 Model IV Figure : -value distributions based on R G. The arameters of the models discussed in Table are fitted to N = 5 data oints and allowing arameter values in the small range. The -values corresond to N n degrees of freedom, where n is the number of fitted arameters... Comarison of -values The -value distributions for the four models are given in Fig. for the case N = 5 and using the small fit range from Table. Two different histograms are shown for each model,

15 corresonding to the two fitting techniques described above. The distributions for models I and II are eaked at small values of for MIGRAD only and for MCMC+MIGRAD, and the models would be disfavored in most exeriments. For models III and IV, there is a significant difference in the -value distribution found from fitting using only MIGRAD, or MCMC+MIGRAD, with the -value distribution in the latter case being much flatter. An investigation into the source of this behavior showed that the -value distribution from MIGRAD is very sensitive to the range allowed for the arameters. This can be seen in Fig. where the same fits were erformed but now using the large ranges for the arameters (see table). In the case where larger arameter ranges are allowed, MIGRAD will converge to arameter values which are not at the global mode more often, and the choice of the starting oint and the (initial) ste size for the fit are crucial. Even in this rather simle fitting roblem, it is seen that the use of the MCMC can make a significant difference in the fitting results Model I MIGRAD MCMC Model III Model II Model IV....8 Figure : -value distributions based on R G. The arameters of the models discussed in Table are fitted to N = 5 data oints and allowing arameter values in the large range. The -values corresond to N n degrees of freedom, where n is the number of fitted arameters. The results from the MCMC+MIGRAD fits also deended on the fit range, although to a lesser extent. Figure shows the -value distributions from χ fits to N = 5 data oints for Model IV for the two arameter ranges. There are small but nevertheless significant differences

16 in the distribution, indicating that the arameter range is also an imortant factor in the -value distribution. Note that we are not execting flat -value distributions since the correction for the number of fitted arameters is only valid for models linearly deendent on the arameters. This is not the case here since we have a Gaussian term in the model (see 7)... Model IV.8... MCMC - small range MCMC - large range....8 Figure : -value distributions based on R G for model IV. The arameters of the model are fitted to N = 5 data oints and allowing arameter values in the two ranges range. The -values corresond to N n degrees of freedom, where n is the number of fitted arameters. Focusing on the results from the MCMC+MIGRAD fits, models III and IV give flat - value distributions and would have a large DoB in a majority of exeriments. In general, these distributions satisfy our exectations and these -values can be used to udate our DoBs in the models as described in Section.. The -value distributions for the four models and using the Johnson discreancy variable are given in Fig. 5. Note that in the results shown for this descreancy variable, we have used θ rather than samling θ according to the osterior. We verified that this did not significantly change the -value distribution of this discreancy variable. The distributions show siky behavior, which becomes much smoother as the number of data oints increases. However, there is still very little discriminating ower between the models using this discreancy variable. This was generally the case for other examles considered, and we therefore do not show any further results from the Johnson discreancy variable in this aer. For all further results resented in this aer, we have erformed MIGRAD only fits but with starting arameter values located near the global mode (whose location can be estimated since the underlying model is known). Only the smaller fit ranges were used. In this way, we aroximate the results which would be achieved with otimal fitting. This allows us to generate and analyze larger data sets, and thus have more sensitivity to the shae of the -value distributions. The -value distributions for the four models and using the runs discreancy variable are given in Fig.. In this figure, there are two -value distributions in each lot since we consider both cases where we have runs of data oints below the exectations, and runs of data oints above the exectations. For the correct model, the joint distribution of (R sr ) and (R fr ) should 5

17 .5 Model I.5 Model II MIGRAD MCMC Model III.5 Model IV Figure 5: -value distributions based on R J. The arameters of the models discussed in Table are fitted to N = 5 data oints and allowing arameter values in the small range. The -values corresond to N b degrees of freedom, where N b = 5 is the number of bins.

18 be symmetric. The -values should generally not be too small. The distribution of (R sr ) vs. (R fr ) is shown in Fig. 7 for the examle under consideration. The biasing of the -values resulting from the fitting of arameters is aarent. In using R fr and R sr, rather large -values should therefore be exected for valid models. 7 Model I 7 Model II 5 Success Failure Model III.5 Model IV Figure : -value distributions based on R sr and R fr. The arameters of the models discussed in Table are fitted to N = 5 data oints and allowing arameter values in the small range. The results for the ensembles with N = data oints are shown in Figs. 8 and 9. The models I and II now usually result in very small -values using both the standard χ and runs discreancy variables. For the χ fits, the -value distribution for model IV is again rather flat, as exected, but it would generally be difficult to conclude that model III is not adequate. Although this -value distribution is falling, as oosed to the case where we only had N = 5 data oints, there is still a large robability to get a sizable -value. The runs statistic has a sharly eaked distribution near = (for success runs) for models I,II, and should therefore have better discriminating ower than the standard χ test. In other words, it should more often lead to the desired result that the incorrect models are discarded. 7

19 sr Model IV Model I Figure 7: Joint distribution of the -values for success and failure runs. The results are for N = 5 data oints. Marker sizes in each bin are chosen relative to the bin with the highest robability of the individual model. Bins with robability less than.5 have been excluded from the lot for the urose of clarity. fr For model III, the success runs variable behaves similarly to the χ test, but the failure runs variable is rather flat. For the correct underlying model, model IV, we see that both success and failure runs variables have similar -value distributions. 8

20 8 8 Model I MIGRAD Model II Model III.5 Model IV Figure 8: -value distributions based on R G. The arameters of the models discussed in Table are fitted to N = data oints and allowing arameter values in the small range. The -values corresond to N n degrees of freedom, where n is the number of fitted arameters. 9

21 8 8 Model I Success Failure Model II Model III.5 Model IV Figure 9: -value distributions based on R sr and R fr. The arameters of the models discussed in Table are fitted to N = data oints and allowing arameter values in the small range.

22 . Examle with Poisson uncertainties.. Definition of the data We again use the function f(x) (see (7)), but now generate data sets with fluctuations from a Poisson distribution. The function is used to give the exected number of events in a bin covering an x range, and the observed number of events follows a Poisson distribution with this mean. For N b = 5 bins, e.g., the third bin is centered at x =. and extends over [.,.]. The exected number of events in this bin is defined as λ =.. N b f(x)dx... Comarison of -value distributions For the Poisson case we consider the four discreancy variables described in sections to comare the -value distributions for the four models (I-IV). We also show the -value distribution for the true model for comarison, where the -value is calculated by evaluating R = R( x θ true, M). Different fits were erformed for each of the different discreancy variable definitions. For the χ discreancy variables, χ minimization was erformed using either the exected or observed number of events as weight. For the likelihood ratio test, a maximum likelihood fit was erformed. The likelihood was defined as P ( m θ, M) = N b i= e λ i λ m i i m i! (8) where m i is the observed number of events in a bin and λ i = λ i ( θ). Since we use flat riors for the arameters, the same results were used for the case where the discreancy variable is the robability of the data. A tyical likelihood fit result is shown in Fig. together with the data, while the -value distributions are given in Figs. -. For all definitions of -values considered here, models I,II generally lead to low DoBs, while models III and IV generally have high DoBs. However, there are differences in the behavior of the -values which we now discuss in more detail. The lower left anel in Figs. - show the -value distribution taking the true model and the true arameters, and can be used as a gauge of the biases introduced by the -value definition. There is a bias for all discreancy variables shown in Fig. because aroximations are used for the robability distribution of the discreancy variable (using P (χ N) rather than P (R P N b ) or P (R N N b ). This bias is usually small, excet in the case where the Neyman χ discreancy variable is used. In this case, the robability of a very small -value is much too high even with the true arameters. The -value distribution using the robability of the data, on the other hand, is flat when the true arameters are used as seen in Fig., so that this choice would be otimal in cases where no arameters are fit.

23 number of events 5 5 Data Model I Model II Model III Model IV x Figure : Examle fits to one data set for the four models defined in Table for N b = 5 and the small arameter range. The lines shows the fit functions evaluated using the best-fit arameter values... Pearson χ and Neyman χ Comarison of the behavior of the Pearson χ to the Neyman χ, Fig., clearly shows that using the exected number of events as weight is suerior. The sike at in the -value distribution when using the observed number of events indicates that this quantity does not behave as exected for a χ distribution when dealing with small numbers of events, and will lead to undesirable conclusions more often than anticiated. The behavior of the Pearson χ using the exected number of events for the different models is quite satisfactory, and this quantity makes a good discreancy variable even in this examle with small numbers of events in the bins. Both of these -value distributions become flat in the case of large number of events in all bins... Likelihood Ratio While models I,II are strongly eaked at small -values, for this definition of a -value models III,IV also have a falling -value distribution. This is somewhat worrisome. In assigning a DoB to models based on this -value, this bias towards smaller -values should be considered, otherwise good models will be assigned too low a DoB. The behavior of the -value extracted from Pearson s χ is referable to the likelihood ratio, as it will lead to value judgments more in line with exectations...5 Probability of the Data Using this definition, the -value distribution is flat for the true model, since this is the only case where the correct robability distribution of the discreancy variable is emloyed. For the

24 8 8 Model I Pearson Neyman Likelihood ratio Model II Model III 5 Model IV True model....8 Figure : -value distributions based on R P, R N and R C. The arameters of the models discussed in Table are fitted to N b = 5 bins and allowing arameter values in the small range. The -values corresond to N n degrees of freedom, where n is the number of fitted arameters.

25 fitted models, two -value distributions are shown - one uncorrected for the number of fitted arameters, and one corrected for the fitted arameters as described in Section..5. Using the uncorrected -values, both models III and IV show -value distributions eaking at =, so that both models would tyically be ket. As exected, the correction for the number of fitted arameters ushes to lower values, and roduces a nearly flat distribution for model IV. This ad-hoc correction works well for this examle. In general, this -value definition has similar roerties to the Pearson χ statistic, but with the advantage of a flat -value distribution when the true model was used and no arameters were fit.. Exonential Decay When dealing with event samles with continuous robability distributions for the measurement variables, it is common ractice when determining the arameters of a model to use a roduct of event robabilities (unbinned likelihood): P ( x θ, N M) = P (x i θ, M). i= If the model chosen is correct, then this definition for the robability of the data (or likelihood) can be used successfully to determine aroriate ranges for the arameters of the model. However, P ( x θ, M) defined in this way has no sensitivity to the overall shae of the distribution and can lead to unexected results if this quantity is used in a GoF test of the model in question. We use a common examle, exonential decay, to illustrate this oint (see also discussions in [] and []). Our model is that the data follows an exonential decay law. We measure a set of event times, t, and analyze these data to extract the lifetime arameter τ. We define two different robabilities of the data D = {t i } (i =... N) I Unbinned likelihood II Binned likelihood ( ) P D τ = N i= τ e t i/τ, P ( D τ) = N b i= λ m i ti + t i m i! e λ i, λ i (τ) = dt N t i τ e t/τ. In the first case, the robability density is a roduct of the densities for the individual events, while in the second case the events are counted in time intervals and Poisson robabilities are calculated in each bin. The exected number of events is normalized to the total number of observed events. We consider time intervals with a width t = unit and time measurements ranging from t = to t =. The overall robability is the roduct of the bin robabilities. For each of these two cases, we consider the -value determined from the distribution of R = P ( D τ). In order to make the oint about the imortance of the choice for the discreancy variable for GoF tests, we generated data which do not follow an exonential distribution. The data is generated according to a linearly rising function, and a tyical data set is shown in Fig..

26 8 8 Model I N DoF N - n DoF Model II Model III.5 Model IV True model Figure : -value distributions based on R L. The arameters of the models discussed in Table are fitted to N b = 5 bins and allowing arameter values in the small range. The -values corresond to N b and N b n degrees of freedom, where n is the number of fitted arameters. 5

27 number of decays Unbinned data Binned data Unbinned ML fit t Figure : Tyical data set used for the fitting of the exonential model. The individual event times are shown, as well as the binned contents as defined in the text. The best fit exonential from the unbinned likelihood is also shown on the lot, normalized to the total number of fitted events... Product of exonentials If the data are fitted using a flat rior robability for the lifetime arameter, then we can solve for the mode of the osterior robability of τ analytically, and get the well-known result τ = N ti. Defining ξ t i, so τ = ξ/n, we can also solve analytically for the -value: = t i >ξ dt dt... (τ ) N e t i /τ and the result is = P (N, N) with the regularized incomlete Gamma function P (s, x) = γ (s, x) Γ (s) = x ts e t dt t s e t dt. Surrisingly, as N increases is aroximately constant with a value.5. Regardless of the actual data, is never small and deends only on N. The best fit exonential is comared to the data in Fig., and yields a large -value although the data is not exonentially distributed. This

28 -value definition is clearly useless, and the unbinned likelihood is seen to not be aroriate as a GoF statistic. The definition of χ in the Gaussian data uncertainties case also results from a roduct of robability densities, so the question as to what is different in this case needs clarification. The oint is that in the case of fitting a curve to a data set with Gaussian uncertainties, the data are in the form of airs, {(x i, y i )}, where the measured value of y is assigned to a articular value of x, and the discreancy between y and f(x) is what is tested. In other words, the value of the function is tested at different x so the shae is imortant. In the case considered here, the data are in the form {x i }, and there is no measurement of the value of the function f(x) at a articular x. The orderings of the x i is irrelevant, and there is no sensitivity to the shae of the function... Product of Poisson robabilities In this case, the -value for the model is determined from the roduct of Poisson robabilities using event counting in bins as described above. Since the exected number of events now includes an integration of the robability density and cover a wide range of exectations, it is sensitive to the distribution of the events and gives a valuable -value for GoF. The -value using R P for the data shown in Fig. is = within the recision of the calculation. The exonential model assumtion is now easily ruled out. In comarison to the unbinned likelihood case, the data are now in the form m, where the m i now refer to the number of events in a well-defined bin i which defines an x range. In other words, we have now a measurement of the height of the function f(x) at articular values of x, and are therefore sensitive to the shae of the function. As should be clear from this examle, the choice of discreancy variable can be very imortant in GoF decisions: Maximum Likelihood estimation with unbinned data will always give an otimal arameter estimate in terms of bias and variance, but it will give no information about the correctness of the model. 5 Discussion We have investigated a number of ossible discreancy variables and -value definitions for Goodness-of-Fit tests, with the understanding that these -values are to be used to make judgments on the accetability of models. The sense in which the -values are used follows Bayesian logic as described in Section.. For this urose, the -value distributions should be reasonably uniform for models which are considered good and they should eak at small values for oor models. Using these requirements to guide us, we classify our results in Table for the examle data sets and models described in the text. Based on our exerience with these examles, we also summarize our suggestions for the use of the different discreancy variables and -value definitions considered here. Most common -values use aroximations for the distribution of the discreancy variable. This leads to non-flat distributions of the -value even when no arameters are fitted, and these 7

29 Discreancy variable -value definition Data tye Comment RG P (χ > R G D N n) D = {(x i, yi)} Flat if n = or model linear + Rsr/fr P (Rsr/fr > R sr/fr D N) D = {(x i, yi)} Generally not flat for n > + RP P (χ > R P D N b n) D = m Reasonably flat + RN P (χ > R N D N b n) D = m Peaked at for small numbers of events RC P (χ > R C D N b n) D = m Distribution for R C is less flat than that for RP + RL P (RL < R L D θ, M) D = {(x i, yi)} Same as RG + RL P (RL < R L D θ, M) D = m Almost flat with correction + RL P (RL < R L D θ, M) D = {x i} No sensitivity to shae RJ P (χ > R J D N b ) D = {(x i, yi)} Siky distribution RJ P (χ > R J D N b ) D = m Siky distribution Table : Summary of the discreancy variables considered in this aer and their erformance for different data tyes. Here, n is the number of fit arameters, N is the number of events and Nb denotes the number of bins. The comment concerns the shae of the -value distribution for good models. The last column gives our recommendation for use as a discreancy variable for model judgments based on the given -value definition. 8

30 deviations can be severe in some cases. We discussed the use of the robability of the data itself as a discreancy variable, and showed that it is flat in the case of no fitted arameters and Poisson distributed data. This was due to the use of the correct robability distribution for the discreancy variable. The algorithm described in the Aendix can be used to generate the correct -value distribution for any discreancy variable for Poisson distributed data. For comosite models, where arameters of the model are fit to the data before a -value is calculated, we find that -value distributions can deend strongly on the technical aroach used to fit the data. Using gradient-based aroaches to find minima or maxima requires considerable attention from the user and generally fine tuning of fit ranges and starting values. The finetuning will certainly affect the -value distributions, making their use more difficult. Setting range limits effectively means defining a rior, and imacts -value distributions. First fitting the data with a Markov Chain Monte Carlo before using MIGRAD gave considerably better results, in the sense that the -value distributions extracted in this way followed exectations. We therefore recommend using a MCMC to ma out the arameter sace as a start to the fitting rocedure in situations where giving good starting values for a gradient-based fit is difficult. 9

31 A Fast -value evaluation in MCMC for Poisson distributed data The eak robability for a Poisson distribution P (m λ) = e λ λ m occurs for m = λ. If the robability distribution of a data set is modeled as a roduct of Poisson terms, then the highest robability is given for {m i } = { λ i }. We use this to define the starting oint for a Markov Chain, and move the bin contents u or down (chosen randomly) at each iteration. For each attemted change in the bin content, we aly the usual Metroolis test []. The robability P ( m λ) is easily udated at each change. E.g., if the result in bin i increases from m i to m i, then the robability changes by P P m i!!λm m i m! i m i i. A large number of exeriments can be quickly simulated and the -value extracted. References [] A. Gelman, X. L. Meng, and H. Stern, Posterior redictive assessment of model fitness via realized discreancies, Stat. Sinica, (99) 7. [] See, e.g., M. J. Bayarri and J. O. Berger, P values for Comosite Null Models, Jour. Am. Stat. Assoc., 95 () 7; M. J. Bayarri and J. O. Berger, The interlay of Bayesian and Frequentist Analysis, Statistic. Sci., 9 () 58; I. J. Good, The Bayes/non-Bayes Comromise: a Brief Review, Jour. Am. Stat. Assoc., 87 (99) 597; J. de la Horra and M. T. Rodriguez-Bernal, Posterior redictive -values: what they are and what they are not, Sociedad de Estadstica e Investigacin Oerativa Test () 75; J. I. Marden, Hyothesis Testing: From Values to Bayes Factors, Jour. Am. Stat. Assoc., 95 () ; T. Sellke, M.J. Bayarri and J.O. Berger, Calibration of Values for Testing Precise Null Hyotheses, The American Statistician, 55 (). [] M. J. Schervish, P Values: What They Are and What They Are Not, The American Statistician, 5 (99). [] See, e.g., E. T. Jaynes, Probability Theory - the logic of science, G. L. Bretthorst (ed.) Cambridge University Press (); A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin Bayesian Data Analysis, Second Edition (Texts in Statistical Science) Chaman & Hall, London () ; P. Gregory, Bayesian Logical Data Analysis for the Physics Sciences, Cambridge University Press, (5); G. D Agostini, Bayesian Reasoning in Data Analysis - A Critical Introduction, World Scientific,. [5] A. Caldwell and K. Kröninger, Signal discovery in sarse sectra: A Bayesian analysis, Phys. Rev. D 7 () 9.

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch