Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Size: px

Start display at page:

Download "Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede"

Ambrose Goodwin
5 years ago
Views:

1 Hypothesis Testing: Suppose we have two or (in general) more simple hypotheses which can describe a set of data Simple means explicitly defined, so if parameters have to be fitted, that has already been done Let us (arbitrarily) call the most important hypothesis H, the null hypothesis The other hypothesis is called H, the alternative hypothesis We want to make a rigorous, quantitative decision in terms of choosing between the two hypotheses H and H, based on measurements (which may have already been performed) Let us call that set of measurements an experiment Let W represent the space of all possible outcomes of that experiment We can think of W as containing two interesting regions: Region is the so-called ical region (aka the rejection region) ω If an experiment has its outcome in the ical region ω, then we reject the null hypothesis H Region is the acceptance region W ω If an experiment has its outcome in the region W ω, then we accept the null hypothesis H In general, we use a test statistic to define the two regions Thus, hypothesis testing is reduced to studying the properties of test statistics Let the random variable x be defined as the test statistic We define the level of significance of the hypothesis test as the probability that the test statistic x will fall in the ical region (aka rejection region) when H is true: = P x ( ω H ) Thus, we see that is the probability that we will reject the null hypothesis H even though H is true This is bad physically, it represents a loss and therefore hopefully is small We define the power of the hypothesis test β as the probability that the test statistic x will fall in the ical region when the alternative hypothesis H is true: ( ω H ) β = P x Thus, we see that β is the probability that we will reject the null hypothesis H when the alternative hypothesis H is true This is good therefore, hopefully β is large P598AEM Lecture otes 5

2 Finally, we see that β must therefore be the probability that we will accept the null hypothesis H when the alternative hypothesis H is true Physically, β represents a contamination Thus: β = P x W ω H This is bad hopefully β is small A good hypothesis test will choose the test statistic x and the ical region ω such that both (loss) and β (contamination) are small This will minimize: a) Errors of the st kind (ie loss ), which occur when the null hypothesis H is rejected even though the null hypothesis H is true (occurs with probability = ) b) Errors of the nd kind (ie contamination ), which occur when the null hypothesis H is accepted when the alternative hypothesis H is true (occurs with probability = β ) A PDF ( ) A PDF ( ) f x H exists for the test statistic x if the null hypothesis H is true f x H exists for the test statistic x if the alternative hypothesis H is true Then: = ( ω ) = ( ) = ( ) P x H f x H dx f x H dx ω x ω x = x = W ω and: β = ( ω ) = ( ) = ( ) P x H f x H dx f x H dx These relations are shown graphically in the figure below: f β ( x H ) f x H dx f x H dx loss Acceptance Region, W ω contamination f ( x H ) Critical Region, ω aka Rejection Region x β β x x P598AEM Lecture otes 5

3 An Example From Particle Physics: Suppose we are carrying out a proton-proton elastic scattering experiment: p + p p+ p Define this to be the null hypothesis H Assume the experiment uses particle detectors which are only sensitive to charged particles Sometimes the inelastic reaction p+ p p+ p+ π also occurs Define this to be the alternative hypothesis H ow, the neutral pi-meson decays via the electromagnetic interaction to two gamma rays, ie π γ + γ But since the gamma rays are electrically neutral particles, they will not be detected in this experiment How do we separate H from H ie how do we classify a given event as an elastic scattering one vs an inelastic scattering event? If the experiment measures the 3-D momentum components p = pxˆ ˆ x + py y + pz zˆ of all of the final-state charged particles (and we are % certain that they are protons, + ie other inelastic scattering processes such as p + p p+δ + n have (somehow) been eliminated), then using the measured momenta of the two final-state protons and conservation of momentum and energy, we can calculate the missing mass squared ( M ) for each event This quantity is just the (square) of the (relativistic) invariant mass of a hypothetical unobserved particle, if the reaction is assumed to be p+ p p+ p+ M If the p+ p scattering reaction is truly elastic and if the final-state charged particle momenta were measured perfectly, of course we would expect to find M = for each event However, if the reaction were truly the alternative hypothesis, that of the inelastic scattering reaction p+ p p+ p+ π, we would expect M = Thus, here we choose M miss to be the test statistic If the final-state charged particle momentum measurements had no uncertainties associated with them, we would simply have: miss If M miss = then the null hypothesis H is true (ie p + p elastic scattering) If M = then the alternative hypothesis H is true (ie p + p inelastic scattering) miss M π However, there are finite final-state momentum measurement uncertainties associated with each event In fact, there are typically different uncertainties on the measurements of the different x, y, z components of p for each charged particle {typically σ p σ x p are comparable, y but σ p z is significantly different} Suppose (in a separate, modified experiment) that we can study known/pure samples of each kind of scattering elastic and inelastic p + p scattering The probability distributions associated with M miss for the two kinds of scattering might be similar to those shown in the two figures below: M π miss miss miss P598AEM Lecture otes 5 3

4 P( M miss ) loss elastic P( M miss ) contamination ical region, ω M miss inelastic We define the ical region ω as the region β β M M π M miss M miss > M where M is some ical mass Clearly, we would like to have both both and β = zero, or at least as small as possible In this example, we see how compromises must be made If we want the power of the of the hypothesis test β to be large (ie β small) so that our sample includes as little background as possible, then the level of significance of the hypothesis test will be large and we will lose many real p + p p+ p elastic scattering events On the other hand, if we prefer to minimize our losses of real events we will have to settle for a large background contamination One solution to this difficulty would be to design/build an experiment with better resolution (ie narrower M miss distributions) Monte Carlo simulations of the experiment should have been done before and during the design of the detector in order to discover problems like this! A procedure known as the eyman-pearson Test will determine the optimum value of the test statistic X that simultaneously maximizes the the power of the of the hypothesis test β ie determines the smallest/least contamination β for a given level of significance of the hypothesis test (ie a loss/inefficiency of the signal) Formally, we want to maximize β = P( x ω H ) for a given value of P( x ω H ) Let the PDF for measuring X be ( ) and ( ) = f x H if the null hypothesis H is true, f x H if the alternative hypothesis H is true Suppose that x corresponds to the measurement of a single random variable, x being the ical value (ie if x > x, we reject H although it is true a loss) with P598AEM Lecture otes 5 4

5 Then: = ( ω ) = ( ) = ( ) P x H f x H dx f x H dx ω x ω x and: β = ( ω ) = ( ) = ( ) P x H f x H dx f x H dx where the integrals are over the ical region ω f ( x H ) We can rewrite: β = f ( x H) dx= f ( x H) dx x X f ( x H ) f ( x H) ote that is also the Likelihood ratio of hypothesis H relative to hypothesis H : f ( x H ) f ( x H) L ie f ( x H) L ( x H ) ( x H ) f L L β = f ( x H) dx= f ( x H) dx x f x = expectation value of L L ow the null hypothesis H is true in the ical region ω We know that numerically: f x H f x H β = ( ) = f x H f x H f x H x dx η η where η is some point in the ical region ω In order to make β as large as possible for a given/fixed value of, we must make the f ( x H) ratio as large as possible for all points in the ical region ω f ( x H ) Turning this around, we can use this to define the ical region ω For example, if we decide that we want β to be eg k = times as large as, β f ( x H ) then we can define the ical region ω as the region where = > k = f ( x H ) In principle, we can solve this and get a compact rule for defining the ical region ω ( H) ( H ) L L More generally, the ical region ω is defined such that: > k L L where L is a likelihood and k is a constant (ie just a number) nb For composite hypotheses, where parameter estimation and hypothesis testing must be done simultaneously, general techniques for choosing the best test are not well developed Often, one must eg resort to Monte Carlo and brute force techniques P598AEM Lecture otes 5 5

6 Example: The lifetime of the Σ baryon This is a very short lived particle that decays Σ Λ + γ via the electromagnetic interaction, the Λ baryon subsequently decays via the (charged) weak interaction, eg Λ p + π We assume that the electromagnetic decays of the Σ baryon obey the exponential decay law and thus can be described by an exponential probability density for the time at which the Σ decays: ( ; τ ) f t = τ This normalized PDF gives the probability that the Σ baryon will decay in an infinitesimal time interval dt between t and t dt f t; τ dt with τ = the mean lifetime of the Σ baryon + as Suppose two theories predict the mean lifetime of the Σ baryon to be τ and τ, respectively Then let: H be the null hypothesis that τ = τ And let: H be the alternative hypothesis that τ = τ ie if f t; τ = e τ and if τ t H is true, then: t e τ Suppose that we have measured the decay times t i for The Likelihood Ratio for the two hypotheses is: L L ( H ) ( H ) L( ti τ τ ) o i= L( ti τ τo) i= f t; τ = e τ t τ H is true, then: Σ baryon decays ti τ e given = = = = given = t i τ e This ratio should exceed some factor k in order to make β more than k times as large as : e ti i= τ L L ( H) ( H ) ti i= τ = e > k ti i = ow the measured mean lifetime is the sample mean: t = ie This can be solved, yielding: t > k+ T ln ln τ e t τ > k P598AEM Lecture otes 5 6

7 Summary: If we consider t space, the ical region which can best reject H relative to H with t > lnk+ ln τ T a factor k is With this erion, β will be k times as large as, where β is the contamination and is the loss In order to obtain actual numbers, we must fix/specify or β Let us work out examples where we fix the loss Here: τ τ H is the alternative hypothesis that τ = τ H is the null hypothesis that = Then: = Probability that we will reject H although it is true = ( given τ ) ical region in t f t dt We first examine the extreme case of =, ie we make a single measurement of the lifetime Then t = t, ie the mean lifetime = the (one) lifetime measurement t Then the two PDF s are: f ( t given τ ) e τ f t given τ = e τ τ t τ = and: T P598AEM Lecture otes 5 7

8 Let us demand, for example, that = 5%, ie we reject the true hypothesis H 5% of the time t Then: 5 e τ dt T τ = = defines the ical (aka rejection) region ω Evaluating this yields: T τ = 996 3, or: T 3τ The value of β is also now determined: ie the contamination is β = 78% t τ T τ T τ β = e dt = e = e T τ Suppose that our single measurement t of the mean lifetime of the Σ baryon comes out to t = t < T = 3τ Then we accept H, ie we accept the hypothesis that τ = τ be Only 5% of the time will we reject H, even though it is true, if we have chosen T = 3τ Unfortunately, we will accept H a fraction β = 78% of the time even when H is true, ie when τ = τ! Thus, here we commit a Type-II error This is not a very statistically significant test {but our data only consists of one single event } It is not optimal in the eyman-pearson sense, since: β = 45, however if we solve 5 for k given the value = 5 we get k = 4 Actually, we solve t lnk ln T T = 3τ > + τ for k using = and We will indeed do better when is large and we make many measurements of decay times, t i However, if we instead have a statistically significant sample of events, ie is now very large, in this regime the statistical uncertainty t t σ = on the experimental determination of the mean lifetime t is now Gaussian-distributed The PDF s will then be: ( t τ ) τ given f t τ f t τ = e π τ ( ) ( t τ ) 4τ given f t τ f t τ = e π ( τ ) P598AEM Lecture otes 5 8

9 Here, we obtain the lower boundary of the ical region ω from ( τ ) = f t dt For = 5 and using the Gaussian single-sided upper 95% CL table on p 8 of P598AEM Lect otes 8, we find: T τ = +, or: T = + τ We then use all of the events and calculate β ( τ ) T f t dt = T 645 For example, suppose = Then: T = + τ = 645τ of the ical region ω for a loss of = 5% is the lower boundary We also obtain β = 99999, ie a contamination of β = = % This is fine!!! The two (Gaussian) probability distributions are now well-separated from each other, as shown in the figure below: = Critical Region, ω τ T = 645τ τ P598AEM Lecture otes 5 9

10 Parametric Tests involve the parameters of a distribution often the Gaussian distribution An example from particle physics is the search for new particles For each event of a particle physics reaction such as: ee π ππππ, we might calculate the invariant mass M π + π of the π + π pairs (nb there are four possible + opposite charged sign pion pair combinations for each event) and plot a histogram of the invariant mass M π + π of the π + π pairs A bump in the M π + π mass distribution could indicate that we are observing the preferential + + production ee πππ + R, where the particle R subsequently decays (in s?) to π π + # of pairs ma m R m b mass of pair Even if we could pick out just those π + π pairs which truly came from R decay, and plotted m R, we would still get a broad bump (and not a δ-function) for two reasons: experimental uncertainties on the measurements of the components of p π + and p π (which are used in the calculation of Γ= τ 66 MeV for τ = s, since m R ) and also the uncertainty principle 66 MeV -s ΔEΔt natural linewidth, If we try to select π + π pairs which come from R decay by defining an upper and lower limit on m R we will get them and also: a) π + π pairs which are not associated with R particles b) π + π pairs made of the π + from an R and an unassociated π (and vice versa) The invariant π + π pair masses of these background events need not peak, and even if they do, there is no reason to expect them to peak near m R We usually assume that the background is described by a smooth function of mass (nb We can use a Monte Carlo program to generate such a background!) Then, we can interpret the histogram as a possible superposition of R decay events (solid red line) and background (dashed blue line) in above figure Of course, statistical fluctuations in the number of events per bin σ ~ n are always present It is possible that the entire bump near n i i m R could be due solely to statistical fluctuations P598AEM Lecture otes 5

11 Let us estimate the probability that in the region m a to entirely due to a statistical fluctuation of the background m b the observed effect at Here, one simple procedure (there are other, more elaborate ones too) is: m= m is Let be the total number of events in the bump region m a to m b Let B be the number of events in that region which are Background Let H be the null hypothesis that = B We assume that there is a way to estimate B and σ B, which are assumed to be Gaussiandistributed, either from fitting the mass distribution away from the bump, or from some model, or from a Monte Carlo program, etc If H is true, then B is consistent with zero If we assume that B and are estimated independently, then: σ = σ + σ B B If is large enough that it is (approximately) Gaussian, then σ B B Δ= will be a Gaussiandistributed random variable with mean = and unit standard deviation, ie distributed as (,) In order to be able to decide whether or not the null hypothesis H is true, we simply apply the rules for Gaussian distributed random variables Thus, Δ is the number of standard deviations away from the expected mean, which is zero We can look in tables for the probability P ( Δ ) that the given value of Δ will occur by chance Finally, P ( Δ ) is the probability that the alternative hypothesis H is true, ie the total events are not simply a statistical fluctuation of the B background events In particle physics, it is currently not fashionable to quote even a 3 or 4 standard deviation bump as a new particle without a lot of confirming work Although the probability of B > 4σ is very small (odds are 6,:), there are so many mass plots studied in all highenergy experiments that we expect to occasionally see such large fluctuations (In some years, it has been estimated that well over, mass plots were examined!) The Goodness of Fit Test usually describes a situation where we want to accept or reject a hypothesis H without specifying an alternative We can still have a PDF f ( x H ) and a ical region ω, such that if x ω, then we reject the hypothesis H The probability that we will reject the null hypothesis H, even though it is true, is still = ω ( ) f x H dx However, since we have no alternative hypothesis H we have no way to calculate the probability of accepting H although it is false, ie that H is true R P598AEM Lecture otes 5

12 Thus, we can make a quantitative statement against H (eg a LSQ fit of a theory prediction to the data is bad), but we have no quantitative evidence for H This leads to the (imprecise) statement that one can never prove a theory, but one can disprove it How does one choose tests in this case? Some turn out to be more sensitive than others to deviations from H There are other types of tests, called distribution-free methods, where if T x has a known distribution, independent of the distribution of x H is true a test statistic In these cases (many such T s exist), one calculates T and then reads from a table the corresponding level of significance of the test One test-statistic example is distribution ( x) normalized so that ( x) F to observed data Here, χ in the test of the goodness of fit of a theoretical probability F x is typically proportional to a PDF and F Δx gives the number of measurements predicted to fall in the range x to x +Δ x The data is typically plotted in a histogram of bins with constant bin width Δ x Let us assume that the number of events predicted in the i th histogram bin, n i is Poisson distributed Then the variance σ =, or: σ = Let ni ni ni m i be the number of experimentally-measured data events in the i th histogram bin ni The test statistic is: T bins ( m n ) ( m n ) bins i i i i i= σ n i= n i i = = If the null hypothesis H is true, ie if the theory does agree well with the measurements, then the test statistic T will be χ -distributed, and we already know how to test for its significance Some detail: Let the total number of measurements be (nb same # for theory and for data!) Then: bins = ni where bins i= is the number of histogram bins Suppose that the PDF of the theory is such that n f ( x ) i = i Then the test statistic is: T i= ( i) ( x ) bins mi f x = f i bins Because of the normalization condition = n, we see that not all of the terms in the test i= statistic T are independent Once we know the contents of the histogram bin # s,,, bins, the # events in the last histogram bin is fixed by the requirement that the total number of events add up to Thus, there are degrees of freedom for the χ distribution bins i P598AEM Lecture otes 5

13 If, in addition we were eg fitting for M λ -parameters, then we d instead have M degrees of freedom for χ The reason that we can use this to reject H if we expect (on average) that each term in the bins χ sum (instead of being χ is too large is that if H is not true, then, will have a nonzero expectation value and contribute much more to the sum than it would if H were true) One drawback for the χ test is that if the number of measurements is small, the theory of χ won t hold In that case we must use other test statistics, eg those based on Likelihood This is very messy, and in this situation we d prefer using the Kolmogorov-Smirnov Test The Kolmogorov - Smirnov (K-S) Test: The K-S test looks at the Cumulative Distribution Function F ( x X) rather than the Probability Density Function f ( x ), and is based on ordered statistics o binning of the measurements S ( x X) is required here, as would be the case eg using a the K-S test also has many variations The Standard Two-Sided/Double-Sided K-S Test: χ test ote that Assume that we have independent measurements x, x,, x of a random variable that we want to compare to an (apriori known/assumed) theoretical prediction, with accompanying f x and corresponding theory Cumulative Distribution Function (CDF): theory PDF X F x X f x dx = We then form the so-called empirical sample/experimental Cumulative Distribution Function S x X as follows: (CDF) Define: S ( X = ) Order (ie sort) the measurements in increasing values of the random variable x (After ordering/sorting, the measurements are located at: X = x x x3 x Define: S ( x X) S ( x ) k, ie S ( x X) measurement k k X = x Thus: S ( x ), k S x, increases/increments by at each Then: S ( x X ) S ( x ) = =, and thus we can define: S X =+ P598AEM Lecture otes 5 3

14 For a continuous theory CDF F( x X), the so-called two-sided (aka double-sided) K-S test statistic D is defined as the supremum of the maximum absolute difference between S x X F x X CDF s : experiment ( ) and theory Since D sup S x X F x X k k k= : S xk X S xk k, the K-S test statistic D is computed as: { } k k D sup S x F x = max F x, F x k k k k k= : k= : where the independent experimental measurements x k {k = :} of the random variable x have been sorted into ascending values: x < x < x3 < xk < < x The theory CDF F ( x X) F( x ) is correspondingly evaluated at k k F x for each value of k A typical double-sided/two-sided KS test comparison of the {absolute} difference between experiment & theory CDF s vs X {with the K-S test statistic D indicated in green} might look similar to that shown in the figure below:, F x S x Theory F x k k k x k F( x ) F( x ) D = sup S x F x k k k { k k } = max, D x x 3 x 4 x 5 x6 x7 Experiment S x k If the null hypothesis S x X = F x X ) and provided that no parameter(s) of the theory have been estimated from the data, the double-sided K-S test statistic D is distributed in such a way that it is independent of the choice of the theory CDF F ( x X) ie the double-sided K-S test statistic D is distribution free if is large enough! If the experiment S ( x X) is repeated a gazillion times, it can be seen that the doublesided K-S test statistic H is true (ie D is itself a random variable If the null hypothesis S x X = F x X ), then in the asymptotic / infinite statistics limit, the double-sided K-S test statistic D converges to zero: H is true (ie { } { D } S ( x X) F( x X) lim lim sup = k k k= : k k X P598AEM Lecture otes 5 4

15 However, if the null hypothesis H is true (ie S ( x X) F( x X) = ), then in the asymptotic / infinite statistics limit, the product D z is a random variable that statistically converges to the Kolmogorov-Smirnov Test Statistic Probability Distribution Function (PDF), f ( z) (valid for > 8 ): K S { } { } lim D lim sup S x X F x X = f z k k K S k= : nb f ( z) = max B() t = PDF of the maximum of the absolute value of B ( t ) ( t ) K S t [,], where B () t is the so-called Brownian bridge, and here, in this situation: t = FK S( z Z), the Kolmogorov-Smirnov Test Statistic Cumulative Distribution Function (CDF), =, ( t FK S( z Z) ) Z F z Z f z dz K S K S = The analytic form of the Kolmogorov-Smirnov Test Statistic Probability Distribution f z, the PDF associated with the asymptotic / infinite statistics limit is Function K S the infinite series expression (valid for > 8 ): K S 8 ( ) k = k k z f z = k z e Linear and semi-log plots of the asymptotic / infinite statistics limit Kolmogorov- Smirnov Test Statistic Probability Distribution Function fk S( z) vs z are shown respectively in the two figures below: The Cumulative Distribution Function (CDF) associated with the Kolmogorov-Smirnov fk S z is: Test Statistic Probability Distribution Function Z F z Z f z dz = K S K S P598AEM Lecture otes 5 5

16 Linear and semi-log plots of the Kolmogorov-Smirnov Test Statistic Cumulative FK S z Z vs Z are shown respectively in the two figures below: Distribution Function The analytic form of the Kolmogorov-Smirnov Test Statistic Cumulative Distribution FK S z Z, in the asymptotic / infinite statistics limit (valid for > 8 ) is: Function k Z π FK S( z Z) = ( ) e = ( ) e = e z Z ote that since k k k Z k 8 Z k= k= k= FK S z Z f K S z dz fk S z = FK S z = Z z, hence differentiating the above infinite series expression for the Kolmogorov-Smirnov Test Statistic Cumulative Distribution Function FK S( z Z) with respect to z = Z, we obtain the above infinite series expression for the Kolmogorov-Smirnov Test Statistic Probability Distribution f z Function K S =, then From either of the above two plots of the Kolmogorov-Smirnov Test Statistic Cumulative Distribution Function FK S( z Z) vs Z, since in the asymptotic / infinite statistics limit Z = D, if we eg choose a ical value of FK S( z Z ) = = 8 (ie = ), there thus exists a corresponding Z = Z for this choice of Physically, this choice of ical value choice means that if the null hypothesis H is true S x X = F x X ), then the Kolmogorov-Smirnov Test Statistic Prob Function (ie Prob ( z Z ) F ( z Z ) = = ie % of the time we will reject the null K S K S hypothesis H, and we thus correspondingly commit an error of the st kind The analytic form of the Kolmogorov-Smirnov Test Statistic Prob Function ProbK S( z Z ) FK S( z Z ), in the asymptotic limit (valid for > 8 ) is: k k Z k k Z Prob z Z F z Z = e = e p-value K S K S k= k= P598AEM Lecture otes 5 6

17 Linear and semi-log plots of the Kolmogorov-Smirnov Test Statistic Prob Function, Prob z Z F z Z vs Z are shown in the two figures below, respectively K S K S The * points marked on the curves of either of the above Prob ( z Z ) F ( z Z ) K S K S vs Z plots indicate several useful choices of ical values - Z associated with the asymptotic ( > 8) limit of the Kolmogorov-Smirnov Test Statistic Prob Function These are summarized in the table below: Asymptotic ( > 8) Critical Values of the Kolmogorov-Smirnov Test Statistic: Z = D (%) Z % 7749 % % % 5747 % 6764 % % 55 Typically (by convention) the ical value = 5 = 5% is most frequently chosen The p-value Prob ( z Z ) F ( z Z ) = K S K S associated with the double-sided single-sample K-S test statistic D Z in the asymptotic limit (valid for > 8 ) is: k k z π k 8z p-value = e = e z k= k= When the p-value =, then D =, ie S ( x X) perfectly matches F( x X) When the p-value =, then S ( x X) is not at all statistically compatible with F( x X) The null hypothesis H, namely that S ( x X) F( x X) corresponding experimental vs theoretical PDF s s ( x) f ( x) = (or, equivalently the P598AEM Lecture otes 5 7 for all X = for all x ) is formally rejected if D > Z, or equivalently if D > D Z, or equivalently if the p-value > when using the above table (valid for > 8 )

18 However for experimental samples with < 8, using the above table for asymptotic / infinite statistics limit results are increasingly inaccurate as For decent accuracy, we must instead use the table given below of ical values of D Z for the double-sided single-sample K-S test for specified values of such that ProbK S( D D Z ) = Critical Values of D Z for the Double-Sided Single-Sample Kolmogorov-Smirnov Test: = = = 5 = = = = = 5 = = Over For experimental samples with < 8 the following parameterization can alternatively be used: Z + + D MA Stephens, Journal of the Royal Statistical Society, Series B, Vol 3, p 5-, 97 An example of the use of the Table of Critical Values of K-S Statistics for < 8: Suppose we have = 7 measurements of the random variable x From the above table: = = = 5 = = D = 385 D = 436 D = 4834 D = 5384 D = 5758 and are told that for = 7 measurements that ProbK S( D D Z ) = P598AEM Lecture otes 5 8

19 We compare the value of D obtained from our double-sided single-sample K-S test with D from the above table If the null hypothesis H ( S ( x X) = F( x X) ) is true, then for = 7 events: D 385 8% of the time (and: % of the time we reject the null hypothesis) D 436 9% of the time (and: % of the time we reject the null hypothesis) D % of the time (and: 5% of the time we reject the null hypothesis) D % of the time (and: % of the time we reject the null hypothesis) D % of the time (and: % of the time we reject the null hypothesis) If our D is larger than one of these values, then we also can cast doubt on the truth of the null hypothesis, at the level indicated K-S Test Confidence Intervals: For the assumption that the null hypothesis H S ( x X) F( x X) corresponding experimental vs theoretical PDF s s ( x) f ( x) = (or, equivalently the = for all x ) is true, the K-S test statistic D is itself a random variable, and has a PDF that is universal and independent of the choice of the theoretical CDF F( x X), and furthermore is known for all Thus, one may use the K-S test statistic D to construct Confidence Intervals for any continuous theory CDF F x X F x X, we can write the K-S test probability statement: ( ) Thus, for any such ProbK S D > D Z = We can invert this probability statement to obtain a Confidence Interval statement about the F x X, valid for all x: theory CDF ({ } { }) Prob S x X D < F x X < S x X + D = K S Physically, this statement means that, for any point x, the theory CDF F( x X) of being larger than ( ) but smaller than a probability S x X D Hence, one can construct a Confidence Interval of width S x X CDF ( ); the probability that the true theory CDF F ( x X) Confidence Interval is will have S x X + D ± D centered on the experimental lies within this If one or more parameters of the theory have been estimated from the data (eg the sample mean and/or the sample variance), the ical values in the above table are invalid for use in this situation However, in this situation, tables of ical values for these kind of K-S tests have been prepared/do exist for certain specific cases, such as the Gaussian/normal distribution and the exponential distribution See eg Table 54 of the Biometrika Tables for Statisticians, edited by ES Pearson and HO Hartley (97) See also MA Stephens, EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, Vol 69, o 347, p (Sept 974) P598AEM Lecture otes 5 9

20 The K-S Test for Two Experimental CDF s: If we wish to compare two independent experimental samples, with CDF s S ( x X) S ( x X), respectively for the null hypothesis version of the above K-S Test also exists for use in this situation Sample S ( x X) entries and sample S ( x X) has works for any pair of empirical experimental sample CDF s S ( x X), S ( x X) Here, the two-experiment double-sided K-S test statistic and H that they are the same/identical, a modified has entries Again, this is a distribution-free test, one which, D is defined as: D sup S x X S x X, k j k= :, j= : The figure below shows the CDF s and the K-S statistic D for two experimental samples,, with empirical sample CDF s S ( x X) k and S x X j : Since the two experimental samples both have finite statistics, the ability to reject the null hypothesis H will be correspondingly weaker The null hypothesis H S( x X) = S( x X) (or equivalently their corresponding PDF s s ( x) s ( x) = for all x ) is formally rejected if: D = D > Z eff,, + P598AEM Lecture otes 5

21 D = D > D Or eff,, + using the above table of ical values k k z π ( k ) 8z p-value = e = e < z or equivalently: The Smirnov Cramér von Mises Test: k= k= with: ( + + ) Z D eff eff, The Smirnov-Cramér-Von Mises Test is somewhat similar to the Kolmogorov Smirnov Test, however the S-C-vM test statistic W uses the entirety of the {square} of the difference SCvM between the two CDF s, not just the maximum {absolute} separation distance D between the two CDF s The S-C-vM test statistic W SCvM is defined as: + = WSCvM S x X F x X f x dx E [ S x X F x X ] X Using: = we can write: F x X f x dx WSCvM S x X F x X df df x = f x dx, thus: When written this way, it is obvious that the S-C-vM test statistic Thus, the SCvM Test is distribution-free for all, not just large Since the sorted empirical experimental sample CDF is statistic W A-D is numerically computed as: W SCvM ( k ) = + F x k = k W SCvM is independent of F! S x X k, the S-C-vM test Once the number W SCvM has been computed for a given, ordered/sorted empirical experimental sample CDF S ( xk X) k vs the xk -sorted theory CDF F ( x k ), we can then refer to a table of ical values for the S-C-vM Test in order to determine whether to accept / reject the null hypothesis H For the S-C-vM Test, in the asymptotic limit, the CDF associated with the a random variable SCvM test statistic is: k ( ) 4 6 k 4 k + F z = k e z J 4k 6z W + + SCvM 4 π z k = k P598AEM Lecture otes 5

22 where the ordinary Bessel function of order is: m= The convergence of this expression fortunately is extremely rapid For, the probability { SCvM } m ( ) ( m ) J x = x m! Γ + + m+ Prob W > W = (%) of the time when the null hypothesis H is true) The table below gives values of W for several choices of Critical Values of the Smirnov Cramér von Mises Test Statistic: W (%) 347 % 46 5% 743 % 68 % 4 6 For finite statistics, the probability Prob WSCvM + W + > = (%) of the time when the null hypothesis H is true S ( x X) = F( x X), See MA Stephens, EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, Vol 69, o 347, p (Sept 974) for additional details The Anderson-Darling Test: The Anderson-Darling Test is similar/related to that of the Smirnov-Cramér-Von Mises Test, comparing the entirety of the {square} of the differences between experiment vs theory CDF s over their entirety, for a test of the null hypothesis H S ( x X) = F( x X) If we consider the more general case of a generalized statistical test using the quadratic difference between experiment vs theory CDF s, the generalized quadratic difference test statistic W is: ψ ( ) W S x X F x X F x X df As before, the continuous theory CDF is F( x X) S ( x X) is defined from the sorted empirical experiment sample PDF s ( x X) S ( xk X) k, and ψ () t function that depends on the continuous theory CDF F( x X), the empirical experiment sample CDF : t is some apriori pre-assigned, non-negative weight Here again, the generalized quadratic difference test statistic W is said to be distribution- F x X -dependence is integrated out in the above integral free, since any/all P598AEM Lecture otes 5

23 If the weight function ψ () t is chosen to be a constant (ie unity): ψ ( t) =, then the above generalized quadratic difference test statistic is none other than the S-C-vM test statistic W SCvM : WSCvM S x X F x X df ow, if the experiment is repeated a gazillion times, it can be then seen that the {sorted} S x X is itself a random variable The product empirical experimental sample CDF S ( x X) is in fact a binomial distribution with probability F( x X) The expectation value of S ( x X) is E S ( x X) = F( x X) with variance F( x X) F( x X) Hence, if we instead choose the weight function ψ ( t) for the generalized quadratic difference test statistic to be: ψ () t = = variance t ( t), ie F ( x X ) F( x X) ( F( x X) ) ψ = Then for a specified X, the quantity S ( x X) F( x X) S ( x X) F( x X) ψ ( F( x X) ) = F( x X) ( F( x X) ) in the asymptotic limit has true mean of and unit variance when the null hypothesis S x X = F x X )!!! H is true (ie Thus, for the specific choice of the weight function ψ ( t) = t( t) where t F( x X) =, the generalized quadratic difference test is then known as the Anderson-Darling Test: ( ) ( ) ( ) ( ) S x X F x X WA-D df F x X F x X Since the sorted empirical experimental sample CDF is Darling test statistic W A-D is numerically computed as: k = k S x X k, the Anderson- {( ) log ( k) log ( k) } WA-D = k F x + k + F x Again, the Anderson-Darling test statistic asymptotic limit, the probability { A-D } W A-D is itself a random variable, and in the Prob W > W = (%) of the time when the null hypothesis H is true (ie S ( x X) = F( x X), or equivalently their corresponding PDF s s( x) = f ( x), for all x ) Thus, for the Anderson-Darling Test, in the asymptotic limit, the CDF associated with the A-D test statistic is: P598AEM Lecture otes 5 3

24 k ( ) Γ ( k+ )( 4k+ ) ( + ) π ( + ) ( + ) F ( z) = e e dw WA D z 4k 8z z 8 w 4k π w 8z k = k! The convergence of this expression is also extremely rapid, and is such that for 5, to 3 decimal places of accuracy, the table of ical values of W for various choices of are: A-D Critical Values of the Anderson Darling Test Statistic: W (%) 933 % 49 5% 3857 % 5969 % P598AEM Lecture otes 5 4

Recall the Basics of Hypothesis Testing

Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE