Exact Statistical Inference in. Parametric Models

Size: px

Start display at page:

Download "Exact Statistical Inference in. Parametric Models"

Neil Simpson
5 years ago
Views:

1 Exact Statistical Inference in Parametric Models Audun Sektnan December 2016 Specialization Project Department of Mathematical Sciences Norwegian University of Science and Technology Supervisor: Professor Bo Henry Lindqvist

2 i

3 ii Summary In this report we look at ways to construct exact statistical tests for parametric models, by generating samples with certain properties. A number of algorithms are implemented in R, and most of the work done is a duplication of results from either (Lindqvist and Taraldsen, 2007) or (Lockhart et al., 2007). The samples generated are verified against calculated theoretical results, or by comparing them with results from other algorithms, whenever this is possible. Some examples are listed for each algorithm. These are later used in goodness-of-fit tests, for evaluating whether or not a certain data set fits well with either the exponential distribution or the gamma distribution.

4 iii

5 Contents Summary ii 1 Introduction Sufficiency Generation of co-sufficient samples Exponential distribution, using Algorithm Verification Truncated exponential distribution Conditions for using Algorithm Examples Algorithm Jeffreys prior Truncated exponential distribution, using Algorithm Convergence diagnostics Verification - Gibbs algorithm Verification - Exact distribution Gibbs sampler - Gamma distribution Transformation of variables Generation of new value Algorithm - Gibbs sampler Verification of rejection step Examples iv

6 CONTENTS v 3.6 Convergence diagnostics Application: Goodness-Of-Fit Tests Kolmogorov-Smirnov Cramer-von Mises criterion Other tests Examples Bibliography 39

7 Chapter 1 Introduction One important part of statistical inference is testing whether or not a random sample comes from a specific probability distribution. Such goodness-of-fit (gof) tests on parametric distributions are often based on the normal distribution and are asymptotic, or certain parameters in the distributions have to be specified, like in the Kolmogorov-Smirnov test. Here we will look at ways to formulate exact tests for certain parametric models, in which there is no need to make any inference on the values of the parameters in the distribution. An important statistical concept related to this, is that of sufficiency. 1.1 Sufficiency Consider a random sample X = (X 1, X 2,..., X n ) from a known probability distribution f X (x;θ) with an unknown parameter θ, possibly multidimensional. If the goal is to estimate θ or some function g (θ), one would calculate some statistic T (x) as an estimator, using the realization x = (x 1, x 2,..., x n ). These estimates might be equal for different realizations x and y, and one might wonder if it is possible to summarize the data from the sample x in such a way that no useful data is lost. This is the concept of sufficient statistics, which is defined as follows (Casella and Berger, 2002) 1

8 CHAPTER 1. INTRODUCTION 2 Definition 1.1 A statistic T (X ) is a sufficient statistic for θ if the conditional distribution of the sample X given the value of T (X ) does not depend on θ. From this it follows that any statistic calculated for estimating a function g (θ) only depends on the value of the sufficient statistic. 1.2 Generation of co-sufficient samples The term co-sufficient samples (conditional-sufficient samples) is taken from (Lockhart et al., 2007) (from now on refered to as Loc07 for simplicity), and refers to samples drawn from the conditional distribution of X = (X 1,..., X n ) given the value of the sufficient statistic T : f X T (x T = t;θ) = f X T (x T = t), assuming that X has a certain probability distribution f X (x;θ). This conditional distribution will not depend on the parameter θ, which follows from the definition of a sufficient statistic. Generation of such samples are in general not straightforward. The paper (Lindqvist and Taraldsen, 2007) (from here on refered to as Lin07 for simplicity) discusses three different algorithms for generating co-sufficient samples, which we will follow here. The first one, Algorithm 1, is rather simple, but works only when certain conditions are fulfilled, which will be discussed later. The algorithm is as follows:

9 CHAPTER 1. INTRODUCTION 3 Algorithm 1 Input: A random sample x = (x 1,..., x n ). Requires that you can generate a random vector U = (U 1,...,U n ) from a known density f U (u), and you have functions χ(u,θ) and τ(u,θ) such that ( χ(u,θ),τ(u,θ) ) (X,T ) when θ is known. 1: Calculate the value of the sufficient statistic t from the random sample x = (x 1,..., x n ). 2: Draw U = (U 1,...,U n ) from the density f U (u). 3: Solve the equation τ(u,θ) = t for θ, denote the unique solution as ˆθ(U, t). 4: Return x t (u) = χ ( u, ˆθ(u, t) ). In general it is difficult to find the exact theoretical distribution for the co-sufficient samples. The joint distribution, conditioned on the value of T, will be of the form f X1,...,X n T (x 1,..., x n t) = f X 1,...,X n,t (x 1,..., x n, t), (1.1) f T (t) which in most cases will be a complicated expression. The marginal distributions f Xi T (x i t) can be found by integrating out the rest of the variables x 1,..., x i 1, x i+1,..., x n. These marginal distributions are the same for all i, so when we later will check if the algorithms produce samples from the correct distribution, we can look at the histogram of the entire sample, not just the component x i. 1.3 Exponential distribution, using Algorithm 1 Assume that a random sample X = (X 1, X 2,..., X n ) comes from an exponential distribution, defined as f X (x;θ) = θe θx, x > 0, (1.2)

10 CHAPTER 1. INTRODUCTION 4 where the parameter θ is not known. A sufficient statistic for the exponential distribution is just the sum T (X ) = n X i. (1.3) i=1 This can be shown using the factorization theorem. If U i Exp(1) independent and identically distributed (iid) for all i = 1,...,n, then the functions ( U1 χ(u,θ) = θ,..., U n θ n U i τ(u,θ) = θ, i=1 will have the correct distribution ( χ(u,θ),τ(u,θ) ) (X,T ), according to Lin07. Algorithm 1 can in this case be written as follows: ), Algorithm 1 - Exponential distribution Input: A random sample x = (x 1,..., x n ). 1: Calculate t = n i=1 x i. 2: Draw U = (U 1,...,U n ), where U i Exp(1) for i = 1,...,n. 3: Calculate ˆθ(u, t) = n i=1 u i /t. ( ) 4: Return X t (u) = u1,..., u ṋ. ˆθ θ Next we try out this algorithm using three different input samples. Example 1.1 First a sample of size 5000 is drawn from an exponential distribution with parameter θ = 5. The histogram from this sample is shown to the left in Figure 1.1, with the histograms for the cosufficient sample to the right. This co-sufficient sample is from a conditional distribution that is not exponential, but since n is so large we have that this distribution is close to the distribution of the original sample.

11 CHAPTER 1. INTRODUCTION 5 Histogram of original sample Exponential Histogram of co sufficient sample Exponential x chi Figure 1.1: Original sample from an exponential distribution with θ = 5 (left), and one cosufficient sample assuming exponential distribution (right). Example 1.2 A sample of the size 5000 is drawn from a uniform distribution between 0 and 1. The histogram from this sample is shown in Figure 1.2, as well as the histogram of one co-sufficient sample. Although the generated sample have the same value for the sufficient statistic T, it is easy to see from the histograms that the original sample did not come from an exponential distribution. Histogram of original sample Uniform Histogram of co sufficient sample Exponential x chi Figure 1.2: Original sample from a uniform distribution between 0 and 1 (left), and one cosufficient sample assuming exponential distribution (right).

12 CHAPTER 1. INTRODUCTION 6 Example 1.3 Next, the original sample is from a lognormal distribution with location parameter µ = 0.5 and scale parameter σ = 0.5. The histogram from this sample is shown in Figure 1.3, as well as the histogram of one co-sufficient sample. Here the histograms are more similar, but it is clear that the mode of the lognormal distribution is not at x = 0, as looks to be the case for the distribution of the co-sufficient sample. Histogram of original sample Lognormal Histogram of co sufficient sample Exponential x chi Figure 1.3: Original sample from a lognormal distribution with (µ, σ) = (0.5, 0.5) (left), and one co-sufficient sample assuming exponential distribution (right). 1.4 Verification How do we know that the co-sufficient samples generated in examples 1 to 3 are correct? In the case of the exponential distribution, it turns out that it is possible to calculate the joint conditional distribution given by equation (1.1), and also the marginal distributions. The sufficient statistic is a sum of n iid exponentially distributed variables with rate parameter θ, and will therefore be gamma-distributed with probability distribution f T (t) = Gamma(n,θ) = θn Γ(n) t n 1 e θt, t > 0.

13 CHAPTER 1. INTRODUCTION 7 Here the gamma distribution is parametrized with the rate parameter instead of the scale parameter. Now we can write f X1,...,X n,t (x 1,..., x n, t) = f T X1,...,X n (t x 1,..., x n )f X1,...,X n (x 1,..., x n ) Next f X1,...,X n (x 1,..., x n ) = = n f Xi (x i ) i=1 n θe θx i i=1 = θ n e θ n i=1 x i, for x 1,..., x n > 0, and zero elsewhere. Next 1 if n i=1 f T X1,...,X n (t x 1,..., x n ) = x i = t 0 else. Hence, putting these results into equation (1.1) and cancelling some factors, we have that the joint conditional distribution can be written as Γ(n) if n t f X1,...,X n T (x 1,..., x n t) = n 1 i=1 x i = t 0 else. We are interested in the marginal distributions, to test if Algorithm 1 works for the exponential distribution. The marginal distribution of x 1, conditioned on t, is found by integrating out

14 CHAPTER 1. INTRODUCTION 8 x 2,..., x n. This can be done as follows f X1 T (x 1 t) = Γ(n) t n 1 = Γ(n) t n 1 = Γ(n) t n 1 = Γ(n) t n 1 t x1 t x1 x t x1 t x1 x t x1 t x1 x t x1 t x1 x t x1 x 2... x n 2 0 t x1 x 2... x n 3 0 t x1 x 2... x n 4 0 t x1 x 2... x n 5 This pattern continues, and the final result is 0 d x n 1...d x 3 d x 2 f X1 T (x 1 t) = Γ(n) (t x 1 ) n 2 t n 1, (n 2)! (t x 1 x 2... x n 2 )d x n 2...d x 3 d x (t x 1 x 2... x n 3 ) 2 d x n 3...d x 3 d x (t x 1 x 2... x n 4 ) 3 d x n 4...d x 3 d x 2. which simplifies to f X1 T (x 1 t) = n 1 t n 1 (t x 1) n 2, x 1 [0, t], n 2. (1.4) This marginal distribution can now be used to check if Algorithm 1 produces co-sufficient samples from the correct distribution, in the case of the exponential distribution. Note that this marginal conditional distribution does not depend on the parameter θ, because we are conditioning on a sufficient statistic. Figure 1.4 shows the results of three different input values for t and n. In all three cases we see that the histograms for the generated data fits well with the theoretical distribution specified by equation (1.4). When n = 2 we see that the distribution is a constant function, and when n = 3 it is a linear function. In any case, for n > 2 we have that the distribution have a maximum at x 1 = 0 and is a decreasing function between 0 and t.

15 CHAPTER 1. INTRODUCTION 9 n = 2 n = 3 n = x 1 x 1 x 1 Figure 1.4: Histograms of x 1 for three different values of n, using Algorithm 1 on the exponential distribution, as well as the corresponding theoretical densities. 1.5 Truncated exponential distribution Assume that a random sample X = (X 1, X 2,..., X n ) comes from a truncated exponential distribution, defined as θe θx if θ R\{0} e f X (x;θ) = θ 1 1 if θ = 0, for 0 x 1. (1.5) where the parameter θ R is not known. This distribution is obtained by truncating the exponential distribution at x = 1, and is actually a valid probability distribution for all θ R, not just for positive values (this distribution is used in Lin07). A sufficient statistic for the truncated exponential distribution is the sum n T (X ) = X i, (1.6) just as in the case of an exponential sample. This can easily be seen in the same way as for the exponential distribution. To use Algorithm 1 we need to generate a random vector U and find the functions χ(u,θ) and τ(u,θ) such that ( χ(u,θ),τ(u,θ) ) (X,T ) when θ is known. This can be done by inversion, because the cumulative distribution function for a random variable from i=1

16 CHAPTER 1. INTRODUCTION 10 the truncated exponential distribution is 1 e θx if θ R\{0} 1 e F X (x;θ) = θ x if θ = 0, for 0 x 1. Solving u = F X (x;θ) for x, assuming θ 0, leads to: x = ln( 1 + ( e θ 1 ) u ). θ So, if u i Unif[0,1], then this leads to x i f X (x;θ) as defined in equation (1.5). We can therefore choose the functions ( ( ( ln 1 + e θ 1 ) ) U 1 χ(u,θ) =,..., ln( 1 + ( e θ 1 ) )) U n, θ θ n ln ( 1 + ( e θ 1 ) ) U i τ(u,θ) =. θ i=1 We can now use Algorithm 1 on the truncated exponential distribution. Solving for ˆθ in step 3 is done numerically, using the uniroot-function in R.

17 CHAPTER 1. INTRODUCTION 11 Algorithm 1 - Truncated exponential distribution Input: A random sample x = (x 1,..., x n ). 1: Calculate t = n i=1 x i from the random sample x = (x 1,..., x n ). 2: Draw U = (U 1,...,U n ), where U i Unif[0,1] for i = 1,...,n. 3: Solve n ln ( 1 + ( e θ 1 ) ) u i = t. θ i=1 This must be done numerically. Denote the unique solution as ˆθ(U, t). 4: Return X t (u) = ln ( ( ) 1 + e ˆθ 1 )u 1 ˆθ ( ( ln 1 + e ˆθ 1,..., ˆθ ) )u n. In the truncated exponential case, it is known that the distribution of a co-sufficient sample of size n is the distribution of n independent uniformly distributed variables between 0 and 1, given their sum. The reason for this is as follows: The conditional distribution of (X 1,..., X n ) (T = t) should be independent of θ, because this is what defines a sufficient statistic. It follows that we can choose the parameter value θ = 0 in equation 1.5, which means the X i s are uniformly distributed between 0 and 1. Hence, the co-sufficient sample (X 1,..., X n ) (T = t) has a distribution of n independent uniform random variables between 0 and 1, given their sum. This result can be used to verify if the co-sufficient samples generated using Algorithm 1 have the correct distribution. The simplest case is n = 2, where X 1 (X 1 + X 2 = t) can be shown to be uniformly distributed between 0 and t for 0 t 1 and uniformly distributed between (t 1) and 1 for 1 t 2. Similarly, X 2 conditioned on T will have the same distribution as X 1 conditioned on T. Now we can test the algorithm in the following example.

18 CHAPTER 1. INTRODUCTION 12 Example 1.4 Co-sufficient samples, assuming a truncated exponential distribution, are generated using Algorithm 1. The input sample is (0.5, 0.5), which is chosen to make the resulting sufficient statistic equal to 1. Hence, a histogram of x 1 conditioned on t = 1 should approximately be a constant between 0 and 1, as would be the case for a uniform variable between 0 and 1. The number of generations used are , and the histogram of x 1 are shown in Figure X 1 Figure 1.5: Histogram of x 1 using Algorithm 1 on the truncated exponential distribution, as well as the theoretical distribution (blue line). Clearly the distribution of the generated co-sufficient samples is not uniform, so there must be something wrong with the procedure. As mentioned earlier, Algorithm 1 has some conditions to make sure the generated samples have the correct distribution, and it turns out that not all of these are fulfilled in the case of the truncated exponential distribution. 1.6 Conditions for using Algorithm 1 The following must hold in order for Algorithm 1 to work Uniqueness The equation τ(u,θ) = t has to have a unique solution ˆθ(U, t).

19 CHAPTER 1. INTRODUCTION 13 Pivotal condition The function τ(u,θ) depends on u through a function r (u), not dependent on θ. Independence condition The output sample χ(u, ˆθ) is independent of τ(u,θ) for some value of θ. It is clear that the pivotal condition does not hold in the case of the truncated exponential distribution, and that is why Algorithm 1 produces samples from the wrong distribution. The solution to overcome this is to put a prior distribution π(θ) on the parameter θ, which will be considered in the next chapter. 1.7 Examples We will work with two different data sets: Ball bearing data: Failure data for 23 ball bearings, measured in millions of revolutions to fatigue failure, gathered from the lecture notes for the course TMA4275: Lifetime Analysis at NTNU (TMA4275, 2016). Premier League data: The number of points for the football team finishing last in the Premier League, during the period (Altomfotball.no, 2016). Algorithm 1 is used to generate co-sufficient samples from these data sets, assuming an exponential distribution. The histograms are shown in Figure 1.6 and Figure 1.7. In both cases we see that the co-sufficient samples seems to have a different shape than the data. In Chapter 4 we will analyze these data sets for gof.

20 CHAPTER 1. INTRODUCTION x u Figure 1.6: Histogram of the co-sufficient samples generated from the ball bearing data (left) and the histogram of the original data (right), assuming an exponential distribution x u Figure 1.7: Histogram of the co-sufficient samples generated from the Premier League data (left) and the histogram of the original data (right), assuming an exponential distribution.

21 Chapter 2 Algorithm 2 In this chapter we look at the extension of Algorithm 1, in the case that the conditions mentioned in the previous chapter doesn t hold. Algorithm 2 is described in Lin07, and here we follow the application of this algorithm on the truncated exponential distribution. 2.1 Jeffreys prior Algorithm 2 involves choosing a prior for θ, for instance Jeffreys prior. This is given as the square root of the Fischer information. The truncated exponential distribution is given by equation 1.5, so we get ln(f X (x;θ)) = lnθ + θx ln(e θ 1) Differentiating this expression two times gives θ ln(f X (x;θ)) = 1 θ + x eθ e θ 1 2 θ 2 ln(f X (x;θ)) = 1 θ 2 + e θ (e θ 1) 2 15

22 CHAPTER 2. ALGORITHM 2 16 and so Hence, Jeffreys prior is in this case I(θ) = 2 θ 2 ln(f X (x;θ)) = 1 θ 2 e θ (e θ 1) 2. This can be used as a prior in Algorithm 2 described later. 1 π(θ) = θ 2 e θ (e θ 1) 2. (2.1) 2.2 Truncated exponential distribution, using Algorithm 2 The algorithm requires that you can generate a random vector U = (U 1,...,U n ) from a known density f U (u), and you have functions χ(u,θ) and τ(u,θ) such that ( χ(u,θ),τ(u,θ) ) (X,T ) when θ is known. Now the parameter is considered a random variable Θ, with a chosen prior distribution π(θ), independent of U. Denote W t (u) as the density of τ(θ,u). The algorithm generates a random vector V from a distribution proportional to W t (u)f U (u), in contrast to Algorithm 1 when a random vector U was generated from the distribution f U (u). Solving τ(u,θ) = t in terms of θ and denoting this solution as ˆθ(u, t) (must be a unique solution), we have that the density of τ(θ,u) can be written as W t (u) = π(θ) det( θ τ(u,θ)). (2.2) θ= ˆθ(u,t) In the case of the truncated exponential distribution, Lin07 uses ordinary inversion, where f Ui (u i ) is the standard normal distribution: U i Unif[0,1]. (2.3)

23 CHAPTER 2. ALGORITHM 2 17 The functions used in the algorithm is ( log(1 + (e θ 1)u 1 ) χ(u,θ) =,..., log(1 + ) (eθ 1)u n ) θ θ n log(1 + (e θ 1)u i ) τ(u,θ) =. θ i=1 In this case one can write θ τ(u,θ) θ= ˆθ(u,t) = e ˆθ(u,t) ˆθ(u, t) n ( u i ) 1 + e ˆθ(u,t) 1 i=1 u i t ˆθ(u, t). (2.4) The algorithm generating the V s is a Markov chain Monte Carlo (McMC) algorithm, where the proposal is U = (U 1,...,U n ) with U i Unif[0,1]. This proposal is also used as initialization. Algorithm 2 Input: A random sample x = (x 1,..., x n ) and the number of iterations m. 1: Calculate t = n i=1 x i. 2: Initialize v 1 by drawing n random variables iid from the standard uniform distribution. For j = 2,3,...,m 3: Generate a random sample U = (U 1,...,U n ), where U i Unif[0,1]. 4: Solve n log(1+(e θ 1)u i ) i=1 θ = t numerically for θ, and denote the solution ˆθ(u, t). 5: Calculate the ratio α = W t (u)f U (u) W t (v j 1 )f U (v j 1 ), using equations (2.2), (2.3) and (2.4), where v j 1 is the sample generated at the previous iteration. 6: Draw z Unif[0,1] 7: If z < α set v j = u, if not set v j = v j 1. End iteration 8: Return v = (v 1,..., v m ). The unique weights used are the values of the density W t (u) for each accepted sample u. The prior used here is Jeffreys prior, given by equation (2.1), and the more simple prior function π(θ) = 1 θ. In both cases we have that the priors go towards infinity when θ 0. This will affect

24 CHAPTER 2. ALGORITHM 2 18 the performance of the algorithm, and so it is chosen that the acceptance probability α is set to zero whenever ˆθ(u, t) or ˆθ(v j 1, t) is less than 0.5. Figure 2.1 shows the distribution of x 1 from the co-sufficient samples, in the case when the input sample is (0.5,0.5) and the prior distribution is 1 θ. Sample size n = 2 and sum t = 1 Unique weights X 1 weight Figure 2.1: Histogram of co-sufficient samples generated using Algorithm 2 (left), and the histogram of unique weights, using 1/θ as prior (right). Figure 2.2 shows the distribution of x 1 from the co-sufficient samples, in the case when the input sample is (0.5,0.5) and the prior distribution on θ is Jeffreys prior. We observe that the weights are more spread out in the case of π(θ) = 1 θ than in the case of Jeffreys prior. In both cases, however, we see that the histogram of x 1 looks to be distributed correctly for this input sample, and that there doesn t seem to be a big difference in the rate of convergence. The number of iterations m is in both cases. 2.3 Convergence diagnostics Algorithm 2 is a McMC-algorithm that will generate samples that are correlated. To analyze if the algorithm converges and how the data is correlated, we plot the trace plot of the first 1000 values of x 1 and the sample autocorrelation function (acf) of both x 1 and x 2 from the generated

25 CHAPTER 2. ALGORITHM 2 19 Sample size n = 2 and sum t = 1 Unique weights X weight Figure 2.2: Histogram of co-sufficient samples generated using Algorithm 2 (left), and the histogram of unique weights, using Jeffreys prior (right). samples in the previous section, using Jeffreys prior. This is shown in Figure 2.3. It looks like the acfs is decaying exponentially, and it goes very slowly towards zero, meaning that there is a high degree of correlation in the co-sufficient samples. The trace plot looks quite ok, and the acceptance rate was in this case 63.3%, and so we can conclude that Algorithm 2 seems to converge in a decent way. 2.4 Verification - Gibbs algorithm The distribution of the co-sufficient samples for the truncated exponential distribution is, as shown earlier, that of n independent uniformly distributed random variables between 0 and 1. It turns out that samples from this distribution can be generated using a Gibbs algorithm as follows (taken from (Lindqvist and Rannestad, 2011)).

26 CHAPTER 2. ALGORITHM 2 20 x Index x 1 x 2 ACF ACF Lag Lag Figure 2.3: Trace plot of the first 1000 iterations and acfs for the generated samples using Algorithm 2 on the truncated exponential distribution. Algorithm - Gibbs algorithm Input: A random sample x = (x 1,..., x n ) and a chosen number of iterations M. 1: Calculate the sum of the original sample, t = n i=1 x i. 2: Initialize X 0 = t, where all the sample points have the same value. i n Iterate m from 1 to M: 3: Draw two integers i < j from {1,...,n}, and compute a = X m i + X m j. 4: If a 1 draw X m+1 i Unif[0, a], if not draw X m+1 i Unif[a 1,1]. 5: Calculate X m+1 j = a X m+1 i. Set the remainding n 2 points of X m+1 equal to X m. End iteration This algorithm can be used to verify if Algorithm 2 works for the truncated exponential distribution, also in the general case of a sample of size n. Figure 2.4 compares histograms for co-sufficient samples generated using Algorithm 2 and the Gibbs algorithm, for three different values of n and t. The number of generations used is for both methods. From this is

27 CHAPTER 2. ALGORITHM 2 21 seems that Algorithm 2 is working properly. 2.5 Verification - Exact distribution In Chapter 1 we calculated the theoretical distribution of the co-sufficient samples. We try the same method here, but limit ourself to the case of n = 2 and n = 3, because the distribution gets quite difficult to calculate for higher n. First the parameter θ is chosen to be zero. We can do this because the co-sufficient samples have a distribution that is independent of θ, so one can choose any value for this parameter. This means that we are looking for the distribution of n independent random variables uniformly distributed between 0 and 1, given its sum. Following the procedure done for the exponential case in Chapter 1, we first note that the distribution of the sufficient statistic is (WolframMath- World, 2016) 1 f T (t) = 2(n 1)! ( ) n ( 1) k n (t k) n 1 sgn(t k). k k=0 Hence, the joint conditional distribution can be written as 2(n 1)! n f X1,...,X n T (x 1,..., x n t) = k=0 ( 1)k ( n if n k)(t k) n 1 sgn(t k) i=1 x i = t and 0 x i 1 i = 1,...,n 0 else. To find the marginal distribution of X 1 given T, we need to integrate out x 2,..., x n. It turns out that this is more difficult than in the case of the exponential distribution, because of the restriction 0 x i 1 i = 1,...,n. Therefore we only look at the case n = 3, where we get f X1 T (x 1 t) = 2(n 1)! min{1,t x1 } n k=0 ( 1)k( n k) (t k) n 1 d x 2, sgn(t k) max{0,t x 1 1} where the limits follow from the restrictions. Solving the integral and inserting n = 3, we get 4 f X1 T (x 1 t) = 3 k=0 ( 1)k( 3) k (t k) 2 sgn(t k) (min{1, t x 1} max{0, t x 1 1}). (2.5)

28 CHAPTER 2. ALGORITHM 2 22 This marginal distribution can now be used to test Algorithm 2 for the truncated exponenetial distribution, in the case of n = 3. This is done for two values of the sufficient statistic, t = 1.5 and t = 2.5, and the histograms of the generated samples, compared with the theoretical density function from equation (2.5), are shown in Figure 2.5. The number of generations are in both cases , and we see that the generated samples seems to be correctly distributed, at least for n = 3.

29 CHAPTER 2. ALGORITHM n=2, t= n=2, t= X 1 X n=3, t= n=3, t= X 1 X n=5, t= n=5, t= X 1 X 1 Figure 2.4: Histogram of the co-sufficient samples using Algorithm 2 (left), compared to the samples obtain by using the Gibbs algorithm (right), for three different values of n and t.

30 CHAPTER 2. ALGORITHM 2 24 Sample size n = 3 and sum t = 1.5 Sample size n = 3 and sum t = X X 1 Figure 2.5: Histogram of the co-sufficient samples using Algorithm 2 for two different values for the sufficient statistic t, as well as the theoretical distribution.

31 Chapter 3 Gibbs sampler - Gamma distribution A method for generating co-sufficient samples for the gamma distribution, using a Gibbs sampler, is done in Loc07. This is an alternative to the methods previously discussed. Here we will try to replicate the method used. 3.1 Transformation of variables The density of a gamma distributed variable is 1 f X (x) = β α Γ(α) xα 1 e x β, x 0, where α and β are the shape and rate parameters, respectively. If (X 1,..., X n ) is a random sample where X i are drawn iid from a gamma distribution, then T = (s, p) is a sufficient statistic, where n s = X i, p = i=1 n X i. i=1 25

32 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 26 In the algorithm below we need the joint density of (s, p, X 3,..., X n ), which can be found using the multivariate transformation formula f s,p,x3,...,x n (s, p, x 3,..., x n ) = f X1,...,X n (x 1,..., x n ) J(s, p, x 3,..., x n )), where J(s, p, x 3,..., x n )) is the Jacobian of the transformation. This is done in Lockhart2007, and the result is f s,p,x3,...,x n (s, p, x 3,..., x n ) = pα 1 e s β β nα (Γ(α)) n 2 n i=3 x i (s n i=3 x i ) 2 4 p n i=3 x i. (3.1) 3.2 Generation of new value The first step in the algorithm will be to generate a new value for x n, denoted x, conditioned on (s, p, x 3,..., x n 1 ). Using equation (3.1), it can be seen that this conditional distribution, denoted f c (x n ), must be proportional to f c (x n ) 1 x n (s n i=3 x i ) 2 4 p n i=3 x i. In Loc07 they mention two ways to generate this value, either by using a rejection algorithm or calculating the numerical cdf and then using inversion. Here we will use the former of these two approaches. As done in Loc07 we define C = s s and D = p/ p, where s = n 1 i=3 x i and p = n 1 i=3 x i. It can be seen that the conditional distribution can be written as 1 f c (x n ). x n (C x n ) 2 4D x n Transforming v = x n /C, this leads to a conditional density for v as f c (v) = K v v(1 v) 2 c, c = 4D C 3, (3.2)

33 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 27 where K is found by normalization using numerical integration. The first step is to find out where the density is non-zero, which is where the value of h(v) = v(1 v) 2 c is positive. This is a polynomial of order 3, and the interval [a,b] is solved numerically. Because h(v) is continuous, with limit when v, limit when v and h (v) = 0 for v = 1 and v = 1, it follows that a is less than 1/3, b is between 1/3 and 1, and the last solution 3 d is larger than 1. The function h(v) is positive on the interval (a,b), as well as on (d, ), but because all the x i s are positive, it follows that x n < x 1 + x 2 + x n, and hence we get the restriction v < 1. So v (a,b) is the interval where the density in equation (3.2) is non-zero. Next we use the beta distribution as a proposal, which has distribution q(x) = 1 B(α b,β b ) xα b 1 (1 x) β b 1, 0 < x < 1. Here B(α b,β b ) is the Beta-function, and α b and β b are the parameters in the distribution. Next we define v = a + x(b a) to transform this distribution to the same interval as the target distibution. This leads to q(v) = (v a) αb 1 (b v) β b 1 (b a)1 α b β b, a < v < b. B(α b,β b ) Here the parameters in the beta distribution are chosen to be α b = 0.5 and β b = 0.5. To use a rejection algorithm it is necessary to find a constant k 1 such that kq(v) f c (v) v [a,b]. (3.3) This is done numerically, by finding the maximum of f c (v) q(v) = πk (v a)(b v) v(v(1 v) 2 c) on this interval, having used that B ( 1 2, 1 2) = π. This will be possible if the limits when v a and

34 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 28 v b exists. Noting that ( ) (v a)(b v) lim v a v(v(1 v) 2 = b a ( (v a) lim c) a v a v(1 v) 2 c b a = a(3a 2 4a + 1), ) having used L Hôpital s rule, and similarly ( ) (v a)(b v) lim v b v(v(1 v) 2 c) b a = b(3b 2 4b + 1), we get f c (v) lim v a q(v) = πk b a a(3a 2 4a + 1), f c (v) lim v b q(v) = πk b a b(3b 2 4b + 1). These limits will be finite as long as a 1 3 and b 1, respectively. Hence there must be a k as required in equation (3.3). 3.3 Algorithm - Gibbs sampler The complete algorithm is described below. Note that this is a McMC-algorithm with acceptance probability equal to 1, because the generated value for x n is from the corrected conditional distribution. But this generated value is of course the result of a rejection step, where values are proposed until eventually one is accepted.

35 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 29 Algorithm - Gibbs sampler Input: A random sample x = (x 1,..., x n ). Iterate n 2 times: 1: Generate a new value for x n, denoted x, conditioned on (s, p, x 3,..., x n 1 ). This is done using a rejection algorithm, as described earlier. 2: Replace x n by x. 3: Rotate the sample one step to the left and relabel. See Figure : Recalculate C and D based on the new sample. These values will be used in Step 1 of the new iteration. End iteration 5: Calculate the values of x 1 and x 2. x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x n x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x* x 1 x 2 x 4 x 5 x 6 x n-1 x* x 3 x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x n x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x* x 1 x 2 x 4 x 5 x 6 x n-1 x* x 3 New value Rotate Relabel New value Rotate x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x n Relabel x 1 x 2 x 3 x 4 x 5 x n-2 x n-1 x n Figure 3.1: Illustration of how the Gibbs sampler algorithm works. Here the letters in bold are the newly generated values. The last step is done by solving the equations x 1 + x 2 = s x 1 x 2 = n x i, i=3 p n i=3 x. i

36 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 30 These equations have the solution x = s n i=3 x i ± (s n i=3 x i ) p n i=3 x i. The order of x 1 and x 2 is decided by flipping a coin. 3.4 Verification of rejection step Let x = (1,2,3,4,5) be a sample where we want to generate a new value v from the density in equation (3.2), to, after multiplying with C, replace x 5 = 5. In this case it is calculated that c = , and the theoretical density function can be found by numerically finding K and the zeros a and b of the term inside the square root. In this case the interval where f c (v) is non-zero is [ , ]. Now we use the rejection step algorithm to generate values for v, and the corresponding histogram is shown in Figure 3.2. The blue line is the theoretical distribution, and from this it seems ok to assume that the rejection step algorithm works fine Figure 3.2: Histogram of generated values from the rejection step, as well as the theoretical distribution function. v

37 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION Examples Now we try out the Gibbs sampler on two data sets, and see how they are distributed compared to how the original sample looks. These are the same data sets that were used in Chapter 1, and both of these data sets will be analyzed for gof in the next chapter. Example Ball bearing data The Gibbs sampler is used to generate co-sufficient samples, using the ball bearing data as input. The histogram of the co-sufficient samples and the original sample are shown in Figure x u Figure 3.3: Histogram of the co-sufficient samples generated from the ball bearing data (left) and the histogram of the original data (right). Example Premier League data Now we try the premier league data as input, and generate co-sufficient samples. The histogram of all these samples, as well as the histogram for the original sample, are shown in Figure 3.4.

38 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION x u Figure 3.4: Histogram of the co-sufficient samples generated from the Premier League data (left) and the histogram of the original data (right). 3.6 Convergence diagnostics The Gibbs sampler is a McMC-algorithm that in general will generate samples that are correlated. To analyze if the algorithm converges and how the data is correlated, we plot the trace plot of x 1 and the sample acfs of x 1 and x 3 from the generated samples in Example 3.2. This is shown in Figure 3.5. The trace plots for the other components are similar. The ACF seems to become insignificant for lags above 5. The ACFs for all the x i s are very similar, except for x 1 and x 2, which both look the one plotted for x 1 in Figure 3.5. The reason for this must be that the Gibbs sampler algorithm generates x 1 and x 2 in a different way than for the rest of the sample. In any case, we can conclude that the Gibbs sampler seems to converge, at least for this data set.

39 CHAPTER 3. GIBBS SAMPLER - GAMMA DISTRIBUTION 33 x Index ACF x ACF x Lag Lag Figure 3.5: Trace plot of the first 1000 iterations of x 1 and acfs for the generated samples in Example 3.2.

40 Chapter 4 Application: Goodness-Of-Fit Tests Co-sufficient samples can be used to test if a sample comes from a particular distribution. Several tests are described below, and these can be used on the co-sufficient samples by calculating the corresponding test statistic for each generated sample, and compare how these values are distributed with the value of the test statistic for the original sample. The distribution of the test statistics are the same for both the co-sufficient samples and the original sample, under the assumption that the original sample comes from a particular distribution. 4.1 Kolmogorov-Smirnov The Kolmogorov-Smirnov gof test is a simple test for analyzing whether or not a sample is from a particular probability distribution. It is based on (Handbook) calculating the maximum distance between the empirical cumulative distribution function (cdf) and the theoretical cdf for the proposed distribution. Because the cdf is a non-decreasing function, it turns out that the test statistic can be written as ( D = max F (Y i ) i 1 1 i N N, i ) N F (Y i ), (4.1) where F (y) is the theoretical cdf. The maximum likelihood estimate for θ is used in the theoretical cdf. 34

41 CHAPTER 4. APPLICATION: GOODNESS-OF-FIT TESTS 35 Figure 4.1 illustrates how this test works. Here one co-sufficient sample is generated using the ball bearing data as input sample, assuming an exponential distribution, and it is clear that the empirical cdf of the co-sufficient sample matches better with the theoretical cdf, than what is the case with the original sample. CDF Theoretical CDF Empirical CDF (sample) Empirical CDF (co suff. sample) t Figure 4.1: Theoretical and empirical cdfs used in the Kolmogorov-Smirnov gof test. The Kolmogorov-Smirnov test is used as a one-sided test, rejecing the null hypothesis if the value of the statistic is large. 4.2 Cramer-von Mises criterion An alternative to Kolmogorov-Smirnov is using the one-sample case of the Cramer-von Mises criterion. The statistic used is defined as (Encyclopediaofmath, 2016) ω 2 = [F n (x) F (x)] 2 df (x), where F n (x) is the empirical distribution function and F (x) is the theoretical cdf for particular distribution that is assumed under the zero hypothesis. If x 1,..., x n are the sample points, in

42 CHAPTER 4. APPLICATION: GOODNESS-OF-FIT TESTS 36 increasing order, then the statistic in the one-sample case is given by T = nω 2 = 1 12n + n i=1 [ ] 2i 1 2 2n F (x i ). The maximum likelihood estimates for α and β are used in the theoretical cdf. The Cramer-von Mises criterion is used as a one-sided test, rejecing the null hypothesis if the value of the statistic is large. 4.3 Other tests We also use some very simple test statistics, just to compare the p-values calculated with the ones calculated for Kolmogorov-Smirnov (KS) and Cramer-von Mises (CM). These are named T1,T2,T3,T4 and T5, and are defined as follows: T1: Maximum of sample T2: Median of sample T3: Proportion of sample less than the mean of the sample T4: Sum of the square of the components of the sample T5: Maximum of sample divided by minimum of sample 4.4 Examples Example Ball bearing data, exponential distribution Using the ball bearing data as input, we assume the null hypothesis that the data comes from an exponential distribution, and generate co-sufficient samples using Algorithm 1. These are used to calculate p-values, and the values are listed in Table 4.1. Figure 4.2 illustrates how the value of the test statistic is distributed for both Kolmogorov-Smirnov and Cramer-von Mises, calculated from the co-sufficient samples. The vertical line is the test

43 CHAPTER 4. APPLICATION: GOODNESS-OF-FIT TESTS 37 statistics for the original sample, and it is easy to see a big difference between the co-sufficient samples and the original sample. Hence the p-values are very small, and the null hypothesis is rejected using any reasonable significance level. we can therefore conclude that the ball bearing data set cannot be described well with an exponential distribution D T Figure 4.2: Histogram of test statistics for the co-sufficient statistics for Kolmogorov-Smirnov (left) and Cramer-von Mises (right), as well as the value of the test statistics for the ball bearing sample (red line), under the assumption of an exponential distribution. Example Ball bearing data, gamma distribution Using the ball bearing data as input, we assume the null hypothesis that the data comes from a gamma distribution, and generate co-sufficient samples using the Gibbs sampler algorithm. These are used to calculate p-values, and the values are listed in Table 4.1. Table 4.1: Calculated p-values for the ball bearing data. KS CM T1 T2 T3 T4 T5 Exponential Gamma The null hypothesis is not rejected, and we conclude that the ball bearing data can be described quite well by the gamma distribution. This seems quite reasonable when looking at Figure 3.3,

44 CHAPTER 4. APPLICATION: GOODNESS-OF-FIT TESTS 38 because the co-sufficient samples and the original sample looks to have similar shapes. Example Premier League Now we test the same null hypothesss for the Premier League data, and the results are shown in Table 4.2, assuming first an exponential distribution and then a gamma distribution. Here the number of co-sufficient samples used are , and when the p-value is exactly zero it means that the test statistic for the original sample is more extreme than for any of the generated samples. It is clear that the data doesn t come from an exponential distribution, which is not surprising considering how the histogram of the data looks (see Figure 1.7). The last row in Table 4.2 also shows that the data doesn t fit well with a gamma distribution. The p-values for the Kolmogorov-Smirnov test and the Cramer-von Mises test are both below 0.05, so we can reject the null hypothesis that the data comes from a gamma distribution when using a significance level of 5%. Looking at Figure 3.4, we observe that an important difference between the histograms are that the co-sufficient samples tails of for increasing values of x, but the original sample has a quite clear cut-off. This is because it is unlikely for the team finishing last in the Premier League to have substantially more than 35 points, because it would require many teams to be very close in the number of points, and that no team is particularly worse than the rest. It is probably this property that makes the p-values so small. Table 4.2: Calculated p-values for the Premier League data. KS CM T1 T2 T3 T4 T5 Exponential Gamma

45 Bibliography Altomfotball.no (2016). [Online; accessed 18-December- 2016]. Casella, G. and Berger, R. L. (2002). Statistical Inference. 2nd edition. Encyclopediaofmath (2016). Cramér von mises criterion. encyclopediaofmath.org/index.php/cram%c3%a9r-von_mises_test. [Online; accessed 18-October-2016]. Handbook, E. S. Engineering Statistics Handbook, kolmogorov-smirnov goodness-of-fit test. Accessed: Lindqvist, B. H. and Rannestad, B. (2011). Monte carlo exact goodness-of-fit tests for nonhomogeneous poisson processes. Applied Stochastic Models in Business and Industry, 27(3): Lindqvist, B. H. and Taraldsen, G. (2007). Conditional monte carlo based on sufficient statistics with applications. Advances in Statistical Modeling and Inference. Essays in Honor of Kjell A Doksum, pages Lockhart, R. A., O Reilly, F. J., and Stephens, M. A. (2007). Use of the gibbs sampler to obtain conditional tests, with applications. Biometrika, 94(4): TMA4275 (2016). Lifetime analysis. TMA4275-Slides pdf. [Online; accessed 18-December-2016]. 39

46 BIBLIOGRAPHY 40 WolframMathWorld (2016). html. [Online; accessed 16-December-2016].

Monte Carlo conditioning on a sufficient statistic

Seminar, UC Davis, 24 April 2008 p. 1/22 Monte Carlo conditioning on a sufficient statistic Bo Henry Lindqvist Norwegian University of Science and Technology, Trondheim Joint work with Gunnar Taraldsen,