Chapter 3 Sufficient statistics and variance reduction Let X 1,X 2,...,X n be a random sample from a certain distribution with p.m/d.f fx θ. A function T X 1,X 2,...,X n = T X of these observations is called a statistic. From a statistical point of view taking a statistic of the observations is equivalent to taking into account only part of the information in the sample. Example: An experiment can either result in success or failure with probability θ and 1 θ respectively. The experiment is performed independently n times. Let X i = { 1 if the ith repetition results in success 0 if the ith repetition results in failure Let S m = m X i and S n m = n i=m+1 X i. Consider the bivariate statistic T X = S m,s n m. This statistic gives information on how many successes are obtained in the first m experiments and on how many successes are obtained in the last n mexperiments. The information on which particular experiments the successes were obtained in is not retained; neither is the information about how many successes are obtained in the first r experiments for r m. Consider now the statistic UX = n X i. This statistic gives information on the total number of successes in the n repetitions; all other information in the sample is not retained by UX. Note, in fact, that UX retains even less information than T X. Note also that 35
36CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION UX = S m + S n m i.e. UX is a function i.e. statistic of T X. Consequently we come to the conclusion that every time we take a function of a statistic we drop some of the information. We have argued in the past that the Fisher information Iθ = I X θ = E S 2 X, where S X = d dθ log f X X θ, is a measure of the amount of information in the sample X about the parameter θ. Now, if T X is a statistic then a measure of the amount of information in T about θ can be given by the Fisher information of T defined by I T θ = E S 2 T with ST = d dθ log f T T θ where f T t θ is the p.m/d.f. of the statistic T. If ˆθ T is an unbiased estimator of θ based on the statistic T instead of on the whole sample X then the Cramér-Rao inequality becomes V ar ˆθ T 1 3.1 I T θ Now, in view of the remarks we made about a statistic being equivalent to taking into account only part of the information in the sample, we should expect to have that I T θ I X θ 3.2 with equality holding if and only if the statistic has retained all the relevant information about θ and dropped only information which does not relate to θ. A statistic which retains all the relevant information about θ and discards only information which does not relate to θ is said to be sufficient for θ. Unfortunately, tempting as it may be, we can not adopt strict equality in 3.2 as the formal definition of sufficiency of a statistic T as this will only be possible in the cases when there is enough regularity for the Fisher Information to be defined. We need a formal definition of sufficiency which holds in all cases irrespective of whether this regularity is there or not. Formal definition of Sufficiency: A statistic TX of the observations X with p.m/d.f. f X x θ is said to sufficient for the parameter θ if the conditional distribution of X given T = t is free of θ i.e. if the conditional p.m/d.f.
37 f X x T=t does not involve θ. From this definition of sufficiency we have the following The factorization theorem A statistic TX, where X has joint p.m/d.f. f X x θ, is sufficient for θ if and only if f X x θ = gt, θhx for all x X n where gt,θ is a function of θ and depends on the observations only through the value t of T and hx is a function which does not involve θ. Proof. We first note that if F X,T x,t θ is the joint p.m/d.f. of X and T then f X,T x,t θ = = { fx x θ if t = Tx 0 if t Tx { fx x θ if x A t 3.3 0 if x / A t where the set A t = {x : Tx = t} = set of all sample results for which T = t. We can understand better the result in 3.3 in terms of an example. Suppose an experiment which can result in either success or failure is repeated independently three times and on the ith repetition we record X i = 1 if we get a success and X i = 0 if we get a failure i = 1, 2, 3. Let the statistic T = 3 X i be the number of successes in the three repetitions. The possible outcomes of the sample X = X 1,X 2,X 3 and of the statistic T are shown below.
38CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION Partition sets X 1,X 2,X 3 T A 0 = 0,0,0 0 A 1 = 1,0,0 0,1,0 0,0,1 1 A 2 = 1,1,0 1,0,1 0,1,1 2 A 3 = 1,1,1 3 Clearly f X,T 0, 1, 0, 2 θ = PrX 1,X 2, X 3 = 0, 1, 0, = 0 3 X i = 2 since clearly we cannot have the result X 1, X 2,X 3 = 0, 1, 0 and at the same have 3 X i = 2. On the other hand f X,T 0, 1, 0, 1 θ = PrX 1,X 2, X 3 = 0, 1, 0, = PrX 1,X 2, X 3 = 0, 1, 0 = f X 0, 1, 0 θ 3 X i = 1. We now turn our attention to the proof of the factorization theorem. Assume first that T is sufficient for θ i.e. that f X x T=t is free of the parameter θ. Since for t = Tx f X x θ = f X,T x,t θ = f X x T=tf T t θ see 3.3 the factorization follows by taking f X x T=t hx and f T t θ gt, θ.
Assume now that the factorization f X x θ = gt,θhx holds for all x X n with t = Tx. It follows that f T t θ = x A t f X x θ = x A t gt,θhx = gt,θ x A t hx = gt, θht 3.4 where the set A t = {x : T x = t} = set of all sample results for which T = t. In calculating 3.4 we have assumed the observations to be discrete; if they are continuous replace summations by integrals. Further in 3.3 we have seen that f X x θ if x A f X x T=t = t f T t θ 0 if x / A t and from 3.4 and the factorization we get gt, θhx f X x T=t = gt,θht = hx if x A t Ht 0 if x / A t i.e. f X x T=t is free of θ. This completes the proof of the factorization theorem. Remark 3.0.1 What are the implications of having the conditional p.m/d.f. f X x T=t free of θ? Given that we know that T x = t it follows that x must be situated in the set A t ; if further f X x T=t is free of θ we can conclude that once we know that x is in the set A t the probability of it being in any particular position within A t is not dependent on θ i.e. once we know that x is in the set A t information on its exact position within A t does not relate to θ. Put in another way, all the information in x relating to θ is contained in the value of T x, the information in x which is not retained by the statistic T does not relate to θ. But we have seen that a statistic T which retains all the relevant information about θ and discards only information that is not relevant to θ is what we call a sufficient statistic for θ. Result: Let TX be a statistic of the sample X whose joint distribution depends on a parameter θ. Then under certain regularity conditions on the joint p.d/m.f f X x θ of X and on the p.d/m.f f T t θ of T 39 I T θ I X θ θ Θ
40CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION with equality if and only if TX is sufficient for θ. Here [ ] 2 I T θ = E θ log f TT θ = E 2 θ log f TT θ 2 and I X θ = E [ ] 2 θ log f XX θ = E 2 θ log f XX θ. 2 Proof. The inequality I T θ I X θ will be assumed valid as a consequence of our understanding of what a statistic does and of what the Fisher information represents - although it can be rigorously proved using mathematics. That strict equality holds if and only if T is sufficient for θ follows from the factorization theorem and is left as an exercise. Remark: 1. Notice that the factorization theorem not only gives us necessary and sufficient conditions for the existence of a sufficient statistic it also identifies for us the sufficient statistic. 2. Sufficiency implies that basing inferences about θ on procedures involving sufficient statistics rather than the whole sample, will be more preferable since such procedures discard, outright, unnecessary information which does not relate to θ. In particular, in estimating θ, the best unbiased estimators based on sufficient statistics are not going to be any less efficient in the formal sense than the best unbiased estimators based on the whole sample since for T sufficient I T θ = I X θ i.e. the CRLB for unbiased estimators based on Tis the same as the CRLB for unbiased estimators based on X. Example 3.0.1 Let X 1,X 2,...,X n be a random sample from the Bernoulli distribution i.e. { 1 with probability θ X i = 0 with probability 1 θ Hence f Xi x i θ = θ x i 1 θ 1 x i
for all i, Use the factorization theorem to find a sufficient statistic for θ and then confirm it is sufficient for θ with the use of the formal definition of sufficiency Solution: The joint mass function of the observations X = X 1,X 2,...,X n is f X x θ = f Xi x i θ = θ x i 1 θ 1 x i = θ P n x i 1 θ n P n x i = g x i, θ h x with h x 1. Hence by the factorization theorem n X i is a sufficient statistic for θ. Notice that since the factorization is not unique, there may be more than one sufficient statistic. For example we could have written f X x θ = θ Sm+S n m 1 θ n Sm S n m = gs m,s n m,θh x with, once again, h x 1 and S m = m x i, S n m = n i=m+1 x i. Hence by the factorization theorem m X i, n i=m+1 X i is a bivariate sufficient statistic for θ. We now show, using the formal definition, that n X i is indeed a sufficient statistic. The conditional p.m.f. of X given that TX = n X i = t is 41 f X x T = t = f X,Tx,t θ f T t θ = f Xx θ f T t θ when Tx = t and zero otherwise. The last equality was obtained using 3.3. However, the statistic T = n X i has the Binomialn,θ distribution. Hence f X x T = t == f Xx θ f T t θ = θ P n x i 1 θ n P n x i n = 1/ n t θ t t 1 θ n t which is independent of θ confirming that n X i is sufficient for θ.
42CHAPTER 3. SUFFICIENT STATISTICS AND VARIANCE REDUCTION Example 3.0.2 Let X 1,X 2,...,X n be a random sample from the N µ,σ 2 distribution. Then f X x θ = 1 exp 1 2πσ 2σ x 2 i µ 2 = 2πσ 2 n/2 exp = 2πσ 2 n/2 exp 1 1 x i µ 2 x 2 i + µ σ 2 x i nµ2 Suppose that both µ and σ 2 are unknown so that θ = µ,σ 2 T. Then f X x θ = g x i, x 2 i,θ h x with h x 1 and g x i, x 2 i,θ = 2πσ 2 n/2 exp 1 x 2 i + µ σ 2 x i nµ2 Hence the bivariate statistic n X i, n X2 i is sufficient for µ,σ 2. This should NOT be interpreted as saying that n X i is sufficient for µ and n X2 i is sufficient for σ 2. All it says is that all the information contained in the sample about µ and σ 2 is also contained in the statistic n X i, n X2 i. Suppose now that µ is unknown but that σ 2 so that we now have θ = µ and f X x θ = 2πσ 2 n/2 µ exp x σ 2 i nµ2 exp 1 x 2 i }{{}}{{} = g x i,θ h x By the factorization theorem we conclude that n X i is sufficient for θ = µ.
Suppose now that µ is known but σ 2 is unknown so that now θ = σ 2 and f X x θ = 2πθ n/2 exp 1 x 2 i + µ x i nµ2 2θ θ 2θ }{{} 1 }{{} = g x i, x 2 i,θ h x By the factorization theorem we conclude that the bivariate statistic n X i, n is sufficient for θ = σ 2. Note that n X2 i by itself is not sufficient for σ 2 unless µ = 0. Example 3.0.3 Let X 1,X 2,...,X n be a random sample from the U0, θ distribution i.e. { 1 if 0 < x f Xi x i θ = i < θ θ 0 otherwise Note that θ is involved in the range of the distribution. Hence it is better if we write the p.d.f. of X i as f Xi x i θ = 1 θ I 0,θ x i where I 0,θ is the identity function of the interval 0,θ. For any set A the identity function of A is defined as { 1 if x A I A x = 0 if x / A 43 X2 i Hence f X x θ = 1 θ I 0,θ x i = 1 θ n I 0,θ x i = 1 θ ni 0,θ maxx i. }{{} 1 }{{} = gmaxx i,θ.h x = max 1 i n X i is sufficient for θ by the factorization theorem.