1 Probability Model. 1.1 Types of models to be discussed in the course

Size: px

Start display at page:

Download "1 Probability Model. 1.1 Types of models to be discussed in the course"

Hilda Wilkinson
6 years ago
Views:

1 Sufficiency January 18, 016 Debdeep Pati 1 Probability Model Model: A family of distributions P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described by giving a joint pdf or pmf f(x θ). Experiment: Observe X(data) P θ, θ unknown. Goal: Make inference about θ. Joint distribution of independent rv s: If X (X 1,..., X n ) and X 1,..., X n are independent with X i g i (x i θ), then the joint pdf is f(x θ) n g i(x i θ) where x (x 1,..., x n ). For iid random variables g 1 g n g. 1.1 Types of models to be discussed in the course Let X (X 1,..., X n ). 1. Random Sample: X 1,..., X n are iid. Regression Model: X 1,..., X n are independent (but not necessarily identically distributed; the distribution of X i may depend on covariates z i ) Random Sample Models Example: Let X 1, X,..., X n iid Poisson(λ), λ unknown. Here we have: X (X 1, X,..., X n ), θ λ, Θ λ : λ > 0}, P θ is described by the joint pmf f(x λ) f(x 1,..., x n λ) g(x i λ) where g is the Poisson(λ) pmf g(x λ) λx e λ x! for x 0, 1,,.... Hence f(x λ) λ x i e λ for x 0, 1,,...} n. Example: Let X 1, X,..., X n iid N(µ, σ ), with µ and σ unknown. Here we have: X x i! 1

2 (X 1, X,..., X n ), θ (µ, σ ), Θ (µ, σ ) : < µ <, σ > 0}, P θ is described by the joint pmf f(x µ, σ ) g(x i µ, σ ) where g is the N(µ, σ ) pdf g(x µ, σ ) 1 πσ e (x µ) /(σ ). Hence f(x µ, σ ) Sufficient Statistic 1 πσ e (x i µ) /(σ ) Let X P θ, θ unknown. What part (or function) of the data X is essential for inference about θ? Example: Suppose X 1,..., X n iid Bernoulli(p) (independent tosses of a coin). Intuitively, T X i # of heads contains all the information about p in the data. We need to formalize this. Let X P θ, θ unknown. Definition 1. The statistic T T (X) is a sufficient statistic for θ if the conditional distribution of X given T does not depend on the unknown parameter θ. Abbreviation: T is SS if L(X T ) is same for all θ, where L stands for law or distribution..1 Motivation for the definition Suppose X P θ, θ Θ, θ unknown. Let T T (X) be any statistic. We can imagine that the data X is generated hierarchically as follows: 1. First generate T L(T ).. Then generate X L(X T ). If T is a sufficient statistic for θ, then L(X T ) does not depend on θ and Step can be carried out without knowing θ. Since, given T, the data X can be generated without knowing θ, the data X supplies no further information about θ beyond what is already contained in T.

3 Notation: X P θ, θ Θ, θ unknown. If T T (X) is a sufficient statistic for θ, then T contains all the information about θ in X in the sense that if X is discarded, but we keep T T (X), we can fake the data (without knowing θ) by generating X from L(X T ). X has the same distribution as X (X P θ ) and the same value of the sufficient statistic (T (X ) T (X)) and can be used for any purpose we would use the real data for. Example: If U(X) is an estimator of θ, then U(X ) is another estimator of θ which performs just as well since U(X) d U(X ) for all θ. Cautionary Note: If the model is correct (X P θ ) and T (X) is sufficient for θ, then can ignore data X and just use T (X) for inference about θ. BUT if we are not sure that the model is correct, X may contain valuable information about model correctness not contained in T (X). Example: X 1, X,..., X n iid Bernoulli(p). T n X i is a sufficient statistic for p. Possible Model violations: The trial might be correlated as not independent. The success probability p might not be constant from trial to trial. These model violations cannot be investigated using the sufficient statistic. This can be only done by further investigation with the data.. Examples of Sufficient Statistic 1. X (X 1, X ) iid Poisson(λ). T X 1 + X is a sufficient statistic for λ because P λ (X 1 x 1, X x T t) P λ(x 1 x, X x, P λ (T t) redundant if tx 1 +x }} T t ) Pλ (X 1 x,x x ) P λ (T t), if t x 1 + x 0 if t x 1 + x This follows from the fact that for discrete distributions P θ, Assuming t x 1 + x, P θ (X x T (X) t) P λ (X 1 x 1, X x T t) Pθ (Xx) P θ (T (X)t) 0 otherwise if T (x) t λ x 1 e λ x 1! λx e λ x! (λ) t e λ t! (Since T Poisson(λ)) ( t x 1 ) t which does not involve λ. Thus, T is a sufficient statistic for λ. Note that 3

4 ( t P (X 1 x 1 T t) x 1 )( 1 ) x1 ( ) 1 t x1, x 1 0, 1,..., t. Thus L(X 1 T t) is Binomial(t, 1/). Given T t, we may generate fake data X1, X without knowing λ which has the same distribution as the real data: (a) Generate X1 Binomial(t, 1/). (Toss a fair coin t times and count the number of heads). (b) Set X t X 1. The real and fake data have the same value of the sufficient statistic: X 1 + X t X 1 + X.. Extension to previous Example: If X (X 1, X,..., X n ) are iid Poisson(λ), then T X 1 + X + + X n is a sufficient statistic for λ. Moreover P (X 1 x 1,..., X n x n T t) ( ) t! 1 t x 1!x! x n! n ( )( ) t 1 x1 ( ) 1 xn x 1,..., x n n n so that L(X T t) is Multinomial with t trials and n categories with equal probability 1/n (see Section 4.6). 3. X (X 1, X ) iid Expo(β). Then T X 1 + X is a sufficient statistic for β. To derive this, we need to calculate L(X 1, X T t). It suffices to get L(X 1 T t) since X t X 1. How to do this? (a) Find joint density f X1,T (x 1, t). (b) Then get conditional density Continuing with the steps, (a) Use the transformation f X1 T (x 1 t) f X 1,T (x 1, t). f T (t) U X 1, T X 1 + X X 1 U, X T U 4

5 with Jacobian J 1. Then f U,T (u, t) f X1,X (u, t u) J 1 β e u/β 1 β e (t u)/β 1 (b) T X 1 + X Gamma(α, β) so that 1 β e t/β, for 0 u t <. f T (t) te t/β β, t 0. Aternatively, integrate over x 1 in the joint density f X1,T (x 1, t) to get f T (t). Now f X1 T (x 1 t) 1 β e t/β I(0 x 1 t) te t/β β 1 t I(0 x 1 t) which does not involve β. Thus T X 1 + X is a sufficient statistic for β. Moreover, L(X 1 T t) is Unif(0, t). This can also be seen intuitively by noting that is constant on the line segment f X1,X (x 1, x ) 1 β e (x 1+x )/β (x 1, x ) : x 1 0, x 0, x 1 + x t} Thus given T t, we may generate fake data X1, X the same distribution as the real data: without knowing β which has (a) Generate X1 Unif(0, t). (b) Set X t X 1. The real and fake data have the same value of the sufficient statistic: X 1 + X t X 1 + X. 4. Extension to previous Example: If X (X 1, X,..., X n ) are iid Expo(β), then T X 1 + X + + X n is a sufficient statistic for β and L(X T t) is a uniform distribution on the simplex (x 1,..., x n ) : x x n t, x i 0 i}. 5

6 5. X (X 1, X ) iid Unif(0, θ). Then T X 1 + X is not sufficient statistic for θ. Proof. We must show that L(X 1, X T ) depends on θ. The support of (X 1, X ) is [0, θ]. Given T t, we know (X 1, X ) lies on the line L (x 1, x ) : x 1 + x t}. Thus, the support of L(X 1, X T ) is L [0, θ] which is drawn below for two different values of θ. The support of L(X 1, X T t) varies with θ. This shows that L(X 1, X T ) depends on θ. 6. If X 1,..., X n iid Bernoulli(p), then T n X i is a sufficient statistic for p. First: What is the joint pmf of X 1,..., X n? Note that P (X 1 1, X 0, X 3 1, X 4 1, X 5 0) p q p p q p 3 q where q 1 p. In general, P (X x) P (X 1 x 1,..., X n x n ) p x i q 1 x i p n x i q n (1 x i) p t q n t p T (x) q n T (x), where T (x) t n x i. Next, we derive L(X T ). We will use the notation T (X) n X i T and T (x) n x i. Recall that for discrete distributions P θ, P θ (X x T (X) t) Assume T (x) n x i t, θ p. Then Pθ (Xx) P θ (T (X)t) 0 otherwise if T (x) t P θ (X x T (X) t) P θ(x x) P θ (T (X) t) pt q ( n t n ) t p t q n t ( 1 n t) 6

7 since T Binomial(n, p). This does not involve p which proves that T is a sufficient statistic for p. Note: The conditional probability is the same for any sequence x (x 1,..., x n ) with t 1s and n t 0s. There are ( n t) such sequences. Summary: Given T X X n t, all possible sequences of t 1s and n t 0s are equally likely. Algorithm for generating from L(X 1,..., X n T t): (a) Put t 1s and n t 0s in an urn. (b) Draw them out one by one (without replacement) until the urn is empty. This makes all possible sequences equally likely. (Think about it!) The resulting sequence (X 1,..., X n) (the fake data) has the same value of the sufficient statistic as (X 1,..., X n ): Xi t X i.3 Sufficient conditions for sufficiency Sometimes finding sufficient statistic might be time-consuming and cumbersome if one proceeds directly from the definition. We need an easy to verifiable sufficient condition to find a sufficient statistic. Suppose X P θ, θ Θ. Theorem 6.. T(X) is a sufficient statistic for θ iff for all x f X (x θ) f T (T (x) θ) is constant as a function of θ. Notation: f X (x θ) is pdf (or pmf) of X. f T (t θ) is pdf (or pmf) of T T (X). Factorization Criterion (FC): There exist functions h(x) and g(t θ) such that for all x and θ. f(x θ) g(t (x) θ)h(x) Theorem 1. T (X) is a sufficient statistic for θ iff the factorization criterion is satisfied. Proof. (When X is discrete) Notation: T T (X), t T (x). 7

8 First, Assume T is a sufficient statistic for θ. Then the pmf f(x θ) can be written as f(x θ) P θ (T t) }} P θ (X x T t) }} This is a function of t and θ. Call it g(t θ) This depends on x, but not θ (by defn. of suff. stat. Call it h(x) g(t θ)h(x). Hence F C is true. Next Assume FC is true. Then P θ (X x T t) P θ(x x) (since X x} T t}) P θ (T t) f(x θ) g(t θ)h(x) f(z θ) g(t θ)h(z) which does not involve θ. z:t (z)t h(x) z:t (z)t h(z) z:t (z)t.4 Applications of FC 1. Let X (X 1,..., X n ) iid Poisson(λ). The joint pmf is f(x λ) f(x 1,..., x n λ) λ x i e λ λ i x i e nλ x i! i x i! ( λ )( ) i x i 1 e nλ i x i! g(t(x) λ)h(x) where T (x) i x i, g(t λ) λ t e nλ, h(x) 1 i x i! Thus, by FC, T (X) i X i is a sufficient statistic for λ.. Simple Linear Regression: Let X i β 0 + β 1 z i + ɛ i, ɛ i i.i.d N(0, σ 0) i 1,..., n where z i, i 1,..., n are known constants. Alternative statement of the model: X 1, X,..., X n independent X i N(β 0 + β 1 z i, σ0). 8

9 Data is X (X 1, X,..., X n ). (z 1, z,..., z n ) are known constants. Unknown parameter is θ (β 0, β 1 ) R. What are the sufficient statistics for this model? Use FC. f(x θ) 1 πσ 0 e (x i β 0 β 1 z i ) /σ 0 }} N(β 0 +β 1 z i,σ 0 ) density ( ) 1 n exp 1 πσ 0 σ0. } (x i β 0 β 1 z i ). }} S Here S x i x i (β 0 + β 1 z i ) + (β 0 + β 1 z i ) x i β 0 x i β 1 x i z i + (β 0 + β 1 z i ). Plus this back into the exponential and rearrange to get exp 1 σ0 f(x θ) x i } ( ) 1 n exp 1 ( πσ 0 σ0 β 0 ( g x i, x i z i, β 0, β 1 )h(x) g(t (x), θ)h(x) where T (x) ( n x i, n x iz i ) and g(t, θ) ( ) 1 n exp 1 ( πσ 0 σ β 0 t 1 β 1 t + with t (t 1, t ) and h(x) exp 1 n } σ0 x i. x i β 1 x i z i + )} (β 0 + β 1 z i ) 3. Continuation of Simple Linear Regression Example: What if the variance σ is unknown? Now θ (β 0, β 1, σ ) and Θ R (0, ). (Change σ 0 to σ in the earlier )} (β 0 + β 1 z i ) 9

10 formulas to indicate this). Now exp 1 n } σ x i is not a function of x, but depends also on θ. So we now factor the joint density as ( ) 1 n f(x θ) exp 1 ( )} πσ σ x i β 0 x i β 1 x i z i + (β 0 + β 1 z i ) where g( T (x) x i, x i, g(t (x), θ)h(x) ( x i, x i z i, β 0, β 1, σ )h(x) x i, g(t, θ) (πσ ) n/ exp ) x i z i (t 1, t, t 3 ) 1 ( σ t 1 β 0 t β 1 t 3 + )} (β 0 + β 1 z i ) and h(x) 1. According to FC, T (X) ( i X i, i X i, i z ix i ) is a sufficient statistic for θ (β 0, β 1, σ ). 4. Discussion on the preceding examples: We have described two models. The model with σ known (i.e., σ σ 0 ) can be regarded as a subset of the model where σ is unknown. Θ 1 (β 0, β 1, σ ) : σ σ 0} R σ 0}. Θ (β 0, β 1, σ ) : σ > 0} R (0, ). Θ 1 Θ. The sufficient statistics we found for these two models were different: 1. T 1 ( i X i, i z i X i ) is SS for Θ 1. T ( i X i, i X i, i z i X i ) is SS for Θ. Note: T is also a SS for Θ 1, but it is not minimal. 5. Sufficient statistic for random samples from various families of normal distributions: Let X (X 1,..., X n ) where X 1,..., X n are iid N(µ, σ ). Consider different families of normal distributions. Θ 1 (µ, σ ) : σ > 0} Θ (µ, σ ) : σ σ 0} Θ 3 (µ, σ ) : µ µ 0, σ > 0} 10 (all normal distributions) (known variance) (known mean)

11 For each space, the obvious sufficient statistic is different. In all case, the joint pdf of X is given by f(x µ, σ ) (πσ ) 1/ exp (x i µ) } σ (πσ ) n/ exp 1 } σ (x i µ) (1) Θ 3 : Here µ µ 0, (a known value), so the unknown parameter is θ σ. The joint pdf may be factored as f(x σ ) (πσ ) n/ exp 1 } σ (x i µ 0 ) i g ( (x i µ 0 ), σ ) h(x) i g ( T 3 (x), σ ) h(x), where T 3 (x) n (x i µ 0 ) so that T 3 T 3 (X) i (X i µ 0 ) is a SS for Θ 3. Note: T 3 is not even a statistic if µ is unknown (i.e., not fixed). For the rest (Θ 1 and Θ ), we modify (1) by substituting (x i µ) (x i x) + n( x µ), where x n 1 n x i. (This is an identity valid for all x 1, x,..., x n and µ). Substituting in (1) and breaking up the exponential yields n (x i x) f(x µ, σ ) (πσ ) n/ exp σ i } exp } n( x µ) σ Θ : Here σ σ0, (a known value), so the unknown parameter is θ µ. Factoring the joint pdf () as f(x µ) [ (πσ 0) n/ exp 1 σ 0 h(x)g( x, µ) h(x)g(t (x), µ) }][ (x i x) exp where T (x) x. This shows that T T (X) X is a SS for θ. i }] n( x µ) σ0 Θ 1 : Here both µ and σ are unknown so θ (µ, σ ). It is clear that () may be written as f(x µ, σ ) g( x, (X i x), µ, σ ) 1 i g(t 1 (x), θ)h(x) () 11

12 where T 1 (x) ( x, i (x i x) ) so that T 1 T 1 (X) ( X, i (X i X) ) is a SS for Θ 1. Note: T 1 is also a SS for Θ and Θ 3, neither T or T 3 is a SS for Θ 1..5 General Facts about SS 1. If T T (X) is a SS for θ Θ A, and Θ B Θ A, then T is SS for θ Θ B. Proof. If L(X T ) is constant for θ Θ A, then it is constant for θ Θ B.. If T is a SS (for θ Θ) and T φ(u) where U U(X), then U is also a SS (for θ Θ). Proof. (Using FC) Since T is SS, f(x θ) g(t (x) θ)h(x) g(φ(u(x)) θ)h(x) g (U(x) θ)h(x) where g (u θ) g(φ(u) θ). Hence U(X) is SS. 3. If T T (X) is a sufficient statistics (for θ Θ), then U (S, T ) is also a sufficient statistic for any S S(X). Proof. Immediate consequence of ) by taking φ(s, t) t. With this choice of φ, we have T φ(u) U is SS. 4. If T T (X) and U U(X) are related by T φ(u) where φ is one-one function, then T is SS iff U is SS..6 Application to random samples from various families of normal distributions: Recall: 1. T 1 ( X, (X i X) ) is SS for Θ 1 (µ, σ ) : σ > 0}.. T X is SS for Θ (µ, σ ) : σ σ 0 }. 3. T 3 (X i µ 0 ) is SS for Θ 3 (µ, σ ) : µ µ 0, σ > 0}. 1

13 Some facts: 1. T 1 is SS for Θ 1 T 1 is SS for Θ and for Θ 3 (Follows from Fact 1 since Θ 1 Θ and Θ 1 Θ 3.. T is SS for Θ T 1 is SS for Θ (Follows from Fact 3). 3. T 3 is SS for Θ 3 and T 3 (X i µ 0 ) (X i X) + n( X µ 0 ) φ(t 1 ) T 1 is SS for Θ 3 (Follows from Fact ). 4. T 1 is SS for Θ 1 ( X, 1 n 1 (Xi X) ) is SS for Θ 1 and ( X i, X i ) is SS for Θ 1 (Since both of these are one-one functions of T 1 (Follows from Fact 4). 3 Minimal sufficient statistic Definition. A minimal sufficient statistic is a function of any other sufficient statistic. T T (X) is minimal sufficient if for every sufficient statistic S S(X) there exists a function ψ such that T ψ(s), that is, T (X) ψ(s(x)). Theorem. (Lehmann-Scheffe Theorem) X P θ, θ Θ. T (X) is a minimal sufficient statistic iff for all x, y, T (x) T (y) iff f(x θ) f(y θ) is constant as a function of θ. Remark 1. It is difficult to show a statistic is MSS directly from the definition. For proving MSS, we usually use the Lehmann-Scheffe Theorem. However, it is often very easy to prove a statistic is not MSS using the definition. If S and T are two different sufficient statistics, and T cannot be written as a function of S, then T is not minimal. Example: Consider the three families of normal distributions used earlier. T 1 and T are both SS for Θ, but T 1 clearly cannot be written as a function of T. Thus T 1 is not a MSS for Θ. Similarly, T 1 and T 3 are both SS for Θ 3, but T 1 clearly cannot be written as a function of T 3. Thus T 1 is not a MSS for Θ 3. Comments on the Lehmann-Scheffe Theorem 1. In situations where the support of f(x θ) depends on θ, a better statement (which avoids awkward 0 0 s) is: For all x, y, T (x) T (y) iff f(x θ) c(x, y)f(y θ) for all θ.. The iff can be broken down as two results (a) If T (X) is sufficient, then for all x, y, T (x) T (y) implies f(x θ) f(y θ) (b) A sufficient statistic T (X) is minimal if for all x, y, f(x θ) f(y θ) T (x) T (y). constant in θ. constant in θ implies 13

14 3.1 Examples for Lehmann-Scheffe Theorem 1. X (X 1,..., X n ) iid N(µ, σ ). T (X) ( X, S ) where S 1 n 1 n (X i X) is MSS for (µ, σ ). X (X 1,..., X n ) iid Uniform(α, β), Θ (α, β) : < α < β < }. T (X) (X (1), X (n) ) is MSS for (α, β) (X (1) min X i, X (n) max X i ). We must verify: for all x, y, T (x) T (y) iff there exists c 0 such that f(x θ) cf(y θ) for all θ. (c does not involve θ, but can depend on x, y). In this case, f(x θ) 1 β α I(α x i β) 1 (β α) n I(x (1) α)i(x (n) β) Similarly, f(y θ) 1 (β α) n I(y (1) α)i(y (n) β). Clearly, (x (1), x (n) ) (y (1), y (n) ) implies f(x θ) f(y θ) (can take c 1) for all θ Θ. This gives one direction. What about the other? Define A(x) θ : f(x θ) > 0}. Here θ (α, β) with α < β. Assume that there exists c 0 such that f(x θ) cf(y θ) for all θ. Then we must have A(x) A(y). But A(x) (α, β) : α x (1), β x (n) }. for any x. Thus A(x) A(y) implies (x (1), x (n) ) (y (1), y (n) ) proving that (x (1), x (n) ) is MSS. Note: This style of argument can only work for examples similar to the uniform distribution where the support depends upon the parameter value. 3. X (X 1,..., X n ) iid Uniform(θ, θ + 1). Then T (X) (X (1), X (n) ) is MSS for θ. Comments: (a) The dimension of the MSS does not have to be the same as the dimension of the parameter. 14

15 (b) shrinking the parameter space does not always change the MSS. When X (X 1,..., X n ) iid Uniform(α, β), Θ 1 (α, β) : α < β} and Θ (α, β) : β α + 1} have the same MSS. 4. Random Sample Model: Suppose X (X 1, X,..., X n ) iid ψ(x θ) (pdf or pmf) where ψ(x θ) is an arbitrary family of pdf s (pmf s). Then T (X ) (X (1), X (),..., X (n) ), the order statistics (data arranged in increasing order) is a sufficient statistic for θ, but may not be minimal. Proof. (Use FC) f(x θ) ψ(x i θ) ψ(x (i) θ) 1 g(t (x ) θ)h(x ). Note: (assume x (1) < x () < < x (n) ). Then P (X x T (X ) t) 1 n! if x is any rearrangement of x (1), x (),..., x (n) and 0 otherwise. All possible ordering are equally likely. To generate from L(X T ), place the values x (1), x (),..., x (n) in a hat and draw them out one by one. Comment: For random sample models, the order statistics are often the SS. 5. X (X 1,..., X n ) iid ψ(x θ) with ψ(x θ) 1 π the Cauchy-location family. Look at (x θ), f(x θ) f(ỹ θ) n 1 1 π 1+(x i θ) n 1 1 π 1+(y i θ) 15

16 If x (i) y (i) for all i, then the ratio is a constant function of θ. Now suppose f(x θ)/f(ỹ θ) is a constant function of θ. Then (1 + (x i θ) ) c(x, y) (1 + (y i θ) ) for some function c(x, y) independent of θ. This is equivalent to (θ x i θ + x i + 1) c(x, y) (θ y i θ + yi + 1). Clearly, both n (θ x i θ + x i + 1) and n (θ y i θ + yi + 1) are polynomials of degree n in θ with the same set of zeros O L and O R. We can spell out O L x i ± i, i 1,..., n}, O R y i ± i, i 1,..., n}, where i 1, the imaginary root of 1/ Then O L and O R are permutations of each other. Hence x (i) y (i) for all i 1,..., n. 6. Suppose X P θ, θ Θ and P θ has a joint pdf or pmf f(x θ). Fact: X is a SS for θ. Proof. (Using FC) Define T T (X) X. (T is the identity function.) Then f(x θ) f(x θ) 1 g(t (x) θ) h(x) where g f and h(x) 1. Thus T is SS. Proof. (From definition of SS) L(X T (X) t) L(X X t) δ t where δ t is the probability measure which places all its mass at the point (dataset) t. 7. Further suppose X (X 1,..., X n ) where X 1,..., X n are iid from the pdf (pmf) f(x θ). Fact: T (X) X (X 1,..., X n ) is not a MSS. Proof. (from definition of MSS) Let S S(X) (X (1), X (),..., X (n) ) (the order statistics). Since we have a random sample model, S is a SS. But clearly T is not a function of S. (You cannot recover the original ordering of the data given only the order statistics.) Thus T is not a MSS. 16

1 Probability Model. 1.1 Types of models to be discussed in the course

1 Probability Model. 1.1 Types of models to be discussed in the course Sufficiency January 11, 2016 Debdeep Pati 1 Probability Model Model: A family of distributions {P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described