STAT 5380 Advanced Mathematical Statistics I (Lecture Notes Spring 2018) 1. Alex Trindade Department of Mathematics & Statistics Texas Tech University

Size: px

Start display at page:

Download "STAT 5380 Advanced Mathematical Statistics I (Lecture Notes Spring 2018) 1. Alex Trindade Department of Mathematics & Statistics Texas Tech University"

Shanon McCarthy
5 years ago
Views:

1 STAT 5380 Advanced Mathematical Statistics I (Lecture Notes Spring 208) Alex Trindade Department of Mathematics & Statistics Texas Tech University Based primarily on TPE: Theory of Point Estimation, 2nd edition, by E.L. Lehmann and George Casella, Springer (998).

3 Contents Chapter. Preliminaries 5.. Conditional Expectation 5.2. Sufficiency 6.3. Exponential Families Convex Loss Function 27 Chapter 2. Unbiasedness UMVU estimators Non-parametric families The Information Inequality Multiparameter Case 47 Chapter 3. Equivariance Equivariance for Location family The General Equivariant Framework Location-Scale Families 7 Chapter 4. Average-Risk Optimality Bayes Estimation Minimax Estimation Minimaxity and Admissibility in Exponential families Shrinkage Estimators and Bigdata 95 Chapter 5. Large Sample Theory Convergence in Probability and Order in Probability Convergence in Distribution Asymptotic Comparisons (Pitman Efficiency) Comparison of sample mean, median and trimmed mean (M-estimation) Chapter 6. Maximum Likelihood Estimation Consistency Asymptotic Normality of the MLE Asymptotic Optimality of the MLE 25 3

5 CHAPTER Preliminaries.. Conditional Expectation Definition. Let (X, A, P ) be a probability space. If X L (A, P ) and G is a sub-σfield of A, then E(X G) is a random variable such that (i) E(X G) G (i.e. is G measurable) (ii) E(I G X) = E(I G E(X G)), G G Construction. For X 0, µ(g) = E(I G X) is a measure on G and P (G) = 0 µ(g) = 0, so by the Radon-Nikodym theorem there exists a G-measurable function E(X G) such that µ(g) = E(X G)dP, i.e.(ii) is satisfied. This shows the existence of G E(X + G) and E(X G). Then we define E(X G) = E(X + G) E(X G). Remark... (ii) generalizes to E(Y X) = E(Y E(X G)) Y G such that E Y X <. The conditional probability of A given G is defined for all A A as P (A G) = E(I A G). Remark..2. If X L 2 (A, P ), then E(X G) is the orthogonal projection in L 2 (A, P ) of X onto the closed linear subspace L 2 (G, P ) of L 2 (A, P ) since (i) E(X G) L 2 (G, P ) and (ii) E(Y (X E(X G))) = 0, Y L 2 (G, P ). Conditioning on a Statistic Let X be a r.v. defined on (X, A, P ) with E X < and let T be a measurable function (not necessarily real-valued) from (X, A) into (T, F). (X, A, P ) T (T, F, P T ) Such a T is called a statistic (and is not necessarily real-valued). The σ-field of subsets of X induced by T is σ(t ) = {T S, S F} = T F Definition..3. E(X T ) E(X σ(t )) Recall that a real-valued function f on X is σ(t ) measurable f = g T for some F-measurable g on T, i.e. f(x) = g(t (x)) as shown below.. X T T 5 g R

6 6. PRELIMINARIES This implies that E(X T ) is expressible as E(X T ) = h(t ) for some function h F which is unique a.e. P T. T h X T R Definition..4. E(X t) h(t) Example..5. Suppose (X, T ) has probability density p(x, t) w.r.t. Lebesgue measure on R 2 and E X <. Then E(X σ(t )) = h(t ) where h(t) = E(X T = t) = xp(x,t) dx p(x,t) dx I p T (t)>0(t), a.s. P T. PROOF (i) R.S. is Borel measurable in t (by Fubini) (ii) G σ(t ) G = T F for some F F I G = I F (T ) E(I G E(X σ(t ))) = E(I G X) = I G X dp = xi F (t)p(x, t) dxdt = I F (t)h(t)p T (t) dt = E[I F (T )h(t )] = E[I G h(t )] Properties of Conditional Expectation If T is a statistic, X is the identity function on X and f n, f, g are integrable, then (i) E[af(X) + bg(x) T ] = ae[f(x) T ] + be[g(x) T ] a.s. (ii) a f(x) b a.s. a E[f(X) T ] b a.s. (iii) f n g, f n (x) f(x) a.s. E[f n (X) T ] E[f(X) T ] a.s. (iv) E[E(f(X) T )] = Ef(X). (v) If E h(t )f(x) <, then E[h(T )f(x) T ] = h(t )E[f(X) T ] a.s. (vi) If G and G 2 are sub-σ-fields of G with G G 2, then E[E(X G ) G 2 ] = E(X G 2 )..2. Sufficiency Set up X: random observable quantity (the identity function on (X, A, P)) X : sample space, the set of possible values of X A: σ-algebra of subsets of X P: {P θ, θ Ω} is a family of probability measures on A (distributions of X ) T: X T is an A/F measurable function and T (X) is called a statistic. probability space (X, A, P) X sample space (X, A, P) T (T, F, P T ) We adopt this notation because sometimes we wish to talk about T (X( )) the random variable and sometimes about T (X(x)) = T (x), a particular element of T. We shall also use the notation P (A T (x)) for P (A T = T (x)) and P (A T ) for the random variable P (A T ( )) on X.

7 .2. SUFFICIENCY 7 Definition.2.. The statistic T is sufficient for θ(or P) iff the conditional distribution of X given T = t is independent of θ for all t, i.e. there exists an F measurable P (A T = ) such that P (A T = t) = P θ (A T = t) a.s. P T θ for all A A and all θ Ω. Example.2.2. X = (X,..., X n ) iid with pdf f θ (x) w.r.t. dx P = P θ (dx,..., dx n ) = f θ (x ) f θ (x n ) dx dx n T (X) = (X (),..., X (n) ) The probability mass function of X given T = t is where X (i) is the i th order statistic. X T =t pθ (x t) = δ t (x () ) δ tn (x (n) ) n! i.e. it assigns point mass to each x such that x n! () = t,, x (n) = t n. This is independent of θ, indicating that T contains all the information about θ contained in the sample. The Factorization Criterion Definition.2.3. A family of probability measure s P = {P θ : θ Ω} is equivalent to a p.m. λ if λ (A) = 0 P θ (A) = 0 θ Ω. We also say that P is dominated by a σ-finite measure µ on (X, A) if P θ µ for all θ Ω. It is clear that equivalence to λ implies domination by λ. Theorem.2.4. Let P be dominated by a p.m. λ where λ = c i P θi (c i 0, c i = ). i=0 Then the statistic T (with range (T, F)) is sufficient for P there exists an F- measurable function g θ ( ) such that dp θ (x) = g θ (T (x)) dλ(x) θ Ω. Proof. ( ) Suppose T is sufficient for P. Then P θ (A T (x)) = P (A T (x)) θ. Throughout this part of the proof X will denote the indicator function of a subset of X. The preceding equality then implies that E θ (X T ) = E(X T ) X A, θ.

8 8. PRELIMINARIES Hence for all θ Ω, X A, G σ(t ), we have E θ (I G E(X T )) = E θ (E θ (I G X T )) = E θ (I G X). Set θ = θ i, multiply by c i and sum over i = 0,, 2,..., to get E λ (I G E(X T )) = E λ (I G X) X A, G σ(t ). This implies that E(X T ) = E λ (X T ) X A, and hence E θ (X T ) = E(X T ) = E λ (X T ) X A, θ. Now define g θ (T ( )) to be the Radon-Nikodym derivative of P θ with respect to λ, with both regarded as measures on σ(t ). We know this exists since λ dominates every P θ. We also know it is σ(t ) measurable, so it can be written in the form g θ (T ( )), and we know that E θ (X) = E λ (g θ (T )X) for all X σ(t ). We need to establish however that this last relation holds for all X A. We do this as follows. X A E θ (X) = E θ [E(X T )] = E λ [g θ (T )E(X T )] = E λ [E(g θ (T )X T )] = E λ [E λ (g θ (T )X T )] = E λ [g θ (T )X]. This shows that g θ (T (x)) = dp θ dλ (x) when P θ and λ are regarded as measures on A. ( ) Suppose that for each θ, dp θ (x) = g dλ θ(t (x)) for some g θ. We shall then show that the conditional probability P λ (A t) is a version of P θ (A t) θ. A A, G σ(t ) I A dp θ = P θ (A T ) dp θ G G = P θ (A T )g θ (T ) dλ and G I A dp θ = = = G G G G I A g θ (T ) dλ E λ [I A g θ (T ) T ] dλ E λ [I A T ]g θ (T ) dλ P θ (A T )g θ (T ) = E λ (I A T )g θ (T ) a.s. λ and hence a.s. P θ θ. Also g θ (T ) 0, a.s. P θ, since dp θ = g θ (T ) dλ. Hence P θ (A T ) = E λ (I A T ) = P λ (A T ) a.s. P θ and the R.S. is independent of θ.

9 .2. SUFFICIENCY 9 Theorem.2.5. (Theorem A.4.2 in appendix of TSH ) If P = {P θ, θ Ω} is dominated by a σ-finite measure µ, then it is equivalent to λ = i=0 c ip θi for some countable subcollection P θi P, i = 0,, 2,..., with c i 0 and c i =. Proof. µ is σ finite, A n A with A, A 2,... disjoint, and A i = X such that 0 < µ(a i ) <, i =, 2,.... Set µ (A) = i= µ(a A i ) 2 i µ(a i ) Then, µ is a probability measure equivalent to µ. Hence we can assume without loss of generality that the dominating measure µ is a probability measure Let and set Then f θ = dp θ dµ S θ = {x: f θ (x) > 0} (.2.) P θ (A) = P θ (A S θ ) = 0 iff µ(a S θ ) = 0. (Since P θ µ and since µ(a S θ ) > 0, f θ > 0 on A S θ P θ (A S θ ) > 0.) A set A A is a kernel if A S θ for some θ; a finite or countable union of kernels is called a chain. Set α = sup µ(c) chains C Then α = µ(c) for some chain C = n=a n, A n S θn. (since {C n } such that µ(c n ) α and for this sequence µ( C n ) = α.) It follows from the following Lemma that P is dominated by λ( ) = n= P 2 n θn ( ). Since it is obvious that λ(a) = 0 P θn (A) = 0 n P θ (A) = 0 θ (by the Lemma), P θ (A) = 0 θ λ(a) = 0 Hence P is equivalent to λ( ) = n= P 2 n θn ( ). Lemma.2.6. If {θ n } is the sequence used in the construction of C, then {P θ, θ Ω} is dominated by {P θn, n =, 2,...}, i.e. P θn (A) = 0 n P θ (A) = 0 θ TSH stands for Testing Statistical Hypotheses, Lehmann & Romano, 3rd ed., Springer, 2005.

10 0. PRELIMINARIES Proof. P θn (A) = 0 n µ(a S θn ) = 0 n (by.2.) (C S θn ) µ(a C) = 0 (Pθ µ) P θ (A C) = 0 θ If P θ (A) > 0 for some θ then, since P θ (A) = P θ (A C) + P θ (A C c ), P θ (A C c ) = P θ (A C c S θ ) > 0 A C c S θ is a kernel disjoint from C C (A C c S θ ) is a chain with µ > α, (P θ (A) > 0 µ(a) > 0) contradicting the definition of α. Hence, P θ (A) = 0 θ. Theorem.2.7. The Factorization Theorem Let µ be a σ-finite measure which dominates P = {P θ : θ Ω} and let p θ = dp θ dµ. Then the statistic T is sufficient for P if and only if there exists a non negative F- measurable function g θ : T R and an A-measurable function h : X R such that (.2.2) p θ (x) = g θ (T (x)) h (x) a.e. µ. Proof. By theorem.2.5, P is equivalent to λ = i c i P θi, where c i 0, i c i =. If T is sufficient for P, p θ (x) = dp θ (x) dµ (x) = dp θ (x) dλ (x) dλ (x) dµ (x) = g θ (T (x)) h (x) by theorem.2.4. On the other hand, if equation (.2.2) holds, (.2.3) dλ (x) = c i dp θi (x) = c i p θi (x) dµ(x) = c i g θi (T (x)) h (x) dµ (x) i= = K (T (x)) h (x) dµ (x).

11 .2. SUFFICIENCY Thus, dp θ (x) = p θ (x) dµ (x) by the definition of p θ (x) = g θ (T (x)) h (x) dλ (x) by equations (.2.2) and (.2.3) K (T (x)) h (x) = g θ (T (x)) dλ (x) where g θ (T (x)) := 0 if K (T (x)) = 0. Hence T is sufficient for P by theorem.2.4. Remark.2.8. If f θ (x) is the density of X with respect to Lebesgue measure then T is sufficient for P iff f θ (x) = g θ (T (x)) h (x) where h is independent of θ. Example.2.9. Let X, X 2,..., X n be iid N (µ, σ 2 ), µ R, σ > 0, and write X = (X, X 2,, X n ). A σ-finite dominating measure on B n is Lebesgue measure with ( ) n p µ,σ 2 (x) = ( ) n exp x 2 σ 2π 2σ 2 i + µ xi nµ2 σ 2 2σ 2 ( = g µ,σ 2 xi, ) x 2 i. Therefore T (X) = ( X i, X 2 i )is sufficient for P = {P µ,σ 2}. Remark.2.0. T (X) = ( X, ) S 2 is also sufficient for P = {P µ,σ 2}, since ( g µ,σ 2 xi, ) x 2 i = g ( x, ) µ,σ S 2 2 T and T are equivalent in the following sense. Definition.2.. Two statistics T and S are equivalent if they induce the same σ-algebra up to P-null sets. i.e. if there exists a P-null set N and functions f and g such that T (x) = f (S (x)) and S (x) = g (T (x)) for all x N c. Example.2.2. Let X,..., X n be iid U(0, θ), θ > 0 and X = (X,..., X n ). p θ (x) = n I θ n [0, ) (x i )I (,θ] (x i ) = θ n I [0, )(x () )I (,θ] (x (n) ) = g θ (x (n) )h(x) T (X) = X (n) is sufficient for θ.

12 2. PRELIMINARIES Example.2.3. X,..., X n iid N(0, σ 2 ), Ω = {σ 2 : σ 2 > 0}. Define T (X) = (X,..., X n ) T 2 (X) = (X 2,..., X 2 n) T 3 (X) = (X X 2 m, X 2 m+ + + X 2 n) T 4 (X) = X Xn 2 p θ (x) = (σ exp ( 2π) n 2σ 2 n Xi 2 ) Each T i (X) is sufficient. However σ(t 4 ) σ(t 3 ) σ(t 2 ) σ(t ). (since functions of T 4 are functions of T 3, functions of T 3 are functions of T 2 and functions of T 2 are functions of T.) Remark.2.4. If T is sufficient for θ and T = H (S) where S is some statistic, then S is also sufficient since p θ (x) = g θ (T (x)) h (x) = g θ (H (S (x)) h (x) Since σ (T ) = S H B T S B S ((X, A) S (S, B S ) H (T, B T )), T provides a greater reduction of the data than S, strictly greater unless H is one to one, in which case S and T are equivalent. Definition.2.5. T is a minimal sufficient statistic, if for any sufficient statistic S, there exists a measurable function H such that T = H (S) a.s. P. Theorem.2.6. If P is dominated by a σ-finite measure µ, then the statistic U is sufficient iff for every fixed θ and θ 0, the ratio of the densities p θ and p θ0 with respect to µ, defined to be when both densities are zero, satisfies p θ (x) p θ0 (x) = f θ,θ 0 (U (x)) a.s. P for some measurable f θ,θ0. Proof. HW problem (TPE Ch Problem 6.6). Theorem.2.7. Let P be a finite family with densities {p 0, p,..., p k }, all having the same support (i.e. S = {x: p i (x) > 0} is independent of i). Then ( p (x) T (x) = p 0 (x), p 2 (x) p 0 (x),, p ) k (x) p 0 (x) is minimal sufficient. (Also true for a countable collection of densities with no change in the proof.)

13 .2. SUFFICIENCY 3 Proof. First T is sufficient by theorem (.2.6) since p i(x) p j is a function of T (x) for (x) all i and j (need common support here.) If U is a sufficient statistic then by theorem (.2.6), p i (x) p 0 (x) is a function of U for each i T is a function of U T is minimal sufficient. Remark.2.8. The theorem.2.7 extends to uncountable collections under further conditions. Theorem.2.9. Let P be a family with common support and suppose P 0 P. If T is minimal sufficient for P 0 and sufficient for P, then T is minimal sufficient for P. Proof. Remark U is sufficient for P U is sufficient for P 0 by Definition.2.. T is minimal sufficient for P 0 T (x) = H(U(x)) a.s. P 0. But since P has common support, T (x) = H(U(x)) a.s. P. () Minimal sufficient statistics for uncountable families P can often be obtained by combining the above theorems. (2) Minimal sufficient statistics exist under weak assumptions (but not always). In particular they exist if (X, A) = (R n, B n ) and P is dominated by a σ-finite measure. Example.2.2. P 0 : (X,..., X n ) iid N(θ, ), θ {θ 0, θ }. P : (X,..., X n ) iid N(θ, ), θ R. p θ (x) p θ0 (x) { = exp [ (xi θ ) 2 ] } (x i θ 0 ) 2 2 { = exp [ ] } 2xi (θ 0 θ ) + nθ 2 nθ0 2 2 This is a function of x, hence X is minimal sufficient for P 0 by Theorem.2.7. Since X is sufficient for P (by the factorization theorem), X is minimal sufficient for P. Example P : (X,..., X n ) iid U(0, θ), θ > 0. Show that X (n) is minimal sufficient (This is part of problem.6.6 for which you will need to use problem.6.).

14 4. PRELIMINARIES Example Logistic P : (X,..., X n ) iid L(θ, ), θ R. P 0 : (X,..., X n ) iid L(θ, ), θ {0, θ,..., θ n }. p θ (x) = exp [ (x i θ)] n i= { + exp [ (x i θ)]} 2, where so T = (T (X),..., T n (X)) is minimal sufficient, T i (x) = p θ i (x) p 0 (x) = enθ i n j= ( + e x j ) 2 ( + e (x j θ i) ) 2. We will show that T (X) is equivalent to (X (),..., X (n) ), by showing that T (x) = T (y) x () = y (),, x (n) = y (n). Proof. ( ) Obvious from the expression for T i (x). ( ) Suppose that T i (x) = T i (y) for i =, 2,..., n, i.e. i.e. n j= n j= ( + e x j ) 2 ( + e (x j θ i) ) = n ( + e y j ) 2 2 ( + e (y j θ i), i =,..., n, ) 2 + u j ω + u j = n j= j= + v j ω + v j, ω = ω,..., ω n, where u j = e x j, v j = e y j and ω i = e θ i. Here we have two polynomials in ω of degree n which are equal for n + distinct values,, ω,..., ω n, of ω and hence for all ω. ω = 0 n ( + u j ) = j= n ( + u j ω) = j= n ( + v j ) j= n ( + v j ω) ω j= the zero sets of both these polynomials are the same x and y have the same order statistics. By theorem.2.7, the order statistics are therefore minimal sufficient for P 0. They are also sufficient for P, so by theorem.2.9, the order statistics are minimal sufficient for P. There is not much reduction possible here! This is fairly typical of location families, the normal, uniform and exponential distributions providing happy exceptions.

15 .2. SUFFICIENCY 5 Ancillarity Definition A statistic V is said to be ancillary for P if the distribution, P V θ, of V does not depend on θ. It is called first order ancillary if E θ V is independent of θ. Example In example.2.23, X (2) X () is ancillary since Y = X θ,..., Y n = X n θ are iid P 0, and X (2) X () = Y (2) Y (). since Example P : (X,..., X n ) iid N(θ, ), θ R. S 2 = (X i X) 2 is ancillary S 2 = (Y i Ȳ )2 where Y i = X i µ, i =, 2,..., are iid N(0, ). Remark Ancillary statistics by themselves contain no information about θ, however minimal sufficient statistics may contain ancillary components. For example, in.2.23, T = (X (),, X (n) ) is equivalent to T = (X (), X (2) X (),, X (n) X () ), whose last (n ) components are ancillary. You can t drop them as X () is not even sufficient. Complete Statistic A sufficient statistic should bring about the best reduction of the data if it contains as little ancillary material as possible. This suggests requiring that no non-constant function of T be ancillary, or not even first order ancillary, i.e. that or equivalently that E θ f (T ) = c for all θ Ω f (T ) = c a.s. P E θ f (T ) = 0 for all θ Ω f (T ) = 0 a.s. P. Definition A statistic T is complete if (.2.4) E θ f (T ) = 0 for all θ Ω f (T ) = 0 a.s. P T is said to be boundedly complete if equation (.2.4) holds for all bounded measurable functions f. Since complete sufficient statistics are intended to give a good reduction of the data, it is not unreasonable to expect them to minimal. We shall prove a slightly weaker result. Theorem Let U be a complete sufficient statistic. If there exists a minimal sufficient statistic, then U is minimal sufficient. Proof. Let T be a minimal sufficient statistic and let ψ be a bounded measurable function. We will show that ψ(u) σ(t ) i.e. E(ψ(U) T ) = ψ(u) a.s.

16 6. PRELIMINARIES Now E(ψ(U) T ) = g(u) for some measurable g since T is minimal and U is sufficient. Let h(u) = E(ψ(U) T ) ψ(u), then E θ h(u) = 0 θ so h(u) = 0 a.s. P since U is complete. Hence ψ(u) = E(ψ(U) T ) σ(t ). Hence U-measurable bounded functions are T -measurable, i.e. σ(u) σ(t ), i.e. U is minimal sufficient. Remark () If P is dominated by a σ-finite measure and (X, A) = (R n, B n ), the existence of a minimal sufficient statistic does not need to be assumed. (2) A minimal sufficient statistic is not necessarily complete. See the next example. Example.2.3. P = {N(θ, θ 2 ), θ > 0} p θ (x) = θ (x θ) 2 2π e 2 θ 2 = θ 2π e 2 ( x θ )2 The single observation X is minimal sufficient but not complete since E θ [I (0, ) (X) Φ()] = P θ (X > 0) Φ() = 0 however P θ (I (0, ) (X) Φ() = 0) = 0 θ. Theorem (Basu s theorem) If T is complete and sufficient for P, then any ancillary statistic is independent of T. Proof. If S is ancillary, then P θ (S B) = p B, independent of θ. Sufficiency of T P θ (S B T ) = h(t ), independent of θ. E θ (h(t ) p B ) = 0 h(t ) = p B a.s. P S is independent of T by completeness.3. Exponential Families. Definition.3.. A family of probability measure s {P θ : θ Ω} is said to be an s-parameter exponential family if there exists a σ-finite measure µ such that ( s ) p θ (x) = dp θ (x) dµ (x) = exp η i (θ) T i (x) B (θ) h (x), where η i, T i and B are real-valued. Remark.3.2. () P θ, θ Ω are equivalent (since {x: p θ (x) > 0} is independent of θ). θ

17 .3. EXPONENTIAL FAMILIES. 7 (2) The factorization theorem implies that T = (T,, T s ) is sufficient. (3) If we observe X,..., X n, iid with marginal distributions P θ then n j= T (X j) is sufficient for θ. Theorem.3.3. If {, η,..., η s } is LI, then T = (T,..., T s ) is minimal sufficient. (Linear independence of {, η,..., η s } means c η (θ) + + c s η s (θ) + d = 0 θ c = = c s = d = 0. Equivalently we can say that {η i } is affinely independent or AI since the set of points {(η (θ),..., η s (θ)), θ Ω} then lie in a proper affine subspace of R s.) (.3.) Proof. Fix θ 0 Ω and consider { s } dp θ (x) = p θ(x) dp θ0 p θ0 (x) = exp {B(θ 0) B(θ)} exp (η i (θ) η i (θ 0 )) T i (x). If {, η,..., η s } is LI then so is {, η η (θ 0 ),..., η s η s (θ 0 )}. Set S = {(η (θ) η (θ 0 ),..., η s (θ) η s (θ 0 )), θ Ω} R s. subspace of R s. Then span(s) is a linear If dim(span(s)) < s, then there exists a non-zero vector v = (v,..., v s ) s.t. v (η (θ) η (θ 0 )) + + v s (η s (θ) η s (θ 0 )) = 0 θ contradicting the linear independence of {, η i η i (θ 0 )}. Hence (.3.2) dim(span(s)) = s i.e. θ,..., θ s Ω s.t. {(η (θ i ) η (θ 0 ),, η s (θ i ) η s (θ 0 )), i =,, s} is LI. From.3., s j= (η j (θ i ) η j (θ 0 ))T j (x) = ln p θ i (x) p θ0 (x) + (B(θ i) B(θ 0 ))i =,..., s. Since the matrix [η j (θ i ) η j (θ 0 )] s i,j= is non-singular, T j (x) can be expressed uniquely in terms of ln p θ i (x) p θ0, i =,..., s. (x) But p θ i (x), i =,..., s is minimal sufficient for P p θ0 (x) 0 = {P θj, j = 0,,, s} by theorem.2.7. Hence T is minimal sufficient by theorem.2.9.

18 8. PRELIMINARIES Example.3.4. p θ (x) = θ 2π exp{ 2 θx2 + θx θ 2 }. η (θ) = 2 θ, η 2(θ) = θ, T (x) = (x 2, x) is sufficient but not minimal θ since rewriting the model as p θ (x) = 2π exp{ 2 θ(x )2 }, we see that T (x) = (x ) 2 is minimal sufficient. Remark.3.5. The exponential family can always be rewritten in such a way that the functions {T i } and {η i } are AI. If there exist constants c,..., c s, d, not all zero, such that c T (x) + + c s T s (x) = d a.s. P then one of the T i s can be expressed in terms of the others (or is constant). After reducing the number of functions T i as far as possible, the same can be done with their coefficients until the new functions {T i } and {η i } are AI. Definition.3.6. (Order of the exponential family.) If the functions {T i, i =,..., s} on X and {η i, i =,..., s} on Ω are both AI, then s is the order of the exponential family ( s ) p θ (x) = dp θ (x) = exp η i (θ) T i (x) B (θ) h (x). dµ Proposition.3.7. The order is well-defined. Proof. We shall show that s + = dim(v ) where V is the set of functions on X defined by V = span{, ln dp θ dp θ0 ( ), θ Ω} (independent of the dominating measure µ and the choice of {η i }, {T i }). ln dp θ dp θ0 (x) = s (η i (θ) η i (θ 0 ))T i (x) + B(θ 0 ) B(θ) i= so that V span{, T i ( ), i =,..., s} dim(v ) s + On the other hand, since {, η i, i =,..., s} is LI, each T j (x) can be expressed as a linear combination of, ln dp θ i dp θ0 (x), i =,..., s, as in the proof of the previous theorem, span{, T i ( ), i =,..., s} V s + dim(v )

19 .3. EXPONENTIAL FAMILIES. 9 Definition.3.8. (Canonical Form) For any s-parameter exponential family (not necessarily of order s) we can view the vector η(θ) = (η (θ),..., η s (θ)) as the parameter rather than θ. Then the density with respect to µ can be rewritten as s p(x, η) = exp[ η i T i (x) A(η)]h(x), η η(ω). Since p(, η) is a probability density with respect to µ, (.3.3) e A(η) = e s η it i (x) h(x)dµ(x). i= Definition.3.9. (The Natural Parameter Set) This is a possibly larger set than {η(θ), θ Ω}. It is the set of all s-vectors for which, by suitable choice of A(η), p(, η) can be a probability density, i.e. N = {η = (η,, η s ) R s : e s η it i (x) h(x)dµ(x) < } Theorem.3.0. N is convex. Proof. Suppose α = (α,..., α s ) and β = (β,..., β s ) N. Then, e p s α it i (x)+( p) s β it i (x) h(x) dµ(x) [ e p [ α i T i (x) h(x) dµ(x)] + e β i T i (x) h(x) dµ(x)] p (Holder s Inequality) < Theorem.3.. T = (T,, T s ) has density p η (t) = exp (η t A (η)) relative to ν = µ T where d µ (x) = h (x) dµ (x). Proof. If f:t R is a bounded measurable function, Ef(T ) = f(t (x))e η T (x) e A(η) d µ(x) = f(t)e η t e A(η) d µ T (t) Definition.3.2. The family of densities p η (t) = exp (η t A (η)), η η(ω), n is called an s-dimensional or s-parameter standard exponential family. (Defined on R s, not X.)

20 20. PRELIMINARIES Theorem.3.3. Let {p η (x)} be the s-parameter exponential family, ( s ) p η (x) = exp η i (θ) T i (x) B (θ) h (x)), η η(ω), and suppose (.3.4) i= φ (x) e s η jt j (x) dµ (x) exists and is finite for some φ and all η j = a j + ib j such that a N (=natural parameter space). Then (i) φ (x) e s η jt j (x) dµ (x) is an analytic function of each η i on {η : R (η) int (N )} and (ii) the derivative of all orders with respect to the η i s of φ (x) e s η jt j (x) dµ (x) can be computed by differentiating under the integral sign. Proof. Let a 0 = (a 0,..., a 0 s) be in int(n ) and let η 0 = a 0 + ib 0. Then φ(x)e s 2 η jt j (x) = h (x) h 2 (x) + i(h 3 (x) h 4 (x)) where h and h 2 are the positive and negative parts of the real part and h 3 and h 4 are the positive and negative parts of the imaginary part. Then φ (x) e s η jt j (x) dµ (x) can be expressed as e η T (x) dµ (x) e η T (x) dµ 2 (x) + i e η T (x) dµ 3 (x) i e η T (x) dµ 4 (x), where dµ i (x) = h i (x) dµ(x), i =,..., 4. Hence it suffices to prove (i) and (ii) for ψ(η ) = e η T (x) dµ(x). Since a 0 int(n ), there exists δ > 0 s.t. ψ(η ) exists and is finite for all η with a a 0 < δ. Now consider the difference quotient ψ(η ) ψ(η 0 ( ) ) = e η0 η η 0 T (x) e(η η 0)T (x) µ(dx) with η η η 0 η 0 < δ/2. Observe that e zt = (zt) j j! zt j = e zt j! zt e zt ezt t e zt z

21 .3. EXPONENTIAL FAMILIES. 2 The integrand in (*) is therefore bounded in absolute value by T (x) e (a0 + δ 2 ) T(x), where a 0 = Re(η) 0 and T (x) e (a0 + δ 2 ) T(x) µ(dx) < since T e δ 4 T }{{} e(a0 + 3δ 4 )T }{{} if T > 0 T e (a0 + δ 2 ) T = bounded integrable {}}{{}}{ T e δ 4 T e (a0 + δ 4 )T if T < 0 (independent of η ). Letting η η 0 in (*) and using the dominated convergence theorem therefore gives (.3.5) φ (η) 0 = T (x)e η0 T(x) µ(dx), where the integral exists and is finite η 0 which is the first component of some η 0 for which Re(η 0 ) N. Applying the same argument to (.3.5) which we applied to (.3.4) existence of all derivatives (i) and (ii). Theorem.3.4. For an exponential family of order s in canonical form and η int (N ), where N is the natural parameter space, ( ) T (i) E η (T ) = A = A η η,, A η s, and (ii) Cov η (T ) = 2 A = η η T [ 2 A η i η j ] s i,j=. so Proof. From theorem.3. e A(η) = e η t ν(dt) = (i) A η i e A(η) = T i (x)e η T (x) h(x)µ(dx) whence E η T i = A (ii) η i. e η T (x) h(x)µ(dx) 2 A η i η j e A(η) + A A η i η j e A(η) = T i (x)t j (x)e η T (x) h(x)µ(dx) i.e. 2 A η i η j = E η (T i T j ) E η (T i )E η (T j ) = Cov η (T i, T j ) Higher order moments of T,, T s are frequently required, e.g. α r r s = E(T r Ts rs ) µ r r s = E[(T E(T )) r (T s E(T s )) rs ]

22 22. PRELIMINARIES etc. These can often be obtained readily from the MGF: M T (u,, u s ) := E(e u T + +u st s ) If M T exists in some neighborhood of 0 ( u 2 i < δ), then all the moments α r,,r s exist and are the coefficients in the power series expansion r u u rs s M T (u,, u s ) = α r,,r s r r,...,r! r s! s The cumulant generating function, CGF, is sometimes more convenient for calculations, especially in connection with sums of independent random vectors. The CGF is defined as K T (u,, u s ) := log M T (u,, u s ). If M T exists in a neighborhood of 0, then so does K T and u r u rs s K T (u,, u s ) = K r r s r! r s!, r,,r s=0 where the coefficients K r r s are called the cumulants of T. The moments and cumulants can be found from each other by formal comparison of the two series. Theorem.3.5. If X has the density s p η (x) = exp [ η i T i (x) A(η)]h(x) i= w.r.t some σ-finite measure µ, then for any η int(n ) the MGF and CGF of T exist in a neighborhood of 0 and Proof. HW problem. K T (u) = A(η + u) A(η) M T (u) = e A(η+u) A(η) Summary on Exponential Families. The family of probability measures {P θ } with densities relative to some σ-finite measure µ, (.3.6) p θ (x) = dp s θ dµ (x) = exp{ η i (θ)t i (x) B(θ)}h(x), θ Ω, is an s-parameter exponential family By redefining the functions T i ( ) and η i ( ) if necessary, we can always arrange for both sets of functions to be affinely independent. The number of summands in the exponent is then the order of the exponential family.

23 .3. EXPONENTIAL FAMILIES. 23 If {, η,..., η s } and {, T,..., T s } are both L.I., then the family is said to be minimal and s = dim(span{, log dp θ ( ), θ Ω}) dp θ0 = order of the exponential family Remark.3.6. Since (.3.6) is by definition a probability density w.r.t. µ for each θ Ω, we have { } exp ηi (θ)t i (x) B(θ) h(x)µ(dx) = { } exp B(θ) = exp ηi (θ)t i (x) h(x)µ(dx) which shows that the dependence of B on θ is through η(θ) = (η (θ),..., η s (θ)) only, i.e. B(θ) = A(η(θ)). Remark.3.7. The previous note implies that each member of the family (.3.6) is a member of the family. s (.3.7) π ξ (x) = exp{ ξ i T i (x) A(ξ)}h(x), ξ = (ξ,..., ξ s ) η(ω) (in fact p θ (x) = π η(θ) (x)). The family of densities {π ξ, ξ η(ω)} defined by (.3.7) is the canonical family associated with (.3.6). It is the same family parameterized by the natural parameter, ξ =vector of coefficients of T i (x), i =,..., s. Remark.3.8. Instead of restricting ξ to the set η(ω), it is natural to extend the family (.3.7) to allow all ξ R s for which we can choose a value of A(ξ) to make (.3.7) a probability density, i.e. for which (.3.8) exp{ ξ i T i (x)}h(x)µ(dx) < N = {ξ R s : (.3.8) holds} is the natural parameter space of the family (.3.7). Remark.3.9. N η(ω) since (.3.7) is by definition a family of probability densities. Definition (Full rank family) As with the original parameterization, we can always redefine ξ to ensure that {T,..., T s } is A.I. If η(ω) contains an s-dimensional rectangle and {T ( ),..., T s ( )} is A.I., then T is minimal sufficient and we say the family (.3.7) is of full rank. (A full rank family is clearly minimal.)

24 24. PRELIMINARIES Remark.3.2. Since N η(ω), full rank int(n ) φ and this is important in view of the consequence of theorem.3.3 that s e A(ξ) = exp( ξ i T i (x))h(x)µ(dx) i= is analytic in each ξ i on the set of s-dimensional complex vectors, ξ : Re(ξ) int(n ). (So derivatives of e A(ξ) w.r.t. ξ i, i =,..., s of all orders can be obtained by differentiation under the integral, yielding explicit expressions for the moments of T for all values of the canonical parameter vector ξ int(n ).) Example Multinomial X M(θ 0,..., θ s ; n) = (X 0,..., X s ), where X i = number of outcomes of type i in n independent trials where θ i, i = 0,..., s, is the probability of an outcome of type i on any one trial. Ω = {θ : θ 0 0,, θ s 0, θ θ s = } () Probability density with respect to counting measure on Z s+ + n! s p θ (x) = x 0! x s! θx 0 0 θs xs I [0, n] (x i )I {n} ( x i ) = exp{ i=0 s x i log θ i }h(x), θ Ω. i=0 This is an (s + )-parameter exponential family with T i (x) = x i, η i (θ) = log θ i. The vectors η(θ), θ Ω, are not confined to a proper affine subspace of R s, so T is minimal sufficient. (2) {T 0,..., T s } is not A.I. since T T s = n. Setting T 0 (x) = x 0 = n x x n gives p θ (x) = h(x) exp{n log θ 0 + s i= x i log θ i θ 0 } Redefining η(θ) = (log θ θ 0,, log θs θ 0 ), we now have an s-parameter representation in which {T,..., T s } is A.I., since the vectors (x,, x s ), x X, are subject only to the constraints x i 0 and s i= x i n. (3) Furthermore the new parameter vectors, η(θ) = (log θ θ 0,, log θs θ 0 ), θ Ω, are not confined to any proper affine subspace of R s, since for any x R s θ 0,..., θ s such that η(θ) = x and so η(ω) = R s. Hence T (x) = (x,..., x s ) is minimal sufficient for P and the order of the family is s. (4) The canonical representation of the family (2) is π ξ (x) = exp{ s ξ i x i A(ξ)}h(x), ξ η(ω) = {(log θ θ 0,, log θ s θ 0 ): θ Ω}

25 .3. EXPONENTIAL FAMILIES. 25 We know from remark.3.6 before that B(θ) = A(η(θ)) for some function A( ). Although it is not necessary, we can verify this directly in this example, since from the representation (2) we have and B(θ) = n log θ 0 θ 0 = θ θ s θ 0 = + θ θ θ s θ 0 = + e η (θ) + + e ηs(θ) A(ξ) = n log( + e ξ + + e ξs ) A(ξ) is of course also determined by e A(ξ) = B(θ) = n log( + e η (θ) + + e ηs(θ) ) exp{ s ξ i x i }h(x)dµ(x). (5) The natural parameter space in this case is N = R s, since we know that N η(ω) and η(ω) = R s by (3) above. Clearly N contains an s-dimensional rectangle and {T,..., T s } is A.I., hence {π ξ (x), ξ N } is of full rank. (6) Moments of T (X) = (X,..., X s ) Theorem.3.4 E ξ T i = A ξ R s ξ i ne ξ i = + e ξ + + e ξ s nθ i /θ 0 = + θ θ θs θ 0 = nθ i and Cov(T i, T j ) = 2 A ξ i ξ j { = ne ξ i (+ +e ξs ) (Moments exist ξ int(n ) = R s ) ne ξ ie ξ j = nθ (+e ξ + +e ξs ) 2 i θ j i j ne2ξ i = nθ (+ +e ξs ) 2 i ( θ i ) i = j

26 26. PRELIMINARIES Theorem (Sufficient condition for completeness of T ) If ( s ) π ξ (x) = exp ξ i T i (x) A (ξ) h (x), ξ η (Ω) i= is a minimal canonical representation of the exponential family P = {p θ : θ Ω} and η (Ω) contains an open subset of R s, then T = (T,..., T s )is complete for P. Proof. Suppose E ξ (f(t )) = 0 ξ η(ω). Then, (.3.9) E ξ f + (T ) = E ξ f (T ) ξ η(ω). Choose ξ 0 int(η(ω)) and r > 0 such that N(ξ 0, r) := {ξ : ξ ξ 0 < r} η(ω). Now define the probability measures, λ + (A) = f + e ξ0 t ν(dt) A f J + e ξ 0 t ν(dt), ν = µ T, d µ(x) = h(x)µ(dx), λ (A) = f e ξ0 t ν(dt) A f J e ξ 0 t ν(dt), where we have assumed that ν({t: f(t) 0}) > 0, since otherwise f = 0 a.s. P T and we are done. Observe now that (.3.0) e δ t λ + (dt) = e δ t λ (dt) δ R s with δ < r since by (.3.9) L.S. = = J J f + (t)e (ξ0+δ) t ν(dt)/ f (t)e (ξ0+δ) t ν(dt)/ J J f + (t)e ξ 0 t ν(dt) f (t)e ξ 0 t ν(dt) Now consider each side of (.3.0) as a function of the complex argument δ = δ 0 + iθ, θ R s. Then L(δ) = R(δ) δ = δ 0 + i θ with δ 0 < r, since (by Theorem.3.3 (i)) both sides are analytic in each component of δ on the set where Re(ξ 0 + δ) N and they are equal when δ is real. In particular, L(iθ) = e iθ t λ + (dt) = R(iθ) = e iθ t λ (dt)

27 .4. CONVEX LOSS FUNCTION 27 for all θ R s. Hence λ + and λ have the same characteristic function λ + = λ f + = f a.s., contradicting ν(f 0) > 0. So f = 0 a.s. ν. Example X,..., X n iid N(σ, σ 2 ) p σ (x) = (σ 2π) exp{ x 2 n 2σ 2 i + xi n σ 2 }, η (σ) = 2σ 2, η 2 (σ) = σ, T (x) = x 2 i T 2(x) = x i η(ω) does not contain a 2-dim rectangle in R 2. T (x) = ( x 2 i, x i ) is not complete since E θ ( x 2 i 2 n + ( x i ) 2 ) = n(2σ 2 ) 2 n + (nσ2 + n 2 σ 2 ) = 0 but there exists no P-null set N such that x 2 i 2 ( x n+ i ) 2 = 0 on N c..4. Convex Loss Function Lemma.4.. Let φ be a convex function on (, ) which is bounded below and suppose that φ is not monotone. Then, φ takes on its minimum value c and φ (c)is a closed interval and is a singleton when φis strictly convex. Proof. Since φ is convex and not monotone, lim φ (x) =. x ± Since φ is continuous, φ attains its minimum value c. φ ({c}) is closed by continuity and interval by convexity. The interval must have zero length if φ is strictly convex. Theorem.4.2. Let ρ be a convex function defined on (, ) and X a random variable such that φ (a) = E (ρ (X a)) is finite for some a. If ρ is not monotone, φ (a)takes on its minimum value and φ (a) is a closed set and is a singleton when ρ is strictly convex. Proof. By the lemma, we only need to show that φ is convex and not monotone. Because lim t ± ρ (t) = and lim a ± x a = ±, so that φ is not monotone. The convexity comes from lim φ (a) = a ± φ (pa + ( p) b) = Eρ (p (X a) + ( p) (X b)) E (pρ (X a) + ( p) ρ (X b)) = pφ (a) + ( p) φ (b).

29 CHAPTER 2 Unbiasedness 2.. UMVU estimators. Notation. P={P θ, θ Ω} is a family of probability measures on A (distributions of X). T:X R is an A/B measurable function and T (or T (X)) is called a statistic. g : Ω R is a function on Ω whose value at θ is to be estimated. if (X, A, P θ ) X (X, A, P θ ) T ( R, B, P T θ Definition 2... A statistic T (or T (X)) is called an unbiased estimators of g (θ) E θ (T (X)) = g (θ) for all θ Ω. Objectives of point estimation. In order to specify what we mean by a good estimator of g(θ), we need to specify what we mean when we say that T (X) is close to g(θ). A fairly general way of defining this is to specify a loss function: L(θ, d) = cost of concluding that g(θ) = d, when the parameter value is θ. L(θ, d) 0 and L(θ, g(θ)) = 0. Since T (X) is a random variable, we measure the performance of T (X) for estimating g(θ) in terms of its expected (or long-term average) loss known as the risk function. R(θ, T ) = E θ L(θ, T (X)), Choice of a loss function will depend on the problem and the purpose of the estimation. For many estimation problem, the conclusion is not particularly sensitive to the choice of loss function within a reasonable range of alternatives. Because of this and especially because of its mathematical convenience, we often choose (and will do so in this chapter) the squared-error loss function with corresponding risk function L(θ, d) = (g(θ) d) 2 (2..) R(θ, T ) = E θ (T (X) g(θ)) 2 29 )

30 30 2. UNBIASEDNESS Ideally we would like to choose T to minimize (2..) uniformly in θ. Unfortunately this is impossible since the estimator T defined by (2..2) T (x) = g(θ 0 ) x X (where θ 0 is some fixed parameter value in Ω) has the risk function, { 0 if θ = θ R(θ, T ) = 0 (g(θ) g(θ 0 )) 2 if θ θ 0 An estimator which simultaneously minimized R(θ, T ) for all θ Ω would necessarily have R(θ, T ) = 0 θ Ω and this is impossible except in trivial cases. Why consider the class of unbiased estimators? There is nothing intrinsically good about unbiased estimators. The only criterion for goodness is that R(θ, T ) should be small. The hope is that by restricting attention to a class of estimators which excludes (2..2), we may be able to minimize R(θ, T ) uniformly in θ and that the resulting estimator will give small values of R(θ, T ). This programme is frequently successful if we attempt to minimize R(θ, T ) with T restricted to the class of unbiased estimators of g(θ). Definition g(θ) is U-estimable, if there exists an unbiased estimator of g(θ). Example X,..., X n iid Bernoulli(p), p (0, ). g(p) = p is U-estimable, since E X n = p p (0, ), while h(p) = is not U-estimable, since if p T (x)p xi ( p) n x i = p p (0, ), lim p 0 RS = and lim p 0 LS = T (0). So T (0) =, but this is not possible since then E p T (X) = p (0, ). p n i= X i Remark a.s. n p and n n a.s. i= X i p n p (0, ). Hence reasonable estimate of p even though it is not unbiased. Theorem If T 0 is an unbiased estimator of g (θ) then the totality of unbiased estimators of g (θ)is given by {T 0 U : E θ U = 0 for all θ Ω}. Proof. If T is unbiased for g(θ), then T = T 0 (T 0 T ) where E θ (T 0 T ) = 0 θ Ω. Conversely if T = T 0 U where E θ U = 0 θ Ω, then E θ T = E θ T 0 = g(θ) θ Ω. Remark For squared error loss, L(θ, d) = (d g(θ)) 2, the risk R(θ, T ) is R(θ, T ) = E θ ((T (X) g(θ)) 2 ) = V ar θ (T (X)) if T is unbiased = V ar θ (T 0 (X) U) = E θ [(T 0 (X) U) 2 ] g(θ) 2 Xi is a

31 2.. UMVU ESTIMATORS. 3 and hence the risk is minimized by minimizing E θ [(T 0 (X) U) 2 ] with respect to U, i.e. by taking any fixed unbiased estimator of g(θ) and finding the unbiased estimator of zero which minimizes E θ [(T 0 (X) U) 2 ]. Then if U does not depend on θ we shall have found a uniformly minimum risk estimator of g(θ), while if U depends on θ, there is no uniformly minimum risk estimator. Note that for unbiased estimators and squared error loss, the risk is the same as the variance of the estimator, so uniformly minimum risk unbiased is the same as uniformly minimum variance unbiased in this case. Example P (X = ) = p, P (X = k) = q 2 p k, k = 0,,..., where q = p. U is unbiased for 0 0 = T 0 (X) = I { } (X) is unbiased for p, 0 < p < T (X) = I {0} (X) is unbiased for q 2, U(k)P (X = k) = pu( ) + k= = U(0) + U(k)q 2 p k k=0 (U(k) 2U(k ) + U(k 2))p k k= U(k) = ku( ) = ka for some a (comparing coefficients of p k, k = 0,, 2,...) So an unbiased estimator of p with minimum risk (i.e. variance) is T 0 (X) a 0X where a 0 is the value of a which minimizes E p (T 0 (X) ax) 2 = P p (X = k)[t 0 (k) ak] 2 Similarly an unbiased estimator of q 2 with minimum risk (i.e. variance) is T (X) a X where a is the value of a which minimizes E p (T (X) ax) 2 = P p (X = k)[t (k) ak] 2 Some straightforward calculations give a p 0 = p + q 2 k2 p k and a = 0 Since a is independent of p, the estimator T (X) of q 2 is minimum variance unbiased for all p, i.e. UMVU. However a 0 does depend on p and so the estimator T0 (X) = T 0 (X) a 0X is only locally minimum variance unbiased at p. (We are using estimator in a generalized sense here since T0 (X) depends on p. We shall continue to use this terminology.) An UMVU estimator of p does not exist in this case. Definition Let V (θ) = inf T V ar θ (T ) where the inf is over all unbiased estimators of g(θ). If an unbiased estimator T of g(θ) satisfies V ar θ (T ) = V (θ) θ Ω it is called UMVU p

32 32 2. UNBIASEDNESS If V ar θ0 T = V (θ 0 ) for some θ 0 Ω T is called LMVU at θ 0 Remark Let H be the Hilbert space of functions on X which are square integrable with respect to P (i.e. with respect to every P θ P), and let U be the set of all unbiased estimators of 0. If T 0 is an unbiased estimator of g(θ) in H, then a LMVU estimator in H at θ 0 is T 0 P U (T 0 ), where P U denotes orthogonal projection on U in the inner product space L 2 (P θ0 ), i.e. P U (T 0 ) is the unique element of U such that T 0 P U (T 0 ) U (in L 2 (P θ0 )). T 0 P U (T 0 ) is LMVU since P U (T 0 ) = arg min U U E θ0 (T 0 U) 2. Notation We denote the set of all estimators T with E θ T 2 < for all θ Ω by and the set of all unbiased estimators of 0 in by U. Theorem 2... An unbiased estimator T of g (θ) is UMVU iff E θ (T U) = 0 for all U U and for all θ Ω. (i.e. Cov θ (T, U) = 0 since E θ U = 0 for all θ and E θ T = g (θ) for all θ Ω.) Proof. ( ) Suppose T is UMVU. For U U, let T = T + λu with λ real. Then T is unbiased and, by definition of T, V ar θ (T ) = V ar θ (T ) + λ 2 V ar θ (U) + 2λCov θ (T, U) V ar θ (T ) therefore, λ 2 V ar θ (U) + 2λCov θ (T, U) 0. Setting λ = Cov θ(t,u) V ar θ (U) to this inequality unless Cov θ (T, U) = 0. Hence Cov θ (T, U) = 0. gives a contradiction ( ) If E θ (T U) = 0 U U and θ Ω, let T be any other unbiased estimator. If V ar θ (T ) =, then V ar θ (T ) < V ar θ (T ), so suppose V ar θ (T ) <. Then T = T U, for some U which is unbiased for 0 (by Theorem 2..5). Hence U = T T E θ U 2 = E θ (T T ) 2 2E θ T 2 + 2E θ T 2 < U U V ar θ (T ) = V ar θ (T U) = V ar θ (T ) + V ar θ (U) 2Cov θ (T, U) V ar θ (T ) since Cov θ (T, U) = 0, T is UMVU.

33 2.. UMVU ESTIMATORS. 33 Unbiasedness and sufficiency. Suppose now that T is unbiased for g(θ) and S is sufficient for P = {P θ, θ Ω}. Consider Then (a) (b) T = E θ (T S) = E(T S) independent of θ E θ T = E θ E(T S) = E θ (T ) = g(θ) θ. V ar θ (T ) = E θ (T E(T S) + E(T S) g(θ)) 2 = E θ ((T E(T S)) 2 ) + V ar θ (T ) + 2E θ [(T E(T S))(E(T S) g(θ))] V ar θ (T ). On the second line we used the fact that T E(T S) is orthogonal to σ(s). The inequality on the third line is strict for all θ T = E(T S) a.s. P. Theorem If S is a complete sufficient statistic for P, then every U-estimable function g (θ) has one and only one unbiased estimator which is a function of S. Proof. T unbiased E(T S) is unbiased and a function of S T (S), T 2 (S) unbiased E θ (T (S) T 2 (S)) = 0 θ T (S) = T 2 (S) a.s. P (completeness) Theorem (Rao-Blackwell) Suppose S is a complete sufficient statistic for P. Then (i) If g (θ) is U-estimable, there exists an unbiased estimator which uniformly minimizes the risk for any loss function L (θ, d) which is convex in d. (ii) The UMV U in (i) is the unique unbiased estimator which is a function of S; it is the unique unbiased estimator with minimum risk provided the risk is finite and L is strictly convex in d. Proof. (i) L(θ, d) convex in d means L(θ, pd + ( p)d 2 ) pl(θ, d ) + ( p)l(θ, d 2 ), 0 < p <. Let T be any unbiased estimator of g(θ) and let T = E(T S), another unbiased estimator of g(θ). Then R(θ, T ) = E θ [L(θ, E(T S))] E θ [E θ (L(θ, T ) S)], by Jensen s inequality for conditional expectation, = E θ L(θ, T ) = R(θ, T ) θ.

34 34 2. UNBIASEDNESS If T 2 is any other unbiased estimator then T 2 = E(T 2 S) = T a.s. P by Theorem Hence starting from any unbiased estimator and conditioning on the CSS S gives a uniquely defined unbiased estimator which is UMVU and is the unique function of S which is unbiased for g(θ). (ii) The first statement was established at the end of the proof of (i). If T is UMVU then so is T = E(T S) as shown in (i); We will show that T is necessarily the uniquely determined unbiased function of S, by showing that T is a function of S a.s. P. The proof is by contradiction. Suppose that "T is a function of S a.s. P" is false. Then there exists θ and a set of positive P θ measure where But this implies that R(θ, T ) = E θ (L(θ, E(T S))) < E θ (E θ (L(θ, T ) S)) T := E(T S) T (Jensen s inequality is strict unless E(T S) = T a.s. P θ ) = R(θ, T ) contradicting the UMVU property of T. Theorem If P is an exponential family of full rank (i.e. {η,..., η s } and {T,..., T s } are A.I. and η (Ω) contains an open subset of R s ) then the Rao-Blackwell theorem applies to any U-estimable g (θ) with S = T. Proof. T is complete sufficient for P. [Some obvious U-estimable g(θ) s are E θ T i (X) = A ξ i ξ=η(θ), {θ : η(θ) int(n )}, where π ξ (x) = e ξ i T i (x) A(ξ) h(x) is the canonical representation of p θ (x).] Two methods for finding UMVU s Method. Search for a function δ(t ), where T is a CSS, such that E θ δ(t ) = g(θ), θ Ω.

35 2.. UMVU ESTIMATORS. 35 Example X,..., X n iid N(µ, σ 2 ), µ R, σ 2 > 0. T =( X, S 2 ) is CSS. E µ,σ 2 X = µ X is UMVU for µ. Method 2. Search for an unbiased δ(x) and a CSS T. Then S = E(δ(X) T ) is UMV U Example X,..., X n iid U(0, θ), θ > 0 g(θ) = θ 2 δ (X) = X is unbiased X (n) is CSS S = E(X X (n) ) is UMVU To compute S we note that given X (n) = x, X = x w.p. n Remark X U(0, x) w.p. n S(x) = x n + ( n )x 2 = n + x n 2 S(X (n) ) = n + 2 n X (n) is UMVU for θ 2 n + n X (n) is UMVU for θ (a) Convexity of L(θ, ) is crucial to the Rao-Blackwell theorem. (b) Large-sample theory tends to support the use of convex L(θ, ). Heuristically if X,..., X n are iid, then as n the error in estimating g(θ) 0 for any reasonable estimates (in some probabilistic sense). Thus only the behavior of L(θ, d) for d close to g(θ) is relevant for large samples. A Taylor expansion around d = g(θ) gives But L(θ, d) = a(θ) + b(θ)(d g(θ)) + c(θ)(d g(θ)) 2 + Remainder L(θ, g(θ)) = 0 a(θ) = 0 L(θ, d) 0 b(θ) = 0 Hence locally, L(θ, d) c(θ)(d g(θ)) 2, a convex weighted squared error loss function.

36 36 2. UNBIASEDNESS Example Observe X,..., X m, iid N(ξ, σ 2 ), and Y,..., Y n, iid N(η, τ 2 ), independent of X,..., X m. (i) For the 4-parameter family P = {P ξ,η,σ 2,τ 2}, ( X, Ȳ, S2 X, S2 Y ) is a CSS since the exponential family is of full rank. Hence X and SX 2 are UMVU for ξ and σ2 respectively and Ȳ and S2 Y are UMVU for η and τ 2. (ii) For the 3-parameter family P = {P ξ,η,σ 2,σ2}, ( X, Ȳ, SS) is a CSS, where SS := (m )SX 2 + (n )S2 Y. Hence X, Ȳ and SS are UMVU for ξ, η and σ2 m+n 2 respectively. (iii) For the 3-parameter family with ξ = η, σ 2 τ 2 (which arises when estimating a mean from 2 sets of readings with different accuracies), ( X, Ȳ, S2 X, S2 Y ) is minimal sufficient but not complete, since X Ȳ 0 a.s. P, but E θ( X Ȳ ) = 0 θ. To deal with Case (iii) we shall first show the following: If σ2 = r for some fixed τ 2 r, i.e. P = {P ξ,ξ,rτ 2,τ 2} then T = ( X i + r Y j, X 2 i + r Y 2 j ) is CSS Proof. p ξ,τ 2(x, y) = exp (2π) m+n 2 (rτ 2 ) m 2 (τ 2 ) n 2 { x 2 2rτ 2 i + rτ = exp { A(ξ, τ 2 ) } exp 2 mξ x mξ2 2rτ 2 { 2rτ 2 ( x 2 i + r y 2 i ) + y 2 2τ 2 i + } nξ2 nξȳ τ 2 2τ 2 ξ rτ ( x 2 i + r } y i ) Since T is a CSS for P and since T = X i +r Y i m+rn for ξ in P. is unbiased for ξ, it is UMVU T is also unbiased for ξ in P = {P ξ,ξ,σ 2,τ 2} V (ξ 0, σ 2 0, τ 2 0 ) V ar ξ0,σ 2 0,τ 2 0 (T ) = σ0τ 2 0 2, where mτ0 2 + nσ0 2 σ0 2 τ0 2 = r. (V is the smallest variance of all unbiased estimators of ξ for P evaluated at ξ 0, σ 2 0, τ 2 0.)

37 2.2. NON-PARAMETRIC FAMILIES 37 On the other hand, every T which is unbiased for ξ in P is also unbiased in P. Hence if T is unbiased for ξ in P, then V ar ξ0,σ0 2,τ 0 2(T ) V ar ξ 0,σ0 2,τ 0 2( Xi + r Y i ), where r = σ2 0, m + rn τ0 2 and the inequality continues to hold with the left-hand side replaced by V (ξ 0, σ0, 2 τ0 2 ). So V (ξ 0, σ0, 2 τ0 2 ) = σ2 0 τ 0 2 mτ0 2+nσ2 0 and the LMVU estimator at (ξ 0, σ0, 2 τ0 2 ) is Xi + σ2 0 τ 2 0 Yi m + σ2 0 τ 2 0 n Since this estimate depends on the ratio r = σ2 0, an UMVU for ξ does not exist τ0 2 in P. A natural estimate for ξ is Xi + S2 X SY ˆξ = 2 Yi. m + S2 X n SY 2 (See Graybill and Deal, Biometrics, 959, pp for its properties.) 2.2. Non-parametric families Consider X = (X,..., X n ), where X,..., X n are iid F, where F F, a family of distribution functions, and P is the corresponding product measure on (R n, B n ). For example, F 0 = df s with density relative to Lebesgue measure, F = df s with x F (dx) <, F 2 = df s with x 2 F (dx) <, etc.. The estimand is g : F R. For example, g(f ) = g(f ) = xf (dx) = µ F x 2 F (dx) g(f ) = F (a) g(f ) = F (p) Proposition If F 0 is defined as above, then (X (),..., X (n) ) is complete sufficient for F 0 (i.e. for the corresponding family of probability measures P).

38 38 2. UNBIASEDNESS Proof. We know that T (X) = (X (),..., X (n) ) is sufficient for P. It remains to show (by problem.6.32, p.72) that T is complete and sufficient for a family P 0 P such that each member of P 0 has positive density on R n. Choose P 0 to be the set of probability measures on B n with densities relative to Lebesgue measure, C(θ,, θ n ) exp{θ xi + θ 2 x i x j + + θ n x x n x 2n i }} i<j This is an exponential family whose natural parameter set N contains an open set (N = R n ). So S(x) = ( x i, i<j x ix j,, x x n ) is complete. But S is equivalent to T (consider the n th degree polynomial whose zeroes are x (),, x (n) ), so T is complete for F 0. Measurable functions of the order statistics. If T (x) := (x (),..., x (n) ) then δ(x,..., X n ) σ(t ) δ(x,..., X n ) = δ(x π,..., X πn ) for every permutation (π,..., π n ) of (,..., n). Since T is a CSS for F 0, this enables us to identify UMVU estimators of estimands g for which they exist. Example g(f ) = F (a). An obvious unbiased estimator of F (a) is T (X) := n I (,a] (X i ) n and T σ(t ) so T is UMVU for F (a). i= Example g(f ) = xdf, F F 0 F 2. Let T 2 (x) = n X i n Then T 2 σ(t ) and, since T is also complete for F 0 F 2, it is therefore UMVU for µ F. Example g(f ) = σf 2, F F 0 F 4. Let T 3 (x) = S(x) 2 (xi x) 2 (x(i) x(i) ) 2 n = = n n T 3 σ(t ) and is unbiased for σf 2. Since T is complete for F 0 F 4, T 3 is UMVU for σf 2. Remark T complete for F does not imply generally that T is complete for F F. In fact the reverse is true. Completeness for F implies completeness for F. However the same argument used in the proof of Proposition 2.2. shows that T is complete for F 0 F 2 (used in example 2.2.3) and i= T is complete for F 0 F 4 (used in example 2.2.4).

Fundamentals of Statistics

Chapter 2 Fundamentals of Statistics This chapter discusses some fundamental concepts of mathematical statistics. These concepts are essential for the material in later chapters. 2.1 Populations, Samples,