LECTURE 2 NOTES 1. Minimal sufficient statistics. Definition 1.1 (minimal sufficient statistic). A statistic t := φ(x) is a minimial sufficient statistic for the parametric model F if 1. it is sufficient. 2. it can be expressed as a function of any other sufficient statistic; i.e. for any other sufficient statistic t, there is some function g such that t = g(t ). Recall passing a random variable x through a function g is a form of data reduction: a statistic t = g(t) has no more information than t. Thus a minimal sufficient statistic is a sufficient statistic with the least information. That is, it is an optimal form of data reduction. Like sufficient statistics, minimal sufficient statistics are not unique. If t is minimal sufficient, g(t) for any invertible g is also minimal sufficient. In terms of partitions of the sample space, a minimal sufficient statistic induces the coarsest (hence minimal) partition of the sample space among all sufficient statistics. Although minimal sufficient statistics are not unique, the minimal sufficient partition is unique. Going back to the coin tossing example, consider the statistic { (1.1) φ 100x 1 + 10x 2 + x 3 x 1 + x 2 + x 3 = 2 (x 1, x 2, x 3 ) = x 1 + x 2 + x 3 otherwise. Table 1 gives the conditional distribution of (x 1, x 2, x 3 ) t. It is sufficient because the conditional distribution does not depend on p. However, it is not minimal because it is not a function of x 1 + x 2 + x 3, which we know is also sufficient. Theorem 1.2. A statistic t := φ(x) is a miminal sufficent statistic for a parametric model F if the ratio f θ(x 1 ) f θ (x 2 ) does not depend on θ if and only if φ(x 1 ) = φ(x 2 ). Proof. We shall prove the theorem when the densities in the model have common support. 1
2 STAT 201B Table 1 The conditional distribution of three coin tosses given t. We see that t is not a minimal sufficient statistic: (x 1, x 2, x 2) t f p(x t ) (0,0,0) 0 1 (1,0,0) 1 1/3 (0,1,0) 1 1/3 (0,0,1) 1 1/3 (1,1,0) 110 1 (0,1,1) 11 1 (1,0,1) 101 1 (1,1,1) 3 1 First, we show that t is sufficient. For each set A t = {x : φ(x) = t} in the partition of induced by t, consider a point x t A t. The density has the form f θ (x) = f θ(x) f θ (x φ(x) ) f θ(x φ(x) ). By assumption, the ratio f θ(x) f θ (x ) 1. x φ(x) is a function of x, so f θ(x) f θ (x ) 2. f θ (x φ(x) ) depends only on φ(x). does not depend on θ. Further, depends only on x. Thus, by the factorization theorem, t = φ(x) is a sufficient statistic. We complete the proof by showing that t is minimal. It suffices to show that φ (x 1 ) = φ (x 2 ) for any sufficient statistic t := φ (x) implies φ(x 1 ) = φ(x 2 ). By the factorization theorem, the density has the form f θ (x) = g θ (φ (x))h(x) For any x 1, choose x 2 such that φ (x 1 ) = φ (x 2 ). We observe the ratio f θ (x 1 ) f θ (x 2 ) = g θ(φ (x 1 ))h(x 1 ) g θ (φ (x 2 ))h(x 2 ) = h(x 1) h(x 2 ) does not depend on θ. By assumption, this implies φ(x 1 ) = φ(x 2 ). Example 1.3. Let x i i.i.d. Poi(θ). We showed that t := i n] x i is a sufficient statistic for the Poisson family. The ratio f θ (x) f θ (x ) = i n] x i! i n] x i! θ i n] x i x i,
LECTURE 2 NOTES 3 which does not depend on θ if and only if t = t. By Theorem 1.2, t is a minimal sufficient statistic. 2. Exponential families. Definition 2.1. A set of densities is an exponential family if the densities are of the form (2.1) f θ (x) = exp ( η(θ) T φ(x) a(θ) ) h(x), where η(θ) R p are the natural parameters. φ(x) R p are sufficient statistics a(θ) := log exp ( η(θ) T φ(x) ) h(x)dx is the log-partition function. h : R is the base measure. Exponential families are of statistical relevance because many common distributions are exponential families. For example, 1. binomial: f p (x) = ( ) n x p x (1 p) n x = exp ( x log p + (n x) log(1 p) )( n x = exp ( x log p 1 p + n log(1 p))( n x). p 1 p. The natural parameter is the logit of p: log 2. Poisson: f λ (x) = λx x! e λ = exp (x log λ λ) 1 x!. 3. Gaussian (σ 2 = 1): 4. Gaussian (unknown σ 2 ): f µ,σ 2(x) = exp ( x2 ( µ = exp σ 2 f µ (x) = 1 2π exp ( 1(x µ)2) = exp ( µx 1 2 µ2) e x2 /2 2π. 2 + µx µ2 2σ 2 σ 2 σ 2 ] T x x 2 2 log σ ) 1 2σ 2 2π ] µ2 2σ 2 log σ ) ) 1. 2π
4 STAT 201B As all-inclusive as exponential families seem, there are notable exceptions. For example, the unif(0, θ) family is not an exponential family. More generally, parametric models whose support depends on the parameter are not exponential families. Often, it is convenient to re-parametrize an exponential family in terms of its natural parameters η. Doing so gives the canonical form of the exponential family: f η (x) = exp ( η T φ(x) a(η) ) h(x). For a set of sufficient statistics φ and a base measure h, the set of all valid natural parameters is the natural parameter space: H := { η R p : exp ( η T φ(x) ) h(x)dx < }. Lemma 2.2. The log-partition function of an exponential family in canonical form is convex. Proof. Let η 1, η 2 H. For any α (0, 1), a(αη 1 + (1 α)η 2 ) = log exp ( (αη 1 + (1 α)η 2 ) T φ(x) ) h(x)dx = log exp ( η1 T φ(x) ) α ( exp η T 2 φ(x) ) 1 α h(x)dx. By Hölder s inequality, log ( ( ( exp η T 1 φ(x) ) h(x) ) ) α α ( ( ( α dx exp η T 2 φ(x) ) h(x) ) ) 1 α 1 α 1 α dx = αa(η 1 ) + (1 α)a(η 2 ). Since a(η 1 ) and a(η 2 ) are finite by assumption, so is a(αη 1 + (1 α)η 2 ). Lemma 2.3. The log-partition function a : Θ R is convex and infinitely differentiable. Further, its derivatives are ] (2.2) a(θ) = E θ φ(x), 2 ] (2.3) a(θ) = cov θ φ(x). Proof. The proof of Lemma 2.2 actually shows a is convex. To complete the proof, we assume exchanging expectation and differentiation is permitted; verifying the validity of the assumption is done by a tedious argument
LECTURE 2 NOTES 5 that appeals to the dominated convergence theorem. We defer the details to the appendix. We exchange expectation and differentiation to obtain i a(θ) = φ i(x) exp ( θ T φ(x) ) h(x)dx exp (θt φ(x)) h(x)dx = φ i (x) exp ( θ T φ(x) a(θ) ) h(x)dx = E θ φi (x) ] for any i p], which estabishes (2.2). Exchanging expectation and differentiation again, 2 i,ja(θ) = φ i (x)(φ j (x) j a(θ)) exp ( θ T φ(x) a(θ) ) h(x)dx = φ i (x)φ j (x) exp ( θ T φ(x) a(θ) ) h(x)dx j a(θ) φ i (x) exp ( θ T φ(x) a(θ) ) h(x)dx = E θ φi (x)φ j (x) ] E θ φi (x) ] E θ φj (x) ], for any i, j p], which estabishes (2.3). The superficial dimension of an exponential family may be reduced by re-parametrization in two cases: 1. T := φ( ) is a subset of an affine set (in R p ): there are a R p and b R such that a T φ(x) = b for any x. 1 2. H is a subset of an affine set (in R p ). In the first case, two parameters η 1, η 2 H may correspond to the same density, making the model unidentifiable. Indeed, the densities that correspond to η and η + a are proportional to each other: f η+a (x) exp ( (η + a) T φ(x) ) h(x) = exp ( η T φ(x) + b ) h(x) exp ( η T φ(x) ) h(x). In fact, any parameter η + γa for some γ R is associated with the same density. In the second case, η obeys linear inequality constraints, the values 1 An affine set is the set of solutions to a linear system: {x R n : Ax = b} for some A R m n and b R m.
6 STAT 201B of some parameters are determined by the values of other parameters. Thus some of the parameters are extraneous. If neither case holds, the exponential family is minimal. Definition 2.4. An exponential family in canonical form is minimal if 1. the image of under φ is not a subset of an affine set. 2. H is not a subset of an affine set. Minimal exponential families are classified into two types: full-rank and curved. A minimal exponential family is full-rank when its natural parameter space H contains an open set. Otherwise, a minimal exponential family is curved. By the factorization theorem, φ(x) is a sufficient statistic for densities of the form f η (x) = exp ( η T φ(x) a(η) ) h(x). If the exponential family is minimal, then φ(x) is a minimal sufficient statistic. Theorem 2.5. If F is a minimal exponential family, then t = φ(x) is a minimal sufficient statistic. Proof. By Theorem 3.2, it suffices to show the ratio fη(x 1) f η(x 2 ) in η if and only if φ(x 1 ) = φ(x 2 ). If φ(x 1 ) = φ(x 2 ), the ratio is Assume fη(x 1) f η(x 2 ) We simplify to obtain f η (x 1 ) f η (x 2 ) = exp ( η T (φ(x 1 ) φ(x 2 )) ) = 1. is constant in η. That is f η1 (x 1 ) f η1 (x 2 ) = f η 2 (x 1 ) f η2 (x 2 ) for any η 1, η 2 H. (η 1 η 2 ) T (φ(x 1 ) φ(x 2 )) = 0. is constant By the definition of minimal exponential family, H is not a subset of an affine set. Since subspaces are affine sets, Θ is not a subset of any subspace, which implies span(θ) = R p.
LECTURE 2 NOTES 7 Since { (η1 η 2 ) T ( φ(x i ) φ(x i) ) = 0 for all η 1, η 2 H } we deduce { η T ( φ(x i ) φ(x i) ) = 0 for all η span(h) }, η T ( φ(x i ) φ(x i )) = 0 for all η R p, which allows us to conclude φ(x i ) = φ(x i ) = 0. The joint distribution of i.i.d. samples from a p-dimensional exponential family is another p-dimensional exponential family: f θ (x i ) = exp ( η(θ) T φ(x i ) a(θ) ) h(x i ) i n] i n] = exp ( η(θ) T ( i n] φ(x i) ) ) a(θ) i n] h(x i ) By the factorization theorem, i n] φ(x i) is a sufficient statistic. We remark that its dimension does not grow with the sample size. In fact, i.i.d. exponential families are the only parametric models that admit a sufficient statistic whose dimension does not depend on the sample size. The Pitman- Koopman-Darmois 2 theorem says that if {x i } i n] are i.i.d. samples from a continuous density f θ in a parametric model that consists of densities whose support does not depend on the parameter, then there is a sufficient statistic whose dimension does not depend on the sample size if and only if the parametric model is an exponential family. Lemma A.1. APPENDI A If two functions f 1 and f 2 are 1. proportional to each other: f 1 (x) = cf 2 (x) for any x 2. f 1(x)dx = f 2(x)dx, then they are equal: f 1 (x) = f 2 (x) for any x. Proof. Indeed, by integrating f 1 (x) = cf 2 (x) and rearranging, we have f 1(x)dx f 2(x)dx = c. By the assumption that their integrals agree, we deduce c = 1, which shows that f 1 (x) = f 2 (x). 2 One of the namesakes of the theorem is Dr. Jim Pitman s father, Dr. Edwin Pitman.
8 STAT 201B Lemma A.2. We have at any η int(h). Let f θ (x) be a member of a canonical exponential family. i z(η) = i exp(η T φ(x)) ] h(x)dx Proof. We show that exchanging differentiation and integration is justified at the origin. By the definition of the derivative, i z(0) = lim n = lim n z(e i ) z(0) e (e j) T φ(x) 1 h(x)dx, where { } is any decreasing sequence that converges to zero. We assume d 0 := sup n is small enough so that e j δ 0 int(h). We recognize the limit of the integrand is i exp(η T φ(x)) ] h(x). To justify exchanging lim n and integration, we appeal to the dominated convergence theorem. Let d 0 := sup n. The (absolute value of the) integrand is at most e (e j) T φ(x) 1 h(x) e (e j) T φ(x) (e j ) T φ(x) h(x) e (e jδ 0 ) T φ(x) (e j δ 0 ) T φ(x) h(x) δ 0 exp2 (e jδ 0 ) T φ(x) h(x) δ 0 exp2 (e jδ 0 ) T φ(x) + exp 2 (e j δ 0 ) T φ(x) h(x) δ 0 We check that the bound is integrable: exp 2 (e jδ 0 ) T φ(x) + exp 2 (e j δ 0 ) T φ(x) h(x)dx δ 0 = z(2e j δ 0 ) + z( 2e j δ 0 ), ( e x 1 e x x ) ( x e x ) ( e x e x + e x)
LECTURE 2 NOTES 9 which, by the assumption that e j δ 0 int(h) is finite. Thus we are free to exchange lim n and integration to conclude i z(0) = lim n = = lim n e (e j) T φ(x) 1 h(x)dx e (e j) T φ(x) 1 h(x)dx i exp(η T φ(x)) ] 0 h(x)dx. Yuekai Sun Berkeley, California November 29, 2015