LECTURE 2 NOTES. 1. Minimal sufficient statistics.

Similar documents
ECE 275B Homework # 1 Solutions Version Winter 2015

ECE 275B Homework # 1 Solutions Winter 2018

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

MIT Spring 2016

1. Fisher Information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

ST5215: Advanced Statistical Theory

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Lecture 17: Minimal sufficiency

The Multivariate Gaussian Distribution [DRAFT]

Probability and Measure

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Lecture 16 November Application of MoUM to our 2-sided testing problem

Exercises with solutions (Set D)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Exponential Families

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 22: Variance and Covariance

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Lecture 3: Random variables, distributions, and transformations

Last Lecture - Key Questions. Biostatistics Statistical Inference Lecture 03. Minimal Sufficient Statistics

Math Solutions to homework 5

18.175: Lecture 15 Characteristic functions and central limit theorem

Classical Estimation Topics

Generalized Linear Models (1/29/13)

Chapter 8.8.1: A factorization theorem

1 Probability Model. 1.1 Types of models to be discussed in the course

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 8 10/1/2008 CONTINUOUS RANDOM VARIABLES

1: PROBABILITY REVIEW

STAT 801: Mathematical Statistics. Hypothesis Testing

MIT Spring 2016

Probabilistic Graphical Models

Exercises and Answers to Chapter 1

18.175: Lecture 3 Integration

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

STAT 830 Hypothesis Testing

Patterns of Scalable Bayesian Inference Background (Session 1)

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Chapter 1. Statistical Spaces

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

1. Let A be a 2 2 nonzero real matrix. Which of the following is true?

LECTURE 15: COMPLETENESS AND CONVEXITY

Lecture 7 Introduction to Statistical Decision Theory

Order Statistics and Distributions

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Lecture 3. David Aldous. 31 August David Aldous Lecture 3

1 Presessional Probability

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

18.175: Lecture 13 Infinite divisibility and Lévy processes

Chapter 4. Theory of Tests. 4.1 Introduction

Sample Spaces, Random Variables

STAT 830 Hypothesis Testing

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

Chapter 1. Sets and probability. 1.3 Probability space

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions

Lecture Notes 3 Convergence (Chapter 5)

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Lecture 26: Likelihood ratio tests

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

I. ANALYSIS; PROBABILITY

Introducing the Normal Distribution

Lecture 11: Random Variables

u( x) = g( y) ds y ( 1 ) U solves u = 0 in U; u = 0 on U. ( 3)

Composite Hypotheses and Generalized Likelihood Ratio Tests

Gaussian Models (9/9/13)

Numerical Solutions to Partial Differential Equations

STA 732: Inference. Notes 2. Neyman-Pearsonian Classical Hypothesis Testing B&D 4

Fourier Transform & Sobolev Spaces

Generalized Linear Models and Exponential Families

USING FUNCTIONAL ANALYSIS AND SOBOLEV SPACES TO SOLVE POISSON S EQUATION

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true

MATH 426, TOPOLOGY. p 1.

McGill University. Faculty of Science. Department of Mathematics and Statistics. Part A Examination. Statistics: Theory Paper

1 Exercises for lecture 1

Non-parametric Inference and Resampling

(Practice Version) Midterm Exam 2

CS Lecture 19. Exponential Families & Expectation Propagation

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Lecture 5: The Bellman Equation

Trace Class Operators and Lidskii s Theorem

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Miscellaneous Errors in the Chapter 6 Solutions

10. Composite Hypothesis Testing. ECE 830, Spring 2014

4. CONTINUOUS RANDOM VARIABLES

Introduction & Random Variables. John Dodson. September 3, 2008

Banach Spaces V: A Closer Look at the w- and the w -Topologies

Expectation Maximization

4 Expectation & the Lebesgue Theorems

Functional Analysis Exercise Class

Introducing the Normal Distribution

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

Lecture 8: Information Theory and Statistics

Course 212: Academic Year Section 1: Metric Spaces

Chapter 2: Random Variables

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Random Variables and Their Distributions

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

New Class of duality models in discrete minmax fractional programming based on second-order univexities

The Sherrington-Kirkpatrick model

Transcription:

LECTURE 2 NOTES 1. Minimal sufficient statistics. Definition 1.1 (minimal sufficient statistic). A statistic t := φ(x) is a minimial sufficient statistic for the parametric model F if 1. it is sufficient. 2. it can be expressed as a function of any other sufficient statistic; i.e. for any other sufficient statistic t, there is some function g such that t = g(t ). Recall passing a random variable x through a function g is a form of data reduction: a statistic t = g(t) has no more information than t. Thus a minimal sufficient statistic is a sufficient statistic with the least information. That is, it is an optimal form of data reduction. Like sufficient statistics, minimal sufficient statistics are not unique. If t is minimal sufficient, g(t) for any invertible g is also minimal sufficient. In terms of partitions of the sample space, a minimal sufficient statistic induces the coarsest (hence minimal) partition of the sample space among all sufficient statistics. Although minimal sufficient statistics are not unique, the minimal sufficient partition is unique. Going back to the coin tossing example, consider the statistic { (1.1) φ 100x 1 + 10x 2 + x 3 x 1 + x 2 + x 3 = 2 (x 1, x 2, x 3 ) = x 1 + x 2 + x 3 otherwise. Table 1 gives the conditional distribution of (x 1, x 2, x 3 ) t. It is sufficient because the conditional distribution does not depend on p. However, it is not minimal because it is not a function of x 1 + x 2 + x 3, which we know is also sufficient. Theorem 1.2. A statistic t := φ(x) is a miminal sufficent statistic for a parametric model F if the ratio f θ(x 1 ) f θ (x 2 ) does not depend on θ if and only if φ(x 1 ) = φ(x 2 ). Proof. We shall prove the theorem when the densities in the model have common support. 1

2 STAT 201B Table 1 The conditional distribution of three coin tosses given t. We see that t is not a minimal sufficient statistic: (x 1, x 2, x 2) t f p(x t ) (0,0,0) 0 1 (1,0,0) 1 1/3 (0,1,0) 1 1/3 (0,0,1) 1 1/3 (1,1,0) 110 1 (0,1,1) 11 1 (1,0,1) 101 1 (1,1,1) 3 1 First, we show that t is sufficient. For each set A t = {x : φ(x) = t} in the partition of induced by t, consider a point x t A t. The density has the form f θ (x) = f θ(x) f θ (x φ(x) ) f θ(x φ(x) ). By assumption, the ratio f θ(x) f θ (x ) 1. x φ(x) is a function of x, so f θ(x) f θ (x ) 2. f θ (x φ(x) ) depends only on φ(x). does not depend on θ. Further, depends only on x. Thus, by the factorization theorem, t = φ(x) is a sufficient statistic. We complete the proof by showing that t is minimal. It suffices to show that φ (x 1 ) = φ (x 2 ) for any sufficient statistic t := φ (x) implies φ(x 1 ) = φ(x 2 ). By the factorization theorem, the density has the form f θ (x) = g θ (φ (x))h(x) For any x 1, choose x 2 such that φ (x 1 ) = φ (x 2 ). We observe the ratio f θ (x 1 ) f θ (x 2 ) = g θ(φ (x 1 ))h(x 1 ) g θ (φ (x 2 ))h(x 2 ) = h(x 1) h(x 2 ) does not depend on θ. By assumption, this implies φ(x 1 ) = φ(x 2 ). Example 1.3. Let x i i.i.d. Poi(θ). We showed that t := i n] x i is a sufficient statistic for the Poisson family. The ratio f θ (x) f θ (x ) = i n] x i! i n] x i! θ i n] x i x i,

LECTURE 2 NOTES 3 which does not depend on θ if and only if t = t. By Theorem 1.2, t is a minimal sufficient statistic. 2. Exponential families. Definition 2.1. A set of densities is an exponential family if the densities are of the form (2.1) f θ (x) = exp ( η(θ) T φ(x) a(θ) ) h(x), where η(θ) R p are the natural parameters. φ(x) R p are sufficient statistics a(θ) := log exp ( η(θ) T φ(x) ) h(x)dx is the log-partition function. h : R is the base measure. Exponential families are of statistical relevance because many common distributions are exponential families. For example, 1. binomial: f p (x) = ( ) n x p x (1 p) n x = exp ( x log p + (n x) log(1 p) )( n x = exp ( x log p 1 p + n log(1 p))( n x). p 1 p. The natural parameter is the logit of p: log 2. Poisson: f λ (x) = λx x! e λ = exp (x log λ λ) 1 x!. 3. Gaussian (σ 2 = 1): 4. Gaussian (unknown σ 2 ): f µ,σ 2(x) = exp ( x2 ( µ = exp σ 2 f µ (x) = 1 2π exp ( 1(x µ)2) = exp ( µx 1 2 µ2) e x2 /2 2π. 2 + µx µ2 2σ 2 σ 2 σ 2 ] T x x 2 2 log σ ) 1 2σ 2 2π ] µ2 2σ 2 log σ ) ) 1. 2π

4 STAT 201B As all-inclusive as exponential families seem, there are notable exceptions. For example, the unif(0, θ) family is not an exponential family. More generally, parametric models whose support depends on the parameter are not exponential families. Often, it is convenient to re-parametrize an exponential family in terms of its natural parameters η. Doing so gives the canonical form of the exponential family: f η (x) = exp ( η T φ(x) a(η) ) h(x). For a set of sufficient statistics φ and a base measure h, the set of all valid natural parameters is the natural parameter space: H := { η R p : exp ( η T φ(x) ) h(x)dx < }. Lemma 2.2. The log-partition function of an exponential family in canonical form is convex. Proof. Let η 1, η 2 H. For any α (0, 1), a(αη 1 + (1 α)η 2 ) = log exp ( (αη 1 + (1 α)η 2 ) T φ(x) ) h(x)dx = log exp ( η1 T φ(x) ) α ( exp η T 2 φ(x) ) 1 α h(x)dx. By Hölder s inequality, log ( ( ( exp η T 1 φ(x) ) h(x) ) ) α α ( ( ( α dx exp η T 2 φ(x) ) h(x) ) ) 1 α 1 α 1 α dx = αa(η 1 ) + (1 α)a(η 2 ). Since a(η 1 ) and a(η 2 ) are finite by assumption, so is a(αη 1 + (1 α)η 2 ). Lemma 2.3. The log-partition function a : Θ R is convex and infinitely differentiable. Further, its derivatives are ] (2.2) a(θ) = E θ φ(x), 2 ] (2.3) a(θ) = cov θ φ(x). Proof. The proof of Lemma 2.2 actually shows a is convex. To complete the proof, we assume exchanging expectation and differentiation is permitted; verifying the validity of the assumption is done by a tedious argument

LECTURE 2 NOTES 5 that appeals to the dominated convergence theorem. We defer the details to the appendix. We exchange expectation and differentiation to obtain i a(θ) = φ i(x) exp ( θ T φ(x) ) h(x)dx exp (θt φ(x)) h(x)dx = φ i (x) exp ( θ T φ(x) a(θ) ) h(x)dx = E θ φi (x) ] for any i p], which estabishes (2.2). Exchanging expectation and differentiation again, 2 i,ja(θ) = φ i (x)(φ j (x) j a(θ)) exp ( θ T φ(x) a(θ) ) h(x)dx = φ i (x)φ j (x) exp ( θ T φ(x) a(θ) ) h(x)dx j a(θ) φ i (x) exp ( θ T φ(x) a(θ) ) h(x)dx = E θ φi (x)φ j (x) ] E θ φi (x) ] E θ φj (x) ], for any i, j p], which estabishes (2.3). The superficial dimension of an exponential family may be reduced by re-parametrization in two cases: 1. T := φ( ) is a subset of an affine set (in R p ): there are a R p and b R such that a T φ(x) = b for any x. 1 2. H is a subset of an affine set (in R p ). In the first case, two parameters η 1, η 2 H may correspond to the same density, making the model unidentifiable. Indeed, the densities that correspond to η and η + a are proportional to each other: f η+a (x) exp ( (η + a) T φ(x) ) h(x) = exp ( η T φ(x) + b ) h(x) exp ( η T φ(x) ) h(x). In fact, any parameter η + γa for some γ R is associated with the same density. In the second case, η obeys linear inequality constraints, the values 1 An affine set is the set of solutions to a linear system: {x R n : Ax = b} for some A R m n and b R m.

6 STAT 201B of some parameters are determined by the values of other parameters. Thus some of the parameters are extraneous. If neither case holds, the exponential family is minimal. Definition 2.4. An exponential family in canonical form is minimal if 1. the image of under φ is not a subset of an affine set. 2. H is not a subset of an affine set. Minimal exponential families are classified into two types: full-rank and curved. A minimal exponential family is full-rank when its natural parameter space H contains an open set. Otherwise, a minimal exponential family is curved. By the factorization theorem, φ(x) is a sufficient statistic for densities of the form f η (x) = exp ( η T φ(x) a(η) ) h(x). If the exponential family is minimal, then φ(x) is a minimal sufficient statistic. Theorem 2.5. If F is a minimal exponential family, then t = φ(x) is a minimal sufficient statistic. Proof. By Theorem 3.2, it suffices to show the ratio fη(x 1) f η(x 2 ) in η if and only if φ(x 1 ) = φ(x 2 ). If φ(x 1 ) = φ(x 2 ), the ratio is Assume fη(x 1) f η(x 2 ) We simplify to obtain f η (x 1 ) f η (x 2 ) = exp ( η T (φ(x 1 ) φ(x 2 )) ) = 1. is constant in η. That is f η1 (x 1 ) f η1 (x 2 ) = f η 2 (x 1 ) f η2 (x 2 ) for any η 1, η 2 H. (η 1 η 2 ) T (φ(x 1 ) φ(x 2 )) = 0. is constant By the definition of minimal exponential family, H is not a subset of an affine set. Since subspaces are affine sets, Θ is not a subset of any subspace, which implies span(θ) = R p.

LECTURE 2 NOTES 7 Since { (η1 η 2 ) T ( φ(x i ) φ(x i) ) = 0 for all η 1, η 2 H } we deduce { η T ( φ(x i ) φ(x i) ) = 0 for all η span(h) }, η T ( φ(x i ) φ(x i )) = 0 for all η R p, which allows us to conclude φ(x i ) = φ(x i ) = 0. The joint distribution of i.i.d. samples from a p-dimensional exponential family is another p-dimensional exponential family: f θ (x i ) = exp ( η(θ) T φ(x i ) a(θ) ) h(x i ) i n] i n] = exp ( η(θ) T ( i n] φ(x i) ) ) a(θ) i n] h(x i ) By the factorization theorem, i n] φ(x i) is a sufficient statistic. We remark that its dimension does not grow with the sample size. In fact, i.i.d. exponential families are the only parametric models that admit a sufficient statistic whose dimension does not depend on the sample size. The Pitman- Koopman-Darmois 2 theorem says that if {x i } i n] are i.i.d. samples from a continuous density f θ in a parametric model that consists of densities whose support does not depend on the parameter, then there is a sufficient statistic whose dimension does not depend on the sample size if and only if the parametric model is an exponential family. Lemma A.1. APPENDI A If two functions f 1 and f 2 are 1. proportional to each other: f 1 (x) = cf 2 (x) for any x 2. f 1(x)dx = f 2(x)dx, then they are equal: f 1 (x) = f 2 (x) for any x. Proof. Indeed, by integrating f 1 (x) = cf 2 (x) and rearranging, we have f 1(x)dx f 2(x)dx = c. By the assumption that their integrals agree, we deduce c = 1, which shows that f 1 (x) = f 2 (x). 2 One of the namesakes of the theorem is Dr. Jim Pitman s father, Dr. Edwin Pitman.

8 STAT 201B Lemma A.2. We have at any η int(h). Let f θ (x) be a member of a canonical exponential family. i z(η) = i exp(η T φ(x)) ] h(x)dx Proof. We show that exchanging differentiation and integration is justified at the origin. By the definition of the derivative, i z(0) = lim n = lim n z(e i ) z(0) e (e j) T φ(x) 1 h(x)dx, where { } is any decreasing sequence that converges to zero. We assume d 0 := sup n is small enough so that e j δ 0 int(h). We recognize the limit of the integrand is i exp(η T φ(x)) ] h(x). To justify exchanging lim n and integration, we appeal to the dominated convergence theorem. Let d 0 := sup n. The (absolute value of the) integrand is at most e (e j) T φ(x) 1 h(x) e (e j) T φ(x) (e j ) T φ(x) h(x) e (e jδ 0 ) T φ(x) (e j δ 0 ) T φ(x) h(x) δ 0 exp2 (e jδ 0 ) T φ(x) h(x) δ 0 exp2 (e jδ 0 ) T φ(x) + exp 2 (e j δ 0 ) T φ(x) h(x) δ 0 We check that the bound is integrable: exp 2 (e jδ 0 ) T φ(x) + exp 2 (e j δ 0 ) T φ(x) h(x)dx δ 0 = z(2e j δ 0 ) + z( 2e j δ 0 ), ( e x 1 e x x ) ( x e x ) ( e x e x + e x)

LECTURE 2 NOTES 9 which, by the assumption that e j δ 0 int(h) is finite. Thus we are free to exchange lim n and integration to conclude i z(0) = lim n = = lim n e (e j) T φ(x) 1 h(x)dx e (e j) T φ(x) 1 h(x)dx i exp(η T φ(x)) ] 0 h(x)dx. Yuekai Sun Berkeley, California November 29, 2015