ST5215: Advanced Statistical Theory

Similar documents
Fundamentals of Statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Lecture 17: Minimal sufficiency

Stat 710: Mathematical Statistics Lecture 12

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

CS Lecture 19. Exponential Families & Expectation Propagation

ST5215: Advanced Statistical Theory

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Chapter 1: Probability Theory Lecture 1: Measure space, measurable function, and integration

MIT Spring 2016

Lecture 26: Likelihood ratio tests

Probability Distributions Columns (a) through (d)

Chapter 1: Probability Theory Lecture 1: Measure space and measurable function

Lecture 20: Linear model, the LSE, and UMVUE

Stat 512 Homework key 2

Chapter 5. Chapter 5 sections

Stat 710: Mathematical Statistics Lecture 27

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

ECE 275B Homework # 1 Solutions Version Winter 2015

Classical and Bayesian inference

Probability and Distributions

ECE 275B Homework # 1 Solutions Winter 2018

RS Chapter 1 Random Variables 6/5/2017. Chapter 1. Probability Theory: Introduction

Lecture 23: UMPU tests in exponential families

ST5215: Advanced Statistical Theory

Probability and Random Processes

Definition 1.1 (Parametric family of distributions) A parametric distribution is a set of distribution functions, each of which is determined by speci

Chapter 8.8.1: A factorization theorem

Lecture 12: Multiple Random Variables and Independence

LECTURE 2 NOTES. 1. Minimal sufficient statistics.

Chapter 1. Statistical Spaces

STAT 512 sp 2018 Summary Sheet

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

1 Exercises for lecture 1

Lecture 17: Likelihood ratio and asymptotic tests

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 4: Conditional expectation and independence

Classical and Bayesian inference

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 8 10/1/2008 CONTINUOUS RANDOM VARIABLES

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Chapter 5 continued. Chapter 5 sections

the convolution of f and g) given by

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

MIT Spring 2016

Lecture 6: Special probability distributions. Summarizing probability distributions. Let X be a random variable with probability distribution

1 Complete Statistics

Contents 1. Contents

Lecture 13: p-values and union intersection tests

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Brief Review on Estimation Theory

Chapter 3 : Likelihood function and inference

Review and continuation from last week Properties of MLEs

Monte Carlo Integration II & Sampling from PDFs

Miscellaneous Errors in the Chapter 6 Solutions

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

Gibbs Sampling in Linear Models #2

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Multivariate Random Variable

Chapter 1 Statistical Reasoning Why statistics? Section 1.1 Basics of Probability Theory

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Things to remember when learning probability distributions:

1.1 Review of Probability Theory

Measure-theoretic probability

1 Probability Model. 1.1 Types of models to be discussed in the course

Generalized Linear Models (1/29/13)

Formulas for probability theory and linear models SF2941

Pattern Recognition. Parameter Estimation of Probability Density Functions

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Universal examples. Chapter The Bernoulli process

Econ 508B: Lecture 5

Fundamental Tools - Probability Theory II

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

Ch 10.1: Two Point Boundary Value Problems

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

CHAPTER 1 DISTRIBUTION THEORY 1 CHAPTER 1: DISTRIBUTION THEORY

The Expectation-Maximization Algorithm

Bivariate distributions

Week 1 Quantitative Analysis of Financial Markets Distributions A

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Check Your Understanding of the Lecture Material Finger Exercises with Solutions

Spring 2012 Math 541B Exam 1

Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Lecture 13: Conditional Distributions and Joint Continuity Conditional Probability for Discrete Random Variables

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Random Variables and Their Distributions

1: PROBABILITY REVIEW

STA 711: Probability & Measure Theory Robert L. Wolpert

Invariant HPD credible sets and MAP estimators

Gärtner-Ellis Theorem and applications.

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

discrete random variable: probability mass function continuous random variable: probability density function

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

Mean-field dual of cooperative reproduction

Homework #2 Solutions Due: September 5, for all n N n 3 = n2 (n + 1) 2 4

Transcription:

Department of Statistics & Applied Probability Monday, September 26, 2011

Lecture 10: Exponential families and Sufficient statistics Exponential Families Exponential families are important parametric families in statistical applications. Definition 2.2 (Exponential families) A parametric family {P θ : θ Θ} dominated by a σ-finite measure ν on (Ω, F) is called an exponential family iff dp θ dν (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) } h(ω), ω Ω, where exp{x} = e x, T is a random p-vector with a fixed positive integer p, η is a function from Θ to R p, h is a nonnegative Borel function on (Ω, F), and { } ξ(θ) = log exp{[η(θ)] τ T (ω)}h(ω)dν(ω). Ω

Remarks In Definition 2.2, T and h are functions of ω only, whereas η and ξ are functions of θ only. The representation dp θ dν (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) } h(ω), ω Ω of an exponential family is not unique. η(θ) = Dη(θ) with a p p nonsingular matrix D gives another representation (with T replaced by T = (D τ ) 1 T ). A change of the measure that dominates the family also changes the representation. If we define λ(a) = A hdν for any A F, then we obtain an exponential family with densities dp θ dλ (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) }.

Terminology In an exponential family, consider the reparameterization η = η(θ) and f η (ω) = exp { η τ T (ω) ζ(η) } h(ω), ω Ω, where ζ(η) = log { Ω exp{ητ T (ω)}h(ω)dν(ω) }. This is called the canonical form for the family (not unique). The new parameter η is called the natural parameter. The new parameter space Ξ = {η(θ) : θ Θ}, a subset of R p, is called the natural parameter space. An exponential family in canonical form is called a natural exponential family. If there is an open set contained in the natural parameter space of an exponential family, then the family is said to be of full rank.

Example 2.6 The normal family {N(µ, σ 2 ) : µ R, σ > 0} is an exponential family, since the Lebesgue p.d.f. of N(µ, σ 2 ) can be written as 1 2π exp { µ σ 2 x 1 2σ 2 x 2 µ2 2σ 2 log σ }. This belongs to an exponential family with T (x) = (x, x 2 ), η(θ) = ( µ ) 1, σ 2 2σ, θ = (µ, σ 2 ), ξ(θ) = µ2 + log σ, and 2 2σ 2 h(x) = 1/ 2π. Let η = (η 1, η 2 ) = ( µ ) 1, σ 2 2σ. 2 Then Ξ = R (0, ) and we can obtain a natural exponential family of full rank with ζ(η) = η1 2/(4η 2) + log(1/ 2η 2 ).

Example 2.6 The normal family {N(µ, σ 2 ) : µ R, σ > 0} is an exponential family, since the Lebesgue p.d.f. of N(µ, σ 2 ) can be written as 1 2π exp { µ σ 2 x 1 2σ 2 x 2 µ2 2σ 2 log σ }. This belongs to an exponential family with T (x) = (x, x 2 ), η(θ) = ( µ ) 1, σ 2 2σ, θ = (µ, σ 2 ), ξ(θ) = µ2 + log σ, and 2 2σ 2 h(x) = 1/ 2π. Let η = (η 1, η 2 ) = ( µ ) 1, σ 2 2σ. 2 Then Ξ = R (0, ) and we can obtain a natural exponential family of full rank with ζ(η) = η1 2/(4η 2) + log(1/ 2η 2 ). A subfamily of the previous normal family, {N(µ, µ 2 ) : µ R, µ 0}, is also an exponential family with the natural parameter η = ( 1 µ, ) 1 2µ and natural parameter space 2 Ξ = {(x, y) : y = 2x 2, x R, y > 0}. This exponential family is not of full rank.

Properties For an exponential family, there is a nonzero measure λ such that dp θ dλ (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) } > 0 for all ω and θ. (1) We can use this fact to show that a family of distributions is not an exponential family.

Properties For an exponential family, there is a nonzero measure λ such that dp θ dλ (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) } > 0 for all ω and θ. (1) We can use this fact to show that a family of distributions is not an exponential family. Consider the family of uniform distributions, i.e., P θ is U(0, θ) with an unknown θ (0, ). If P = {P θ : θ (0, )} is an exponential family, then (1) holds with a nonzero measure λ. For any t > 0, there is a θ < t such that P θ ([t, )) = 0, which with (1) implies that λ([t, )) = 0. Also, for any t 0, P θ ((, t]) = 0, which with (1) implies that λ((, t]) = 0. Since t is arbitrary, λ 0. This contradiction implies that P cannot be an exponential family.

Properties For an exponential family, there is a nonzero measure λ such that dp θ dλ (ω) = exp{ [η(θ)] τ T (ω) ξ(θ) } > 0 for all ω and θ. (1) We can use this fact to show that a family of distributions is not an exponential family. Consider the family of uniform distributions, i.e., P θ is U(0, θ) with an unknown θ (0, ). If P = {P θ : θ (0, )} is an exponential family, then (1) holds with a nonzero measure λ. For any t > 0, there is a θ < t such that P θ ([t, )) = 0, which with (1) implies that λ([t, )) = 0. Also, for any t 0, P θ ((, t]) = 0, which with (1) implies that λ((, t]) = 0. Since t is arbitrary, λ 0. This contradiction implies that P cannot be an exponential family. Which of the parametric families from Tables 1.1 and 1.2 are exponential families?

Example 2.7 (The multinomial family). Consider an experiment having k + 1 possible outcomes with p i as the probability for the ith outcome, i = 0, 1,..., k, k i=0 p i = 1. In n independent trials of this experiment, let X i be the number of trials resulting in the ith outcome, i = 0, 1,..., k. Then the joint p.d.f. (w.r.t. counting measure) of (X 0, X 1,..., X k ) is f θ (x 0, x 1,..., x k ) = n! x 0!x 1! x k! px 0 0 px 1 1 px k k I B(x 0, x 1,..., x k ), where B = {(x 0, x 1,..., x k ) : x i s are integers 0, k i=0 x i = n} and θ = (p 0, p 1,..., p k ). The distribution of (X 0, X 1,..., X k ) is called the multinomial distribution, which is an extension of the binomial distribution. In fact, the marginal c.d.f. of each X i is the binomial distribution Bi(p i, n). P = {f θ : θ Θ} is called the multinomial family, where Θ = {θ R k+1 : 0 < p i < 1, k i=0 p i = 1}.

Example 2.7 (continued) Let x = (x 0, x 1,..., x k ), η = (log p 0, log p 1,..., log p k ), and h(x) = [n!/(x 0!x 1! x k!)]i B (x). Then f θ (x 0, x 1,..., x k ) = exp {η τ x} h(x), x R k+1. Hence, the multinomial family is a natural exponential family with natural parameter η. However, this representation does not provide an exponential family of full rank, since there is no open set of R k+1 contained in the natural parameter space. A reparameterization leads to an exponential family with full rank. Using the fact that k i=0 X i = n and k i=0 p i = 1, we obtain that f θ (x 0, x 1,..., x k ) = exp {η τ x ζ(η )} h(x), x R k+1, where x = (x 1,..., x k ), η = (log(p 1 /p 0 ),..., log(p k /p 0 )), and ζ(η ) = n log p 0. The η -parameter space is R k. Hence, the family P is a natural exponential family of full rank.

Properties If X 1,..., X m are independent random vectors with p.d.f. s in exponential families, then the p.d.f. of (X 1,..., X m ) is again in an exponential family. The following result summarizes some other useful properties of exponential families. Its proof can be found in Lehmann (1986).

Properties If X 1,..., X m are independent random vectors with p.d.f. s in exponential families, then the p.d.f. of (X 1,..., X m ) is again in an exponential family. The following result summarizes some other useful properties of exponential families. Its proof can be found in Lehmann (1986). Theorem 2.1 Let P be a natural exponential family with p.d.f. f η (x) = exp { η τ T (x) ζ(η) } h(x) (i) Let T = (Y, U) and η = (ϑ, ϕ), where Y and ϑ have the same dimension. Then, Y has the p.d.f. f η (y) = exp{ϑ τ y ζ(η)} w.r.t. a σ-finite measure depending on ϕ. In particular, T has a p.d.f. in a natural exponential family.

Theorem 2.1 (continued) (ii) Furthermore, the conditional distribution of Y given U = u has the p.d.f. (w.r.t. a σ-finite measure depending on u) f ϑ,u (y) = exp{ϑ τ y ζ u (ϑ)}, which is in a natural exponential family indexed by ϑ. (iii) If η 0 is an interior point of the natural parameter space, then the m.g.f. ψ η0 of T is finite in a neighborhood of 0 and is given by ψ η0 (t) = exp{ζ(η 0 + t) ζ(η 0 )}. Furthermore, if f is a Borel function satisfying f dp η0 <, then the function f (ω) exp{η τ T (ω)}h(ω)dν(ω) is infinitely often differentiable in a neighborhood of η 0, and the derivatives may be computed by differentiation under the integral sign.

Example 2.5 Let P θ be the binomial distribution Bi(θ, n) with parameter θ, where n is a fixed positive integer. Then {P θ : θ (0, 1)} is an exponential family, since the p.d.f. of P θ w.r.t. the counting measure is f θ (x) = exp { x log } ( ) θ 1 θ + n log(1 θ) n I x {0,1,...,n} (x) (T (x)=x, η(θ)=log θ 1 θ, ξ(θ)= n log(1 θ), and h(x)= ( n x) I{0,1,...,n} (x)). θ 1 θ If we let η = log, then Ξ = R and the family with p.d.f. s ( ) n f η (x) = exp {xη n log(1 + e η )} I x {0,1,...,n} (x) is a natural exponential family of full rank. From Theorem 2.1(ii), the m.g.f. of the binomial distribution Bi(θ, n) is ( 1 + e exp{n log(1+e η+t ) n log(1+e η η e t ) n )} = 1 + e η = (1 θ+θe t ) n.

Sufficient Statistics Data reduction without loss of information: A statistic T (X ) provides a reduction of the σ-field σ(x ) Does such a reduction results in any loss of information concerning the unknown population? If a statistic T (X ) is fully as informative as the original sample X, then statistical analyses can be done using T (X ) that is simpler than X. The next concept describes what we mean by fully informative.

Sufficient Statistics Data reduction without loss of information: A statistic T (X ) provides a reduction of the σ-field σ(x ) Does such a reduction results in any loss of information concerning the unknown population? If a statistic T (X ) is fully as informative as the original sample X, then statistical analyses can be done using T (X ) that is simpler than X. The next concept describes what we mean by fully informative. Definition 2.4 (Sufficiency) Let X be a sample from an unknown population P P, where P is a family of populations. A statistic T (X ) is said to be sufficient for P P (or for θ Θ when P = {P θ : θ Θ} is a parametric family) iff the conditional distribution of X given T is known (does not depend on P or θ).

Remarks Once we observe X and compute a sufficient statistic T (X ), the original data X do not contain any further information concerning the unknown population P (since its conditional distribution is unrelated to P) and can be discarded. A sufficient statistic T (X ) contains all information about P contained in X and provides a reduction of the data if T is not one-to-one. The concept of sufficiency depends on the given family P. If T is sufficient for P P, then T is also sufficient for P P 0 P but not necessarily sufficient for P P 1 P.

Remarks Once we observe X and compute a sufficient statistic T (X ), the original data X do not contain any further information concerning the unknown population P (since its conditional distribution is unrelated to P) and can be discarded. A sufficient statistic T (X ) contains all information about P contained in X and provides a reduction of the data if T is not one-to-one. The concept of sufficiency depends on the given family P. If T is sufficient for P P, then T is also sufficient for P P 0 P but not necessarily sufficient for P P 1 P. Example 2.10 Suppose that X = (X 1,..., X n ) and X 1,..., X n are i.i.d. from the binomial distribution with the p.d.f. (w.r.t. the counting measure) f θ (z) = θ z (1 θ) 1 z I {0,1} (z), z R, θ (0, 1). Consider the statistic T (X ) = n i=1 X i, which is the number of ones in X.

Example 2.10 (continued) For any realization x of X, x is a sequence of n ones and zeros. T contains all information about θ, since θ is the probability of an occurrence of a one in x and given T = t, what is left in the data set x is the redundant information about the positions of t ones. To show T is sufficient for θ, we compute P(X = x T = t). Let t = 0, 1,..., n and B t = {(x 1,..., x n ) : x i = 0, 1, n i=1 x i = t}. If x B t, then P(X = x, T = t) = 0. If x B t, then n n P(X = x, T = t) = P(X i = x i ) = θ t (1 θ) n t I {0,1} (x i ). Also Then i=1 P(T = t) = P(X = x T = t) = i=1 ( ) n θ t (1 θ) n t I t {0,1,...,n} (t). P(X = x, T = t) P(T = t) = 1 ( n t)i Bt (x) is a known p.d.f. (does not depend on θ). Hence T (X ) is sufficient for θ (0, 1) according to Definition 2.4.

How to find a sufficient statistic? Finding a sufficient statistic by means of the definition is not convenient. It involves guessing a statistic T that might be sufficient and computing the conditional distribution of X given T. For families of populations having p.d.f. s, a simple way of finding sufficient statistics is to use the factorization theorem.

How to find a sufficient statistic? Finding a sufficient statistic by means of the definition is not convenient. It involves guessing a statistic T that might be sufficient and computing the conditional distribution of X given T. For families of populations having p.d.f. s, a simple way of finding sufficient statistics is to use the factorization theorem. Theorem 2.2 (The factorization theorem) Suppose that X is a sample from P P and P is a family of probability measures on (R n, B n ) dominated by a σ-finite measure ν. Then T (X ) is sufficient for P P iff there are nonnegative Borel functions h (which does not depend on P) on (R n, B n ) and g P (which depends on P) on the range of T such that dp dν (x) = g ( ) P T (x) h(x).

Sufficient statistics in exponential famlilies If P is an exponential family, then Theorem 2.2 can be applied with g θ (t) = exp{[η(θ)] τ t ξ(θ)}, i.e., T is a sufficient statistic for θ Θ. In Example 2.10 the joint distribution of X is in an exponential family with T (X ) = n i=1 X i. Hence, we can conclude that T is sufficient for θ (0, 1) without computing the conditional distribution of X given T.

Sufficient statistics in exponential famlilies If P is an exponential family, then Theorem 2.2 can be applied with g θ (t) = exp{[η(θ)] τ t ξ(θ)}, i.e., T is a sufficient statistic for θ Θ. In Example 2.10 the joint distribution of X is in an exponential family with T (X ) = n i=1 X i. Hence, we can conclude that T is sufficient for θ (0, 1) without computing the conditional distribution of X given T. Example 2.11 (Truncation families) Let φ(x) be a positive Borel function on (R, B) such that b a φ(x)dx < for any a and b, < a < b <. Let θ = (a, b), Θ = {(a, b) R 2 : a < b}, and [ b ] 1 f θ (x) = c(θ)φ(x)i (a,b) (x), c(θ) = φ(x)dx a

Example 2.11 (continued) Then {f θ : θ Θ}, called a truncation family, is a parametric family dominated by the Lebesgue measure on R. Let X 1,..., X n be i.i.d. random variables having the p.d.f. f θ. Then the joint p.d.f. of X = (X 1,..., X n ) is n f θ (x i ) = [c(θ)] n I (a, ) (x (1) )I (,b) (x (n) ) i=1 n φ(x i ), (2) where x (i) is the ith ordered value of x 1,..., x n. Let T (X ) = (X (1), X (n) ), g θ (t 1, t 2 ) = [c(θ)] n I (a, ) (t 1 )I (,b) (t 2 ), and h(x) = n i=1 φ(x i). By (2) and Theorem 2.2, T (X ) is sufficient for θ Θ. i=1

Example 2.12 (Order statistics) Let X = (X 1,..., X n ) and X 1,..., X n be i.i.d. random variables having a distribution P P, where P is the family of distributions on R having Lebesgue p.d.f. s. Let X (1),..., X (n) be the order statistics given in Example 2.9. Note that the joint p.d.f. of X is f (x 1 ) f (x n ) = f (x (1) ) f (x (n) ). Hence, T (X ) = (X (1),..., X (n) ) is sufficient for P P. The order statistics can be shown to be sufficient even when P is not dominated by any σ-finite measure, but Theorem 2.2 is not applicable (see Exercise 31 in 2.6).