Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA. Asymptotic Inference in Exponential Families Let X j be a sequence of independent, identically distributed random variables from a natural exponential family η T (x) A(η) f(x η) = h(x) e and let q denote the dimension of T and η. We have seen that the log moment generating function for T (X) is A(ω + η) A(η) and hence that the mean and covariance for T are given respectively by ET η] = A(η) VT η] = 2 A(η) = I(η), () where we recognize the Hessian of A(η) as the Information matrix, both expected (Fisher) and observed. The likelihood function upon observing a sample of size n is L n (η) = h(x j ) e η P T (x j ) n A(η), so the Maximum Likelihood Estimator (MLE) ˆη n of η satisfies the equation A(ˆη n ) = T n Under suitable regularity conditions the Central Limit Theorem will ensure that T n will have an asymptotically normal distribution with (by () mean A(η) and covariance I(η)/n, so n A(ˆηn ) A(η) ]
will have an asymptotical No(0, I(η)) distribution. By Taylor s Theorem we can write A(ˆηn ) A(η) ] = 2 A(η)(ˆη n η) + O( ˆη n η 2 ) = I(η)(ˆη n η) + O(/n), from which we conclude n(ˆηn η) No ( 0, I(η) ) or, more casually, that ˆη n has approximately a No ( η, n I(η)] ) distribution so the Maximum Likelihood Estimator is consistent and efficient and asymptotically normal... Unnatural Families If we parametrize by some θ Θ other than the natural parameter η, but still have a smooth mapping θ η(θ), we can note that ˆη n = η(ˆθ n ), and (again, by Taylor) η(ˆθ n ) = η(θ) + J (ˆθ n θ) + O( ˆθ n θ 2 ), where the Jacobian matrix J is given by J ij = η j / θ i, so n(ˆθn θ) No ( 0, J I(η)J] ) = No ( 0, I(θ) ) and again we have consistency, efficiency, and asymptotic normality. In one-dimensional families this leads to Frequentist confidence intervals of the form α Pr θ (ˆθn Z α/2 n I(ˆθ n ), ˆθn + Z α/2 ) ] θ n I(ˆθ n ) where the is required both because the distribution of ˆθ n is only approximately normal, and because I(ˆθ n ) is only approximately I(θ). 2. Bayesian Asymptotic Inference The observed information for a single observation X = x from the model X f(x θ) is i(θ, x) = 2 θ log f(x θ); 2
evidently the Fisher (expected) information is related to this by I(θ) = Ei(θ, X) θ]. The likelihood for a sample of size n is just the product of the individual likelihoods, leading to a sum for the log likelihoods, and observed information n i(θ, x) = i(θ, x j ). If the log likelihood log f n (x θ) is differentiable throughout Θ and attains a unique maximum at an interior point ˆθ n (x) Θ, then we can expand log f n (x θ) in a second-order Taylor series for θ = ˆθ n (x) + ɛ/ n close to ˆθ n (x) to find (for q = dimensional θ) j= log f n (x θ) = log f n (x ˆθ n ) + (ɛ/ n) θ log f n (x ˆθ n ) + (ɛ/ n) 2 2 θ! 2! log f n(x ˆθ n ) +o ( (ɛ/ n) 2 2 θ log f n(x ˆθ n ) ) = log f n (x ˆθ n ) + 0 ɛ2 2 n n i(ˆθ n, x j ) + o() j= log f n (x ˆθ n ) ɛ2 2 Ei(ˆθ n, X)] = log f n (x ˆθ n ) ɛ2 2 I(ˆθ n ) = log f n (x ˆθ n ) 2 n I(ˆθ n )(θ ˆθ n ) 2, where we have used the consistency of ˆθ n and have applied the strong law of large numbers for i(θ, X). Thus we have the likelihood approximation f n (x θ) No(ˆθ n (x), ni(ˆθ n )), normal with mean the MLE ˆθ n (x) and precision ni(ˆθ n ) (or covariance n I(ˆθ n ) ). Note that the prior distribution is irrelevent, asymptotically, so long as it is smooth and doesn t vanish in a neighborhood of ˆθ n ; thus we have the Bayesian Central Limit Theorem, π n (θ x) No(ˆθ n, n I(ˆθ n )] ), leading in the q = -dimensional case to Bayesian credible intervals of the form α Pr θ (ˆθn Z α/2, ˆθn + Z α/2 ) ] x, n I(ˆθ n ) n I(ˆθ n ) 3
just as before, but now with a completely different interpretation. 3. An Example Let X i be Bernoulli random variables with PX i = ] = θ for some θ Θ = (0, ); this is an exponential family with natural parameter η = η(θ) = log θ θ and natural sufficient statistic (for a sample of size n) T n = ΣX j, hence with log normalizing factor A(η) = log( + e η ) = B(θ) = log( θ), i.e. with likelihood L(θ) = p Tn ( p) n Tn = e ηtn n log(+eη ) and hence with MLE s and ( ) information matrices ˆθ n = T n /n = X n I θ = ˆη n = log Tn n T n = log θ( θ) X n X n I η = e η (+e η ) 2, so I η (ˆη n ) = T n (n T n )/n 2, I θ (ˆθ n ) = n 2 /(T n (n T n )), and 95% confidence intervals would be 0.95 Prˆη n.96/ ni(ˆη n ) < η < ˆη n +.96/ ni(ˆη n )] = T n.96 Prlog n T n Tn (n T n )/n < η < log T n n T n +.96 Tn (n T n )/n ] This can be written as an interval 0.95 = PrL n < θ < R n ] for θ = e η /( + e η ), with left and right endpoints L n (T n ) = T n / T n + (n T n ) exp ( +.96/ T n (n T n )/n )] R n (T n ) = T n / T n + (n T n ) exp (.96/ T n (n T n )/n )] ; for examle, with T 00 = 0 successes in n = 00 tries, the endpoints are L 00 (0) = 0/0 + 90 exp(.96/ 9)] = / + 9 exp(0.6533)] = 0.05465 and R 00 (0) = / + 9 exp( 0.6533)] = 0.7597, while with T 00 = 50 the interval endpoints would be are L 00 (50) = 50/50 + 50 exp(.96/ 25)] = /+exp(0.392)] = 0.40324 and R 00 (50) = /+exp( 0.392)] = 0.59676. Intervals for θ can be made directly using 0.95 Prˆθ n.96/ ni(ˆθ n ) < θ < ˆθ n +.96/ ni(ˆθ n )] = Prˆθ n.96 ˆθ n ( ˆθ n )/n < θ < ˆθ n +.96 ˆθ n ( ˆθ n )/n], 4
an interval with endpoints L 00 (0) = 0.0.96 0.09/00 = 0.0.96 0.30 = 0.042, R 00 (0) = 0.0 +.96 0.03 = 0.588 for T 00 = 0, and L 00 (50) = 0.50.96 0.25/00 = 0.0.96 0.05 = 0.402, R 00 (50) = 0.50 +.96 0.05 = 0.598 for T 00 = 50. Recall that the exact 95% confidence intervals for θ are qbeta(0.025, T n,n-t n +), qbeta(0.975, T n +,n-t n ) in general, or qbeta(0.025, 0,9), qbeta(0.975,,90)] =0.0490, 0.762] for T 00 = 0 and qbeta(0.025, 50,5), qbeta(0.975, 5, 50)] = 0.3983, 0.607] for T 00 = 50. In summary, L 00 (0) R 00 (0) L 00 (50) R 00 (50) Normal, based on η: 0.05465, 0.7597] 0.40324, 0.59676] Normal, based on θ: 0.0420, 0.5880] 0.40200, 0.59800] Exact Frequentist: 0.04900, 0.7622] 0.39832, 0.6068] Exact Bayes (.0,.0): 0.05564, 0.7456] 0.40364, 0.59636] Exact Bayes (0.5,0.5): 0.05258, 0.702] 0.4037, 0.59683] Exact Bayes (0.0,0.0): 0.0495, 0.6557] 0.40270, 0.59730] Evidently the normal approximations to the Frequentist intervals are both rather close, but a bit too narrow (hence cover θ with a bit less than the promissed 95% probability). Of course the approximations improve with increasing n and become worse for smaller n. The intervals based on the natural parameter are very close to the Bayesian credible intervals. All the approximations are better near the middle of the range (θ /2) than near the endpoints. 5