MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30
INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ) is the maximized log likelihood and k is the dimension of the model parameter space. Copyright c 2012 (Iowa State University) Statistics 511 2 / 30
AIC = 2l(ˆθ) + 2k can be used to determine which of multiple models is best for a given data set. Small values of AIC are preferred. The +2k portion of AIC can be viewed as a penalty for model complexity. Copyright c 2012 (Iowa State University) Statistics 511 3 / 30
Schwarz s Bayesian Information Criterion is given by BIC = 2l(ˆθ) + k log(n). BIC is the same as AIC except the penalty for model complexity is greater for BIC (when n 8) and grows with n. Copyright c 2012 (Iowa State University) Statistics 511 4 / 30
AIC and BIC can each be used to compare models even if they are not nested (i.e., even if one is not a special case of the other as in our reduced vs. full model comparison discussed previously). However, if REML likelihoods are used, compared models must have the same model for the response mean. Different models for the mean would yield different error contrasts and therefore different datasets for computation of maximized REML likelihoods. Copyright c 2012 (Iowa State University) Statistics 511 5 / 30
LARGE n THEORY FOR MLEs Suppose θ is a k 1 parameter vector. Let l(θ) denote the log likelihood function. Under regularity conditions discussed in, e.g., Shao, J.(2003) Mathematical Statistics, 2 nd Ed. Springer, New York; we have the following. Copyright c 2012 (Iowa State University) Statistics 511 6 / 30
There is an estimator ˆθ that solves the likelihood equations l(θ) θ = 0 and is a (weakly) consistent estimator of θ, i.e., lim Pr[ ˆθ θ > ε] = 0 for any ε > 0. n For sufficiently large n, where ˆθ N(θ, I 1 (θ)), Copyright c 2012 (Iowa State University) Statistics 511 7 / 30
[( ) ( ) ] l(θ) l(θ) I(θ) = E θ θ [ 2 ] l(θ) = E θ θ [ { 2 }] l(θ) = E. θ i θ j i,j {1,...,k} Copyright c 2012 (Iowa State University) Statistics 511 8 / 30
I(θ) is known as the Fisher Information matrix. I(θ) can be approximated by the observed Fisher Information matrix, which is given by Î(ˆθ) 2 l(θ) θ θ θ=ˆθ. I(θ) and Î(ˆθ) may depend on unknown nuisance parameters. In such cases, nuisance parameters are replaced by consistent estimators. Copyright c 2012 (Iowa State University) Statistics 511 9 / 30
A Simple Example Suppose y 1,..., y n i.i.d. N(µ, σ 2 ). If we are interested in inference for µ, we can take θ = µ and treat σ 2 as a nuisance parameter. It is straightforward to show that ȳ is the unique solution to the likelihood equation. Furthermore, it is straightforward to show that I(θ) = Î(ˆθ) = n/σ 2. Copyright c 2012 (Iowa State University) Statistics 511 10 / 30
A Simple Example (continued) Thus, we have and ˆθ = ȳ N(θ = µ, I 1 (θ) = σ 2 /n) ȳ N(µ, s 2 /n) for sufficiently large n, where s 2 = n i=1 (y i ȳ ) 2. n 1 Copyright c 2012 (Iowa State University) Statistics 511 11 / 30
WALD TESTS AND CONFIDENCE INTERVALS Suppose for large n that ˆΣ 1/2 (ˆθ θ) N(0, I) and (ˆθ θ) ˆΣ 1 (ˆθ θ) χ 2 k. Copyright c 2012 (Iowa State University) Statistics 511 12 / 30
For example, suppose ˆθ is the MLE of θ and Î(ˆθ) is the observed information matrix. Then under regularity conditions, we have and for sufficiently large n. [Î(ˆθ)] 1/2 (ˆθ θ) N(0, I) (ˆθ θ) Î(ˆθ)(ˆθ θ) χ 2 k Copyright c 2012 (Iowa State University) Statistics 511 13 / 30
An approximate 100(1 α)% confidence interval for θ i is ˆθ i ± z 1 α/2 ˆΣ ii, where z 1 α/2 is the 1 α/2 quantile of the N(0, 1) distribution and ˆΣ ii is element (i, i) of ˆΣ. Copyright c 2012 (Iowa State University) Statistics 511 14 / 30
An approximate p-value for testing H 0 : θ = θ 0 is Pr[χ 2 k (ˆθ θ 0 ) ˆΣ 1 (ˆθ θ0 )], where χ 2 k is a central χ2 random variable with k degrees of freedom. Copyright c 2012 (Iowa State University) Statistics 511 15 / 30
MULTIVARIATE DELTA METHOD Suppose g is a function from IR k to IR m, i.e., g 1 (θ) for θ IR k g 2 (θ), g(θ) =. g m (θ) for some functions g 1,..., g m. Suppose g is differentiable with derivative matrix g 1 (θ) g θ 1 m (θ) θ 1 D...... g 1 (θ) θ k g m (θ) θ k Copyright c 2012 (Iowa State University) Statistics 511 16 / 30
Now suppose ˆθ has mean θ and variance Σ. Then Taylor s Theorem implies which implies and g(ˆθ) g(θ) + D (ˆθ θ) E[g(ˆθ)] g(θ) + D E(ˆθ θ) = g(θ) Var[g( ˆθ)] Var[g(θ) + D ( ˆθ θ)] = D ΣD. Copyright c 2012 (Iowa State University) Statistics 511 17 / 30
If ˆθ N(θ, Σ), it follows that g(ˆθ) N(g(θ), D ΣD). In practice, we often need to estimate D by replacing θ in D with ˆθ to obtain ˆD. Similarly, we often need to replace Σ with an estimate ˆΣ. Copyright c 2012 (Iowa State University) Statistics 511 18 / 30
LIKELIHOOD RATIO BASED INFERENCE Suppose we wish to test the null hypothesis that a reduced model provides an adequate fit to a dataset relative to a more general full model that includes the reduced model as a special case. Copyright c 2012 (Iowa State University) Statistics 511 20 / 30
Define Λ as Reduced Model Maximized Likelihood. Full Model Maximized Likelihood Λ is known as the likelihood ratio. 2 log Λ is known as the likelihood ratio test statistic. Tests based on 2 log Λ are called likelihood ratio tests. Copyright c 2012 (Iowa State University) Statistics 511 21 / 30
Under the regularity conditions in Shao (2003) mentioned previously, the likelihood ratio test statistic 2 log Λ is approximately distributed as central χ 2 k f k r under the null hypothesis, where k f and k r are the dimensions of the parameter space under the full and reduced models, respectively. This approximation can be reasonable if n is sufficiently large. Copyright c 2012 (Iowa State University) Statistics 511 22 / 30
Likelihood Ratio Tests and Confidence Regions for a Subvector of the Full Model Parameter Vector θ Suppose θ is k 1 vector and is partitioned into vectors θ[ 1 k 1 ] 1 and θ 2 k 2 1, where k = k 1 + k 2 θ1 and θ =. θ 2 Consider a test of H 0 : θ 1 = θ 10. Copyright c 2012 (Iowa State University) Statistics 511 23 / 30
Suppose ([ ]) ˆθ is the MLE of θ and ˆθ 2 (θ 1 ) maximizes θ1 l over θ 2 for any fixed value of θ 1. θ 2 [ ([ Then 2 l(ˆθ) l θ 10 ˆθ 2 (θ 10 ) ])] is approximately χ 2 k 1 under the null hypothesis by our previous result when n is sufficiently large. Copyright c 2012 (Iowa State University) Statistics 511 24 / 30
Also, Pr { 2 [ ([ l(ˆθ) l θ 1 ˆθ 2 (θ 1 ) ])] } χ 2 k 1,1 α 1 α which implies { ([ ]) θ 1 Pr l l(ˆθ) 1 } ˆθ 2 (θ 1 ) 2 χ2 k 1,1 α 1 α. Copyright c 2012 (Iowa State University) Statistics 511 25 / 30
Thus, the set of values of θ 1 that, when maximizing over θ 2, yield a maximized likelihood within 1 2 χ2 k 1,1 α of the likelihood maximized over all θ, form a 100(1 α)% confidence region for θ 1. Such a confidence region is known as a profile likelihood confidence region because ([ ]) θ 1 l ˆθ 2 (θ 1 ) is the profile log likelihood for θ 1. Copyright c 2012 (Iowa State University) Statistics 511 26 / 30
A Warning The normal and χ 2 approximations mentioned in these notes may be crude if sample sizes are not sufficiently large. The regularity conditions mentioned in these notes do not hold if the true parameter falls on the boundary of the parameter space. Thus, as an example, testing H 0 : σ 2 u = 0 is not covered by the methods presented here. Copyright c 2012 (Iowa State University) Statistics 511 30 / 30