MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Size: px

Start display at page:

Download "MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30"

Isaac Doyle
5 years ago
Views:

1 MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics / 30

2 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ) is the maximized log likelihood and k is the dimension of the model parameter space. Copyright c 2012 (Iowa State University) Statistics / 30

3 AIC = 2l(ˆθ) + 2k can be used to determine which of multiple models is best for a given data set. Small values of AIC are preferred. The +2k portion of AIC can be viewed as a penalty for model complexity. Copyright c 2012 (Iowa State University) Statistics / 30

4 Schwarz s Bayesian Information Criterion is given by BIC = 2l(ˆθ) + k log(n). BIC is the same as AIC except the penalty for model complexity is greater for BIC (when n 8) and grows with n. Copyright c 2012 (Iowa State University) Statistics / 30

5 AIC and BIC can each be used to compare models even if they are not nested (i.e., even if one is not a special case of the other as in our reduced vs. full model comparison discussed previously). However, if REML likelihoods are used, compared models must have the same model for the response mean. Different models for the mean would yield different error contrasts and therefore different datasets for computation of maximized REML likelihoods. Copyright c 2012 (Iowa State University) Statistics / 30

6 LARGE n THEORY FOR MLEs Suppose θ is a k 1 parameter vector. Let l(θ) denote the log likelihood function. Under regularity conditions discussed in, e.g., Shao, J.(2003) Mathematical Statistics, 2 nd Ed. Springer, New York; we have the following. Copyright c 2012 (Iowa State University) Statistics / 30

7 There is an estimator ˆθ that solves the likelihood equations l(θ) θ = 0 and is a (weakly) consistent estimator of θ, i.e., lim Pr[ ˆθ θ > ε] = 0 for any ε > 0. n For sufficiently large n, where ˆθ N(θ, I 1 (θ)), Copyright c 2012 (Iowa State University) Statistics / 30

8 [( ) ( ) ] l(θ) l(θ) I(θ) = E θ θ [ 2 ] l(θ) = E θ θ [ { 2 }] l(θ) = E. θ i θ j i,j {1,...,k} Copyright c 2012 (Iowa State University) Statistics / 30

9 I(θ) is known as the Fisher Information matrix. I(θ) can be approximated by the observed Fisher Information matrix, which is given by Î(ˆθ) 2 l(θ) θ θ θ=ˆθ. I(θ) and Î(ˆθ) may depend on unknown nuisance parameters. In such cases, nuisance parameters are replaced by consistent estimators. Copyright c 2012 (Iowa State University) Statistics / 30

10 A Simple Example Suppose y 1,..., y n i.i.d. N(µ, σ 2 ). If we are interested in inference for µ, we can take θ = µ and treat σ 2 as a nuisance parameter. It is straightforward to show that ȳ is the unique solution to the likelihood equation. Furthermore, it is straightforward to show that I(θ) = Î(ˆθ) = n/σ 2. Copyright c 2012 (Iowa State University) Statistics / 30

11 A Simple Example (continued) Thus, we have and ˆθ = ȳ N(θ = µ, I 1 (θ) = σ 2 /n) ȳ N(µ, s 2 /n) for sufficiently large n, where s 2 = n i=1 (y i ȳ ) 2. n 1 Copyright c 2012 (Iowa State University) Statistics / 30

12 WALD TESTS AND CONFIDENCE INTERVALS Suppose for large n that ˆΣ 1/2 (ˆθ θ) N(0, I) and (ˆθ θ) ˆΣ 1 (ˆθ θ) χ 2 k. Copyright c 2012 (Iowa State University) Statistics / 30

13 For example, suppose ˆθ is the MLE of θ and Î(ˆθ) is the observed information matrix. Then under regularity conditions, we have and for sufficiently large n. [Î(ˆθ)] 1/2 (ˆθ θ) N(0, I) (ˆθ θ) Î(ˆθ)(ˆθ θ) χ 2 k Copyright c 2012 (Iowa State University) Statistics / 30

14 An approximate 100(1 α)% confidence interval for θ i is ˆθ i ± z 1 α/2 ˆΣ ii, where z 1 α/2 is the 1 α/2 quantile of the N(0, 1) distribution and ˆΣ ii is element (i, i) of ˆΣ. Copyright c 2012 (Iowa State University) Statistics / 30

15 An approximate p-value for testing H 0 : θ = θ 0 is Pr[χ 2 k (ˆθ θ 0 ) ˆΣ 1 (ˆθ θ0 )], where χ 2 k is a central χ2 random variable with k degrees of freedom. Copyright c 2012 (Iowa State University) Statistics / 30

16 MULTIVARIATE DELTA METHOD Suppose g is a function from IR k to IR m, i.e., g 1 (θ) for θ IR k g 2 (θ), g(θ) =. g m (θ) for some functions g 1,..., g m. Suppose g is differentiable with derivative matrix g 1 (θ) g θ 1 m (θ) θ 1 D g 1 (θ) θ k g m (θ) θ k Copyright c 2012 (Iowa State University) Statistics / 30

17 Now suppose ˆθ has mean θ and variance Σ. Then Taylor s Theorem implies which implies and g(ˆθ) g(θ) + D (ˆθ θ) E[g(ˆθ)] g(θ) + D E(ˆθ θ) = g(θ) Var[g( ˆθ)] Var[g(θ) + D ( ˆθ θ)] = D ΣD. Copyright c 2012 (Iowa State University) Statistics / 30

18 If ˆθ N(θ, Σ), it follows that g(ˆθ) N(g(θ), D ΣD). In practice, we often need to estimate D by replacing θ in D with ˆθ to obtain ˆD. Similarly, we often need to replace Σ with an estimate ˆΣ. Copyright c 2012 (Iowa State University) Statistics / 30

20 LIKELIHOOD RATIO BASED INFERENCE Suppose we wish to test the null hypothesis that a reduced model provides an adequate fit to a dataset relative to a more general full model that includes the reduced model as a special case. Copyright c 2012 (Iowa State University) Statistics / 30

21 Define Λ as Reduced Model Maximized Likelihood. Full Model Maximized Likelihood Λ is known as the likelihood ratio. 2 log Λ is known as the likelihood ratio test statistic. Tests based on 2 log Λ are called likelihood ratio tests. Copyright c 2012 (Iowa State University) Statistics / 30

22 Under the regularity conditions in Shao (2003) mentioned previously, the likelihood ratio test statistic 2 log Λ is approximately distributed as central χ 2 k f k r under the null hypothesis, where k f and k r are the dimensions of the parameter space under the full and reduced models, respectively. This approximation can be reasonable if n is sufficiently large. Copyright c 2012 (Iowa State University) Statistics / 30

23 Likelihood Ratio Tests and Confidence Regions for a Subvector of the Full Model Parameter Vector θ Suppose θ is k 1 vector and is partitioned into vectors θ[ 1 k 1 ] 1 and θ 2 k 2 1, where k = k 1 + k 2 θ1 and θ =. θ 2 Consider a test of H 0 : θ 1 = θ 10. Copyright c 2012 (Iowa State University) Statistics / 30

24 Suppose ([ ]) ˆθ is the MLE of θ and ˆθ 2 (θ 1 ) maximizes θ1 l over θ 2 for any fixed value of θ 1. θ 2 [ ([ Then 2 l(ˆθ) l θ 10 ˆθ 2 (θ 10 ) ])] is approximately χ 2 k 1 under the null hypothesis by our previous result when n is sufficiently large. Copyright c 2012 (Iowa State University) Statistics / 30

25 Also, Pr { 2 [ ([ l(ˆθ) l θ 1 ˆθ 2 (θ 1 ) ])] } χ 2 k 1,1 α 1 α which implies { ([ ]) θ 1 Pr l l(ˆθ) 1 } ˆθ 2 (θ 1 ) 2 χ2 k 1,1 α 1 α. Copyright c 2012 (Iowa State University) Statistics / 30

26 Thus, the set of values of θ 1 that, when maximizing over θ 2, yield a maximized likelihood within 1 2 χ2 k 1,1 α of the likelihood maximized over all θ, form a 100(1 α)% confidence region for θ 1. Such a confidence region is known as a profile likelihood confidence region because ([ ]) θ 1 l ˆθ 2 (θ 1 ) is the profile log likelihood for θ 1. Copyright c 2012 (Iowa State University) Statistics / 30

30 A Warning The normal and χ 2 approximations mentioned in these notes may be crude if sample sizes are not sufficiently large. The regularity conditions mentioned in these notes do not hold if the true parameter falls on the boundary of the parameter space. Thus, as an example, testing H 0 : σ 2 u = 0 is not covered by the methods presented here. Copyright c 2012 (Iowa State University) Statistics / 30

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)