Akaike Information Criterion

Size: px

Start display at page:

Download "Akaike Information Criterion"

Allan Gilbert
5 years ago
Views:

1 Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7,

2 background Background Model statistical model: Y j = h(t j ;q)+e j, j = 1,2,...,n h(t j ;q): corresponds to the observed part of the solution of a mathematical model (e.g., described by system of ODE, or PDE) with model parameter q at the measurement time point t j. E j : measurement error at measurement time point t j (random variable). Y j : observation at measurement time point t j ( random variable). Under the assumption of E j being i.i.d N(0,σ 2 ), we have n [ 1 probability distribution model: g(y θ) = j=1 )] 2πσ exp ( (y j h(t j ;q)) 2 2σ 2 g: probability density function of Y = (Y 1,Y 2,...,Y n ) T given parameter θ. θ includes mathematical model parameter q and statistical model parameter σ, i.e., θ = (q,σ) T. y = (y 1,y 2,...,y n ) T is a realization of Y. February 7,

3 background Risk Modeling error (in terms of uncertainty assumption) Specified inappropriate parametric probability distribution for the data at hand. Estimation error ϑ ˆθ 2 = ϑ θ 2 }{{} bias + θ ˆθ 2 }{{} variance ϑ: parameter vector for the full reality model. θ: is the projection of ϑ onto the parameter space Θ k of the approximating model. ˆθ: the maximum likelihood estimate of θ in Θ k. February 7,

4 background Principle of Parsimony (with same data set) Variance For sufficiently large sample size n, we have n θ ˆθ 2 assym. χ 2 k, where E(χ2 k ) = k. Akaike Information Criterion February 7,

5 K-L information Kullback-Leibler Information Information lost when approximating model is used to approximate the full reality. Continuous Case I(f,g( θ)) = = Ω Ω f(x)log ( ) f(x) dx g(x θ) f(x)log(f(x))dx f(x) log(g(x θ))dx } Ω {{} relative K-L information f: full reality or truth in terms of a probability distribution. g: approximating model in terms of a probability distribution. θ: parameter vector in the approximating model g. Remark I(f,g) 0, with I(f,g) = 0 if and only if f = g almost everywhere. I(f,g) I(g,f),whichimpliesK-L information is not the real distance. February 7,

6 AIC Akaike Information Criterion (1973) Motivation The truth f is unknown. The parameter θ in g must be estimated from the empirical data {y j } n j=1. y = (y 1,y 2,...,y n ) T is a realization of Y, which is the random sample from the density function f(y). θ (Y): estimator of θ. It is a random variable. I(f,g( θ (Y))) is a random variable. Remark We need to use expected K-L information E Y [I(f,g( θ (Y)))] to measure the distance between g and f. February 7,

7 AIC Selection Target Minimizing E Y [I(f,g( θ (Y)))] g G E Y [I(f,g( θ (Y)))] = [ ] Ω f(x)log(f(x))dx f(y) f(x)log(g(x θ (y)))dx dy. Ω Ω }{{} E Y E X [log(g(x θ (Y)))] G: collection of admissible models (in terms of probability density functions). θ (y) is MLE estimate based on model g and data {y j } n j=1. Model Selection Criterion Maximizing g G E Y E X [log(g(x θ (Y)))] February 7,

8 AIC Key Result An approximately unbiased estimate of E Y E X [log(g(x θ (Y)))] for large sample and good model is log(l(ˆθ y)) k L(ˆθ y) represents likelihood of ˆθ given data {y j } n j=1, that is, L(ˆθ y) = g(y ˆθ). ˆθ: maximum likelihood estimate of θ given data {y j } n j=1, that is, ˆθ = θ (y). k: number of estimated parameters (including mathematical model parameters and statistical model parameters). Remark Large sample: used to ensure that the estimate ˆθ provides a good approximation to the true value θ 0 (consistency property of maximum likelihood estimator). Good model: the model that is close to f in the sense of having a small K-L value. February 7,

9 AIC Maximum Likelihood Case AIC = 2logL(ˆθ y) ւ bias + 2k ց variance CalculateAICvalueforeachmodelwiththesame data set,andthe best model is the one with minimum AIC value. The value of AIC depends on data {y j } n j=1, which leads to model selection uncertainty. Least-Squares Case Assumption: i.i.d. normally distributed errors ( ) RSS AIC = nlog +2k n RSS denotes residual sum of squares of fitted model. February 7,

10 TIC Takeuchi s Information Criterion (1976) useful in cases where the model is not particular close to truth. Model Selection Criterion Maximizing g G E Y E X [log(g(x θ (Y)))] Key Result An approximately unbiased estimator of E Y E X [log(g(x θ (Y)))] for large sample is log(l(ˆθ y)) tr(j(θ 0 )I(θ 0 ) 1 ) { ( J(θ 0 ) = E f θ log(g(x θ 0)) )( θ log(g(x θ 0)) ) } T ( } I(θ 0 ) = E f { 2 log(g(x θ 0 )) θ i θ j )k k Remark If g f, then I(θ 0 ) = J(θ 0 ). Hence tr(j(θ 0 )I(θ 0 ) 1 ) = k. If g is close to f, then tr(j(θ 0 )I(θ 0 ) 1 ) k. February 7,

11 TIC TIC TIC = 2log(L(ˆθ y))+2tr(ĵ(ˆθ)[î(ˆθ)] 1 ), where Î(ˆθ) and Ĵ(ˆθ) are both k k matrix, and ( ) 2 log(g(y ˆθ)) Î(ˆθ) = estimate of I(θ 0 ) θ i θ j k k Ĵ(ˆθ) = [ θ log(g(y ˆθ)) ][ θ log(g(y ˆθ)) ] T estimate of J(θ 0 ) Remark Attractive in theory. Rarely used in practice because we need a very large sample size to obtain good estimates for both I(θ 0 ) and J(θ 0 ). February 7,

12 AIC c A Small Sample AIC use in the case where the sample size is small relative to the number of parameters rule of thumb: n/k < 40 Univariate Case Assumption: i.i.d normal error distribution with the truth contained in the model set. 2k(k +1) AIC c = AIC + } n k {{ 1} bias-correction Remark Thebias-correctiontermvariesbytypeofmodel(e.g.,normal,exponential,Poisson). Inpractice, AIC c isgenerallysuitableunlesstheunderlyingprobabilitydistribution is extremely nonnormal, especially in terms of being strongly skewed. February 7,

13 AIC c Multivariate Case Assumption: each row of E is i.i.d N(0,Σ) with the truth contained in the model set. Applying to the multivariate case: AIC c = AIC +2 k( k +1+p) n k 1 p Y = TB +E, where Y R n p,t R n k,b R k p. p: total number of components. n: number of independent multivariate observations, each with p nonindependent components. k: total number of unknown parameters and k = kp+p(p+1)/2. Remark Bedrick and Tsai in [1] claimed that this result can be extended to the multivariate nonlinear regression model. February 7,

14 AIC difference AIC Differences, Likelihood of a Model, Akaike Weights AIC differences Information loss when fitted model is used rather than the best approximating model i = AIC i AIC min AIC min : AIC values for the best model in the set. Likelihood of a Model Useful in making inference concerning the relative strength of evidence for each of the models in the set ( L(g i y) exp 1 ) 2 i, where means is proportional to. Akaike Weights Weight of evidence in favor of model i being the best approximating model in the set exp( 1 2 w i = i) R r=1 exp( 1 2 r) February 7,

15 confidence set Confidence Set for K-L Best Model Three Heuristic Approaches (see [4]) Based on the Akaike weights w i To obtain a 95% confidence set on the actual K-L best model, summing the Akaike weights from largest to smallest until that sum is just 0.95, and the corresponding subset of models is the confidence set on the K-L best model. Based on AIC difference i 0 i 2, substantial support, 4 i 7, considerable less support, i > 10, essentially no support. Remark Particularly useful for nested models, may break down when the model set is large. The guideline values may be somewhat larger for nonnested models. February 7,

16 confidence set Motivated by likelihood-based inference The confidence set of models is all models for which the ratio L(g i y) L(g min y) > α, where α might be chosen as 1 8. February 7,

17 multimodel inference Multimodel Inference Unconditional Variance Estimator [ R var(ˆ θ) = w i var(ˆθ i g i )+(ˆθ i ˆ θ) 2 i=1 θ is a parameter in common to all R models. ˆθ i means that the parameter θ is estimated based on model g i, ˆ θ is a model-averaged estimate ˆ θ = R i=1 w iˆθ i. Remark Unconditional meansnotconditionalonanyparticularmodel,butstillconditional on the full set of models considered. If θ is a parameter in common to only a subset of the R models, then w i must be recalculated based on just these models (thus these new weights must satisfy wi = 1). Use unconditional variance unless the selected model is strongly supported (for example, w min > 0.9). February 7, ] 2

18 summary Summary of Akaike Information Criteria Advantages Valid for both nested and nonnested models. Compare models with different error distribution. Avoid multiple testing issues. Selected Model The model with minimum AIC value. Specific to given data set. Pitfall in Using Akaike Information Criteria Can not be used to compare models of different data sets. For example, if nonlinear regression model g 1 is fitted to a data set with n = 140 observations,onecannotvalidlycompareitwithmodelg 2 when7outliershavebeen deleted, leaving only n = 133. February 7,

19 summary Should use the same response variables for all the candidate models. For example, if there was interest in the normal and log-normal model forms, the models would have to be expressed, respectively, as g 1 (x µ,σ) = 1 2πσ exp instead of g 1 (x µ,σ) = 1 2πσ exp [ [ (x µ)2 2σ 2 (x µ)2 2σ 2 ] ], g 2 (x µ,σ) = [ ] 1 x 2πσ exp (log(x) µ)2, 2σ 2, g 2 (log(x) µ,σ) = 1 [ ] exp (log(x) µ)2. 2πσ 2σ 2 Do not mix null hypothesis testing with information criterion. Information criterion is not a test, so avoid use of significant and not significant, or rejected and not rejected in reporting results. Do not use AIC to rank models in the set and then test whether the best model is significantly better than the second-best model. Should retain all the components of each likelihood in comparing different probability distributions. February 7,

20 reference References [1] E.J. Bedrick and C.L. Tsai, Model Selection for Multivariate Regression in Small Samples, Biometrics, 50 (1994), [2] H. Bozdogan, Model Selection and Akaike s Information Criterion (AIC): The General Theory and Its Analytical Extensions, Psychometrika, 52 (1987), [3] H. Bozdogan, Akaike s Information Criterion and Recent Developments in Information Complexity, Journal of Mathematical Psychology, 44 (2000), [4] K.P. Burnham and D.R. Anderson, Model Selection and Inference: A Practical Information-Theoretical Approach, (1998), New York: Springer-Verlag. [5] K.P. Burnham and D.R. Anderson, Multimodel Inference: Understanding AIC and BIC in Model Selection, Sociological methods and research, 33 (2004), [6] C.M. Hurvich and C.L. Tsai, Regression and Time Series Model Selection in Small Samples, Biometrika, 76 (1989). February 7,

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)