GENERAL FORMULA OF BIAS-CORRECTED AIC IN GENERALIZED LINEAR MODELS

Size: px

Start display at page:

Download "GENERAL FORMULA OF BIAS-CORRECTED AIC IN GENERALIZED LINEAR MODELS"

Hugh Randall
5 years ago
Views:

1 GENERAL FORMULA OF BIAS-CORRECTED AIC IN GENERALIZED LINEAR MODELS Shinpei Imori, Hirokazu Yanagihara, Hirofumi Wakaki Department of Mathematics, Graduate School of Science, Hiroshima University ABSTRACT The present paper considers a bias correction of Akaike s information criterion (AIC) for selecting variables in the generalized linear model (GLM). When the sample size is not so large, the AIC has a non-negligible bias that will negatively affect variable selection. In the present study, we obtain a simple expression for a bias-corrected AIC (corrected AIC, or CAIC) in GLMs. A numerical study reveals that the CAIC has better performance than the AIC for variable selection. Key words: Akaike s information criterion, Bias correction, Generalized linear model, Maximum likelihood estimation, Variable selection.. INTRODUCTION In real data analysis, deciding the best subset of variables in regression models is an important problem. It is common for a model selection to measure the goodness of fit of the model by the risk function based on the expected Kullback-Leibler (KL) information (Kullback and Leibler (95)). For actual use, we must estimate the risk function, which depends on unknown parameters. The most famous estimator of the risk function is Akaike s information criterion (AIC) proposed by Akaike (973). Since the AIC can be simply defined as 2 the maximum log-likelihood +2 the number of parameters, the AIC is widely applied in chemometrics, engineering, econometrics, psychometrics, and many other fields for selecting appropriate models using a set of explanatory variables. In addition, the order of the bias of the AIC to the risk function is O(n ), which indicates implicitly that the AIC sometimes has a non-negligible bias to the risk function when the sample size n is not so large. The AIC tends to underestimate the risk function and the bias of AIC tends to increase with the number of parameters in the model. Potentially, the AIC tends to choose the model that has more parameters than the true model as the best model (see Shibata (980)). Combined with these characteristics, the bias will cause a disadvantage whereby the model having the most parameters is easily chosen by the AIC among the candidate models as the best model. One method of avoiding this undesirable result is to correct the bias of the AIC. A number of authors Corresponding author address: m06547@hiroshima-u.ac.jp

2 have investigated bias-corrected AIC for various models. For example, Sugiura (978) developed an unbiased estimator of the risk function in linear regression models, which is the UMVUE of the risk function reported by Davies et al. (2006). Hurvich and Tsai (989) formally adjusted the bias of the AIC (called AICc) in several models. In particular, the AICc is equivalent to Sugiura s bias-corrected AIC in the case of the linear regression model. Wong and Li (998) extended Hurvich and Tsai s AICc to a wider model and verified that their AICc has a higher performance than the original AIC by conducting numerical studies. Unfortunately, except for the linear regression model, the AICc does not completely reduce the bias of the AIC to O(n 2 ). As mentioned previously, the goodness of fit of the model is measured by the risk function based on the expected KL information. Thus, obtaining a higher-order asymptotic unbiased estimator of the risk function will allow us to correctly measure the goodness of fit of the model. This will further facilitate the reasonable selection of variables. From this viewpoint, Yanagihara et al. (2003) and Kamo et al. (20) proposed the bias-corrected AIC s in the logistic model and the Poisson regression model, respectively, which complementarily reduce the bias of the AIC to O(n 2 ). We refer to the completely bias-corrected AIC to O(n 2 ) as the corrected AIC (CAIC). Frequently, the CAIC improves the performance of the original AIC dramatically. This strongly suggests the usefulness of the CAIC for real data analysis. Nevertheless, the CAIC is rarely used in real data analysis because the CAIC has been derived only in a few models. Moreover, since the derivation of the CAIC is complicated, a great deal of practice is needed in order to carry out the calculation of the CAIC if a researcher wants to use the CAIC in a model in which the CAIC has not been derived. Hence, the CAIC is not a user-friendly model selector under the present circumstances, although the CAIC has better performance than the original AIC. If we can obtain the CAIC in a small amount of time, the CAIC will become a useful and user-friendly model selector. The goal of the present paper is to derive a simple formula for the CAIC in the widest model possible. The model considered is the generalized linear model (GLM), which was proposed by Nelder and Wedderburn (972). Nevertheless, the GLM can express a number of statistical models by changing the distribution and the link function, such as the normal linear regression model, the logistic regression model, and the probit model, which are currently commonly used in a number of applied fields, cf., Barnett and Nurmagambetov (200), Matas et al. (200), Sánchez-Carneo et al. (20), and Teste and Lieffers (20). Practically speaking, the GLM can be easily fitted to real data using the glm function in R (R Development Core Team (20)), which implements a number of distributions and link functions. Since the model considered herein is wide and can be easily fitted to real data, the CAIC in the GLM is confirmed useful in real data analysis. Generally, the additional bias correction terms in the CAIC requires estimators of several orders of moments of the response variables. Such moments should be calculated for each specified model. However, we emphasize that 2

3 the moments do not remain in our formula. Hence, in order to apply the CAIC to the real data with our formula, we need only calculate the derivatives of the log-likelihood function of less than or equal to the fourth order. Our formula can be applied using formula manipulation software such as Mathematica. The remainder of the present paper is organized as follows: In Section 2, we consider a stochastic expansion of the maximum likelihood estimator (MLE) in the GLM. In Section 3, we propose a new information criterion by complementarily reducing the bias of the AIC in the GLMs to O(n 2 ). In Section 4, we present several examples of the CAIC in a model in which the glm program is implemented. These examples will be helpful to applied researchers. In Section 5, we investigate the performance of the proposed CAIC through numerical simulations. Technical details are provided in the Appendix. 2. STOCHASTIC EXPANSION OF THE MLE IN THE GLM The GLM considered herein is developed to allow us to fit regression models for the response variables that follow a very general distribution belonging to the exponential family, the probability density function of which is given as follows: { θy b(θ) f(y; θ, ϕ) = exp + c(y, ϕ), (2.) where a( ), b( ), and c( ) are known functions, the unknown parameter θ is referred to as the natural location parameter, and ϕ is often referred to as the scale parameter. (For the details of the GLM, see, e.g., Meyer et al. (2002).) In the present paper, we assume that ϕ is known. The exponential family includes the normal, binomial, Poisson, geometric, negative binomial, exponential, gamma, and inverse normal distributions. Let each datum consist of a sequence {(y i, x i ); (i =,..., n), where y,..., y n are independent random variables referred to as response variables, and x,..., x n are p-dimensional nonstochastic vectors referred to as explanatory variables. The expectation of the response y i is related to a linear predictor η i = x iβ by a link function h( ), i.e., h(e[y i ]) = h(µ(θ i )) = η i. For theoretical purposes, we define u = (h µ), i.e., θ i = u(η i ). When h = µ, i.e., u is an identity function, we say that h is the natural link function. For example, the logistic regression model uses the natural link function. Finally, the candidate model is expressed as y i i.d f(y i ; θ i (β), ϕ), where f( ) is given by (2.). The p-dimensional unknown vector β can be estimated by the maximum likelihood method. The joint probability density function of y = (y,..., y n ) is given by n n { θi y i b(θ i ) f(y; β) = f(y i ; θ i (β), ϕ) = exp + c(y i, ϕ). 3

4 Hence, the log-likelihood function of the GLM is expressed as l(β; y) = log f(y; β) = { θi y i b(θ i ) + c(y i, ϕ). (2.2) Let ˆβ be the MLE of β. Here, ˆβ is given as the solution of the following likelihood equation: l(β; y) = β ( y i b(θ ) i) θi x i = θ i η i X (y µ) = 0 p, where X = (x,... x n ), = diag( θ / η,..., θ n / η n ), and µ = ( b(θ )/ θ,..., b(θ n )/ θ n ). Note that b(θ) is a C -class function and all of the orders of the moments of y exist in the interior Θ 0 of the natural parameter space Θ. In using some of the properties of the MLE, we have the following regularity assumptions (see, e.g., Fahrmeir and Kaufmann (985)): C. : x iβ h(m) (i =,..., n), for all β B, C.2 : h is three times continuously differentiable, C.3 : For all x i χ, θ i / η i 0, (i =,..., n), C.4 : n 0 s.t. X X has full rank for n n 0, where B is an admissible open set in R p for the parameter β, χ is a compact set for the regressors x i, and M denotes the image µ(θ 0 ). Condition C. is necessary in order to obtain the GLM for all β. Condition C.2 is necessary in order to calculate the bias. Conditions C.3 and 4 ensure that X V X is positive definite for all β B, n n 0, where ( 2 b(θ ) V = diag θ 2,..., 2 b(θ n ) θ 2 n Moreover, we have the following additional conditions to assure strong consistency and asymptotic normality of ˆβ, which can be derived by slightly modifying the results reported by Fahrmeir and Kaufmann (985): C.5 : sequence {x i lies in χ with u(x β) Θ 0, β B, C.6 : lim inf n λ min (X V X/n) > 0, C.7 : c > 0, n, λ min (X X) > cλ max (X X), n n, where λ min (A) is the smallest eigenvalue of symmetric matrix A. According to Theorem 5 in Fahrmeir and Kaufmann (985), ˆβ has strong consistency and asymptotic normality under these conditions. Furthermore, from C.6, X V X = O(n), (n ). ). 4

5 Based on the above conditions, ˆβ can be formally expanded as follows: ˆβ = β + n b + n b 2 + n n b 3 + O p (n 2 ). (2.3) Note that l( ˆβ; y)/ β = 0 p. By applying a Taylor expansion around ˆβ = β to this equation, the likelihood equation is expanded as follows: 0 p = (g + G 2 b ) + { G b 2 + n n 2 G 3(b b ) + n n { G 2 b G 3(b b 2 + b 2 b ) + 6 (I p b )G 4 (b b ) where 0 p is a p-dimensional vector of zeros, and g = n l(β; y) β = n (y i d i )c i x i, + O p (n 2 ), (2.4) G 2 = 2 l(β; y) = { d n β β i2 c 2 i + (y i d i )c i2 x i x n i, G 3 = ( ) n β 2 l(β; y), β β = { d i3 c 3 i 3d i2 c i c i2 + (y i d i )c i3 (x i x i x n i), G 4 = ( ) 2 n β β 2 l(β; y) β β = { d i4 c 4 i 6d i3 c 2 n ic i2 3d i2 c i c 2 i2 4d i2 c i3 + (y i d i )c i4 (x i x i x i x i). Here, coefficients c ik and d ik, (i =,..., n; k =,..., 4) are defined as c ik = k θ i η k i, d ik = k b(θ i ). (2.5) θi k Note that c ik is determined by the link function, and d ik is determined by the distribution of the model. Let us define Z i = n(g i M i ) (i = 2, 3, 4), where M i = E[G i ], the explicit forms of which are M 2 = n M 3 = n M 4 = n ( d i2 c 2 i)x i x i, ( d i3 c 3 i 3d i2 c i c i2 )(x i x i x i), ( d i4 c 4 i 6d i3 c 2 ic i2 3d i2 c 2 i2 4d i2 c i c i3 )(x i x i x i x i). 5

6 Thus, Z i (i = 2, 3, 4) can be expressed as Z 2 = Z 3 = Z 4 = n n n {(y i d i )c i2 x i x i, {(y i d i )c i3 (x i x i x i), {(y i d i )c i4 (x i x i x i x i). Based on the regularity assumptions, nonsingularity of M 2 is guaranteed. Furthermore, the regularity assumptions and Conditions C.6 and C.7 ensure the asymptotic normality of Z i. Hence, we can rewrite (2.4) as 0 p = (g + M 2 b ) + {M 2 b n n M 3(b b ) + Z 2 b + { n M 2 b 3 + n 2 M 3(b b 2 + b 2 b ) + 6 (I p b )M 4 (b b ) + Z 2 b Z 3(b b ) + O p (n 2 ). (2.6) Comparing the terms of the same order in both sides of (2.6), the explicit forms of b, b 2, and b 3 are obtained as follows: b = M2 g, { b 2 = M2 2 M 3(b b ) + Z 2 b, b 3 = M 2 { 2 M 3(b b 2 + b 2 b ) + 6 (I p b ) M 4 (b b ) + Z 2 b Z 3(b b ). 3. BIAS CORRECTION OF THE AIC The goodness of fit of the model is measured by the risk function based on the expected KL information, as follows: R = E y E y [ 2l( ˆβ; y )], where y = (y,..., yn) is an n-dimensional random vector that is independent of y and has the same distribution as y. At the beginning of this section, we derive the bias of 2l( ˆβ; y) to R. Under ordinary circumstances, calculation of the expectations of y under the specific distribution are needed in order to express the bias. However, based on the characteristics of the exponential family, we can obtain the bias without calculating the expectations of y under the specific distribution. The explicit form of the bias can be expressed by several derivatives of the log-likelihood function. 6

7 The bias when we estimate R by 2l( ˆβ; y) is given as B = R E y [ 2l( ˆβ; y)] = E y [E y [ [ 2l( ˆβ; y) 2l( ˆβ; ]] y ) = 2E y ] (y i d i ) ˆθ i. (3.) By applying a Taylor expansion around ˆβ = β to ˆθ i = (h µ) (x i ˆβ), ˆθ i is expanded as ˆθ i = θ i + ( ˆβ β) θ i β + 2 ( ˆβ β) 2 θ i β β ( ˆβ β) + 6 ( ˆβ β) {( β 2 β β ) θ i {( ˆβ β) ( ˆβ β) + O p (n 2 ). (3.2) Substituting the stochastic expansion of ˆβ in (2.3) into (3.2) yields the following: ˆθ i = θ i + c i x n ib + {c i x ib n c i2(x ib ) 2 + {c n i x ib 3 + c i2 (x ib )(x ib 2 ) + 6 n c i3(x ib ) 3 + O p (n 2 ). (3.3) By combining (3.) and (3.3), we obtain [ ] [ ] (y i d i ) B = 2E θ i + 2 (y i d i ) E c i x n ib [ + 2 n E (y i d i ) {c i x ib c i2(x ib ] ) 2 [ + 2 { n n E (y i d i ) c i x ib 3 + c i2 (x ib )(x ib 2 ) + ] 6 c i3(x ib ) 3 + O(n 2 ). (3.4) Recall that d i = b(θ i )/ θ i = E[y i ]. This yields the first term of (3.4), as follows: [ ] (y i d i ) 2E θ i = 0. (3.5) Since E[gg ] = M 2, the second term of (3.4) can be calculated as [ ] 2 (y i d i ) E c i x n ib = 2E[g M2 g] = 2p. (3.6) The third term of (3.4) can be obtained as [ 2 y i d i E {c i x ib n c i2(x ib ] ) 2 3 = d n 2 i3 c 2 ic i2 Uii 2 + d n 3 2 i3 c 3 i(d j3 c 3 j + 3d j2 c j c j2 )Uij 3 + O(n 2 ), (3.7) i,j where n i,j refers to n n j=, and U ij is the (i, j)th element of the matrix U = XM2 X, i.e., U ij = x im 2 x j. (3.8) 7

8 Note that coefficient U ij is determined by both the link function and the distribution of the model. The derivation of (3.7) is shown in Appendix A.. Furthermore, the fourth term of (3.4) can be expanded as [ 2 n n E (y i d i ) {c i x ib 3 + c i2 (x ib )(x ib 2 ) + 6 c i3(x ib ] ) 3 = (d n 2 i4 c 4 i + 6d i3 c 2 ic i2 d i2 c 2 i2)uii 2 {2(d n 3 2 i3 c 3 i)(d j3 c 3 j + 3d j2 c j c j2 ) + 4(d i2 c i c i2 )(d j2 c j c j2 )Uij 3 n 3 2 i,j {(d i3 c 3 i)(d j3 c 3 j + 3d j2 c j c j2 ) + 4(d i2 c i c i2 )(d j2 c j c j2 )U ij U ii U jj + O(n 2 ). (3.9) i,j The detailed derivation of (3.9) is given in Appendix A.2. Finally, by substituting (3.5), (3.6), (3.7), and (3.9) into (3.4), we obtain the asymptotic expansion of B up to order n as B = 2p + n (w + w 2 ) + O(n 2 ), (3.0) where w = n w 2 = n 2 2 (d i4 c 4 i + 3d i3 c 2 ic i2 d i2 c 2 i2)uii, 2 {d i3 c 3 i(d j3 c 3 j + 3d j2 c j c j2 ) + 4(d i2 c i c i2 )(d j2 c j c j2 )(Uij 3 + U ij U ii U jj ). i,j (3.) By a simple calculation, we have c i = and c i2 = 0 when the link function is natural. Thus, if the model has the natural link function, w and w 2 became simple, as follows: w = n w 2 = n 2 2 d i4 Uii, 2 d i3 d j3 (Uij 3 + U ij U ii U jj ). i,j (3.2) Equation (3.0) yields the following formula for the CAIC: CAIC = AIC + n (ŵ + ŵ 2 ), (3.3) where ŵ and ŵ 2 are defined by replacing β in w and w 2 with ˆβ. On the other hand, if h is not the natural link function, we have to use w and w 2 in (3.). Note that ŵ and ŵ 2 depend only on several derivatives. Therefore, we can comfortably obtain coefficients ŵ and ŵ 2 using formula manipulation software. 8

9 4. THE CAIC IN THE MODELS IMPLEMENTED IN glm In this section, we will present several examples of the CAIC in the GLM, which can be used in the glm of the R software. In glm, the binomial distribution accepts the links logit, Probit, cauchit, and cloglog (complementary log-log). The Gamma distribution accepts the inverse, identity, and log links. The Poisson accepts the distribution of the log, identity, and sqrt links, and the inverse Gaussian distribution accepts the /µ 2, inverse, and log links. These examples are obtained through our formula in (3.3). If h is the natural link function, w and w 2 in (3.3) are expressed as shown in (3.2). Otherwise, w and w 2 in (3.3) are expressed as in (3.). Next, we present the coefficients of the CAIC in the all of models mentioned above. The results indicate that the CAIC in the model with a non-natural link function is more complex than that in the model with the natural link function. An expression of the log-likelihood function of the GLM is given by Equation (2.). 4.. Case of the Binomial Distribution When we assume that y i is distributed according to the Binomial distribution B(m i, p i ) (i =,..., n), the parameters and functions based on the distribution are given by ( ) pi θ i = log, b(θ i ) = m i log( + exp(θ i )), p i ( ) mi ϕ =, =, c(y i, ϕ) = log. Then, the coefficients of the CAIC specifying the distribution are given by d i = m i exp(θ i ) + exp(θ i ), d i2 = m i exp(θ i ) ( + exp(θ i )) 2, d i3 = m i exp(θ i )( exp(θ i )) ( + exp(θ i )) 3, d i4 = m i exp(θ i )( 4 exp(θ i ) + exp(2θ i )) ( + exp(θ i )) 4. The remaining coefficients of the CAIC, which are determined by the link function, are as follows: Case of the logistic link function, i.e., E[y i ] = m i p i = m i ( + exp( η i )) (natural link function): { U ij = x i m k exp( η k ) n ( + exp( η k )) x kx 2 k x j. k= The CAIC derived from the above coefficients coincides with the CAIC reported by Yanagihara et al. (2003). Case of the probit link function, i.e., m i p i = m i Φ(η i ), where Φ( ) is the cumulative distribution y i 9

10 function (CDF) of the standard normal distribution: ϕ(η i ) c i = ( Φ(η i ))Φ(η i ), c ϕ(η i ) i2 = {( Φ(η i ))Φ(η i ) {η iφ(η 2 i )( Φ(η i )) + ϕ(η i )( 2Φ(η i )), { U ij = x i m k ϕ(η k ) 2 n ( Φ(η k ))Φ(η k ) x kx k x j, k= where ϕ( ) is the probability density function (PDF) of the standard normal distribution. Case of the cauchit link function, i.e., m i p i = m i Ψ(η i ), where Ψ( ) is the CDF of the standard Cauchy distribution: ψ(η i ) c i = ( Ψ(η i ))Ψ(η i ), c ψ(η i ) 2 i2 = {( Ψ(η i ))Ψ(η i ) {2πη iψ(η 2 i )( Ψ(η i )) ( 2Ψ(η i )), { U ij = x i m k ψ(η k ) 2 n ( Ψ(η k ))Ψ(η k ) x kx k x j, k= where ψ( ) is the PDF of the standard Cauchy distribution. Case of the cloglog link function i.e., m i p i = m i m i exp{ exp(η i ): exp(η i ) c i = exp( exp(η i )), c i2 = exp(η i){ ( exp(η i )) exp( exp(η i )), { exp( exp(η i )) 2 { U ij = x i m k exp(2η k ) exp(exp(η k )) x k x n exp( exp(η k )) k x j. k= 4.2. Case of the Poisson Distribution Second, when we assume that y i is distributed according to a Poisson distribution P o(λ i ) (i =,..., n), the parameters and functions based on the model are given as follows: θ i = log λ i, b(θ i ) = exp(θ i ), ϕ =, =, c(y i, ϕ) = log(y i!). The coefficients of the CAIC specifying the distribution are given by d i = d i2 = d i3 = d i4 = exp(θ i ). The remaining coefficients of the CAIC, which are determined by the link function, are as follows: Case of the log link function, i.e., E[y i ] = λ i = e η i (natural link function): { U ij = x i exp(η k )x k x k x j. n k= The CAIC obtained from the above coefficients coincides with that reported by Kamo et al. (20). 0

11 Case of the identity link function, i.e., λ i = η i : c i =, η i c i2 =, ηi 2 U ij = x i ( n k= ) x k x k x j. η k Case of the sqrt link function, i.e., λ i = ηi 2 : c i = 2, η i c i2 = 2, ηi 2 U ij = x i ( 4 n ) x k x k x j. k= 4.3. Case of the Gamma Distribution When we assume that a positive observed response, y i, is distributed according to the Gamma distribution Γ(λ i, ν) (i =,..., n), the parameters and functions based on the model are given as follows: θ i = λ i ν, b(θ i) = log( θ i ), ϕ = ν, = ϕ, c(y i, ϕ) = ν log ν log(γ(ν)) + (ν ) log y i, where Γ( ) is the gamma function. Then, the coefficients of the CAIC specifying the distribution are given by d i = θ i, d i2 = θ 2 i, d i3 = 2θ 3 i, d i4 = 6θ 4 i. The remaining coefficients of the CAIC, which are determined by the link function, are as follows: Case of the inverse link function, i.e., E[y i ] = λ i ν = η i ( U ij = x i ν n k= (natural link function): η 2 k x kx k) x j. Case of the log link function, i.e., λ i ν = exp(η i ): ( c i = exp( η i ), c i2 = exp( η i ), U ij = x i ν n x k x k) x j. k= Case of the identity link function, i.e., λ i ν = η i : c i = η 2 i, c i2 = 2η 3 i U ij = x i ( ν n k= η 4 k x kx k) x j.

12 4.4. Case of the Inverse Gaussian Distribution When we assume that a positive observed response, y i, is distributed according to the inverse Gaussian distribution IG(µ i, λ) (i =,..., n), the parameters and functions based on the model are given as follows: θ i = µ 2 i, b(θ i ) = 2 θ i, ϕ = λ, = λ 2, c(y i, ϕ) = 2 log(λ) 2 log(2πx3 ) λ 2x, Then, the coefficients of the CAIC specifying the distribution are given as d i = θ /2 i, d i2 = 2 θ 3/2 i, d i3 = 3θ 5/2 i, d i4 = 5 2 θ 7/2 i. The remaining coefficients of the CAIC, which are determined by the link function, are as follows: Case of the /µ 2 link function, i.e., E[y i ] = µ i = η /2 i ( U ij = x i nλ Case of the inverse link function, i.e., µ i = η i : k= c i = 2η i, c i2 = 2, U ij = x i Case of the log link function, i.e., µ i = exp(η i ): (natural link function): η 3/2 k x k x k) x j. ( 4 nλ c i = 2 exp( 2η i ), c i2 = 4 exp( 2η i ), U ij = x i k= ( η k x kx k) x j. 4 nλ exp( η k )x k x k) x j. k= 5. NUMERICAL STUDIES In this section, we conduct numerical studies to show that the CAIC is better than the original AIC. At the beginning of this section, we examine the numerical studies for the frequencies of the model and the prediction error of the best models selected by the criteria. We prepared the eight candidate models with n = 50 and 00. First, we constructed an n 8 explanatory variable matrix X = (x,..., x n ). The first column of X is n, where n is an n-dimensional vector of ones, and the remaining seven columns of X were defined by realizations of independent dummy variables with binomial distribution B(, 0.4). In this simulation, we prepared two parameters β, as follows: Case : β = (0.65, 0.65), Case2 : β = (0., 0., 0.3, 0.5). 2

13 The explanatory variables matrix in the jth model consists of the first j columns of X (j =,..., 8). Thus, in case, the true model is the second model, and in case 2, the true model is the fourth model. We simulated,000 realizations of y = (y,..., y n ) in the probit regression model, i.e., y i.d i B(, p i ), where p i = Φ(x iβ) (i =,..., n). Table : Selection-probability and prediction error for the case of β = (0.65, 0.65) n Model PE B Risk AIC CAIC Risk AIC CAIC Table 2: Selection-probability and prediction error for the case in which β = (0., 0., 0.3, 0.5) n Model PE B Risk AIC CAIC Risk AIC CAIC Tables and 2 list the following properties. () selection-probability: the frequency of the model chosen by minimizing the information criterion. (2) prediction error of the best model (PE B ): the risk function of the model selected by the information criterion as the best model, which is estimated as PE B = 000 E y [ 2l( 000 ˆβ Bi ; y )], where y is a future observation, and ˆβ Bi iteration. is the value of ˆβ of the selected model at the ith In particular, PE B is an important property because it is equivalent to the expected KL information between the true model and the best model selected by the criteria. In case with n = 50 and 00 and in case 2 with n = 00, the model having the smallest risk function (referred to as the principle best model) coincides with the true model. On the other hand, in case 2 with n = 50, 3

14 the principle best model became the first model rather than the fourth model, i.e., the true model did not conform with the principle best model. This means that using a model that is smaller than the true model is better for prediction in case 2 with n = 50. From Tables and 2, we can see that the selection-probabilities and prediction errors of the CAIC were improved in all situations in comparison with the AIC. We simulated several other models and obtained similar results. Table 3: Selection-probability and estimated prediction error Selected model AIC CAIC Logistic Probit Total Logistic Probit Total X Ray X Ray, Acid X Ray, Age 0 0 X Ray, Grade X Ray, Stage X Ray, Grade, Acid X Ray, Stage, Acid X Ray, Stage, Age X Ray, Stage, Grade X Ray, Grade, Age, Acid X Ray, Stage, Age, Acid X Ray, Stage, Grade, Acid X Ray, Stage, Grade, Age X Ray, Stage, Grade, Age, Acid PE B Next, for the purpose of analyzing the GLM, we consider the data reported in Brown (980), who discussed an experiment in which 53 prostate cancer patients underwent surgery to examine their lymph nodes for evidence of cancer. The response variable is the number of patients with nodal involvement, and there were five predictor variables: X Ray, Stage, Age, Acid, and Grade. First, we assume that the response variable y i is distributed according to B(, p i ) (i =,..., n). For the link function, we prepare two functions: the logistic link function and the Probit link function. In this analysis, we select the link functions and variables simultaneously. Table 3 shows the selection-probability of the model selected by minimizing the information criterion and the estimated prediction error of the best model selected by the information criterion. We divide the data into calibration sample data and validation sample data. The sample sizes of the calibration sample and the validation sample were 43 and 0, respectively. The best subset of the variables and the link function were selected by criteria derived from the calibration sample. The selection-probabilities were evaluated from only the calibration sample. The prediction errors were estimated as follows. Let d j = (d j,..., d nj ) be an n-dimensional vector expressing a pattern to leave out 0 data at the jth iteration j =,..., 00), i.e., d ij = or 0 and n d ij = 0. Moreover, we let ˆβ B,[ dj ] denote 4

15 ˆβ [ dj ] of β of the best model evaluated from the calibration sample, where ˆβ [ dj ] is given as ˆβ [ dj ] = arg max β Finally, the estimated PE B is given as PE B = j= ( d ij ) log f(y i ; β). d ij { 2 log f(y i ; ˆβ B,[ dj ]). Table 3 indicates that the models selected by the AIC were spread over a wider area than those of the CAIC, although the model most selected by the AIC is the same as that selected by the CAIC. In particular, the selection probability of the model most selected by the CAIC is much higher than that selected by the AIC. The estimated prediction error of the CAIC was smaller than that of the AIC. Thus, the CAIC is thought to have improved the accuracy of the original AIC. Consequently, from Tables, 2, and 3, we recommend the use of the CAIC rather than the AIC for selecting variables in the GLMs. 6. CONCLUSION In the present paper, we derived a simple formula for the CAIC in the GLM. All of the coefficients in our formula are the first through fourth derivatives of the log-likelihood function. The GLM can express a number of statistical models by changing the distribution and the link function and can be easily fitted to the real data using the function glm in the R software, which implements several distributions and link functions. Hence, based on the real data analysis, the present result is useful for real data analysis. Moreover, the numerical studies revealed that the CAIC is better than the original AIC. For example, we presented explicit forms of the CAIC in all of the models that are implemented in glm. These explicit forms are thus confirmed to be useful in real data analysis. Even if a researcher wants to use the CAIC in a model that for which an example CAIC has not yet been derived, the researcher can easily obtain the CAIC using formula manipulation software. In the present paper, we deal primarily with variable selection. However, in the simulation of the real data analysis, we also considered the selection of the link function. If we choose the link function by minimizing the original AIC, the optimal link function is determined only by maximizing the log-likelihood function. On the other hand, if we use the CAIC to select the link function, the optimal link function is not determined only by maximizing the log-likelihood function. Thus, using the CAIC will allow us to select an appropriate link function. As mentioned above, we confirm that the results are useful and user friendly. 5

16 APPENDIX A. Derivation of the Third Term of (3.4) In order to calculate the moments of b and Z 2, we rewrite the third term of (3.4) using b, Z 2, and M 3 as [ 2 n E (y i d i ) { c i x ib c i2(x ib ) 2 ] = n E [b M 3 (b b )] + 3 n E [b Z 2 b ], (A.) where c ij and d ij are defined in (2.5), and U ij is defined in (3.8). Let φ b (t) be the characteristic function of the distribution of b, defined as φ b (t) = E[exp(it b )] = n E[exp{i(y j d j )s j ], j= s j = n c j t M 2 x j, where t = (t,..., t p ). Note that E[exp{i(y µ)s] is the characteristic function of y µ, which is expressed as Therefore, we have { b(θ + is) b(θ) E[exp{i(y µ)s] = exp iµs. { ( ) b(θj + is j ) b(θ j ) φ b (t) = exp id j s j. j= Based on the property of the random variable with mean zero, the third moment is equivalent to the third cumulant. Since s j = O(n /2 ), log φ b (t) can be expanded as log φ b (t) = { b(θj + is j ) b(θ j ) id j s j j= = j= { 2 d j2(is j ) d j3(is j ) d j4(is j ) 4 j= + O(n 3/2 ). Thus, the third cumulant of b = (b,..., b p ) is computed through the derivative of log φ b (t), i.e., E[b α b α2 b α3 ] = i 3 3 log φ b (t) t α t α2 t α3 t=0 = 2 s j s j s j d j3 + O(n 3/2 ). t α t α2 t α3 Note that s j = c j e t αl n α l M2 x j, 6

17 where e j is the p-dimensional vector, the jth element of which is and the other elements of which are 0. Thus, using Equations (2.5) and (3.8), we obtain n E[b M 3 (b b )] = n 3 d j3 c 3 j(d i3 c 3 i + 3d i2 c i c i2 )Uij 3 + O(n 2 ). i,j (A.2) Let φ b,z 2 (t, T ) denote the characteristic function of the joint distribution for b and Z 2 as { b(θ j + iv j ) b(θ j ) φ b,z 2 (t, T ) = exp id j v j, where T = (t () ij ) (i, j =,..., p) and v j = j= n ( c j t M 2 x j + c j2 x jt x j ). In the same manner as in the calculation of log ϕ b (t), we have log φ b,z 2 (t, T ) = Note that { b(θj + iv j ) b(θ j ) id j v j j= = j= { 2 d j2(iv j ) d j3(iv j ) d j4(iv j ) 4 + O(n 3/2 ). v k = c k e t i n im2 x k, v k T ij = n c k2 (e ix k )(e jx k ). Hence, we obtain n E[b Z 2 b ] = n 2 d i3 c 2 ic i2 Uii 2 + O(n 2 ). (A.3) Substituting (A.2) and (A.3) into (A.), the third term of (3.4) is given by (3.7). A.2 Derivation of the Fourth Term of (3.4) In order to use the asymptotic properties, we express the fourth term of (3.4) in terms of b, Z 2, Z 3 M 2 M 3, and M 4 as [ 2 n n E (y i d i ) {c i x ib 3 + c i2 (x ib )(x ib 2 ) + 6 c i3(x ib ) 3 ] = n E [ (b b ) M 3M 2 M 3 (b b ) ] + 3n E [(b b ) M 4 (b b )] 3 n E [ (b b ) M 3M 2 Z 2 b ] 4 n E [ b Z 2 M 2 Z 2 b ] + 4 3n E [b Z 3 (b b )]. (A.4) 7

18 Let φ b,z 2,Z 3 ( ) denote the characteristic function of the joint distribution for b, Z 2, and Z 3 defined by { ( ) b(θj + ir j ) b(θ j ) φ b,z 2,Z 3 (t, T 2, T 3 ) = exp id j r j, where T 2 = (T (2) ij ) (i, j =,..., p), T 3 = (T (3) ) (i, j, k =,..., p) and r j = j= ijk n ( c j t M 2 x j + c j2 x jt 2 x j + c j3 x jt 3 (x j x j )). In order to simplify the calculations, we define the following notation: τ kj = k! (id jkr j ) k, 2 τ 2α κ ij = = t α= i t j n m= 2 τ 2α κ i,kl = = α= t i T (2) n jk m= 2 τ 2α κ ik,jl = = α= T (2) (2) ik T n jl κ i,ijk = α= 2 τ 2α t i T (3) ijk = n m= d m2 c 2 m(e im 2 x m )(e jm 2 x m ), (A.5) d m2 c m c m2 (e im 2 x m )(e kx m )(e lx m ), (A.6) d m2 c 2 m2(e ix m )(e kx m )(e jx m )(e lx m ), m= Using the derivations of ϕ b,z 2,Z 3, the first term of (A.4) is given by E[(b b ) M 3M 2 M 3 (b b )] = = (A.7) d m2 c m c m3 (e im 2 x m )(e ix m )(e jx m )(e kx m ). (A.8) p [M 3M 2 M 3 ] i,j,k,l E[b i b j b k b l ] i,j,k,l p [M 3M 2 4 φ b,z M 3 ] 2,Z 3 (t, T 2, T 3 ) i,j,k,l t i t j t k t l i,j,k,l t=0p,t 2 =0,T 3 =0 By applying a Taylor expansion, we obtain { ( ) 4 b(θj + ir j ) b(θ j ) exp id j r j t i t j t k t l j= t=0p,t2=0,t3=0 { 4 = exp (τ 2j + τ 3j + τ 4j ) + O(n ) 3/2 t i t j t k t l j= t=0 p,t 2 =0,T 3 =0 { 4 τ 4α = κ ij κ kl + κ ik κ jl + κ jk κ il + + O(n 2/3 ) exp { + O(n 3/2 ). t i t j t k t l Note that r j = O(n /2 ) and 4 τ 4α t i t j t k t l = α= α= 3 d α4 r α t i r α t j r α t k r α t l = O(n ). 8.

19 Hence, the first term of (A.4) is expressed as p E[(b b ) M 3M 2 M 3 (b b )] = [M 3M 2 M 3 ] i,j,k,l (κ ij κ kl + κ ik κ jl + κ jk κ il ) + O(n ). (A.9) i,j,k,l By substituting (A.5) into (A.9), we obtain E[(b b ) M 3M 2 M 3 (b b )] = (d n 2 2 i2 c 3 i + 3d i2 c i c i2 )(d j2 c 3 j + 3d j2 c j c j2 )(U ii U ij U jj + 2Uij) 3 + O(n ). (A.0) i,j The remaining terms of (A.4), as well as the first term of (A.4), will be calculated. The second term of (A.4) is similarly obtained from (A.0) as follows: E[(b b ) M 4 (b b )] = 3 (d i4 c 4 i + 6d i3 c 2 n ic i2 + 3d i2 c 2 i2 + 4d i2 c i c i3 )Uii 2 + O(n ). Next, we calculate the third term of (A.4). The third term of (A.4) is expressed as follows: E [ ] p (b b ) M 3M 2 Z 2 b = [M 3M 2 ] i,j,k E[b i b j b l Z 2,kl ] Expression (A.6) implies that = i,j,k,l (A.) p [M 3M 2 ] i,j,k (κ ij κ l,kl + κ ik κ j,kl + κ jk κ i,kl ) + O(n ), (A.2) i,j,k,l E [ ] (b b ) M 3M 2 Z 2 b = (d n 2 2 i3 c 3 i + 3d i2 c i c i2 )d j2 c j c j2 (U ii U ij U jj + 2Uij) 3 + O(n ). (A.3) The fourth term of (A.4) is as follows: i,j E [ ] p b Z 2 M2 Z 2 b = i,j,k,l It follows from (A.7) and (A.4) that E [ ] b Z 2 M2 Z 2 b = d i2 c 2 n i2u ii + n 2 2 [M Finally, we calculate the fifth term of (A.4). Note that p E [b Z 3 (b b )] = E[Z 3,ijk b i b j b k ] = 2 ] jk (κ il κ ik,jl + κ i,ij κ l,kl + κ i,kl κ l,ij ) + O(n ). (A.4) (d i2 c i c i2 )(d j2 c j c j2 )(U ii U ij U jj + Uij) 3 + O(n ). (A.5) i,j i,j,k p (κ ij κ k,ijk + κ ik κ j,ijk + κ jk κ i,ijk ) + O(n ). (A.6) i,j,k 9

20 Substituting (A.8) into (A.6) yields E [b Z 3 (b b )] = 3 n d i2 c i c i3 Uii 2 + O(n ). (A.7) Consequently, from (A.0), (A.), (A.3), (A.5), and (A.7), we obtain the fourth term of (3.4) as (3.9). ACKNOWLEDGEMENTS The authors would like to thank Dr. Mariko Yamamura of Hiroshima University and Dr. Eisuke Inoue of Kitazato University for their helpful comments and suggestions. The authors would also like to thank Dr. Kenichi Kamo of Sapporo Medical University for teaching us the R programming langauge. REFERENCES [] Akaike, H. (973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (eds. Petrov, B. N. & Csáki, F.), , Akadémiai Kiadó, Budapest. [2] Barnett, S. B. L. & Nurmagambetov, T. A. (200). Cost of asthma in the United States: J. Allergy Clin. Immun., 27, [3] Brown, B. W. (980). Prediction analysis for binary data. In Biostatistics Casebook. (eds. Miller Jr., G. R., Efron., B., Brown, B. W., & Moses, L. E.), 3 8, Wiley, New York. [4] Davies, S. L., Neath, A. A. & Cavanaugh, J. E. (2006). Estimation optimality of corrected AIC and modified Cp in linear regression. Internat. Statist. Rev., 74, 2, [5] Fahrmeir, L. & Kaufmann, H. (985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Statist., 3, [6] Hurvich, C. M. & Tsai, C. L. (989). Regression and time series model selection in small samples. Biometrika, 76, [7] Kamo, K., Yanagihara, H. & Satoh, K. (20). Bias-corrected AIC for selecting variables in Poisson regression models. Comm. Statist. Theory Methods (in press). [8] Kullback, S. & Libler, R. (95). On information and sufficiency. Ann. Math. Statist., 22, [9] Nelder, J. A. & Wedderburn, W. M. (972). Generalized linear models. J. R. Statist. Soc. A., 35,

21 [0] Matas, M., Picornell, A., Cifuentes, C., Payeras, A., Bassa, A., Homar, F., López-Labrador, F. X., Moya, A., Ramon, M. M. & Castro, J. A. (200). Relating the liver damage with hepatitis C virus polymorphism in core region and human variables in HIV--coinfected patients. Infect. Genet. Evol., 0, [] Meyers, R. H., Montgomery, D. C. & Vining, G. G. (2002). Generalized Linear Models with Applications in Engineering and the Sciences. Wiley Interscience, Canada. [2] R Development Core Team (20). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL [3] Sánchez-Carneo, N., Couñago E., Rodrigues-Perez, D. & Freire, J. (20). Exploiting oceanographic satellite data to study the small scale coastal dynamics in a NE Atlantic open embayment. J. Marine Syst., 87, [4] Shibata, R. (980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Ann. Math. Statist., 8, [5] Sugiura, N. (978). Further analysis of the data by Akaike s information criterion and the finite corrections. Commun. Statist. -Theory Meth., 7,, [6] Teste, F. P. & Lieffers, V. J. (20). Snow damage in lodgepole pine stands brought into thinning and fertilization regimes. Forest Ecol. Manag., 26, [7] Wong, C. S. & Li, W. K. (998). A note on the corrected Akaike information criterion for threshold autoregressive models. J. Time Ser. Anal., 9, [8] Yanagihara, H., Sekiguchi, R. & Fujikoshi, Y. (2003). Bias correction of AIC in logistic regression models. J. Statist. Plann. Inference, 5,

Bias-corrected AIC for selecting variables in Poisson regression models

Bias-corrected AIC for selecting variables in Poisson regression models Ken-ichi Kamo (a), Hirokazu Yanagihara (b) and Kenichi Satoh (c) (a) Corresponding author: Department of Liberal Arts and Sciences,