MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA

Size: px

Start display at page:

Download "MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA"

Sybil Neal
5 years ago
Views:

1 J. Jpn. Soc. Comp. Statist., 26(2013), DOI: /jjscs MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA Yiping Tang ABSTRACT Overdispersion is a common phenomenon in discrete data analysis with generalized linear models which can never be ignored. In the literature on model selection with overdispersion, the main stream of the methods is to use the maximum likelihood method with a strong assumption of defining an entire distribution of the data to construct information criteria. For example, Fitzmaurice (1997) proposed model selection methods based on Efron s (1986) double-exponential families. However, the assumption of a parametric distribution of the data is sometimes too strong. In this paper, we propose a new model selection strategy based on the quasi-likelihood functions. First, we define a statistical model with two moments in the form of mean and variance. Second, we define the semiparametric information based on the semiparametric statistical models. Finally, we propose the semiparametric information criterion (SIC) based on quasi-likelihood estimators with weak assumptions on the first two moments. Although the proposed semiparametric information criterion is inspired by data with overdispersion, it applies more generally to data no matter whether they are discrete or continuous. It may be useful for any models for which the quasi-likelihood methods are appropriate. We also demonstrate the use of SIC through simulation studies and data application of the well-known data on germination of Orobanche. 1. Introduction 1.1. Overdispersed data In many areas of applications such as Biostatistics, life insurance, epidemiological studies and so on, when dealing with data analysis depending on GLMs (including binomial and Poisson distributions), the phenomenon of overdispersion widely exists. That is, given a standard exponential family, GLMs hold a specified mean-variance relationship, the variation of the data is however greater than the variance predicted by the mean-variance relationship. Example 1 We use a familiar example introduced by McCullagh and Nelder (1989, p.125) to illustrate the problem of overdispersion. Overdispersion commonly arises in cluster sampling. Clusters usually vary in size, for simplicity, we assume that the cluster size, k, is fixed and that the m individuals sampled actually come from m/k clusters. In the ith cluster, the number of positive respondents, Z i, is assumed to have the binomial distribution with index k and parameter π i, which is considered as a random variable and thus varies from cluster to cluster. The total number of positive respondents is Y = Z 1 + Z Z m/k. Graduate School of Science, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba , Japan tou.kihei@gmail.com Key words: Akaike information criterion; Generalized linear models; Kullback-Leibler information; Model selection; Overdispersion; Quasi-likelihood 53

2 TANG If we let E(π i ) = π and Var(π i ) = τ 2 π(1 π), it may be shown that the unconditional mean and variance of Y are given by E(Y ) = mπ, Var(Y ) = mπ(1 π){1 + (k 1)τ 2 } = mπ(1 π)σ 2. Note that the above results only depend on the first two moments of π i. Note also that the dispersion parameter σ 2 = 1 + (k 1)τ 2 depends both on the cluster size and on the variability of π from cluster to cluster, but not on the sample size, m. Overdispersion can occur only if m > Modeling overdispersed data In order to model overdispersion, many approaches have been proposed to use the maximum likelihood method with an assumption of defining an entire distribution of the data. Williams (1982) modified standard logistic-linear analysis by using beta-binomial distributions to accommodate the overdispersed binomial data. Lawless (1987) proposed negativebinomial regression models to deal with overdispersed Poisson data. Efron (1986) proposed the double-exponential family method to add a dispersion parameter to control the variance independently of the mean. Anderson et al. (1994) proposed modified information criteria QAIC and QAIC c which are based on the quasi-likelihood to deal with model selection with overdispersed data, and they found that these criteria performed well in product multinomial models of capture-recapture data in the presence of differing levels of overdispersion. Hurvich and Tsai (1995) provided information on the use of AIC c with overdispersed data. Fitzmaurice (1997) developed model selection strategies based on Efron s (1986) doubleexponential families so that the maximum likelihood method can still be used and model selection can be done in the generalized linear model framework. However, in practical data analysis, the above methods have a common drawback that the true distribution of the data will never be known, therefore the assumption of specifying an exact distribution is very strong. Wedderburn (1974) proposed an important method called the quasi-likelihood method. He added a dispersion parameter to control the variance independently of the mean, but weakened the assumption of defining the entire distribution of the data. He just used assumptions on the first two moments, rather than the entire distribution of the data. An attractive property of the quasi-likelihood function is that it has properties similar to those of the likelihood function. When we deal with overdispersed data, and only use the first and the second moments, the maximum likelihood methods will not be available. Furthermore, when the model selection problem appears, maximum likelihood based information criterion AIC can not be effective any more Contributions of the paper The purpose of this paper is to extend GLM to situations where we do not wish to impose any distributional assumptions for the data, instead we make assumptions on the first two moments of the data. Using this semiparametric information, we proceed by extending the idea of Akaike (1974) using the Kullback-Leibler information to measure the discrepancy between the true distribution and the assumed semiparametric models. We call the resulting information criterion the semiparametric information criterion (SIC). One 54

3 Semiparametric Information Criteria Using Quasi-likelihood major application is to apply SIC for analyzing overdispersed data. The rest of the paper is organized as follows. In Section 2, we make a brief review of the quasi-likelihood method of Wedderburn (1974). We shall use the quasi-likelihood estimator in the subsequent sections for constructing predictive semiparametric models. Section 3 contains the main theoretic results. New information criteria based on the semiparametric models are developed in this section. In Section 4, we apply the proposed model selection techniques for analyzing the classical seeds germination data of Crowder (1978). We found that our results are consistent with most results in the literature. Finally, in Section 5, we show some results of a series of simulation studies to demonstrate the potential usefulness of the proposed model selection methods, which also indicate problems for future research. 2. Quasi-likelihood Generalized linear models are widely used as a standard tool in modern regression models. They were first introduced by Nelder and Wedderburn (1972). Their estimation is based on the iteratively reweighted least squares algorithm, which only requires a relationship between the mean and variance instead of the full distribution. Wedderburn (1974) extended the GLM by replacing the likelihood function with the quasi-likelihood function. McCullagh (1983) further developed the quasi-likelihood and established the consistency and asymptotic normality of the parameter estimates under second moment assumptions. We will briefly review the quasi-likelihood method here. Suppose that the n components of the response vector Y are independent with mean vector µ and covariance matrix ϕv (µ); ϕ may be a known or unknown constant and V ( ) is known. Since the components of Y are independent, the matrix V (µ) can be written in the following form: V (µ) = diag {V 1 (µ),..., V n (µ)}. One assumption in V (µ) is that V i (µ) depends only on the ith component of µ. We first consider a single component Y of the response vector Y. Under the above conditions, we define U = Q(Y ; µ) µ The function U has the following properties: E(U) = 0, Var (U) = 1/{ϕV (µ)}, E The quasi-likelihood function Q(Y ; µ) may be written as follows: Q(Y ; µ) = µ Y = Y µ ϕv (µ). (1) Y t ϕv (t) dt. ( ) U = 1/{ϕV (µ)}. µ We refer to Q(Y ; µ) as the quasi-likelihood, or more correctly, as the log quasi-likelihood for µ. Since the components of Y are independent by assumption, the log quasi-likelihood for the complete data Q (Y ; µ) is the sum of the individual contributions. The quasi-likelihood estimating equations for the regression parameters β, obtained by differentiating Q (Y ; µ) with respect to β, may be written in the form U(β) = 0, where U(β) = D T V 1 (Y µ) /ϕ (2) 55

4 TANG is called the quasi-score function. In this expression, the components of D, of order n q, are D ir = µ i / β r, the derivatives of µ(β) with respect to the parameters. The covariance matrix of U(β), which is also the negative expected value of U(β)/ β, is i β = D T V 1 D/ϕ. For quasi-likelihood functions, this matrix plays a similar role to the Fisher information for ordinary likelihood functions. Let y i (i = 1,..., n) be independent response variables which may be proportions or counts with mean µ i and variance ϕv (µ i ). In addition, suppose that Ey i ] = µ i = t 1 (η i ), where η i = x i β, t 1 ( ) is the inverse link function, x i is a q vector of covariates, β is a q vector of the unknown parameters β = (β 1,..., β q ) and ϕ does not depend on β. Let X be the n q matrix of covariates with the ith row being X (i), θ = (β, ϕ) Θ q+1 and q < n. Since the above model only depends on the assumption of the mean and variance, we may use the quasi-likelihood method to estimate the unknown parameters θ. For estimating β, we use the following quasi-score functions: ] n yi t 1 (X (i) β) U 1j (β, ϕ) = ϕv ( t ( 1 X (i) β )) i X ij, for j = 1,..., q. (3) For estimating ϕ we use the second moment function: ( n yi t ( 1 X (i) β )) ] 2 U 2 (β, ϕ) = ϕv ( t ( 1 X (i) β )) (n q 1), (4) where X (i) is the ith row of X, and i = t 1 (η i )/ η i. The quasi-likelihood estimators ( β, ϕ) are the simultaneous solutions of the q + 1 equations U 1j ( β, ϕ) = 0, for j = 1,..., q and U 2 ( β, ϕ) = 0. According to the results of McCullagh (1983), the numerical value of β is not affected by ϕ whether ϕ is known or not. So U 1j (β, ϕ) can be viewed as a function of β. We rewrite quasi-likelihood equations in a matrix form, U 1 ( β, ϕ) { ( = D T V 1 Y t 1 X β )} = 0, (5) where Y is a n 1 vector; t 1 (X β) = (t 1 (X (1) β),..., t 1 (X (n) β)) ; D is a n q matrix with components as D ij = i X ij, for i = 1,..., n and j = 1,..., q; V is a n-dimensional diagonal matrix with components ϕv ( t ( 1 X (i) β )), for i = 1,..., n. The asymptotic covariance matrix of β ( 1 is cov( β) = D T V D) 1. Beginning with an arbitrary value β 0 sufficiently close to β, the sequence of parameter estimates defined by the Newton-Raphson method with Fisher scoring is β 1 = β ( ) DT 0 V 0 D 0 DT 0 1 V 0 {Y ( t 1 X β )} 0. (6) The quasi-likelihood estimator β can be obtained by iterating until convergence. Approximate unbiasedness and asymptotic normality of β follow directly from (6) under the second moment assumptions. Let θ = ( β, ϕ). Moore (1986) established consistency and asymptotic normality of θ. 56

5 Semiparametric Information Criteria Using Quasi-likelihood 3. Semiparametric Model Selection Criteria In this section, we propose semiparametric model selection criteria SIC and SIC T. They apply generally to data no matter whether they are discrete or continuous. For convenience, we suppose that the data in the subsequent paragraphs are continuous. If they are discrete then integration is replaced by summation Semiparametric models We assume that the true probability density function of Y is h(y) and the true distribution function is H(y). We will relax the assumption of a complete specification of the probability model, only using some general form of the variance function about the mean as used in GLM. Now suppose that the statistician makes the assumption that the random variables y i (i = 1,..., n) have distribution G( ), density function g( ), mean µ i (β) and variance ϕv (µ i (β)). The first and the second moments can be combined in the following way: ( m(y i, θ) = m(y i, β, ϕ) = y i µ i (β) (y i µ i (β)) 2 ϕv (µ i (β)) ) = ( ) m1i m 2i By assumption, we have that E G m (y i, θ)] = m(y i, θ) dg(y) = 0. Let M denote the set of all possible distribution functions, let { } G(θ) = G M : m(y, θ) dg(y) = 0. Define G = (θ Θ) G(θ) as the set of all possible distribution functions that are restricted by the moments conditions. The class G is the statistical model we shall consider in this paper Kullback-Leibler information Extending the idea of AIC, we shall generalize the data-based model selection approach based also on the Kullback-Leibler information. We use the Kullback-Leibler information to measure the closeness between H(y) and the assumed semiparametric model G( ). Let g vary in the statistical model class G. We consider the constrained minimization problem ( ) h(y) v(β, ϕ) = inf log h(y) dy (7) g G g(y) subject to m(y, β, ϕ)g(y) dy = 0 and g(y) dy = 1. Using the results in convex analysis (e.g., Kitamura, 2006), the solution to the infinite dimensional optimization problem v(β, ϕ) can be equivalently written as the solution to the finite dimensional optimization problem v (β, ϕ) as follows: v (β, ϕ) = sup λ,r λ + ] log ( λ r m (y, β, ϕ)) h(y) dy. (8) 57

6 TANG Since we are only interested in the maximization of v (β, ϕ) with respect to β and ϕ, we can ignore other terms and proceed to define the semiparametric information of the statistical model G as follows: Definition 1 The semiparametric Kullback-Leibler information for quasi-likelihood function is defined as follows: SKL(θ) = sup E H ρ(y, θ, r)] = sup ρ(y, θ, r)h(y) dy, (9) r 2 r 2 where ρ(y, θ, r) = log (1 + r m(y, β, ϕ)). Since the true distribution function H( ) is unknown, we use the empirical distribution of the data to estimate the true distribution function H( ) Regularity conditions In this paper, we shall use SKL(θ) to select an appropriate semiparametric model. We will consider both the correctly specified case and the misspecified case. A semiparametric model is called correctly specified if the moments assumed evaluated at a particular value of the parameter are equal to the moments of the underlying true distribution. First, we state the regularity conditions which will be used in the sequel to prove the asymptotic unbiasedness of the proposed information criteria. C1 Define θ = (β, ϕ ) as the pseudo-true value, such that the semiparametric Kullback- Leibler information SKL(θ) is minimized at θ. C2 ρ(y, θ, r) is twice continuously differentiable in θ Θ and r 2. Note that C2 implies that r(θ) = arg sup ρ(y, θ, r)h(y) dy (10) r 2 is continuous with respect to θ. Furthermore, from C1 we know that there is a unique r corresponding to θ : r = arg sup r 2 ρ(y, θ, r)h(y) dy. (11) C3 The problem of inf θ Θ SKL(θ) has a unique saddle point solution (θ, r ), where r is defined in (11). When the model G is correctly specified, there exists a unique (θ, r ) satisfying SKL(θ ) = 0. C4 The following law of large numbers holds: sup sup θ Θ r 2 1 n n ρ(y i, θ, r) E H ρ(y, θ, r)] = o p(1) The expected information When we consider SKL(θ) as defined in (9), the minimizer of SKL(θ) can be viewed as the solution to the following saddle point problem: θ = arg min sup θ Θ r L(θ) E H ρ(y, θ, r)], (12) 58

7 Semiparametric Information Criteria Using Quasi-likelihood where Θ denotes the parameter space for θ, L(θ) = {r : r m(y, θ) I}, and I is an open interval containing zero. Next we describe the process of solving the above saddle point problem. Firstly, the expectation of ρ(y, θ, r) is maximized for given θ, so that the θ satisfies the following condition: E H ρ 1 (y, θ, r) m(y, θ)] = 0, where ρ 1 (y, θ, r) = 1/(1 + r m(y, θ)) is the first derivative of ρ with respect to r m(y, θ). C5 Assume that ρ(y, θ, r)h(y) dy and ρ 1 (y, θ, r)h(y) dy are differentiable under the integral sign. Secondly, Since θ is the minimizer of the quantity E H ρ(y, θ, r)], for a proper r, implying that E H ρ 1 (y, θ, r) M(y, θ) r(θ)] = 0, where M(y, θ) = m(y, θ)/ θ. Now denote the matrix of the first derivatives of ρ(y, θ, r) with respect to θ and r by ( ) ρ1 (y, θ, r) M(y, θ) Υ(θ, r) = r, ρ 1 (y, θ, r)m(y, θ) The values (θ, r(θ )) must satisfy the following first-order condition: E H Υ (θ, r (θ ))] = 0. (13) C6 Define ] Υ(θ, r ) Q = E H (θ, S = E, r ) H Υ(θ, r )Υ(θ, r ) ]. Assume that Q and S are finite and S is invertible. However, θ is not computable because it involves the expectation with respect to the true density function h(y). One way is to use the value to minimize the empirical version of the expected semiparametric Kullback-Leibler information: θ = arg min θ Θ { 1 n } n ρ (y i, θ, r (θ)), (14) where r(θ) is defined in (10). To approximate r(θ), we use the empirical distribution function in place of the true distribution function H( ) to obtain { } 1 n r(θ) = arg max ρ(y i, θ, r). (15) r 2 n We can show that ( θ, r( θ )) are consistent and asymptotically normally distributed along similar lines of Chen et al. (2007), we omit the proof here. The calculation of ( θ, r( θ )) is unstable and usually difficult to compute. Since the θ defined in Section 2 is also a consistent estimator, it is reasonable to use ( θ, r( θ)) instead. It is easy to show that ( θ, r( θ)) holds the properties of consistency and asymptotic normality too. 59

8 TANG Definition 2 In the statistical model G, the expected semiparametric information is defined as follows: ] E H SKL( θ)] = E H ρ(y, θ, r( θ))h(y) dy. (16) For later use, we write explicitly the components of Υ(θ, r) as follows: ( ) m(y, θ) M1 M M(y, θ) = θ = 3, M 2 M 4 where M 1 = m 1 β = µ i(β) β, M 2 = m 2 β = 2(µ i (β) y) µ i β ϕ V (µ i(β)) µ i µ i β, M 3 = m 1 ϕ = 0, M 4 = m 2 ϕ = V (µ i(β)) The proposed information criteria Definition 3 Let k be the dimension of θ, then we define the following information criterion: SIC = n ( ρ y i, θ, ) r( θ) + k, (17) where θ = ( β, ϕ) with β being the quasi-likelihood estimator and ϕ the estimator from the usual Pearson residual. We shall call this information criterion the semiparametric information criterion, or SIC for short. The proposed criterion SIC has the following properties. Proposition 1 Suppose that the model is correctly specified. Further suppose that (C1) (C6) hold. Then SIC is an asymptotically unbiased estimator of the expected information. 1 n Proof. Under C4, we know that when n goes to infinity, n ρ(y i, θ, r( θ)) converges in probability to E H ρ(y, θ, ] r( θ)). So that the expected information can be written as follows: E H SKL( θ)] = E H E H ρ(y, θ, r( θ) ]] θ) + E H E H ρ(y, θ, r( θ) θ) ]] E H EH ρ(y, θ, r(θ ) θ ) ]] + E H EH ρ(y, θ, r(θ ) θ ) ]] E H EH ρ(y, θ, r(θ ) θ ) ]] + E H EH ρ(y, θ, r(θ ) θ ) ]] E H E H ρ(y, θ, r( θ) θ) ]] = E H E H ρ(y, θ, r( θ) θ) ]] + D 1 + D 2 + D 3, where D 1, D 2 and D 3 denote the term on the second, third and the fourth lines above, respectively. Now we compute these quantities next. First, we consider D 1. According to Taylor s theorem and C5 and C6, by expanding ρ(y, θ, r( θ)) at (θ, r ) up to the second order, we have 60

9 Semiparametric Information Criteria Using Quasi-likelihood ( ) ρ(y, θ, r( θ)) =. θ θ ρ(y, θ, r) ρ (y, θ, r ) + r( θ) r (θ, r ) (θ, r ( ) ) + 1 ( ) θ θ 2 ρ(y, θ, r) θ θ 2 r( θ) r (θ, r ) 2 (θ, r r( θ) r ) Taking expectation of both sides with respect to the true distribution function H(y), we obtain that ] ρ(y, θ, r) E H (θ, r ) = E H Υ(θ, r )] = 0. (θ, r ) Thus we deduce that ( )]] ( )]] D 1 = E H E H ρ y, θ, r( θ) θ E H E H ρ y, θ, r(θ ) θ ( ) = 1 ( )]] θ 2 E θ 2 ρ(y, θ, r) H E H θ θ r( θ) r (θ, r ) 2 (θ, r + o p (1). r( θ) r ) Next, we consider a second-order Taylor expansion of D 3, obtaining ( ρ (y, θ, r(θ )) =. ( ρ y, θ, ) θ r( θ) + θ ) ρ(y, θ, r) r(θ ) r( θ) (θ, r ) ( + 1 θ θ ) ( 2 ρ(y, θ, r) θ 2 r(θ ) r( θ) (θ, r ) 2 θ ) ( θ, r ( θ)) r(θ ) r( θ) The properties of the consistency of ( θ, r( θ) ) implies that ] ρ(y, θ, r) E H (θ, r ) = 0. ( θ, r ( θ)) These results jointly imply that ( D 3 = E H E H ρ y, θ, r(θ ) θ ( = 1 2 E θ H E θ H r( θ ) r( θ) ( θ θ )]] + o p (1). r( θ ) r( θ) )]] ( E H E H ρ y, θ, r( θ) ) 2 ρ(y, θ, r) (θ, r ) 2 ( θ, r ( θ)) ( θ, r ( θ)) )]] θ Finally, we consider D 2. According to the law of large numbers, we have that D 2 = 0. In addition, the quadratic forms in both D 1 and D 3 converge to centrally distributed chi-square random variables with k degrees of freedom. This completes the proof. For a possibly misspecified model, we propose the following criterion for model selection. 61

10 TANG Definition 4 Consider a possibly misspecified semiparametric model G as defined in Section 3.1. Let ) Ŝ = 1 n ) ) ] 1 n Υ i ( θ, r( θ) Υ i ( θ, r( θ), Q = Υ i ( θ, r( θ) n n (θ., r ) Then we define the following information criterion SIC T = n ( ρ y i, θ, ) r( θ) trace(ŝ Q 1 ). (18) The proposed criterion SIC T has the following properties. Proposition 2 Under the regularity conditions (C1) (C6) listed in Section 3.3 and 3.4, SIC T is an asymptotically unbiased estimator of the expected information. SIC T can be understood as a semiparametric extension of the Takeuchi information criterion for parametric likelihood analysis (Takeuchi, 1976). Proof. Under (C1) (C6), we obtain that and 1 n n Υ i (θ, r )Υ i (θ, r ) P EH Υ(θ, r )Υ(θ, r ) ] = S, 1 n n Υ i (θ, r) (θ, r ) ] P Υ(θ, r ) EH (θ, r (θ = Q., r ) ) Furthermore, using the consistency and the asymptotic normality of ( θ, r( θ)): ( ) θ θ n N(0, Q 1 SQ 1 ), r( θ) r we obtain that ( ) ]] 2nE H E H ρ y, θ, r( θ) θ ρ (y, θ, r ) ( ) ( )]] θ θ 2 ρ(y, θ, r) = ne H E H θ θ r( θ) r (θ, r ) 2 (θ, r r( θ) r ) ] ( ) ( ) ]] = trace E 2 ρ(y, θ, r) H θ θ θ θ (θ, r ) 2 (θ, r ne H r( θ) r ) r( θ) r P trace Q ( Q 1 SQ 1)] = trace ( SQ 1), 62

11 Semiparametric Information Criteria Using Quasi-likelihood where S and Q can be explicitly written as follows: S = E H Υ (θ, r ) Υ (θ, r ) ] = E H ( ρ1 M(y, θ ) r ρ 1 m(y, θ ) ) (ρ1 r M(y, θ ) ρ 1 m(y, θ ) )] = E H ρ 2 1 M(y, θ ) r r M(y, θ ) ρ 2 1M(y, θ ) r m(y, θ ) ρ 2 1m(y, θ ) r M(y, θ ) ρ 2 1m(y, θ )m(y, θ ) Q = E H Υ (θ, r ) (θ, r ) ] = E H ρ 1 M θ r + ρ 2 M r r M ρ 1 M + ρ 2 M r m(y, θ ) ρ 1 M + ρ 2 m(y, θ ) r M ρ 2 m(y, θ )m(y, θ ) where M = M(y, θ ) and the derivative with respect to θ can be written as M(y, θ ) r ( ) ( ) 2 m 1 = β β, 2 m 2 r 2 m 1 β β β, 2 m 2 r ϕ β ) ϕ. (19) θ ( 2 m 1 ϕ β, 2 m 2 ϕ β r ( 2 m 1 ϕ 2, 2 m 2 ϕ 2 ) r The entries in (19) are calculated as follows: 2 m 1 β β = µ ] i 2 ( m 1 2 ) β β, ϕ β = m 1 2 m 1 β = 0, ϕ ϕ 2 = 0, 2 m 2 β β = 2(µ i (β) y i ) µ i β β ϕ V (µ ] i(β)) µ i µ i β, 2 ( m 2 2 ) ϕ β = m 2 β = V (µ i(β)) µ i ϕ µ i β, 2 m 2 ϕ 2 = Data Analysis Crowder (1978, Table 3) considered the data on germination of the seed variates Orobanche. Orobanche is a parasitic plant growing on the roots of flowering plants. In this experiment, two types of seeds from Orobanche (O.aegyptiaca75 and O.aegyptiaca73) were tested for germination when they were given two different root extracts (Bean and cucumber). The goal of the experiment was to determine the effect of the root extract on inhibiting the growth of the parasitic plant. This data was also analyzed by Breslow and Clayton (1993) to illustrate the use of generalized linear mixed models for overdispersion. In these studies, the authors have not considered the aspects of model selection. Now we reanalyze it by applying the model selection technique proposed based on the quasi-likelihood. In the Orobanche data, y i are the proportion of germinated seeds, and n i are seeds in the ith data set (i = 1,..., n = 21). Here, we apply the quasi-likelihood method to deal with these overdispersed data. We suppose that Ey i ] = µ i = π i and Vary i ] = ϕv (µ i ) = ϕπ i (1 π i )/n i, where ϕ is the overdispersion parameter. Furthermore, suppose each mean π i is associated with a set of covariates x i through a logistic link function, namely logit(π i ) = log(π i /(1 π i )) = Xi T β. Here X i = (1, x 1i, x 2i, x 1i x 2i ), where x 1i = 1 if the root extract is cucumber, otherwise 0 if the root extract is Bean; and x 2i = 1 if the type of seeds is O.aegyptiaca75, otherwise 0 if the type of seeds is O.aegyptiaca73. Here we have five candidate models: ], ], 63

12 TANG (A) logit(π i ) = β 0 (B) logit(π i ) = β 0 + β 1 x 1i (C) logit(π i ) = β 0 + β 2 x 2i (D) logit(π i ) = β 0 + β 1 x 1i + β 2 x 2i (E) logit(π i ) = β 0 + β 1 x 1i + β 2 x 2i + β 3 x 1i x 2i We first used the quasi-binomial likelihood method of the free software R (R Development Core Team, 2012) to estimate the regression coefficients in all the candidate models, the results are summarized in Table 1. From Table 1, we know that the standard errors of the parameters are adjusted by the overdispersion factor to a reasonable level. Furthermore, the overdispersion parameter also has an impact on hypothesis testing. Comparing model (E) with model (D) by F -test, we found that the p value is The p value is 0.25, when model (B) is compared with model (D). Thus these two pairs of models are not significantly different when the p values are compared with the threshold value 5%. On the other hand, both the difference between model (A) and model (B) and that between model (C) and model (D) are significant using appropriate F -tests. Table 1: Summaries of quasi-binomial analysis for the Orobanche data Models (A) (B) (C) (D) (E) Intercept (sd) (0.15) (0.15) (0.27) (0.22) (0.25) Extracts (sd) (0.21) (0.21) (0.34) Seeds (sd) (0.33) (0.23) (0.30) Interaction (sd) (0.42) Overdispersion Next we computed the values of the various information criteria and the results are summarized in Table 2. SIC has the smallest value at model (B) among all candidate models. In Table 2, AIC BB denotes the values of AIC based on the beta-binomial model. As can be seen in Table 2, SIC is the smallest for model (B). SIC ranks the models in the following way: (B) > (A) > (D) > (E) > (C). This result is consistent with the results obtained Crawley (2005, p ). Crawley (2005) used the quasi-likelihood method to analyze the Orobanche data. He applied one F-test to compare the full model (E) with other simpler models. He found that there is no compelling evidence that the types of seeds and the interaction should be kept in the model, and the minimal adequate model is model (B). On the other hand, SIC T ranks the models in the following way: (C) > (A) > (B) > (D) > Table 2: Semiparametric model selection criteria for the Orobanche data Models (A) (B) (C) (D) (E) SIC SIC T AIC BB

13 Semiparametric Information Criteria Using Quasi-likelihood (E), which is quite different from the results obtained by using SIC. Even in full likelihood analysis, Takeuchi type information criteria have been reported to be unstable in some cases (Shibata, 1989). Comparing to SIC and SIC T, AIC BB selects the most complicated model (E) as the best one. 5. Simulation Studies Now we examine the performance of SIC and SIC T numerically. The other criterion considered in the simulation studies is AIC (Akaike, 1973). We computed the frequencies of selecting the candidate models for each information criterion, following the idea of Greven and Kneib (2010). We generated data {(y i, n i ), i = 1,..., n}, where y i is the proportion of successes in n i trials. The total sample size is fixed as N = n n i. Similar to the germination data set, we shall consider the case of two covariates x 1 and x 2. Firstly, we generated the number of trials n i. The data are divided into J clusters. Assume that the total number of trials in the jth cluster is m j, so that J j=1 m j = N. We generated the m j from the multinomial distribution Multinomial(N, p N ). We further suppose that there are v sets of data in every cluster. Given a fixed number of trials m j, the number of trials n ij in set i are generated from the multinomial distribution Multinomial ( m j, (q 1,..., q v ) ), where q 1 =... = q v = 1/v and m j = v n ij. The next step is to generate the proportion of successes y i. In order to ensure the existence of overdispersion, we suppose that the data follow a mixture model so that n i y i Binomial(n i, P i ), that is, n i y i follows a binomial distribution with mean n i π i, where P i is a beta random variable with parameters α i = cπ i, β i = c(1 π i ) with c = α i +β i being a constant independent of i. Simple calculation shows that y i has mean EY i ] = π i and variance 1 VarY i ] = π i (1 π i )1 + (n i 1)( c + 1 )]/n i, the variance being larger than the nominal binomial variance π i (1 π i )/n i. The degree of overdispersion is controlled by the positive constant c, smaller c corresponding to larger overdispersion. Finally, π i relates to the regression parameters through the logistic link function, namely logit(π i ) = β 0 + β 1 x 1i + β 2 x 2i + β 3 x 1i x 2i. The five candidate models in our simulation studies are the same as the models considered in Section 4. We considered three scenarios. In each case, we set J = 4, N = 2000, p N = (1/2, 1/4, 1/8, 1/8), n = (40, 60, 80) and c = (20, 40, 60, 80). The true model is logit(π i ) = β 0 + β 1 x 1i, where the β 0 = 0.27,and β 1 = In each case we generated the pseudo random numbers 2000 times. In the following tables, AIC denotes the Akaike s information criterion AIC based on the binomial model while AIC BB refers to the AIC based on the beta-binomial model Scenario I In this case the covariates x 1i and x 2i are binary variables taking values 0, 1. The obtained results are summarized in Table 3. In Table 3, we find that when n = 40 and c = 20, AIC BB chooses the true model (B) over 76% of the samples. Both SIC and SIC T select the true model over 60% of the samples, and SIC is slightly better than SIC T. AIC ignores the overdispersion and performs poorly with only about 45% selection frequency. When c increases from 20 to 80, the degree of overdispersion decreases. In the extreme case, when c goes into infinity, the beta-binomial distribution tends to the binomial distribution. 65

14 TANG Table 3: The frequency of models selected by various criteria in scenario I n c Criteria (A) (B) (C) (D) (E) AIC(%) 0(0) 907(45) 0(0) 452(23) 641(32) SIC(%) 657(33) 1329(66) 8(0) 6(0) 0(0) SIC T (%) 690(34) 1206(60) 3(0) 47(2) 54(3) AIC BB (%) 0(0) 1524(76) 0(0) 284(14) 192(10) AIC(%) 0(0) 1327(66) 0(0) 350(18) 323(16) SIC(%) 209(10) 1778(89) 2(0) 11(1) 0(0) SIC T (%) 217(11) 1610(80) 0(0) 96(5) 77(4) AIC BB(%) 0(0) 1519(76) 0(0) 289(14) 192(10) AIC(%) 0(0) 1172(59) 0(0) 396(20) 432(22) SIC(%) 137(7) 1844(92) 3(0) 15(1) 1(0) SIC T (%) 149(7) 1541(77) 3(0) 88(4) 219(11) AIC BB (%) 0(0) 1544(77) 0(0) 276(14) 180(9) AIC(%) 0(0) 1458(73) 0(0) 341(17) 201(10) SIC(%) 14(1) 1981(99) 1(0) 4(0) 0(0) SIC T (%) 12(1) 1444(72) 1(0) 227(11) 316(16) AIC BB(%) 0(0) 1567(78) 0(0) 282(14) 151(8) In every scenario when n = 40 and c increases from 40 to 60, the selection frequencies of AIC, SIC and SIC T become higher while the behavior of AIC BB seems to be stable. A similar tendency is observed in the cases when n = 80 and c = (40, 60). The frequencies of models selected by various criteria when n = 60 and c = (20, 40, 60, 80) are the intermediate results between the cases when n = 40, c = (20, 80) and the cases when n = 80, c = (20, 80), we omit them in all scenarios. When the sample size n = 80, AIC BB selects the true models 77% of times over the samples. SIC behaves better than AIC BB in all cases, it chooses the true models 92% of the times and achieves the highest rate 99% when n = 80 and c = 80. However, SIC T behaves better than AIC while is a little worse than AIC BB. When c increases, AIC, SIC and SIC T become better and better; while AIC BB seems to be stable Scenario II In this scenario, the covariate x 1i was generated from the standard Normal distribution N(0, 1) and x 2i was generated from the Uniform distribution U(0, 1). The obtained results are shown in Table 4. In Table 4, when the sample size n = 40, both SIC and SIC T perform better than AIC but worse than AIC BB. When c increases from 20 to 80, the values of AIC BB keep constant, selecting the true model 74% times over the samples; SIC has selection rates increasing correspondingly from 54% to 68% while the selection rates of SIC T increase from 49% to 62%. When the sample size is set to n = 80, AIC BB performs stably with selection rates varying around 76%. Both SIC and SIC T perform better than AIC BB in all cases. SIC behaves more stably than SIC T Scenario III In this case, both x 1i and x 2i were generated independently from the standard Normal distribution N(0, 1). The obtained results are shown in Tables 5. In Table 5, when the sample size n = 40, both SIC and SIC T perform better than AIC but worse than AIC BB. When c increases from 20 to 80, the values of AIC BB are relatively 66

15 Semiparametric Information Criteria Using Quasi-likelihood Table 4: The frequency of models selected by various criteria in scenario II n c Criteria (A) (B) (C) (D) (E) AIC(%) 0(0) 677(34) 0(0) 507(25) 816(41) SIC(%) 927(46) 1070(54) 3(0) 0(0) 0(0) SIC T (%) 993(50) 984(49) 3(0) 11(1) 9(0) AIC BB (%) 0(0) 1504(75) 0(0) 290(14) 206(10) AIC(%) 0(0) 1175(59) 0(0) 443(22) 382(19) SIC(%) 643(32) 1356(68) 0(0) 1(0) 0(0) SIC T (%) 708(35) 1243(62) 0(0) 29(1) 20(1) AIC BB(%) 0(0) 1474(74) 0(0) 310(16) 216(11) AIC(%) 0(0) 930(46) 0(0) 493(25) 577(29) SIC(%) 143(7) 1744(87) 106(5) 4(0) 3(0) SIC T (%) 156(8) 1678(84) 98(5) 44(2) 24(1) AIC BB (%) 0(0) 1530(76) 0(0) 267(13) 203(10) AIC(%) 0(0) 1354(68) 0(0) 365(18) 281(14) SIC(%) 35(2) 1936(97) 27(1) 2(0) 0(0) SIC T (%) 42(2) 1758(88) 26(1) 106(5) 68(3) AIC BB(%) 0(0) 1546(77) 0(0) 290(14) 164(8) Table 5: The frequency of models selected by various criteria in scenario III n c Criteria (A) (B) (C) (D) (E) AIC(%) 0(0) 704(35) 0(0) 526(26) 770(38) SIC(%) 923(46) 1077(54) 0(0) 0(0) 0(0) SIC T (%) 988(49) 987(49) 0(0) 9(0) 16(1) AIC BB(%) 0(0) 1501(75) 0(0) 289(14) 210(10) AIC(%) 0(0) 1180(59) 0(0) 396(20) 424(21) SIC(%) 642(32) 1357(68) 0(0) 1(0) 0(0) SIC T (%) 703(35) 1253(63) 0(0) 25(1) 19(1) AIC BB (%) 0(0) 1503(75) 0(0) 285(14) 212(11) AIC(%) 0(0) 871(44) 0(0) 470(24) 659(33) SIC(%) 175(9) 1766(88) 46(2) 11(1) 2(0) SIC T (%) 185(9) 1707(85) 39(2) 44(2) 25(1) AIC BB(%) 0(0) 1506(75) 0(0) 278(14) 216(11) AIC(%) 0(0) 1314(66) 0(0) 398(20) 288(14) SIC(%) 41(2) 1946(97) 10(1) 3(0) 0(0) SIC T (%) 51(3) 1769(88) 9(0) 102(5) 69(3) AIC BB (%) 0(0) 1519(76) 0(0) 322(16) 159(8) stable with selection rates of the true model larger than 75%; SIC becomes a little better, the percentage of the true model selected increases from 54% to 68% while the selection rates of SIC T increase from 49% to 63%. When the sample size n = 80, we see that AIC BB performs stably with the selection rates being around 76%. Both SIC and SIC T perform better than AIC BB in all cases. 67

16 TANG 6. Conclusions In this paper we proposed two information criteria, namely SIC and SIC T, for selecting semiparametric models for use in conjunction with the quasi-likelihood methodologies. These model selection techniques complement the usual deviance analyses for generalized linear models. We demonstrated the possible usefulness of the proposed information criteria through both data analysis and simulation studies. In our simulation studies, we generated the data using the beta-binomial model so that the standard Akaike information criterion can be used. In this sense, it is unfair for our proposed methods, since both SIC and SIC T only use partial information on the first and second moments. However, our simulation results show that both SIC and SIC T are comparable with AIC in terms of the frequency of the true model selected. When trace(ŝ Q 1 ) = k, the Takeuchi-type information criterion SIC T reduces to SIC. This occurs when the model is correctly-specified. However, our simulation results show that SIC T may be unstable for a couple of reasons. One reason is the difficulty to estimate S and Q. The instability of the Takeuchi-type information criterion has been reported even in full-likelihood analysis (Shibata, 1989). The results in Section 5 suggest that both SIC and SIC T may be used when sample size is moderately large. Our simulation results also indicate that there are two factors affecting the performance of SIC and SIC T, namely the sample size n and the degree of overdispersion c. We see that when the sample size is fixed and c increases, the frequencies of the true models selected by both SIC and SIC T also increase. When the sample size n increases, both SIC and SIC T have higher true model selection percentages. As often pointed out in the literature, AIC tends to select overparametrized models. From the results shown in Tables 3 5, the phenomenon is also observed here. AIC based on the inappropriate binomial model particularly tends to select the most complicated model (E). AIC based on the beta-binomial distributions is obviously better but has the same tendency. On the other hand, the quasilikelihood based model selection criteria SIC and SIC T tend to select simpler models. SIC looks to be more stable than SIC T in our simulation studies. Due to the relative instability and computational complexity of SIC T, when both SIC and SIC T are available, use of SIC is a relatively conservative choice. There are still problems left unsolved. For example, The performance of SIC and SIC T is not good when the true model becomes more complicated. We will continue to involve these problems in our future works. Acknowledgments This paper is part of my Ph.D. project. I would like to thank my supervisor Professor Jinfang Wang for his kind supervision during these years. I am grateful to the editor and the anonymous referees for their valuable suggestions that improved the quality of the paper. REFERENCES Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, B. N. Petrov and F. Csaki (eds), , Budapest: Akademiai Kiado. Anderson, D. R., Burnham, K. P. and White, G. C. (1994). AIC model selection in overdispersed capture-recapture data. Ecology, 75, Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. J Am. Statist. Ass., 88,

17 Semiparametric Information Criteria Using Quasi-likelihood Chen, X., Hong, H., and Shum, M. (2007). Nonparametric likelihood ratio model selection tests between parametric likelihood and moment conditional models, Journal of Econometrics, 141, Crawley, M. J. (2005). Statistics: an introduction using R. John Wiley & Sons, Ltd. Crowder, M. J. (1978). Beta-binomial anova for proportions. Appl. Statist., 27, Efron, B. (1986). Double exponential families and their use in generalized linear regression. J Am. Statist. Ass., 81, Fitzmaurice, D. M. (1997). Model selection with overdispersed data. Statistician, 46, Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika, 97, Hurvich, C. M. and Tsai, C-L. (1995). Model selection for extended quasi-likelihood models in small samples. Biometrics, 51, Kitamura, Y. (2006). Empirical likelihood methods in econometrics: theory and practice. Unpublished Manuscript, Yale University. Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canadian J. Statist., 15, McCullagh, P. (1983). Quasi-likelihood functions. Ann. Statist., 11, McCullagh, P. and Nelder, J. A. (1989). Generalized linear models. Chapman and Hall. Moore, D. F. (1986). Asymptotic properties of moment estimators for overdispersed counts and proportions. Biometrika, 73, Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J.R. Statist. Soc. A, 135, R Development Core Team. (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL Shibata, R. (1989). Statistical aspect of model selection, In From Data to Model, Ed. J. C. Willems, Springer, Takeuchi, K. (1976). Distribution of informational statistics and criteria for adequacy of models. Mathematical Sciences (In Japanese), 153, Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61, Williams, D. A. (1982). Extra-binomial variation in logistic-linear models. Appl. Statist., 31, (Received: December 26, 2012, Accepted: September 23, 2013) 69

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and