H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly ABSTRACT Clustered or Hierarchical structures data with binary responses are very common in many practical applications. Clustered data may have equal number of observations, or they may have not. These data structure often involve the analysis of data with complex patterns of variability. Mixed models are often the most appropriate models to use in practice, as they contain fixed effects of interest and random effects to account for the clustering. The random effects reflect multiple error structures. As for data that are clustered, According to Lee and Nelder (1996) for clustered binary mixed effects models, a preferred model is the Hierarchical Generalized Linear Model (HGLM). This article compares the performance of h-likelihood estimation method of the mixed effects clustered binary data models with balanced and unbalanced cluster sizes. The comparative was evaluated by computer simulation in terms of unbiasedness parameters, Type I error rate, power, and standard error. The simulation is performed by using different numbers of clusters and different numbers of cluster sizes. The results show that the balanced mixed effects clustered binary data models is more fit then unbalanced mixed effects clustered binary data models. Keywords: Hierarchical Generalized Linear Model (HGLM), H-Likelihood Method, Binary Response. INTRODUCTION Many research studies in health, finance, education, and social sciences have involved collecting binary data clustered into groups, such as the smoking status of students sampled from different schools or disease status of animals from different farms. Such data would be expected to be correlated within clusters, as students from the same school would tend to be more similar than those from different schools, and animals from the same farm would tend to be more similar than those from different farms. When designing such studies, a choice need to be made regarding the number of groups to sample from. A larger number of groups or schools resulted in less dependence in the data and more precision in estimating the effects of explanatory variables. In some experiments, the clusters may balanced or unbalanced; that is, the number of observations in a cluster (the size of the cluster), may equal or differs among the clusters. Unbalanced clusters resulted from sub-sampling unequal numbers of observations from each cluster. Unbalanced clusters also occurred when there were randomly missing vector elements for a clustered multivariate outcome or if subjects differed in the number of relevant vector elements for the analysis. Many authors studied the unbalanced clustered data; The different cluster size could lead to different dispersions for each cluster. This unbalanced data in each cluster brought up the problem of heterogeneous models which required different variance components, as had been addressed in previous studies for continuous response (El-Saeiti, 2004). In this article, the researcher used a nested design with mixed effects model, the mixed model was the most appropriate model to use in real life, as it contained fixed and random factors. When the model contains both fixed effects and random effects, it is named the generalized linear mixed models (GLMM) or hierarchical generalized linear models (HGLM), Lee and Nelder, 1996. Hierarchical generalized linear models allow extra error components in the linear predictors of generalized linear models. The distribution of these components is not required to be normal, allowing a broader class of models. In hierarchical generalized linear models, the response and random effects are allowed to follow any distribution in the exponential family for more details see McCullagh and Searle (2001). As such, the HGLM is more appropriate for clustered data than the generalized linear models (GLM). In generalized linear models, using the maximum likelihood (ML) to estimate the mean component. An extension to ML in HGLM is Restricted Pseudo Likelihood estimation (REPL) estimation method for binary mixed effect models that discussed in depth by (El-Saeiti, 2015). Helena and Louise (1997) showed ML and REPL have parameter estimates that agree fairly closely. To estimate the mean parameters and dispersion parameters, by using hierarchical likelihood estimation (HL). In HL the distribution of random components does not need to be normal same as REPL; this allows for a broader class of models (Lee and Nelder, 1996).

Lee and Nelder (1996) defined the hierarchical likelihood for y h = ln( f (y v; β, φ)) + ln ( f (v; α)) (1) l (β, φ ; y v) + l (α ; v), (2) where f (y v; β) and f (v; α) denote the condition density function of y given random effect v, and the density function of v, respectively. One reason for developing an algorithm for the v-scale rather than for the u-scale is that v could often assume any real value whereas u usually has range restrictions, which may cause problems in convergence (Lee & Nelder, 1996). The random component v is the scale on which the random effect u occurs linearly in the linear predictor, v = v(u), where β are fixed effects, φ are the dispersion parameters for the conditional distribution of y v, and α are the parameters for the random effects. Call estimates are derived from maximizing the h-likelihood and the maximum h-likelihood estimates (MH- LEs); these are obtained by solving: h β = 0, h v = 0. As an example to explain the HGLM, focusing on the binary outcome, According to (Lee & Nelder, 1996), the appropriate distribution for the dependent variable is binomial (since the outcome is binary) and the appropriate distribution for the random effect is a beta distribution. For more detail and example on binary data outcome with beta distribution for random effects see El-Saeiti (2013), Lalonde (2009) and Lee and Nelder (1996). The HGLM pieces: Response distribution, random distribution, linear component, and the link function respectively are: Y i j u i Bin(µ, µ(1 µ)), u i Beta (γ,λ), η i j = x i j β + v(u i ), η i j = logit (p). The h likelihood for binomial-beta model (Lee & Nelder, 1996) h = l (β, φ ; y v) + l (α ; v). As such, the h likelihood estimation equation for the fixed part β and random component v respectively are Thus, h = β k k i=1 n i j=1 [ x i jk y i j n i x i jk e (x i j β+v i) 1 + e (x i j β+v i) ] = 0, (3) and ˆv i = h v i = n i j=1 [ ] y i j e(x i j β+v i) 1 + e (x i j β+v i) e (v i) + γ (γ + λ) 1 + e (v = 0. (4) i)

Thus, equating h v i to zero gives an estimate of the random effect û i = k i=1 n i j=1 y i j n i p i + λ. λ + γ Then we could solve equations (3) and (4) by using either a Newton Raphson method or a Fisher s scoring method Gu (2008). SIMULATION For generating data, the researcher generate two dates sets, the first data set for balanced cluster size, and the second data set for unbalanced cluster size. Then defined the values for parameters and generated the values, random effect variable, and calculated the probability of the dependent variable. For an unequal cluster size was generated an unequal number of subjects per cluster from the Poisson distribution. The mean from the Poisson distribution was the mean for the number of observations for each cluster. By choosing different varying mean cluster sizes ( n = 10, 25, 50,100), the researcher showed the difference in statistical performance for various sample sizes. The next step was to generate a normally distributed continuous variable x i j with mean = 3 and a known variance = 20; x 1i j N(3,20). Thus, the researcher generated a beta distributed random variable u i with a parameter γ =2 and λ= 3 for each cluster i; u i Beta(2,3). For equal cluster size, the same steps were taken, but the number of observations is equal in each cluster. Finally, Y i j was generated for each data unit randomly from a Bernoulli distribution with a success probability where eβ 0+β 1 x i j +u i p i j = 1 + e β 0+β 1 x i j +u i Where β 0 =1, β 1 = 0.2 Parameter estimates were obtained using H-Likelihood, Heo and Leon (2005). The article defined to be the number of clusters [ K= 10, 20, 50,100], the cluster size for balanced cluster [ n= 10, 25, 100], and for unbalanced cluster as the mean number of observations per cluster [ n = 10, 25, 100]. For each combination of K and n, 1, 000 data sets were generated for each case equal and unequal to calculate the power, Type I error, and standard errors. To calculate the power, Type I error rate, and standard error, data were generated according to the model with the systematic component η i j = β 0 + β 1 x 1i j + v i, with one affected treatment of β 1. Thus, the model was fitted with the systematic component η i j = β 0 + β 1 x 1i j + β 2 x 2i j + v i,, where β 0 was the intercept,β 1 was the treatment effect, x 1 was generated from normal distribution, β 2 was an extra parameter, and x 2 was the second treatment effect generated from the Poisson distribution with mean = 3, x 2 Poi(λ = 3). Power was estimated as proportion of correct detection of significance for β 1, while Type I error rate was estimated as proportion of incorrect detection of significance for β 2. In H-Likelihood HGLM was described in last paragraph, the systematic component applied for generating data was η i j = 1 + 0.2x 1i j + v i, and the systematic component for the fit model was η i j = 1 + 0.2x 1i j + 3.1x 2i j + v i, where v i Beta(2,3). For the Binomial Beta h-likelihood, the researcher used the HGLM function in the HGLM package in R. Using the hglm function got the estimation for parameters β and t-statistics with the p-values. Through simulation, the average of 1,000 estimates was calculated for β 1, β 2, power of the hypothesis test for β 1, Type I error of the hypothesis test for β 2, and standard error for β 1.

RESULTS Table 1 for Binomial Beta h-likelihood estimate parameters. The Binomial Beta h-likelihood estimate Table 1: Estimate parameters Clusters Sample size ˆ β1 Balanced cluster ˆ β2 Unbalanced cluster 10 0.2319765-0.007228321 0.1958833 0.009286461 K = 10 25 0.1939059 0.003553967 0.2017746 0.0108503 50 0.1970002-0.002042296 0.188225-0.0001238602 100 0.199145 0.002284678 0.2009817-0.01050844 10 0.215392-0.03054897 0.210038 0.01873527 K = 20 25 0.2038395-0.01017131 0.2013315-0.001884942 50 0.2035105 0.004907986 0.2022876 0.0006811804 100 0.2006388-0.002680622 0.1983477-0.000997808 10 0.2080814 0.001532905 0.1958833 0.009286461 K = 50 25 0.1994717 0.002696468 0.2022252 0.006061514 50 0.1967751-0.0005004571 0.2000865 0.002234016 100 0.2001256 0.0007905866 0.20241 0.000397104 10 0.2004939 0.001584383 0.196161 0.003048525 K = 100 25 0.2016236-0.002657747 0.202098 0.002534502 50 0.1991661 0.0008547018 0.2014994 0.001459892 100 0.1996344-0.00128299 0.1980433 0.001697924 parameters for balanced and unbalanced cluster size showed an estimate values for β 1 and β 2 were very close to actual values which were β 1 = 0.2 and β 2 = 0. The Binomial Beta h-likelihood was a good estimate method, with estimated values close to true parameters. The results show that the performance of Binomial Beta h-likelihood estimate is similar, regardless of inequality in cluster size. Table 2 explained the Binomial Beta h-likelihood Type I error rate for β 2 for balanced and unbalanced cluster size. Type I error rates were computed as the proportion of p values less than 0.05 under a null hypothesis H 0 : β 2 = 0. Ideally, Type I error rate should be close to 0.05. Type I error rate for β 2 explained slightly different value for equal and unequal cluster size. It was noticed that balanced cluster size has smaller values for large cluster size then unbalanced cluster size. ˆ β1 ˆ β2

Table 2: Type I Error Clusters Sample size Balanced Unbalanced 10 0.12 0.085 K = 10 25 0.07 0.095 50 0.12 0.09 100 0.073 0.104 10 0.136 0.109 K = 20 25 0.09 0.096 50 0.165 0.108 100 0.067 0.087 10 0.067 0.085 K =50 25 0.065 0.126 50 0.087 0.104 100 0.123 0.089 10 0.102 0.06 K = 100 25 0.082 0.134 50 0.095 0.136 100 0.087 0.121 Table 3 demonstrated the Binomial Beta h-likelihood power of the hypothesis test for β 1. Statistical power was computed as the proportion of correct rejections of the hypothesis H 0 : β 1 = 0. Through simulation, the test was conducted 1,000 times to see how often the test was significant. The power was the proportion of those 1,000 tests rejected correctly. Table 3: Power Clusters Sample size Balanced Unbalanced 10 0.89 0.906 K =10 25 1 0.677 50 1 0.864 100 1 0.991 10 0.998 0.615 K =20 25 1 0.937 50 1 0.999 100 1 1 10 1 0.906 K =50 25 1 1 50 1 1 100 1 1 10 1 0.991 K =100 25 1 1 50 1 1 100 1 1 It is noticed the balanced cluster size was more powerful then unbalanced cluster size especially with small sample size. The power statistics for balanced clustered is higher then unbalanced clustered which mean the Binomial Beta h-likelihood is better estimate method for balanced then unbalanced cluster binary model.

Table 4 refer to Stranded error. The SE was computed as the average of 1,000 SE of the estimates of β 1. Smaller SE represented smaller estimated variability, or greater precision, of the parameter estimates, Heo and Leon, 2005. The standard error for ˆβ indicated whether or not the efficiency improved. From Table 4, the Binomial Beta h-likelihood showed the balanced cluster has small standard errors. Table 1 to Table 4 for Table 4: Stranded error Clusters Sample size Balanced Unbalanced 10 0.07152838 0.05695932 K =10 25 0.04197166 0.08128032 50 0.02903917 0.05683201 100 0.0202115 0.04005908 10 0.04737441 0.09272815 K =20 25 0.02885089 0.05658015 50 0.02028826 0.04003575 100 0.0142676 0.02824783 10 0.02903625 0.05695932 K =50 25 0.01807183 0.03579394 50 0.0127137 0.02526456 100 0.00901145 0.01782909 10 0.0202617 0.04016537 K =100 25 0.01277624 0.0252753 50 0.009003529 0.01786361 100 0.006371349 0.01261467 the Binomial Beta h-likelihood method for equal and unequal clusters sizes summarized the simulation result for parameters estimate, power statistics test, Type I error rate, and standard error. From Tables are noticed that Binomial Beta h-likelihood was a good estimate method, because the average of 1,000 replications gave estimates that were very close to actual value, which was 0.2 for β 1, and β 2 was close to zero. The power statistics for balanced was higher then unbalanced, and the Type I error rate for balanced clustered had a kind of smaller results then unbalanced clustered. The smaller average of SE represented smaller estimated variability, or greater precision, of the parameter estimates, Heo and Leon (2005). The balanced cluster size has a kind of better values then unbalanced cluster sizes. CONCLUSIONS Binomial Beta h-likelihood was an effective method for mixed effects for clustered binary data model with slightly different according to cluster size. The average of 1,000 replications gave estimates that were close to actual values. The power of the hypothesis test for regression parameters in balanced was better then unbalanced and the Type I error rate for the hypothesis test for regression parameters was acceptable with smaller values for balanced then unbalanced. The standard error for regression parameters was small. In this article, the author proves that Binomial Beta h-likelihood is a acceptable estimation method for balanced clustered sizes more then unbalanced clusters binary response. The results from the simulation demonstrated the capability of Binomial Beta h-likelihood estimation method with balanced cluster size.

FUTURE WORK Since Binomial Beta h-likelihood is a acceptable estimation method for balanced clustered sizes more then unbalanced clusters binary response; It is a good idea to adjust the Binomial Beta h-likelihood estimate method to deal with unbalanced cluster size which will be the next work for the author. References El-Saeiti, I. N. (2004). Messy data in heteroscedastic models case study: Mixed nested design. M.Sc. THESIS. El-Saeiti, I. N. (2013). Adjusted variance components for unbalanced clustered binary data models. Ph.D. Dissertations. El-Saeiti, I. N. (2015). Performance of mixed effects for clustered binary data models. AIP Conf. Proc., 1643:, 80 85. Gu, Z. (2008). Model diagnostics for generalized linear mixed models. Dissertations. Helena, Geys. Geert, M. and Louise, M. R. (1997). Pseudo-likelihood inference for clustered binary data. COMMUN STATIST-THEORY METH, 26(11):2743 2767. Heo, M. and Leon, A. (2005). Performance of a mixed effects logistic regression model for binary outcomes with unequal cluster size. Biopharmaceutical Statistics, 15:513 526. Lalonde, T. L. (2009). Components of overdispersion in hierarchical generalized linear models. Dissertations. Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society, Series B (Methodological), 58(4):619 678. McCullagh, C. E. and Searle, S. R. (2001). Generalized, Linear, and Mixed Models. John Wiley & Sons, Inc., New York.