H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Similar documents
Generalized Linear Models (GLZ)

Generalized Linear Models 1

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

STAT5044: Regression and Anova

Generalized, Linear, and Mixed Models

Chapter 4 Multi-factor Treatment Designs with Multiple Error Terms 93

Generalized linear models

Investigating Models with Two or Three Categories

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Linear Regression Models P8111

Outline of GLMs. Definitions

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

The performance of estimation methods for generalized linear mixed models

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

9 Generalized Linear Models

LOGISTIC REGRESSION Joseph M. Hilbe

Chapter 1. Modeling Basics

Longitudinal Modeling with Logistic Regression

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

WU Weiterbildung. Linear Mixed Models

Generalized Linear Models Introduction

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Statistics 572 Semester Review

Comparison of beta-binomial regression model approaches to analyze health related quality of life data

Generalized Linear Models I

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

High-Throughput Sequencing Course

Spring RMC Professional Development Series January 14, Generalized Linear Mixed Models (GLMMs): Concepts and some Demonstrations

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized Linear Mixed Models for Count Data

Generalized Linear Models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Experimental Design and Data Analysis for Biologists

Linear Regression With Special Variables

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Generalized Linear Models

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Confidence intervals for the variance component of random-effects linear models

Introduction to Generalized Linear Models

Correlation and regression

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Sample size calculations for logistic and Poisson regression models

Review of Panel Data Model Types Next Steps. Panel GLMs. Department of Political Science and Government Aarhus University.

Sample size determination for logistic regression: A simulation study

Generalized linear models

PQL Estimation Biases in Generalized Linear Mixed Models

Generalized linear models

Lecture #11: Classification & Logistic Regression

Generalized Linear Models for Non-Normal Data

AP-Optimum Designs for Minimizing the Average Variance and Probability-Based Optimality

STAT 501 EXAM I NAME Spring 1999

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Modelling heterogeneous variance-covariance components in two-level multilevel models with application to school effects educational research

Statistical Analysis of List Experiments

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Logistic Regression - problem 6.14

Package HGLMMM for Hierarchical Generalized Linear Models

Chapter 1 Statistical Inference

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

36-720: The Rasch Model

STA6938-Logistic Regression Model

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Introduction to General and Generalized Linear Models

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS

Stat 5101 Lecture Notes

SUPPLEMENTARY SIMULATIONS & FIGURES

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Classification. Chapter Introduction. 6.2 The Bayes classifier

11. Generalized Linear Models: An Introduction

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Classification: Linear Discriminant Analysis

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STAT 705 Generalized linear mixed models

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Hierarchical Generalized Linear Model Approach For Estimating Of Working Population In Kepulauan Riau Province

Semiparametric Generalized Linear Models

Poisson regression: Further topics

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

Handbook of Regression Analysis

Estimated Precision for Predictions from Generalized Linear Models in Sociological Research

Application of Prediction Techniques to Road Safety in Developing Countries

Exam Applied Statistical Regression. Good Luck!

GLM models and OLS regression

CHAPTER 1: BINARY LOGIT MODEL

COMPLEMENTARY LOG-LOG MODEL

SAS/STAT 13.1 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Correlated and Interacting Predictor Omission for Linear and Logistic Regression Models

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Transcription:

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly ABSTRACT Clustered or Hierarchical structures data with binary responses are very common in many practical applications. Clustered data may have equal number of observations, or they may have not. These data structure often involve the analysis of data with complex patterns of variability. Mixed models are often the most appropriate models to use in practice, as they contain fixed effects of interest and random effects to account for the clustering. The random effects reflect multiple error structures. As for data that are clustered, According to Lee and Nelder (1996) for clustered binary mixed effects models, a preferred model is the Hierarchical Generalized Linear Model (HGLM). This article compares the performance of h-likelihood estimation method of the mixed effects clustered binary data models with balanced and unbalanced cluster sizes. The comparative was evaluated by computer simulation in terms of unbiasedness parameters, Type I error rate, power, and standard error. The simulation is performed by using different numbers of clusters and different numbers of cluster sizes. The results show that the balanced mixed effects clustered binary data models is more fit then unbalanced mixed effects clustered binary data models. Keywords: Hierarchical Generalized Linear Model (HGLM), H-Likelihood Method, Binary Response. INTRODUCTION Many research studies in health, finance, education, and social sciences have involved collecting binary data clustered into groups, such as the smoking status of students sampled from different schools or disease status of animals from different farms. Such data would be expected to be correlated within clusters, as students from the same school would tend to be more similar than those from different schools, and animals from the same farm would tend to be more similar than those from different farms. When designing such studies, a choice need to be made regarding the number of groups to sample from. A larger number of groups or schools resulted in less dependence in the data and more precision in estimating the effects of explanatory variables. In some experiments, the clusters may balanced or unbalanced; that is, the number of observations in a cluster (the size of the cluster), may equal or differs among the clusters. Unbalanced clusters resulted from sub-sampling unequal numbers of observations from each cluster. Unbalanced clusters also occurred when there were randomly missing vector elements for a clustered multivariate outcome or if subjects differed in the number of relevant vector elements for the analysis. Many authors studied the unbalanced clustered data; The different cluster size could lead to different dispersions for each cluster. This unbalanced data in each cluster brought up the problem of heterogeneous models which required different variance components, as had been addressed in previous studies for continuous response (El-Saeiti, 2004). In this article, the researcher used a nested design with mixed effects model, the mixed model was the most appropriate model to use in real life, as it contained fixed and random factors. When the model contains both fixed effects and random effects, it is named the generalized linear mixed models (GLMM) or hierarchical generalized linear models (HGLM), Lee and Nelder, 1996. Hierarchical generalized linear models allow extra error components in the linear predictors of generalized linear models. The distribution of these components is not required to be normal, allowing a broader class of models. In hierarchical generalized linear models, the response and random effects are allowed to follow any distribution in the exponential family for more details see McCullagh and Searle (2001). As such, the HGLM is more appropriate for clustered data than the generalized linear models (GLM). In generalized linear models, using the maximum likelihood (ML) to estimate the mean component. An extension to ML in HGLM is Restricted Pseudo Likelihood estimation (REPL) estimation method for binary mixed effect models that discussed in depth by (El-Saeiti, 2015). Helena and Louise (1997) showed ML and REPL have parameter estimates that agree fairly closely. To estimate the mean parameters and dispersion parameters, by using hierarchical likelihood estimation (HL). In HL the distribution of random components does not need to be normal same as REPL; this allows for a broader class of models (Lee and Nelder, 1996).

Lee and Nelder (1996) defined the hierarchical likelihood for y h = ln( f (y v; β, φ)) + ln ( f (v; α)) (1) l (β, φ ; y v) + l (α ; v), (2) where f (y v; β) and f (v; α) denote the condition density function of y given random effect v, and the density function of v, respectively. One reason for developing an algorithm for the v-scale rather than for the u-scale is that v could often assume any real value whereas u usually has range restrictions, which may cause problems in convergence (Lee & Nelder, 1996). The random component v is the scale on which the random effect u occurs linearly in the linear predictor, v = v(u), where β are fixed effects, φ are the dispersion parameters for the conditional distribution of y v, and α are the parameters for the random effects. Call estimates are derived from maximizing the h-likelihood and the maximum h-likelihood estimates (MH- LEs); these are obtained by solving: h β = 0, h v = 0. As an example to explain the HGLM, focusing on the binary outcome, According to (Lee & Nelder, 1996), the appropriate distribution for the dependent variable is binomial (since the outcome is binary) and the appropriate distribution for the random effect is a beta distribution. For more detail and example on binary data outcome with beta distribution for random effects see El-Saeiti (2013), Lalonde (2009) and Lee and Nelder (1996). The HGLM pieces: Response distribution, random distribution, linear component, and the link function respectively are: Y i j u i Bin(µ, µ(1 µ)), u i Beta (γ,λ), η i j = x i j β + v(u i ), η i j = logit (p). The h likelihood for binomial-beta model (Lee & Nelder, 1996) h = l (β, φ ; y v) + l (α ; v). As such, the h likelihood estimation equation for the fixed part β and random component v respectively are Thus, h = β k k i=1 n i j=1 [ x i jk y i j n i x i jk e (x i j β+v i) 1 + e (x i j β+v i) ] = 0, (3) and ˆv i = h v i = n i j=1 [ ] y i j e(x i j β+v i) 1 + e (x i j β+v i) e (v i) + γ (γ + λ) 1 + e (v = 0. (4) i)

Thus, equating h v i to zero gives an estimate of the random effect û i = k i=1 n i j=1 y i j n i p i + λ. λ + γ Then we could solve equations (3) and (4) by using either a Newton Raphson method or a Fisher s scoring method Gu (2008). SIMULATION For generating data, the researcher generate two dates sets, the first data set for balanced cluster size, and the second data set for unbalanced cluster size. Then defined the values for parameters and generated the values, random effect variable, and calculated the probability of the dependent variable. For an unequal cluster size was generated an unequal number of subjects per cluster from the Poisson distribution. The mean from the Poisson distribution was the mean for the number of observations for each cluster. By choosing different varying mean cluster sizes ( n = 10, 25, 50,100), the researcher showed the difference in statistical performance for various sample sizes. The next step was to generate a normally distributed continuous variable x i j with mean = 3 and a known variance = 20; x 1i j N(3,20). Thus, the researcher generated a beta distributed random variable u i with a parameter γ =2 and λ= 3 for each cluster i; u i Beta(2,3). For equal cluster size, the same steps were taken, but the number of observations is equal in each cluster. Finally, Y i j was generated for each data unit randomly from a Bernoulli distribution with a success probability where eβ 0+β 1 x i j +u i p i j = 1 + e β 0+β 1 x i j +u i Where β 0 =1, β 1 = 0.2 Parameter estimates were obtained using H-Likelihood, Heo and Leon (2005). The article defined to be the number of clusters [ K= 10, 20, 50,100], the cluster size for balanced cluster [ n= 10, 25, 100], and for unbalanced cluster as the mean number of observations per cluster [ n = 10, 25, 100]. For each combination of K and n, 1, 000 data sets were generated for each case equal and unequal to calculate the power, Type I error, and standard errors. To calculate the power, Type I error rate, and standard error, data were generated according to the model with the systematic component η i j = β 0 + β 1 x 1i j + v i, with one affected treatment of β 1. Thus, the model was fitted with the systematic component η i j = β 0 + β 1 x 1i j + β 2 x 2i j + v i,, where β 0 was the intercept,β 1 was the treatment effect, x 1 was generated from normal distribution, β 2 was an extra parameter, and x 2 was the second treatment effect generated from the Poisson distribution with mean = 3, x 2 Poi(λ = 3). Power was estimated as proportion of correct detection of significance for β 1, while Type I error rate was estimated as proportion of incorrect detection of significance for β 2. In H-Likelihood HGLM was described in last paragraph, the systematic component applied for generating data was η i j = 1 + 0.2x 1i j + v i, and the systematic component for the fit model was η i j = 1 + 0.2x 1i j + 3.1x 2i j + v i, where v i Beta(2,3). For the Binomial Beta h-likelihood, the researcher used the HGLM function in the HGLM package in R. Using the hglm function got the estimation for parameters β and t-statistics with the p-values. Through simulation, the average of 1,000 estimates was calculated for β 1, β 2, power of the hypothesis test for β 1, Type I error of the hypothesis test for β 2, and standard error for β 1.

RESULTS Table 1 for Binomial Beta h-likelihood estimate parameters. The Binomial Beta h-likelihood estimate Table 1: Estimate parameters Clusters Sample size ˆ β1 Balanced cluster ˆ β2 Unbalanced cluster 10 0.2319765-0.007228321 0.1958833 0.009286461 K = 10 25 0.1939059 0.003553967 0.2017746 0.0108503 50 0.1970002-0.002042296 0.188225-0.0001238602 100 0.199145 0.002284678 0.2009817-0.01050844 10 0.215392-0.03054897 0.210038 0.01873527 K = 20 25 0.2038395-0.01017131 0.2013315-0.001884942 50 0.2035105 0.004907986 0.2022876 0.0006811804 100 0.2006388-0.002680622 0.1983477-0.000997808 10 0.2080814 0.001532905 0.1958833 0.009286461 K = 50 25 0.1994717 0.002696468 0.2022252 0.006061514 50 0.1967751-0.0005004571 0.2000865 0.002234016 100 0.2001256 0.0007905866 0.20241 0.000397104 10 0.2004939 0.001584383 0.196161 0.003048525 K = 100 25 0.2016236-0.002657747 0.202098 0.002534502 50 0.1991661 0.0008547018 0.2014994 0.001459892 100 0.1996344-0.00128299 0.1980433 0.001697924 parameters for balanced and unbalanced cluster size showed an estimate values for β 1 and β 2 were very close to actual values which were β 1 = 0.2 and β 2 = 0. The Binomial Beta h-likelihood was a good estimate method, with estimated values close to true parameters. The results show that the performance of Binomial Beta h-likelihood estimate is similar, regardless of inequality in cluster size. Table 2 explained the Binomial Beta h-likelihood Type I error rate for β 2 for balanced and unbalanced cluster size. Type I error rates were computed as the proportion of p values less than 0.05 under a null hypothesis H 0 : β 2 = 0. Ideally, Type I error rate should be close to 0.05. Type I error rate for β 2 explained slightly different value for equal and unequal cluster size. It was noticed that balanced cluster size has smaller values for large cluster size then unbalanced cluster size. ˆ β1 ˆ β2

Table 2: Type I Error Clusters Sample size Balanced Unbalanced 10 0.12 0.085 K = 10 25 0.07 0.095 50 0.12 0.09 100 0.073 0.104 10 0.136 0.109 K = 20 25 0.09 0.096 50 0.165 0.108 100 0.067 0.087 10 0.067 0.085 K =50 25 0.065 0.126 50 0.087 0.104 100 0.123 0.089 10 0.102 0.06 K = 100 25 0.082 0.134 50 0.095 0.136 100 0.087 0.121 Table 3 demonstrated the Binomial Beta h-likelihood power of the hypothesis test for β 1. Statistical power was computed as the proportion of correct rejections of the hypothesis H 0 : β 1 = 0. Through simulation, the test was conducted 1,000 times to see how often the test was significant. The power was the proportion of those 1,000 tests rejected correctly. Table 3: Power Clusters Sample size Balanced Unbalanced 10 0.89 0.906 K =10 25 1 0.677 50 1 0.864 100 1 0.991 10 0.998 0.615 K =20 25 1 0.937 50 1 0.999 100 1 1 10 1 0.906 K =50 25 1 1 50 1 1 100 1 1 10 1 0.991 K =100 25 1 1 50 1 1 100 1 1 It is noticed the balanced cluster size was more powerful then unbalanced cluster size especially with small sample size. The power statistics for balanced clustered is higher then unbalanced clustered which mean the Binomial Beta h-likelihood is better estimate method for balanced then unbalanced cluster binary model.

Table 4 refer to Stranded error. The SE was computed as the average of 1,000 SE of the estimates of β 1. Smaller SE represented smaller estimated variability, or greater precision, of the parameter estimates, Heo and Leon, 2005. The standard error for ˆβ indicated whether or not the efficiency improved. From Table 4, the Binomial Beta h-likelihood showed the balanced cluster has small standard errors. Table 1 to Table 4 for Table 4: Stranded error Clusters Sample size Balanced Unbalanced 10 0.07152838 0.05695932 K =10 25 0.04197166 0.08128032 50 0.02903917 0.05683201 100 0.0202115 0.04005908 10 0.04737441 0.09272815 K =20 25 0.02885089 0.05658015 50 0.02028826 0.04003575 100 0.0142676 0.02824783 10 0.02903625 0.05695932 K =50 25 0.01807183 0.03579394 50 0.0127137 0.02526456 100 0.00901145 0.01782909 10 0.0202617 0.04016537 K =100 25 0.01277624 0.0252753 50 0.009003529 0.01786361 100 0.006371349 0.01261467 the Binomial Beta h-likelihood method for equal and unequal clusters sizes summarized the simulation result for parameters estimate, power statistics test, Type I error rate, and standard error. From Tables are noticed that Binomial Beta h-likelihood was a good estimate method, because the average of 1,000 replications gave estimates that were very close to actual value, which was 0.2 for β 1, and β 2 was close to zero. The power statistics for balanced was higher then unbalanced, and the Type I error rate for balanced clustered had a kind of smaller results then unbalanced clustered. The smaller average of SE represented smaller estimated variability, or greater precision, of the parameter estimates, Heo and Leon (2005). The balanced cluster size has a kind of better values then unbalanced cluster sizes. CONCLUSIONS Binomial Beta h-likelihood was an effective method for mixed effects for clustered binary data model with slightly different according to cluster size. The average of 1,000 replications gave estimates that were close to actual values. The power of the hypothesis test for regression parameters in balanced was better then unbalanced and the Type I error rate for the hypothesis test for regression parameters was acceptable with smaller values for balanced then unbalanced. The standard error for regression parameters was small. In this article, the author proves that Binomial Beta h-likelihood is a acceptable estimation method for balanced clustered sizes more then unbalanced clusters binary response. The results from the simulation demonstrated the capability of Binomial Beta h-likelihood estimation method with balanced cluster size.

FUTURE WORK Since Binomial Beta h-likelihood is a acceptable estimation method for balanced clustered sizes more then unbalanced clusters binary response; It is a good idea to adjust the Binomial Beta h-likelihood estimate method to deal with unbalanced cluster size which will be the next work for the author. References El-Saeiti, I. N. (2004). Messy data in heteroscedastic models case study: Mixed nested design. M.Sc. THESIS. El-Saeiti, I. N. (2013). Adjusted variance components for unbalanced clustered binary data models. Ph.D. Dissertations. El-Saeiti, I. N. (2015). Performance of mixed effects for clustered binary data models. AIP Conf. Proc., 1643:, 80 85. Gu, Z. (2008). Model diagnostics for generalized linear mixed models. Dissertations. Helena, Geys. Geert, M. and Louise, M. R. (1997). Pseudo-likelihood inference for clustered binary data. COMMUN STATIST-THEORY METH, 26(11):2743 2767. Heo, M. and Leon, A. (2005). Performance of a mixed effects logistic regression model for binary outcomes with unequal cluster size. Biopharmaceutical Statistics, 15:513 526. Lalonde, T. L. (2009). Components of overdispersion in hierarchical generalized linear models. Dissertations. Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society, Series B (Methodological), 58(4):619 678. McCullagh, C. E. and Searle, S. R. (2001). Generalized, Linear, and Mixed Models. John Wiley & Sons, Inc., New York.