Generating Half-normal Plot for Zero-inflated Binomial Regression

Size: px

Start display at page:

Download "Generating Half-normal Plot for Zero-inflated Binomial Regression"

Victor Clark
5 years ago
Views:

1 Paper SP05 Generating Half-normal Plot for Zero-inflated Binomial Regression Zhao Yang, Xuezheng Sun Department of Epidemiology & Biostatistics University of South Carolina, Columbia, SC SUMMARY The Half-normal Plot, a valuable tool in model diagnostics, is a statistical graph based on the simulated envelope. Michael Friendly (1998) contributed the macro %halfnormal for generalized linear models, while the macro %halfnormal can only be used in some distributions: Normal, Binomial, Poisson and Gamma distribution. Since the zero-inflated binomial(zib) model was well-defined by D. Hall (2000), it is becoming popular in modeling the zero-inflated binomial data, including some pharmaceutical data, such as natural immunity. And the half-normal plot is a good way to evaluate model fitting for ZIB model. The macro(%halfnormal_zib) developed in this paper can be used to generate half-normal plot for ZIB regression, which is easily implemented in practice. And we show the application of macro %halfnormal_zib via a simulated data and Whitefly dataset used in Hall s paper. Keywords: half-normal plot, zero-inflated binomial, proc nlmixed, whitefly dataset INTRODUCTION There are many analysis methods for binomial data, such as Logistic regression, Probit mdoel, etc. While in practice, we can observe excess zeros in the raw data, i.e. the observed number of 0 is larger than the predicted from the model. We call this phenomenon as zero-inflation. In the biomedical field, the natural immunity may be a cause of zero-inflation. At this time, the zero-inflated binomial(zib) model can be used to fit the data. The basic idea behind ZIB is we consider there is a pure 0 state in the data generating mechanism, and the remaining part is from a general binomial distribution. ZERO-INFLATED BINOMIAL MODEL The zero-inflated binomial(zib) distribution was first introduced by Kemp and Kemp in 1988, but they only use ZIB to highlight some important aspects of empirical probability generating function estimation. Hall (2000) investigated and extended the ZIB model with and without random effect, and gave some detailed application in the data analysis. Following Hall (2000), we give some detailed information about ZIB regression model { 0 with probability p i Y i (1) Binomial(n i, π i ) with probability 1 p i This model implies 0 with probability p i + (1 ( p i )(1 ) π i ) ni Y i = n i k with probability (1 p i ) πi k k (1 π i) ni k (2), k = 1, 2,, n i ) with E(Y i ) = (1 p i )n i π i and V ar(y i ) = (1 p i )n i π i (1 π i (1 p i n i ). The above probability can be also be expressed as a generalized Bernoulli distribution, making it more apparent on combining it into a full likelihood, n ) ui ( ( ) ni L = (p i + (1 p i )(1 π i ) ni (1 p i ) π k k i (1 π i ) ni k) 1 u i (3) i=1 1

2 The parameters π = (π 1,, π n ) T and p = (p 1,, p n ) T are modeled via logit link, logit(p) = Gγ and logit(π) = Bβ. Hence the log-likelihood for this ZIB model is ( n ) ( ( ) ) l(γ, β; y) = u i log (e γ + (1 + e Biβ ) ni log(1 + e γ ) + (1 u i ) y i B i β n i log(1 + e Biβ ni ) ) + log k i=1 for covariate matrices B n p and G n q. Here n is the number of observation, p is the number of covariates in the general binomial regression model, and q is the number of covariates in the zero-inflation part, γ q 1 and β p 1 are parameter vector for zero-inflation part and general binomial regression part, respectively. The EM algorithm or the Newton-Raphson method can be used to obtain the ML estimates. You are referred to Hall (2000) for more information. We define the standardized Pearson residual as r p i = y i (1 p i )n i π i ( ) (5) (1 p i )n i π i 1 π i (1 p i n i ) Having fitted a ZIB model to the raw dataset, the Score test, likelihood ratio test, or Wald test can be used to justify the ZIB model is better than the general binomial regression model. If we are convinced the zero-inflation in the raw data, we still need a tool to evaluate the ZIB model. Then the half-normal plot is a good candidate for the model diagnosis. (4) HALF-NORMAL PLOT Since the distribution of the residuals is not known, half-normal plots with simulated envelopes are a helpful diagnostic tool (Atkinson, 1985, 4.2; Neter et al., 1996, 14.6; Collet, 2003, 5.2.2). The main idea is to enhance the usual half-normal plot by adding a simulated envelope which can be used to decide whether the observed residuals are consistent with the fitted model. Half-normal plots with a simulated envelope can be produced as follows: (i) fit the model and generate a simulated sample of n independent observations using the fitted model as if it were the true model; (ii) fit the model to the generated sample, and compute the ordered absolute values of the residuals; (iii) repeat steps (i) and (ii) k times; (iv) consider the n sets of the k order statistics; for each set compute its average, minimum and maximum values; (v) plot these values and the ordered residuals of the original sample against the half-normal scores Φ ((t 1 + n ) 1/8)/(2n + 1/2). The minimum and maximum values of the k order statistics yield the envelope. Atkinson(1985, p. 36) suggests using k = 19, so that the probability that a given absolute residual will fall beyond the upper band provided by envelope is approximately equal to 1/20 = Observations corresponding to absolute residuals outside the limits provided by the simulated envelope are worthy of further investigation. Additionally, if a considerable proportion of points falls outside the envelope, then one has evidence against the adequacy of the fitted model. SAS MACRO %halfnormal zib Michael Friendly (1998) contributed the macro %halfnormal for some generalized linear models (GLMs), while the macro %halfnormal can only be used in some distributions: Normal, Binomial, Poisson and Gamma distribution. In the statistical modeling, the model adequacy to the data is an important issue. Fortunately, the half-normal plot can be a valuable tool on evaluating the model adequacy. With the common application of ZIB model, the program for generating the half-normal plot is in need. This paper develops a macro %halfnormal_zib for generating half-normal plot for ZIB model, you can find the original code in the appendix. The half-normal plot generated by %halfnormal_zib is based on the standardized Pearson residual. 2

3 SIMULATION AND AN EXAMPLE We first generate a simulated data set to test the macro. And the simulation is based on the following schedule: p i logit(p i ) = log = g i γ 1 p i = β 0 + β 1 x i = x i i (6) π i logit(π i ) = log = b i β 1 π i = α 0 + α 1 x = x i i (7) The code for generating the above scheduled data is shown in Figure-1. There is only one covariate, x, in the ZIB model. The variable x can be thought as a dosage in biomedical field. And m is the number of trials, yzib is the response variable, i.e. the number of event in m number of trials. Figure 1: The program to generate a zero-inflated Binomial random sample based on schedule log(p i /(1 p i )) = x i for zero-inflation part, and log(π i /(1 π i )) = x i for regular Binomial model, the dataset test has 149 observations. Then we can get the parameter estimates for the simulated data via PROC NLMIXED, which is integrated in the macro %halfnormal zib. The following code show us to use the developed macro, generally, we do not need to indicate all the parameters in the macro, e.g. we can omit seed, nres and out in the application. %halfnorm_zib(data = test, resp = yzib, coefg = bp_0 bp_1, coefb = bll_0 bll_1, g = x, b = x, gv = 0 0, bv = 0 0, trials = m, out = pp, seed = 2006, nres = 19); The parameter estimates are shown in Table-1, here t-value is an asymptotically normal Wald type t statistics defined as the ratio of the estimate to its standard error. From Table-1, we can see the estimated parameters are very close to the assumed. The half-normal plot generated from the macro is shown in Figure-2, all the points fall within the boundary of the simulated envelope, indicating the fitted model is adequate. In the following, we show a real data analysis by using the Whitefly data (Hall, 2000) to show the application of the macro. To learn more information about Whitefly dataset, you are referred to Hall s paper. Since there is no CLASS statement in PROC NLMIXED, in which we can not use the nominal variable directly. Then Figure-3 shows the program to use PROC GLMMOD to generate a new dataset, which contains the expected variable information for further analysis. For simplicity and inadequacy of ZIB modeling, we only include one variable, treatment, from the original dataset. There are 6 treatment methods, by using PROC GLMMOD, we will generate 6 new variables in the dataset. And we are interested if the treatment method will affect the number of surviving whiteflies. 3

4 Table 1: The model fitting information for the simulated data, from the schedule log(p i /(1 p i )) = x i for zero-inflation part, and log(π i /(1 π i )) = x i for regular Binomial model, based on 149 observations. Part of Parameter Standard 95% CI 95% CI ZIB Parameter estimate error df t-value P-value Lower Upper Zero-inflation β β Regular Binomial α α < Figure 2: The half-normal plot generated from the macro %halfnormal zib using the standardized Pearson residual. The graph indicates ZIB model fit the data reasonably well. The application of macro %halfnormal zib is also shown in Figure-3, the original treatment variable becomes 6 variables trt_1, trt_2, trt_3, trt_4, trt_5, and trt_6. trt_6 is the control group. The parameter estimations are shown in Table-2, from which we can see that the treatments have significant effect on the number of surviving whiteflies, except the control group. If we want to do some comparison between different treatment, we then can use ESTIMATE statement in PROC NLMIXED, e.g. estimate bll 1 = bll 2 bll 1 bll 2; will test if the effect of treatment-1 will be different from treatment-2 on controlling the number of surviving whiteflies. While this has to be done by using another PROC NLMIXED program instead of this macro. Although Table-2 provides us with some interesting information to the data, we have to check if the used ZIB model can adequately fit the data, Figure-4 give us a negative response. Since many of the points fall outside the boundary of the simulated envelope, indicating, there are some other important variables we have not controlled in the modeling, the adequate models are considered in Hall s paper. 4

5 Figure 3: Program to generate a new dataset from Whitefly dataset by using PROC GLMMOD, and the usage of macro %halfnormal zib to generate the half-normal plot for diagnosis. Table 2: The model fitting information for Whitefly data, which has 640 observations. Only one variable: treatment from the original dataset is included in the modeling. Part of Parameter Standard 95% CI 95% CI ZIB Parameter estimate error df t-value P-value Lower Upper Zero-inflation bp bp bp bp bp bp < bp Regular Binomial bll < bll < bll < bll < bll < bll < bll DISCUSSION AND SUMMARY In this article, we have adapted Friendly s macro %halfnormal to the situation of zero-inflated binomial regression and developed a new macro %halfnormal zib to generate the half-normal plot for ZIB modeling diagnosis. The half-normal plot generated from %halfnormal zib is only based on the standardized Pearson residual, and the source code is attached in the appendix. You can change it to other kind of residual to generate corresponding half-normal plot. Also, this macro will give your only the parameter estimations and the half-normal plot, it will not perform a comparison, i.e. using statement estimate, contrast, etc in procedure NLMIXED. The macro %halfnormal zib is designed for general application, while for the simulated data, we set the order option in the axis statement, and in your application, you can delete this option and just run the macro, then you can find the axis scale for your plot, based on this, you can re-set the order option to generate a good-looking graph for your work. 5

6 Figure 4: The half-normal plot generated from model fitting to the Whitefly data. There are many points falling outside of the boundary of the envelope, indicating the ZIB model with only one variable information can not fit the data reasonably well. We then can include more variables in the model or by using the mixed effect ZIB model to fit the data, as shown by D. Hall(2000) Meanwhile, you have to notice the origin option in the legend statement, for different computer, SAS R will put the legend in different location on your graph, so you need to make small adjustment according to your computer. References [1] Atkinson, A.C. (1985). Plots, Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. New York: Oxford University Press. [2] C.D. Kemp and A.W. Kemp(1988), Rapid estimation for discrete distributions, The Statistician, 37: [3] Collet, D.(2003), Modelling binary data, 2nd ed., Chapman & Hall/CRC, New York [4] D.B. Hall(2000), Zero-inflated Poisson and Binomial regression with random effects: A case study, Biometrics, 56: [5] M. Friendly(1998), [6] Neter, J., Kutner, M.H., Nachtsheim, C.J. and Wasserman, W. (1996). Applied Linear Statistical Models(4th ed.), Chicago: Irwin. ACKNOWLEDGEMENTS The authors thank Dr. Daniel B. Hall for permission to use the Whitefly data set, also would like to give a nod to Toby Dunn, whose suggestions are valuable to this paper. 6

7 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Zhao Yang Department of Epidemiology & Biostatistics, University of South Carolina Columbia, SC Xuezheng Sun Department of Epidemiology & Biostatistics, University of South Carolina Columbia, SC SAS R and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. R indicates USA registration. This document is generated by L A TEX. Other brand and product names are trademarks of their respective companies. 7

8 APPENDIX This part contains the original code for macro %halfnormal_zib. We show the codes in several figures: from Figure-5 to Figure-13. Also there are some comments on each figure. Figure 5: Definition of the macro %halfnormal zib, the parameters included in this macro are defined above, also we generate a new dataset ZIBD for further analysis. Figure 6: We define three macros: %coefvar, which will be use to to generate model formula in the PROC NLMIXED; %coefval, will be used to generate expression in the PARMS statement in the PROC NLMIXED; and %nwords, which will be used to set up some conditions in using this macro. 8

9 Figure 7: Define some conditions which can prevent user from mis-using this macro, the conditions include: number of coefficients in the model should equal to number of initial values, and should be one more than number of variables in both part of the ZIB model. Also we have to indicate the response variable and the trial variable. Figure 8: By using PROC NLMIXED, we fit ZIB model for the dataset. We present the log-likelihood function and the probability function for the modeling. You can add more options in the PROC NLMIXED statement. 9

10 Figure 9: We first generate two dataset from the procedure NLMIXED, then we calculate the standardized Pearson residual. Then we generate a new dataset from the fitted ZIB model information, which contains 19 replication, by default. Figure 10: Using PROC NLMIXED, we use the 19 generated response variables to re-model the dataset. 10

11 Figure 11: Panel A is to calculated the residual from the 19 fitted model, and in Panel B, we combine all the generated dataset to create a new dataset for further processing. Figure 12: This part also contains Panel A and Panel B, both are extracted from the macro %halfnormal contributed by M. Friendly in

12 Figure 13: This part is to generate the half-normal graph for ZIB diagnosis, you can make some modifications in your convenience to the LEGEND, AXIS and SYMBOL statement. 12

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal