Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Size: px

Start display at page:

Download "Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method"

Melina Barber
6 years ago
Views:

1 Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA David Geffen School of Medicine at the University of California, Los Angeles, Department of Medicine, Division of General Internal Medicine & Health Services Research, 911 Broxton Ave, 1st Floor, Los Angeles, CA UCLA School of Dentistry, Le Conte Ave, Los Angeles, CA Abstract Zero Truncated Poisson (ZTP) regression model is used to model positive count data, where zero is a potential value but is almost impossible to be observed due to the nature of study and its design. ZTP is more accurate than traditional Poisson regression model for this kind of data. In practice, researchers often need to test the difference of the predicted counts between groups with ZTP regression model. The test result can be misleading if the design is very unbalanced. However, the combination of ZTP regression model and recycled predictions method is one possible way to create an identical structure of the covariates when comparing the predicted counts between groups. This paper uses ZTP regression model based on recycled predictions method to model the positive count data and estimates the variance of the difference of the predicted counts by delta method. Finally, the model and estimation techniques are applied to a real study of Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT). Keywords: Zero Truncated Poisson (ZTP) regression model, recycled predictions method, variance estimation, delta method I. Introduction The regression model for count data is receiving more and more attention nowadays [1-3], even though the use of regression models to describe count data is relatively recent [4-6]. Count data is the non negative integer outcome, such as the number of international conflicts, daily accidents, industrial injuries and so no. It is also common 2478

2 in the clinical settings, such as the number of doctors and hospital visits. In these cases, directly using standard linear regression model to count outcomes will result in inefficient, in consistent and biased estimation [7]. It is much better to use the models specifically designed for count outcomes. We have standard Poisson Regression Models (SPRM) and standard Negative Binomial Regression Models (SNBRM), which are the foundation of other modified count models, such as zero truncated models and zero inflated models. We are only interested in the zero truncated Poisson (ZTP) regression models in this paper. When zero count is a potential possible value, but is missing in the data set, we call it zero truncated data. The missing of zero count happens due to the sample scheme, in which the zero count is impossible to be observed [4-7]. For example, we study how often people have coffee every month if we collect the data in Starbucks. Then we are not able to observe the zero counts, since all the sampled people will have at least once. Shaw (1988) has proposed Poisson regression models for the analysis of truncated samples of count data [8]. Recycled predictions method is widely used to balance the data between different groups. It is very common in Medical cost-effective analysis [9], which creates an identical covariate structure. The method is coding everyone as if they were all in the control group and predicting the outcome for each individual. Then calculate the predicted outcome for each individual by coding everyone as if they were all in the treatment group. The estimated outcomes of the control group and treatment group are given by the arithmetic mean of all individuals in the data set respectively. To test whether the predicted values by above method is significantly different, the essential part is how to estimate the variance of the difference, which not only involve the variance of the predicted value, but will need to adjust the variance of all the covariates. This paper will talk about using ZTP regression model based on recycled predictions method to model the positive count data as well as estimate the variance of the difference of the predicted counts by delta method. The idea in this paper is from a real problem in the health service research. We want to test the difference between the lengths of hospital stays among 6 hospitals, adjust for age, gender and other demographic information. We will use ZTP regression to model the positive count of length of stay. The sample size is very unbalanced that one hospital has most of the observations and other 5 hospitals have less. After we use zero truncated Poisson (ZTP) regression model based recycled predictions method to predict the counts among different hospitals, we want to test whether the counts are different between any two hospitals. The research question is that how to estimate the standard error of the difference between any two predicted counts of Zero Truncated Poisson model based on recycled predictions. The model and estimation is derived in the methods section. Finally, the estimation method will be applied to a real data Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT) example in the results section. 2479

3 II. Methods 2.1 Poisson Model and Poisson Regression Model The density function of standard Poisson model can be expressed as the following [1], (1) Here is the only parameter in the model (1), which denotes the rate of occurrences or the expected number of times an event will occur over a given period of time. The standard Poisson model is fundamental to understanding the Poisson regression models for counts [7]. The count variable of interest is, a random variable to denote the number of times an event will occur. Then, is the possible values of. The standard Poisson model requires the data to satisfy, In Poisson regression, for the given covariate vector for each subject, p is the number of covariates, then the parameter can be estimated at subject level as [6], we use log to denote the natural logarithm, (2) Here is the coefficient in the model, which can be estimated by Maximum likelihood method (MLE). The log likelihood function is as below, (3) In the standard Poisson regression model, the conditional mean of, is given by, (4) Here we allow zero counts in the model. 2.2 Zero Truncated Poisson Regression Model Zero-truncated Poisson (ZTP) regression, introduced by [8], is used to model the always positive counts. (If zero is an admissible value for the dependent variable, then standard Poisson regression is more appropriate [6]). The sampling schemes are most likely the reason that gives rise to the Zero Truncated Poisson (ZTP) model. The density function for ZTP is expressed as (5) (for ) after the zero value of being truncated [2-4], here j can be any positive numbers that takes, the probability for is, (5) 2480

4 Here is the only parameter in the model. We add the index for to indicate different groups or different design matrices, we may have different parameters of the model. We will later extend this to the regression setting easily. The expected counts for each given parameter is (6) The variance based on the model can also be estimated as, ZTP is different from the standard Poisson regression if we compare (6) and (7) with (4). According to the Poisson regression model at subject level can be estimated by the generalized linear model as below [6], here ( ) will indicate the estimation, (7) The offset is defined as including no observations in the model. It is a design based constant, different from the constant in the model. It is a variable that is included in a linear model without a corresponding coefficient being estimated [4-6], for example the exposure time and the population used in order to estimate the rate. The coefficient can also be estimated by the log likelihood function as, (8) The log likelihood function is more complicated than that of standard Poisson as (3). 2.3 Recycled Predictions Method Before we use recycled predictions method to estimate for each hospital, we first redefine the notation in more detail. We assume that there are subjects from the hospitals. Therefore, there are observations from 6 hospitals in total. The parameter can be estimated for each given, here is a vector defined as, We suppose in the vector, are the dummy variables for 6 hospitals, with hospital 6 as the reference group, i.e. represents the information of subject k in hospital 6 with all are zero. Then we assume the estimator for the coefficient is, 2481

5 Here are the coefficients for the corresponding set of dummy variables. This will be used to estimate for the subject in hospital for given, where is from 1 to 6 and k is from 1 to as (9) (9) Here collection. is the designed based constant at subject level, related to the data Now we can plug in the recycled prediction methods, in which we vary characteristics of interest across the whole data set to create identical covariate structure for different hospitals, and then average the predictions across the whole data set to estimate the parameter at hospital level. For example, if we assume all the observations are from hospital 1 and set the dummy variables as, where is from 1 to 6 and k is from 1 to across the whole data set. Then calculate the average of estimated at hospital level by all the N observations in the data set as, With the same techniques, we can calculate the estimators of other hospitals. The reference hospital can be estimated by letting all dummy variables. We note that is estimated by all the observations in the data set and has nothing to do with which hospital each observation originally belongs to. It is estimated by resetting all the dummy variables. To simplify the notation, we will use k to denote all the subjects in the data set, where. At person level, each is estimated by, where is from the information, with only dummy variables for hospitals changed. This is because the recycled prediction methods for hospital i only involve the change of. We can simplify the notation of the estimation as below, (10) For each hospital i, it is only related to the information by setting different set of corresponding dummy variables as above. We will use to denote that for each subject k in the data, we have different set of dummy variables to estimate. Then the mean of can be calculated as, (11) The variance is estimated by delta method, consider is a function of, 2482

6 (12) Here is a constant. The variance-covariance matrix of the coefficients is given by STATA with command e(v) by the generalized linear model command ZTP. is a vector that and is a scalar, and the coefficient is the same as before. The bootstrap may be an option to estimate the variance, but the calculation in (12) based on delta method takes the variance of other covariates into account. 2.4 Estimate the variance of the difference Then, we will derive the estimation of the variance of the difference. Consider in recycled predictions method, which is used to remove the effects of other variables, the estimated parameter and for two hospitals ( ) can be expressed as following by (5) (13) From (11), use the fact that, here is a vector as before, and are the estimated coefficients in the model corresponding to the different sets of dummy variables respectively. For example, if we want to estimate the log scale difference of hospital1 and hospital2, the only difference is the set of dummy variables. For hospital1, we set and all others is zero while for hospital2, we set and all others is zero. Then consider the above fact we have If we want to compare hospital1 with the reference hospital6 (by set all zero), then the difference will be Therefore, the only deference between the two groups is related to the correponding dummy group coefficient(s). We have the following relationship between and, (14) where is a constant for any given model. When we estimate the covariance between and, we will need to take the estimation into account. 2483

7 Next, we will derive the covariance of and. It will be too complicated to calculate the variance directly. Therefore, we consider the Taylor expansion of (10) Here. We only keep the linear part of the expansion, then we have the estimation for the as Finally, we have the variance-covariance matrix for and is (15) Again, we will calculate the variance of by delta method. If, we have the derivative as. Therefore, the variance for can be expressed as, This can be used further for testing whether the two predicted values are equal. To derive the variance of the difference between two predicted values, we can directly calculate from (10) and (13), we define, (16) then by the relationship between and, consider and (12), The derivative can be expressed as (17) Then by delta method, we will finally have the variance of the difference between group i and group j as, (18) Substitute the values from (11) (15) and (17) we will have the standard error estimation of the difference of predicted values, which is the square root of (16). We 2484

8 can further use based on this to test whether the two predicted values are equal. III. Results We use the Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT) study as a real data example to apply the above theory. It is a prospective, observational investigation of medication adherence among HIV-infected patients starting a new Highly Active Antiretroviral Treatment (HAART) regimen [12-13] from February 1998 through April At each study week, 0, 8, 24, 48, each patient was asked about the names of their antiretroviral (ARV) medications. Patients visited the study nurses every 4 weeks for measurement of medications adherence. At each visit, the change of their medications would be noted down. We model the number combinations of ARV drugs for each patient in the study. Consider the design of the study; all the patients will have at least one drug combinations in the ADEPT data. There is a Medication Event Monitoring System (MEMS) bottle cap that recorded the date and time of each opening of the pill bottle. In the simplest case, we will consider 2 covariate variables age and gender effect. Of all the 116 patients, 8 of the patients had 3 drugs, 26 of them had 2 drugs and all the other 82 patients had only one drug. There are 23 (around 20.0%) females in the data set. The average age of these patients is 37.2 (with standard deviation 8.1) ranges from 20.3 to First we will model the data with classical Zero Truncated Poisson model. The output from STATA for the number of ARV drugs based on the model and the predicted mean number of drugs by gender is illustrated as Table1. Table1. Zero Truncated Poisson Model without Using Recycled Predictions Zero-truncated Poisson regression Number of obs = 116 LR chi2( 2) = Prob > chi2 = Log likelihood = Pseudo R2 = drug Coef. Std. Err. z P> z [95% Conf. Interval age sex _cons > sex = 0 (a) Standard ZTP model Variable Obs Mean Std. Dev. Min Max drug_ztp > sex = 1 Variable Obs Mean Std. Dev. Min Max drug_ztp (b) Predictions based on Standard ZTP model 2485

9 Now we will apply the recycled predictions method to modify the predictions. We will treat all the 116 patients in the data as male and female separately to predict the mean number of drugs. In this case, we predict the outcome adjust for the age based on the ZTP models in Table1. The recycled predictions are as Table2. Table2. The predictions based recycled predictions methods Variable Obs Mean Std. Dev. Min Max nl_drug_gen nl_drug_gen Compare Table1 (b) and Table2, the predictions are modified by the recycled predictions. Finally, in order to test the difference of the above correlated predictions, we will use the estimation method of the variance in the methods sections. The estimated standard error for the difference is We do not have strong evidence (p-value = 0.52) to reject the null hypothesis that the predicted number of combinations ARV drugs between male and female are the same, based on ZTP models with recycled prediction methods adjust for age. IV. Discussion In this paper, the zero truncated count is the key point for the model. As [11] mentioned, the interpretation of coefficients in the truncated model is always more complicated than those for the standard model. When the truncation exists only in the sample (the zero counts can not be observed), i.e. the population is in the case of standard Poisson model without zero truncated, and then the coefficient will have a usual interpretation, just consider the following, (19) The simple multiplicative connection exists for the two models. However, if no zero count is not because of the sampling scheme, but because observed zero count is truly impossible, then the simple connection is no longer correct. The adverse effects of overdispersion are worse with truncated models [7]. When sample is not truncated, using SPRM in the presence of overdispersion does not bias the estimated coefficients. But if we use ZTP model, the estimated coefficients will be biased and inconsistent and will therefore lead to biased estimation of predicted counts. As suggested by [7], before using ZTP regression model, we must check for overdispersion by zero truncated negative binomial model (ZTNB) based on a Likelihood-ratio test. 2486

10 Reference: 1. Dobson AJ, Dobson A. An Introduction to Generalized Linear Models, Second Edition. 2nd ed. Chapman & Hall/CRC; Grogger JT, Carson RT. Models for Truncated Counts. Journal of Applied Econometrics. 1991;6(3): Springael J., Van Nieuwenhuyse I. On the sum of independent zero-truncated Poisson random variables. University of Antwerp, Faculty of Applied Economics; Hardin JW, Hilbe JM. Generalized Linear Models and Extensions, Second Edition. 2nd ed. Stata Press; Wedel M, Desarbo WS, Bult JR, Ramaswamy V. A Latent Class Poisson Regression Model for Heterogeneous Count Data. Journal of Applied Econometrics. 1993;8(4): Corporation S. STATA Base Reference Manual, Volume 3 : Q - Z, Release 10. Stata; Long JS, Freese J. Regression Models for Categorical Dependent Variables Using Stata, Second Edition. 2nd ed. Stata Press; Shaw, D. (1988), 'On-site samples' regression problems of non-negative integers, truncation, and endogenous stratification', Journal of Econometrics, 37, ; Basu A, Meltzer D. Implications of spillover effects within the family for medical cost-effectiveness analysis. Journal of Health Economics. 2005;24(4): Glick H, Doshi JA, Sonnad SS, Polsky D. Economic evaluation in clinical trials. Oxford University Press; 2007: Simonoff JS. Analyzing categorical data. Springer; 2003: Miller LG, Liu H, Hays RD, et al. Knowledge of antiretroviral regimen dosing and adherence: a longitudinal study. Clin. Infect. Dis. 2003;36(4): Golin CE, Liu H, Hays RD, et al. A prospective study of predictors of adherence to combination antiretroviral medication. J Gen Intern Med. 2002;17(10):

Generalized linear models

Generalized linear models Christopher F Baum ECON 8823: Applied Econometrics Boston College, Spring 2016 Christopher F Baum (BC / DIW) Generalized linear models Boston College, Spring 2016 1 / 1 Introduction