Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Similar documents
Generalized linear models

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

MODELING COUNT DATA Joseph M. Hilbe

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

General Linear Model (Chapter 4)

Meta-analysis of epidemiological dose-response studies

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Homework Solutions Applied Logistic Regression

Binary Dependent Variables

i (x i x) 2 1 N i x i(y i y) Var(x) = P (x 1 x) Var(x)

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

ECON 594: Lecture #6

Practice exam questions

Lecture 10: Introduction to Logistic Regression

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Confidence intervals for the variance component of random-effects linear models

Lecture 5: Poisson and logistic regression

Lecture 2: Poisson and logistic regression

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Problem Set 10: Panel Data

Sociology 362 Data Exercise 6 Logistic Regression 2

How To Do Piecewise Exponential Survival Analysis in Stata 7 (Allison 1995:Output 4.20) revised

Understanding the multinomial-poisson transformation

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised January 20, 2018

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

Multilevel Modeling Day 2 Intermediate and Advanced Issues: Multilevel Models as Mixed Models. Jian Wang September 18, 2012

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Correlation and Simple Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Applied Statistics and Econometrics

2. We care about proportion for categorical variable, but average for numerical one.

Semiparametric Generalized Linear Models

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Study Design: Sample Size Calculation & Power Analysis

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Lab 10 - Binary Variables

Lecture 12: Effect modification, and confounding in logistic regression

Testing and Model Selection

Analysis of repeated measurements (KLMED8008)

Lecture 4: Generalized Linear Mixed Models

options description set confidence level; default is level(95) maximum number of iterations post estimation results

Lecture#12. Instrumental variables regression Causal parameters III

Treatment interactions with nonexperimental data in Stata

Varieties of Count Data

Lab 07 Introduction to Econometrics

Handout 12. Endogeneity & Simultaneous Equation Models

Estimating chopit models in gllamm Political efficacy example from King et al. (2002)

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

Binomial and Poisson Probability Distributions

Lecture 7 Time-dependent Covariates in Cox Regression

Categorical and Zero Inflated Growth Models

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Editor Executive Editor Associate Editors Copyright Statement:

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

1 The basics of panel data

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

ECON Introductory Econometrics. Lecture 17: Experiments

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Regression #8: Loose Ends

S o c i o l o g y E x a m 2 A n s w e r K e y - D R A F T M a r c h 2 7,

STAC51: Categorical data Analysis

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Classification & Regression. Multicollinearity Intro to Nominal Data

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Ordinal Independent Variables Richard Williams, University of Notre Dame, Last revised April 9, 2017

At this point, if you ve done everything correctly, you should have data that looks something like:

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Essential of Simple regression

Evaluating Patient Level Costs. Outline

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

sociology 362 regression

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

16.400/453J Human Factors Engineering. Design of Experiments II

fhetprob: A fast QMLE Stata routine for fractional probit models with multiplicative heteroskedasticity

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Latent class analysis and finite mixture models with Stata

Autocorrelation. Think of autocorrelation as signifying a systematic relationship between the residuals measured at different points in time

leebounds: Lee s (2009) treatment effects bounds for non-random sample selection for Stata

Lecture 7: OLS with qualitative information

Sociology 63993, Exam 2 Answer Key [DRAFT] March 27, 2015 Richard Williams, University of Notre Dame,

Question 1 carries a weight of 25%; Question 2 carries 20%; Question 3 carries 20%; Question 4 carries 35%.

Addition to PGLR Chap 6

Analysing repeated measurements whilst accounting for derivative tracking, varying within-subject variance and autocorrelation: the xtiou command

Poisson regression: Further topics

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data. Fred Mannering University of South Florida

Transcription:

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095-1772. 2 David Geffen School of Medicine at the University of California, Los Angeles, Department of Medicine, Division of General Internal Medicine & Health Services Research, 911 Broxton Ave, 1st Floor, Los Angeles, CA 90024. 3 UCLA School of Dentistry, 10833 Le Conte Ave, Los Angeles, CA 90095-1668 Abstract Zero Truncated Poisson (ZTP) regression model is used to model positive count data, where zero is a potential value but is almost impossible to be observed due to the nature of study and its design. ZTP is more accurate than traditional Poisson regression model for this kind of data. In practice, researchers often need to test the difference of the predicted counts between groups with ZTP regression model. The test result can be misleading if the design is very unbalanced. However, the combination of ZTP regression model and recycled predictions method is one possible way to create an identical structure of the covariates when comparing the predicted counts between groups. This paper uses ZTP regression model based on recycled predictions method to model the positive count data and estimates the variance of the difference of the predicted counts by delta method. Finally, the model and estimation techniques are applied to a real study of Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT). Keywords: Zero Truncated Poisson (ZTP) regression model, recycled predictions method, variance estimation, delta method I. Introduction The regression model for count data is receiving more and more attention nowadays [1-3], even though the use of regression models to describe count data is relatively recent [4-6]. Count data is the non negative integer outcome, such as the number of international conflicts, daily accidents, industrial injuries and so no. It is also common 2478

in the clinical settings, such as the number of doctors and hospital visits. In these cases, directly using standard linear regression model to count outcomes will result in inefficient, in consistent and biased estimation [7]. It is much better to use the models specifically designed for count outcomes. We have standard Poisson Regression Models (SPRM) and standard Negative Binomial Regression Models (SNBRM), which are the foundation of other modified count models, such as zero truncated models and zero inflated models. We are only interested in the zero truncated Poisson (ZTP) regression models in this paper. When zero count is a potential possible value, but is missing in the data set, we call it zero truncated data. The missing of zero count happens due to the sample scheme, in which the zero count is impossible to be observed [4-7]. For example, we study how often people have coffee every month if we collect the data in Starbucks. Then we are not able to observe the zero counts, since all the sampled people will have at least once. Shaw (1988) has proposed Poisson regression models for the analysis of truncated samples of count data [8]. Recycled predictions method is widely used to balance the data between different groups. It is very common in Medical cost-effective analysis [9], which creates an identical covariate structure. The method is coding everyone as if they were all in the control group and predicting the outcome for each individual. Then calculate the predicted outcome for each individual by coding everyone as if they were all in the treatment group. The estimated outcomes of the control group and treatment group are given by the arithmetic mean of all individuals in the data set respectively. To test whether the predicted values by above method is significantly different, the essential part is how to estimate the variance of the difference, which not only involve the variance of the predicted value, but will need to adjust the variance of all the covariates. This paper will talk about using ZTP regression model based on recycled predictions method to model the positive count data as well as estimate the variance of the difference of the predicted counts by delta method. The idea in this paper is from a real problem in the health service research. We want to test the difference between the lengths of hospital stays among 6 hospitals, adjust for age, gender and other demographic information. We will use ZTP regression to model the positive count of length of stay. The sample size is very unbalanced that one hospital has most of the observations and other 5 hospitals have less. After we use zero truncated Poisson (ZTP) regression model based recycled predictions method to predict the counts among different hospitals, we want to test whether the counts are different between any two hospitals. The research question is that how to estimate the standard error of the difference between any two predicted counts of Zero Truncated Poisson model based on recycled predictions. The model and estimation is derived in the methods section. Finally, the estimation method will be applied to a real data Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT) example in the results section. 2479

II. Methods 2.1 Poisson Model and Poisson Regression Model The density function of standard Poisson model can be expressed as the following [1], (1) Here is the only parameter in the model (1), which denotes the rate of occurrences or the expected number of times an event will occur over a given period of time. The standard Poisson model is fundamental to understanding the Poisson regression models for counts [7]. The count variable of interest is, a random variable to denote the number of times an event will occur. Then, is the possible values of. The standard Poisson model requires the data to satisfy, In Poisson regression, for the given covariate vector for each subject, p is the number of covariates, then the parameter can be estimated at subject level as [6], we use log to denote the natural logarithm, (2) Here is the coefficient in the model, which can be estimated by Maximum likelihood method (MLE). The log likelihood function is as below, (3) In the standard Poisson regression model, the conditional mean of, is given by, (4) Here we allow zero counts in the model. 2.2 Zero Truncated Poisson Regression Model Zero-truncated Poisson (ZTP) regression, introduced by [8], is used to model the always positive counts. (If zero is an admissible value for the dependent variable, then standard Poisson regression is more appropriate [6]). The sampling schemes are most likely the reason that gives rise to the Zero Truncated Poisson (ZTP) model. The density function for ZTP is expressed as (5) (for ) after the zero value of being truncated [2-4], here j can be any positive numbers that takes, the probability for is, (5) 2480

Here is the only parameter in the model. We add the index for to indicate different groups or different design matrices, we may have different parameters of the model. We will later extend this to the regression setting easily. The expected counts for each given parameter is (6) The variance based on the model can also be estimated as, ZTP is different from the standard Poisson regression if we compare (6) and (7) with (4). According to the Poisson regression model at subject level can be estimated by the generalized linear model as below [6], here ( ) will indicate the estimation, (7) The offset is defined as including no observations in the model. It is a design based constant, different from the constant in the model. It is a variable that is included in a linear model without a corresponding coefficient being estimated [4-6], for example the exposure time and the population used in order to estimate the rate. The coefficient can also be estimated by the log likelihood function as, (8) The log likelihood function is more complicated than that of standard Poisson as (3). 2.3 Recycled Predictions Method Before we use recycled predictions method to estimate for each hospital, we first redefine the notation in more detail. We assume that there are subjects from the hospitals. Therefore, there are observations from 6 hospitals in total. The parameter can be estimated for each given, here is a vector defined as, We suppose in the vector, are the dummy variables for 6 hospitals, with hospital 6 as the reference group, i.e. represents the information of subject k in hospital 6 with all are zero. Then we assume the estimator for the coefficient is, 2481

Here are the coefficients for the corresponding set of dummy variables. This will be used to estimate for the subject in hospital for given, where is from 1 to 6 and k is from 1 to as (9) (9) Here collection. is the designed based constant at subject level, related to the data Now we can plug in the recycled prediction methods, in which we vary characteristics of interest across the whole data set to create identical covariate structure for different hospitals, and then average the predictions across the whole data set to estimate the parameter at hospital level. For example, if we assume all the observations are from hospital 1 and set the dummy variables as, where is from 1 to 6 and k is from 1 to across the whole data set. Then calculate the average of estimated at hospital level by all the N observations in the data set as, With the same techniques, we can calculate the estimators of other hospitals. The reference hospital can be estimated by letting all dummy variables. We note that is estimated by all the observations in the data set and has nothing to do with which hospital each observation originally belongs to. It is estimated by resetting all the dummy variables. To simplify the notation, we will use k to denote all the subjects in the data set, where. At person level, each is estimated by, where is from the information, with only dummy variables for hospitals changed. This is because the recycled prediction methods for hospital i only involve the change of. We can simplify the notation of the estimation as below, (10) For each hospital i, it is only related to the information by setting different set of corresponding dummy variables as above. We will use to denote that for each subject k in the data, we have different set of dummy variables to estimate. Then the mean of can be calculated as, (11) The variance is estimated by delta method, consider is a function of, 2482

(12) Here is a constant. The variance-covariance matrix of the coefficients is given by STATA with command e(v) by the generalized linear model command ZTP. is a vector that and is a scalar, and the coefficient is the same as before. The bootstrap may be an option to estimate the variance, but the calculation in (12) based on delta method takes the variance of other covariates into account. 2.4 Estimate the variance of the difference Then, we will derive the estimation of the variance of the difference. Consider in recycled predictions method, which is used to remove the effects of other variables, the estimated parameter and for two hospitals ( ) can be expressed as following by (5) (13) From (11), use the fact that, here is a vector as before, and are the estimated coefficients in the model corresponding to the different sets of dummy variables respectively. For example, if we want to estimate the log scale difference of hospital1 and hospital2, the only difference is the set of dummy variables. For hospital1, we set and all others is zero while for hospital2, we set and all others is zero. Then consider the above fact we have If we want to compare hospital1 with the reference hospital6 (by set all zero), then the difference will be Therefore, the only deference between the two groups is related to the correponding dummy group coefficient(s). We have the following relationship between and, (14) where is a constant for any given model. When we estimate the covariance between and, we will need to take the estimation into account. 2483

Next, we will derive the covariance of and. It will be too complicated to calculate the variance directly. Therefore, we consider the Taylor expansion of (10) Here. We only keep the linear part of the expansion, then we have the estimation for the as Finally, we have the variance-covariance matrix for and is (15) Again, we will calculate the variance of by delta method. If, we have the derivative as. Therefore, the variance for can be expressed as, This can be used further for testing whether the two predicted values are equal. To derive the variance of the difference between two predicted values, we can directly calculate from (10) and (13), we define, (16) then by the relationship between and, consider and (12), The derivative can be expressed as (17) Then by delta method, we will finally have the variance of the difference between group i and group j as, (18) Substitute the values from (11) (15) and (17) we will have the standard error estimation of the difference of predicted values, which is the square root of (16). We 2484

can further use based on this to test whether the two predicted values are equal. III. Results We use the Adherence and Efficacy of Protease Inhibitor Therapy (ADEPT) study as a real data example to apply the above theory. It is a prospective, observational investigation of medication adherence among HIV-infected patients starting a new Highly Active Antiretroviral Treatment (HAART) regimen [12-13] from February 1998 through April 1999. At each study week, 0, 8, 24, 48, each patient was asked about the names of their antiretroviral (ARV) medications. Patients visited the study nurses every 4 weeks for measurement of medications adherence. At each visit, the change of their medications would be noted down. We model the number combinations of ARV drugs for each patient in the study. Consider the design of the study; all the patients will have at least one drug combinations in the ADEPT data. There is a Medication Event Monitoring System (MEMS) bottle cap that recorded the date and time of each opening of the pill bottle. In the simplest case, we will consider 2 covariate variables age and gender effect. Of all the 116 patients, 8 of the patients had 3 drugs, 26 of them had 2 drugs and all the other 82 patients had only one drug. There are 23 (around 20.0%) females in the data set. The average age of these patients is 37.2 (with standard deviation 8.1) ranges from 20.3 to 67.0. First we will model the data with classical Zero Truncated Poisson model. The output from STATA for the number of ARV drugs based on the model and the predicted mean number of drugs by gender is illustrated as Table1. Table1. Zero Truncated Poisson Model without Using Recycled Predictions Zero-truncated Poisson regression Number of obs = 116 LR chi2( 2) = 36.42 Prob > chi2 = 0.0000 Log likelihood = -89.000357 Pseudo R2 = 0.1698 drug Coef. Std. Err. z P> z [95% Conf. Interval age -.025068.0207726-1.21 0.228 -.0657817.0156456 sex.294332.4272278 0.69 0.491 -.543019 1.131683 _cons.2377446.8718251 0.27 0.785-1.471001 1.94649 -> sex = 0 (a) Standard ZTP model Variable Obs Mean Std. Dev. Min Max drug_ztp 23.4814835.1065053.2886211.64336 -> sex = 1 Variable Obs Mean Std. Dev. Min Max drug_ztp 93.6925427.1221134.3171865 1.024349 (b) Predictions based on Standard ZTP model 2485

Now we will apply the recycled predictions method to modify the predictions. We will treat all the 116 patients in the data as male and female separately to predict the mean number of drugs. In this case, we predict the outcome adjust for the age based on the ZTP models in Table1. The recycled predictions are as Table2. Table2. The predictions based recycled predictions methods Variable Obs Mean Std. Dev. Min Max nl_drug_gen1 116.6833662.1272097.3171865 1.024349 nl_drug_gen0 116.5091277.0947749.2363131.7631697 Compare Table1 (b) and Table2, the predictions are modified by the recycled predictions. Finally, in order to test the difference of the above correlated predictions, we will use the estimation method of the variance in the methods sections. The estimated standard error for the difference is 2.78. We do not have strong evidence (p-value = 0.52) to reject the null hypothesis that the predicted number of combinations ARV drugs between male and female are the same, based on ZTP models with recycled prediction methods adjust for age. IV. Discussion In this paper, the zero truncated count is the key point for the model. As [11] mentioned, the interpretation of coefficients in the truncated model is always more complicated than those for the standard model. When the truncation exists only in the sample (the zero counts can not be observed), i.e. the population is in the case of standard Poisson model without zero truncated, and then the coefficient will have a usual interpretation, just consider the following, (19) The simple multiplicative connection exists for the two models. However, if no zero count is not because of the sampling scheme, but because observed zero count is truly impossible, then the simple connection is no longer correct. The adverse effects of overdispersion are worse with truncated models [7]. When sample is not truncated, using SPRM in the presence of overdispersion does not bias the estimated coefficients. But if we use ZTP model, the estimated coefficients will be biased and inconsistent and will therefore lead to biased estimation of predicted counts. As suggested by [7], before using ZTP regression model, we must check for overdispersion by zero truncated negative binomial model (ZTNB) based on a Likelihood-ratio test. 2486

Reference: 1. Dobson AJ, Dobson A. An Introduction to Generalized Linear Models, Second Edition. 2nd ed. Chapman & Hall/CRC; 2001. 2. Grogger JT, Carson RT. Models for Truncated Counts. Journal of Applied Econometrics. 1991;6(3):225-38. 3. Springael J., Van Nieuwenhuyse I. On the sum of independent zero-truncated Poisson random variables. University of Antwerp, Faculty of Applied Economics; 2006. 4. Hardin JW, Hilbe JM. Generalized Linear Models and Extensions, Second Edition. 2nd ed. Stata Press; 2007. 5. Wedel M, Desarbo WS, Bult JR, Ramaswamy V. A Latent Class Poisson Regression Model for Heterogeneous Count Data. Journal of Applied Econometrics. 1993;8(4):397-411. 6. Corporation S. STATA Base Reference Manual, Volume 3 : Q - Z, Release 10. Stata; 2007. 7. Long JS, Freese J. Regression Models for Categorical Dependent Variables Using Stata, Second Edition. 2nd ed. Stata Press; 2005. 8. Shaw, D. (1988), 'On-site samples' regression problems of non-negative integers, truncation, and endogenous stratification', Journal of Econometrics, 37, 211-223; 2005 9. Basu A, Meltzer D. Implications of spillover effects within the family for medical cost-effectiveness analysis. Journal of Health Economics. 2005;24(4):751-773. 10. Glick H, Doshi JA, Sonnad SS, Polsky D. Economic evaluation in clinical trials. Oxford University Press; 2007:244. 11. Simonoff JS. Analyzing categorical data. Springer; 2003:496. 12. Miller LG, Liu H, Hays RD, et al. Knowledge of antiretroviral regimen dosing and adherence: a longitudinal study. Clin. Infect. Dis. 2003;36(4):514-518. 13. Golin CE, Liu H, Hays RD, et al. A prospective study of predictors of adherence to combination antiretroviral medication. J Gen Intern Med. 2002;17(10):756-765. 2487