THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE THE USE OF GENERALIZED METHOD OF MOMENTS WITH BINARY OUTCOMES IN THE PRESENCE

Size: px

Start display at page:

Download "THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE THE USE OF GENERALIZED METHOD OF MOMENTS WITH BINARY OUTCOMES IN THE PRESENCE"

Magnus Sutton
5 years ago
Views:

1 THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE THE USE OF GENERALIZED METHOD OF MOMENTS WITH BINARY OUTCOMES IN THE PRESENCE OF TIME VARYING COVARIATES A DISSERTATION SUBMITTED TO THE GRADUATE FACULTY in partial fulfillment of the requirements for the degree of Doctor of Philosophy BY SUMMER GALE FRANK Oklahoma City, Oklahoma

2 THE USE OF GENERALIZED METHOD OF MOMENTS WITH BINARY OUTCOMES IN THE PRESENCE OF TIME VARYING COVARIATES APPROVED BY: Sara K. Vesely, Ph.D., Chair Michael P. Anderson, Ph.D. Laura A. Beebe, Ph.D. David M. Thompson, Ph.D. Eleni L. Tolma, Ph.D. DISSERTATION COMMITTEE 2

3 COPYRIGHT by Summer Gale Frank August 1,

4 ACKNOWLEDGEMENTS The path I took to reach the momentous achievement of completing my dissertation has been long, winding, and often steep. I have many people to thank for this achievement, because I know that I would not have reached the end of this journey without their encouragement, support, and guidance. First and foremost, I would like to thank my advisor, Dr. Sara Vesely, who invested countless hours to assure my success throughout the program and my dissertation. She always provided support and encouragement as well as an extra push when I needed it. It was an honor to be her first doctoral student and to share this journey with her. I also want to thank my dissertation committee. Dr. David Thompson and Dr. Michael Anderson helped me to refine my analysis/simulation code and research plans and provided assistance when simulation issues seemed insurmountable. Dr. Laura Beebe and Dr. Eleni Tolma kept me grounded, reminding me to keep in mind the realworld applications of my theoretical explorations. I am thankful for the many professors and staff in the Department of Biostatistics and Epidemiology and in the College of Public Health who provided me guidance or encouragement. I am also grateful to the students with whom I shared this rollercoaster ride. I appreciate every presentation practice session, impromptu coffee break, bit of encouragement (hug, smile, or kind word) and attentive ear during stressed ramblings. I am grateful to Dr. Trent Lalonde for sharing his simulation/analysis code and for providing advice as I adapted it for use in my dissertation. I also appreciate Brett Zimmerman and the staff at the OU Supercomputing Center for Education & Research 4

5 for helping me to adapt my code for use on the supercomputer and for helping me with error messages and incessant questions throughout my dissertation. Finally, I would not have been able to finish this dissertation without the support of my family. To my parents, Ilene and Brantley, who have always provided support and encouraged me to follow my dreams. They listened to me work through issues in my simulations, reminded me that they were confident in my abilities and made sure I was taking care of myself. I would not be the person I am today without your love and guidance. To Matt, who listened to my daily dissertation ramblings, fed my heart and my empty belly when I got home after long nights of dissertation work, and shined a light when I couldn t see the one at the end of the tunnel. To my sister, Lela, and her family, I m thankful for the reminder of the important things in life and for the chance to focus on being Aunt Summer when my dissertation became overwhelming. 5

6 TABLE OF CONTENTS LIST OF TABLES... 7 LIST OF FIGURES ABSTRACT Chapter I. INTRODUCTION II. BACKGROUND AND LITERATURE REVIEW III. METHODS AND MATERIALS IV. RESULTS V. DISCUSSION BIBLIOGRAPHY APPENDICES Appendix A

7 LIST OF TABLES Table Page 1. Table 1: Interpretation of strength of correlation (Mukaka, 2012) Table 2: Summary of comparison methods and criteria for estimating equation exclusion Table 3: Values to explore in single variable scenarios Table 4: Summary of components in simulation Table 5: Performance measures for evaluating methods Table 6: Crude analysis results for Youth Asset Study females (n=481) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N) Table 7: Crude analysis results for Youth Asset Study males (n=429) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N) Table 8: Part 1 of adjusted analysis results for Youth Asset Study females (n=481) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parental income, and time/wave Table 9: Part 2 of adjusted analysis results for Youth Asset Study females (n=481) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parental Table 10: Part 3 of adjusted analysis results for Youth Asset Study females (n=481) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parent Table 11: Part 1 of analysis results for Youth Asset Study males (n=429) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parental income, and time/wave

8 LIST OF TABLES CONTINUED 12. Table 12: Part 2 of analysis results for Youth Asset Study males (n=429) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parental income, and time Table 13: Part 3 of analysis results for Youth Asset Study males (n=429) modeling effect of the Positive Peer Role Model asset on the outcome of tobacco use in the last 30 days (Y/N), accounting for effect coded versions of age, race/ethnicity, parental income, and time Table 14: Crude analysis results for rehospitalization data (n=1,625) modeling effect of number of diseases (NDX) on the outcome of rehospitalization within 30 days (Y/N) Table 15: Part 1 of adjusted analysis results for rehospitalization data (n=1,625) modeling effect of number of diseases (NDX) on the outcome of rehospitalization within 30 days (Y/N), accounting for number of procedures (NPR), length of stay (LOS), and effect coded versions of presence of atherosclerosis (DX101) and time (T2 and T3) Table 16: Part 2 of adjusted analysis results for rehospitalization data (n=1,625) modeling effect of number of diseases (NDX) on the outcome of rehospitalization within 30 days (Y/N), accounting for number of procedures (NPR), length of stay (LOS), and effect coded versions of presence of atherosclerosis (DX101) and time (T2 and T3) Table 17: Comparison of performance measures in single Type I covariate with 500 subjects/clusters Table 18: Comparison of performance measures in single Type I covariate with 1000 subjects/clusters Table 19: Comparison of performance measures in single Type II covariate with 500 subjects/clusters and 3 time points Table 20: Comparison of performance measures in single Type II covariate with 500 subjects/clusters and 4 time points Table 21: Comparison of performance measures in single Type II covariate with 500 subjects/clusters and 5 time points Table 22: Comparison of performance measures in single Type II covariate with 1000 subjects/clusters and 3 time points

9 LIST OF TABLES CONTINUED 23. Table 23: Comparison of performance measures in single Type II covariate with 1000 subjects/clusters and 4 time points Table 24: Comparison of performance measures in single Type II covariate with 1000 subjects/clusters and 5 time points Table 25: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 3 time points and a time-dependent covariate weight of Table 26: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 3 time points and a time-dependent covariate weight of Table 27: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 3 time points and a time-dependent covariate weight of Table 28: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 29: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 30: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 31: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 5 time points and a time-dependent covariate weight of Table 32: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 5 time points and a time-dependent covariate weight of Table 33: Comparison of performance measures in single Type III covariate with 500 subjects/clusters, 5 time points and a time-dependent covariate weight of Table 34: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 3 time points and a time-dependent covariate weight of Table 35: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 3 time points and a time-dependent covariate weight of

10 LIST OF TABLES CONTINUED 36. Table 36: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 3 time points and a time-dependent covariate weight of Table 37: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 38: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 39: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 4 time points and a time-dependent covariate weight of Table 40: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 5 time points and a time-dependent covariate weight of Table 41: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 5 time points and a time-dependent covariate weight of Table 42: Comparison of performance measures in single Type III covariate with 1000 subjects/clusters, 5 time points and a time-dependent covariate weight of

11 LIST OF FIGURES Figure Page 1. Figure 1: Simulation process for Type I covariates in an example with three time points Figure 2: Simulation process for Type II covariates in an example with three time points Figure 3: Simulation process for Type III covariates in an example with three time points Figure 4: Mean number of estimating equations excluded in each method by time point in Type I covariates Figure 5: Mean number of estimating equations excluded in each method by TDC weight and time point in Type II covariates Figure 6: Mean number of estimating equations excluded in each method by TDC weight, feedback weight and time point in Type III covariates Figure 7: Absolute value of the bias in each method by time point in Type I covariates Figure 8: Absolute value of the bias in each method by TDC weight and time point in Type II covariates Figure 9: Absolute value of the bias in each method by TDC weight, feedback weight and time point in Type III covariates Figure 10: Empirical SE in each method by time point in Type I covariates Figure 11: Empirical SE in each method by TDC weight and time point in Type II covariates Figure 12: Empirical SE in each method by TDC weight, feedback weight and time point in Type III covariates Figure 13: Coverage versus relative efficiency for all methods in number of subject/cluster values of 500 and 1000 by time-varying covariate type

12 LIST OF FIGURES CONTINUED 14. Figure 14: Coverage versus relative efficiency for all methods in number of subject/cluster values of 500 and 1000 by time-varying covariate type and number of time points plotted in all methods using full X-axis width Figure 15: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by time-varying covariate type and number of time points plotted using a smaller X-axis width Figure 16: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by time point for timevarying covariate Type I Figure 17: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type II Figure 18: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 19: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 20: Coverage versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 21: Absolute value of the bias versus relative efficiency for all methods in number of subject/cluster values of 500 and 1000 by time-varying covariate type Figure 22: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by timevarying covariate type and number of time points Figure 23: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by time point for time-varying covariate Type I

13 LIST OF FIGURES CONTINUED 24. Figure 24: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type II Figure 25: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 26: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 27: Absolute value of the bias versus relative efficiency for GMM-Rho2 and comparison methods in number of subject/cluster values of 500 and 1000 by number of time points and time-dependent covariate weight for time-varying covariate Type III and feedback weight Figure 28: Prevalence versus time point in time-dependent covariate weight scenarios by time-dependent covariate type and number of clusters/subjects Figure 29: Mean versus time point in time-dependent covariate weight scenarios by time-dependent covariate type and number of clusters/subjects Figure 30: Prevalence versus time point in feedback weight scenarios by timedependent covariate type and number of clusters/subjects Figure 31: Mean covariate value versus time point in feedback weight scenarios by time-dependent covariate type and number of clusters/subjects

14 ABSTRACT Introduction: Statistical models for longitudinal data that include covariates that are timevarying can produce biased parameter estimates. The generalized estimating equations (GEE) method employing an independent working covariance structure is considered the safe choice to reduce bias, but can produce inefficient estimates of the standard error of a parameter estimate. Alternatively, the generalized method of moments (GMM) can be used to combine estimating equations (EE). Some researchers propose excluding invalid EE from GMM. EE are invalidated by their assumed covariate type or by directly testing standardized residuals (at time t) and weighted covariates (at time s) in a logistic regression model, for each combination of s and t (α=0.05). Methods: This dissertation proposed identifying invalid EE through the correlation between the studentized residuals at time t and weighted covariate at time s where s t (ρ 0.2 and α=0.05); EE on the diagonal were never excluded. Higher correlation thresholds (ρ=0.3 and 0.4) were also explored. We compared the performance of the GEE independence method and two previously developed GMM methods to my proposed GMM method in three real-world analyses and various simulated data scenarios (cluster size= 500 and 1000; time points= 3, 4, and 5; covariate type = I, II, III). Simulations explored various strengths of weights representing the feedback from the previous outcome to the current covariate (feedback weight) and the previous covariate value on the current outcome (time-dependent covariate weight). Results: The simulation studies demonstrated that higher thresholds for correlation exclude too few EE, producing parameter estimates with larger bias than other methods. When the smallest correlation value (ρ=0.2) was employed, my method had smaller bias 14

15 and standard error in many simulation scenarios, tending to perform best in extreme combinations (i.e. high feedback weight strength combined with low time-dependent covariate weight strength, or vice-versa). My method had the smallest standard deviation in two real-world analyses and shared the second smallest value in the third with another method. Discussion: The comparative performance of my method depends upon the strengths of the feedback and time-dependent covariate weights and the correlation threshold used to identify invalid EE to exclude from the GMM calculation. 15

16 CHAPTER I INTRODUCTION The ability to obtain unbiased estimates of the effect of an exposure on an outcome of interest is an important aspect of public health research. In most cases, researchers will need to account for multiple explanatory variables, or covariates, that also affect the outcome and it is likely that one or more of these variables will vary over time. The presence of a covariate that varies over time can introduce bias in the calculation of parameter estimates in most statistical techniques. Therefore, it is important to address the bias that could be introduced by these variables. However, the methods that assure an unbiased estimate can hinder the researcher s efforts to obtain an efficient estimate and result in an overestimation of the standard error. This overestimation can result in wider confidence intervals surrounding parameter estimates, which can lead to incorrect conclusions regarding the effect of an exposure on the outcome. In public health, research efforts are often directed toward health and behavior outcomes in the general population or in subsets of it such as youth. If we conclude that an exposure does not have an effect on an outcome or that our confidence in its magnitude is not sufficient to warrant pursuit, it is likely to be excluded from future intervention efforts. It is, therefore, essential that research studies utilize analysis methods that can balance this trade-off between bias and efficiency. Many public health studies are longitudinal in nature and aim to estimate the effect of multiple explanatory variables on a dichotomous, or binary, outcome. For example, a cohort of adolescent participants may be enrolled and then followed annually over time to explore the effect of having a positive peer role model on reported tobacco 16

17 use (yes or no). The effect of additional explanatory variables such as the sex, family structure and race/ethnicity of the youth may be of interest. To determine if there is a relationship between these variables and the binary outcome of reported tobacco use, the most appropriate analysis method must be chosen. Methods for independent data can be used when one can be fairly certain that the measurement for one individual has no influence on the measurement for another individual. However, in longitudinal studies we generally observe that repeated measurements for a particular individual are correlated; therefore, one can no longer assume independence. This is an important aspect of the data as it guides the analysis methods that a researcher uses to explore and interpret those data. Generalized Estimating Equations (GEE) can be used to account for this correlation when one is interested in the marginal, or population-averaged, effect of an explanatory variable on the outcome of interest. GEE allows one to model the covariance among measurements based upon an assumed pattern of correlation among outcomes. Liang and Zeger (1986) introduced the GEE method in their seminal paper, which postulated a working covariance structure which could be used to obtain consistent coefficient estimates without having to specify the correct correlation structure. Generally, this method produces unbiased coefficient estimates and standard errors even when the covariance structure is not correctly specified (Diggle, Heagerty, Liang, & Zeger, 2002; Liang & Zeger, 1986). However, GEE may produce biased coefficient estimates in the presence of explanatory variables whose values vary over time, termed time-varying covariates. Time-varying covariates are variables whose values can change during a study. They are in direct contrast to time-invariant covariates whose values are fixed or stationary throughout a study. Examples of time-invariant covariates include biological sex and race/ethnicity. Time-varying covariates can change systematically over time, 17

18 such as the measurement time or the age of a participant for a study in which measurements are taken once per year. They can also change over time by study design, such as when a researcher assigns a treatment to a participant in a randomized crossover study. When time-varying covariates change systematically or by study design, they are referred to as exogenous because the values of the covariate are externally controlled. For exogenous variables, the covariate at a particular time is independent of the preceding outcome measurements because the values of the covariates at the time of measurement are determined in a manner that is unrelated to the outcome. However, when a covariate changes randomly over time in a manner that is not systematic or fixed, it is referred to as endogenous. Examples of endogenous variables include body mass index (BMI) and having a positive peer role model. For endogenous variables, the covariate at a particular time can depend upon the preceding covariate values and outcome measurements. The categorization of time-varying covariates was refined by Lai and Small into Types I, II, and III (Lai & Small, 2007). They redefine exogenous covariates as Type I and endogenous covariates as either Type II or Type III. The most complex covariate type is the Type III variable. For a Type III covariate, the current outcome value can be affected by the previous covariate value and the current covariate value can be affected by the previous outcome value such that there is a feedback from the previous outcome to the current covariate value. For a Type II covariate, only the current outcome value can only be affected by the previous covariate value; there is no feedback loop from the previous outcome value to the current covariate value. Consider the example of a study that examines the effect of having a positive peer role model on reported tobacco use. If having a positive peer role model was a Type II covariate, then having a positive peer role model at the previous time point could affect the current reported tobacco use, but reported tobacco use at the previous measurement would not affect whether a person 18

19 had a positive peer role model at the current measurement. Conversely, if having a positive peer role model was a Type III variable, then having a positive peer role model at the previous time point could affect the current reported tobacco use and having a positive peer role model at the current measurement time could be affected by the previous reported tobacco use. This might be observed if a potential positive peer role model chose not to spend time with a person who uses tobacco, but would reconsider after the person quit using tobacco. Pepe and colleagues have shown that using a non-diagonal working covariance matrix produces biased estimates of the parameters when GEE methodology is used in the presence of time-varying covariates (Diggle et al., 2002; Pepe & Anderson, 1994). Pepe and Anderson (1994) concluded that to obtain unbiased estimates, the safest choice for the working covariance structure is the diagonal, independence structure. However, the diagonal structure specifies that outcomes are uncorrelated and ignores the correlation between an individual s repeated measures. This results in an overestimation of the standard error and, in turn, a loss of efficiency (G. M. Fitzmaurice, 1995). This dissertation addresses this balance between bias and efficiency when estimating the population-averaged effect of explanatory variables on a binary outcome. Discussions of this balancing act have continued over the past 20 years and various methods have been proposed to deal with the issues introduced by time-varying covariates. Some methods address the issues less effectively than others. One method proposed to account for independent variables whose values change over time is the process of lagging data. A lagged covariate can be used to assess the effect of a time-varying exposure at a prior time on an outcome. However, the length of the lag is at the discretion of the researcher and depends on the situation. Consider the example where a researcher takes measurements of an exposure and an outcome at the same time annually for five years. If the researcher deems the most 19

20 recent measurement to be important in assessing the effect of the exposure on the outcome (i.e. the effect of a quickly metabolized drug on patient symptoms), then they may choose to lag the data by one year. Conversely, the researcher may be interested in the effect of a drug that must build up in the system before impacting patient symptoms such that there is a longer interval between the exposure and its anticipated effect on the outcome. In that case, the researcher might consider the effect of the exposure at two or three measurement times before the outcome measurement to be of importance. The appropriate lag time is commonly unknown, so often a researcher must explore a number of possible lags. Though one can use two or more lagged versions of a single covariate when a single lagged variable is insufficient, this can introduce issues with highly correlated predictors, uncertainty regarding the number of lagged covariates to include, and specification of the functional form of the coefficients (Diggle et al., 2002). Even once a researcher decides which lag to apply, use of a non-diagonal working covariance structure can still introduce bias and care must be taken when exploring alternative structures (Schildcrout & Heagerty, 2005). Moreover, most applications would only use the exposure that directly preceded the outcome measurement to model the entire outcome process, which is unhelpful when the entire covariate history is of interest (Diggle et al., 2002). Transition models also make use of multiple lagged variables, but the marginal methods developed for these models have similar issues of bias in the presence of time-varying covariates (Azzalini, 1997; Heagerty, 2002; Heagerty & Zeger, 2000). Marginal structural models, using inverse probability weighting, have been proposed to model time-varying covariates in a longitudinal setting. However, the use of weights in this method also introduces efficiency issues, which worsen as the weight increases (Robins, Greenland, & Hu, 1999; Robins, Hernán, & Brumback, 2000). 20

21 Other methods, which do not involve decisions about lagged versions of timevarying covariates, have been demonstrated to be more effective in striking a balance between bias and efficiency when estimating the population-averaged effect of explanatory variables that vary over time. The Generalized Method of Moments (GMM) technique allows one to derive estimators from moment conditions where moments are included in the final calculation of the parameter estimates based on a weight matrix that gives more weight to the most efficient estimators (Hansen, 1982). Lai and Small (2007) and Qu, Lindsay, and Li (2000) have adapted the GMM for use in marginal regression analysis of longitudinal data in the presence of time-varying covariates; however, these studies were directed at specific types of time-varying covariates. Additionally, Lai and Small required the categorization of variables into the specific types of time-varying variables such that the type of variable dictated which moments would be included in the final calculation of the parameter estimates (Lai & Small, 2007). Lalonde, Wilson, and Yin (2014) extended the GMM beyond these specific types of time-varying covariates without dictating the inclusion of sample moments by covariate type. However, the work published by Lalonde and colleagues did not explore how the method worked in various simulated data conditions. Furthermore, their published work explored only one metric for identifying sample moments for inclusion. No study to date has provided a thorough examination of the methods proposed to analyze data with time-varying covariates in a variety of simulated data settings. Thus, in addition to examining other metrics for identifying appropriate sample moments, this dissertation fills an important gap in the literature by exploring the performance of multiple methods in various simulated data conditions. 21

22 Research Goals This study investigates a GMM technique for combining estimating equations in a manner that achieves an estimate with an improved balance between bias and efficiency. The primary aim is to improve the GMM calculation of the standard error of the parameter estimate introduced by Lalonde et al. and explore alternative metrics to identify which valid estimating equations to include in the sample moment estimation (T. L. Lalonde et al., 2014). I compare performance in simulated data conditions using the same mechanisms as those used by Lalonde et al. and in two real-world datasets. I compare my method to 1) the GEE method using the independence matrix as proposed by Pepe and Anderson, 2) the GMM method introduced by Lai and Small which categorizes time-varying covariate by type, and 3) the GMM method proposed by Lalonde et al. (Lai & Small, 2007; T. L. Lalonde et al., 2014; Pepe & Anderson, 1994). Additionally, I will provide a more thorough comparison of these methods using simulated data that explores a wider array of scenarios which are intended to emulate the variability of features that might be seen in real-world datasets. Specific Aims By using data from real-world data sets and simulated data scenarios, this study will address the following specific aims. I. Improve the GMM method proposed by Lalonde et al. by using additional metrics to identify which sample moments to include in parameter estimation a. Studentized residuals instead of standardized residuals b. Correlation (R) value in addition to p-value cut-off i. Values of R to explore include 0.2, 0.3, and 0.4 c. Require inclusion of estimating equations along the diagonal (where s=t) 22

23 II. Compare the improved GMM method in a variety of simulated data scenarios to three methods (independent GEE, GMM method proposed by Lai and Small, and GMM method proposed by Lalonde et al.) a. Specific aspects of simulated data scenarios to explore i. Number of clusters 1. Values to explore include 500, and 1000 ii. Number of repeated observations per cluster 1. Values to explore include 3, 4, and 5 iii. Time-dependent covariate and feedback weights 1. Values to explore for time-dependent covariate weight include 0.7, 0.8, and Values to explore for feedback weight include 1.3, 1.5, and 1.7 b. Real-world datasets i. Rehospitalization ii. Youth Asset Study 23

24 CHAPTER II BACKGROUND AND LITERATURE REVIEW Notation The mathematics in this research will be described with the following notation. Vectors and matrices will be bolded. A transpose will be notated using the prime symbol (i.e. the transpose of the vector A will be notated as A ). The total number of observations in a dataset will be indicated with the letter N, while the repeated measures within a cluster will be represented by the letter t. The number of parameters to estimate will be identified by the letter p. As we have repeated measurements, subjects will be indexed using the letter i, while the observations within each subject will be indexed with a j. The number of repeated measurements on a subject will be identified, n i. This means that n ij would be the observation at the j th measurement time for the i th subject. Each subject has the outcome/response values, Y i = (Y i1,, Y ini ) and time-varying covariate matrix X i = (x i1,, x ini ). Although the components of Y i are generally correlated, when i j we assume that Y i and Y j are independent. Z i is a matrix of timeinvariant covariates where the number of columns is equal to the number of timeinvariant covariates. Marginal Regression Models The purpose of this research is to investigate the performance of an improved form of the generalized method of moments (GMM) approach to combining generalized estimating equations (GEE) in marginal regression models introduced by Lalonde and colleagues (T. L. Lalonde et al., 2014). Marginal regression models account for the correlation present in longitudinal or clustered data for which independence can no longer be assumed. Because marginal methods model the outcome as a function of the 24

25 explanatory variables separately from the association among the outcome s repeated measures for each individual, they are appropriate when one intends to make inferences about the population-average parameters. The generalized linear model (GLM) is a general class of marginal regression models that share a few salient features. It is assumed that the mean of the outcome, μ i = E(Y i ), is related to a vector of covariates, x, through a link function such that h(μ i ) = x i β and the variance of Y i is a known function of the mean of the outcome (Garrett M Fitzmaurice, Laird, & Ware, 2012). Regression coefficients in GLM can be solved using an iteratively reweighted least squares method of maximum likelihood estimation introduced by Nelder and Wedderburn (Nelder & Wedderburn, 1972). In GLM, the form of the likelihood function is a member of the exponential family of distributions. However, this method was extended to other distributions outside of the exponential family in the quasi-likelihood (QL) method first described by Wedderburn (Wedderburn, 1974). This QL method makes assumptions about the link and variance functions without specifying the entire distribution of the outcome, Y i ; however, the data are assumed to be independent. The generalized estimating equations (GEE) method extends the QL method to correlated data. GEE Overview First introduced by Liang and Zeger, GEE allows one to account for the correlation, or non-independence, of the repeated measurements taken on an individual or in observations that have natural groupings, or clusters, such as families, classrooms, or clinics (Liang & Zeger, 1986). The parameter estimates from GEE describe the changes in the population mean of a particular outcome given changes in the covariates while accounting for the correlation observed in the clusters of repeated measurements. Termed a marginal or population-averaged model, this method requires one to specify 25

26 the marginal distributions of the outcome at each time point, y i, and the working correlation matrix describing associations among elements of the vector of each subject s repeated observations. However, one does not need to specify the full joint distribution of a subject s observations. The t generalized estimating equations are shown below. N D 1 i V i (y i µ i ) = 0 i=1 The vector of predicted mean responses/outcomes for each individual is represented by the symbol µ i where the i indicates the mean value is for the i th individual, while y i is the vector of actual outcomes for each individual. D i is a derivative matrix that contains the derivative of µ i with respect to the parameter estimates, β. The D matrices transform each individual s set of predicted and actual outcomes (µ i and y i ) from their original units to that of the link function that we use to estimate the expected value of the outcome. V i is an approximation of the true covariance matrix for the outcome, Y i and is composed of the variances on the diagonal and the working covariances on the off-diagonal (Garrett M Fitzmaurice et al., 2012). V i = A i 12 Corr(Y i )A i 1 2 A i is a diagonal matrix whose non-zero elements represents the variance of the outcomes Y i (and when the 1/2 is applied, the standard deviation) and depends upon the parameter estimates, β. Corr(Y i ) represents the assumed correlation matrix and is made up of pairwise association parameters, α, which are assumed to represent the pairwise correlations among the outcomes/responses. Since the working covariance matrix in GEE depends upon the assumed correlation matrix in addition to the parameter estimates, the GEE are functions of β and α (Diggle et al., 2002). 26

27 GEE Performance This study will assess the performance of the new method by comparing various estimates of the bias and efficiency of its estimator to those of other methods. The GEE method is among many longitudinal methods that implicitly assume that the exposure of interest does not affect other covariates modeled to estimate the outcome of interest and that the outcome does not affect the exposure of interest or other covariates. Therefore, these methods do not account for the effect of the outcome on either the exposure or covariates of interest or for the effect of previous outcomes on current or future outcomes. In other words, these methods do not properly account for time-varying covariates. Failure to account for time-varying covariates can introduce bias into the calculation of the parameter estimates (Rothman, Greenland, & Lash, 2008). Pepe and Anderson recognized that the parameter estimates might be biased when modeling time-varying covariates using GEE with a non-diagonal working covariance structure (Pepe & Anderson, 1994). The potential for bias is due to the violation of an important assumption in GEE models. In GEE it is assumed that the expectation of Y i (the expected value of the outcome) given the entire exposure history X i and the baseline covariates depends only on the past exposure history X i-1 and baseline covariates. E(Y i X i1, X i2,, X it,, X it, Z 1 ) = E(Y i X i1, X i2,, X it 1, Z 1 ) In other words, the expected value of the outcome at a particular time, t, depends only on the covariates prior to that time. If the current value of the outcome Y i given the current value of the covariate X i predicts the subsequent value of the covariate X i+1 then this assumption will be violated (Garrett M Fitzmaurice et al., 2012). Diggle refers to this as the full covariate conditional mean (FCCM) (Diggle et al., 2002). When a covariate changes over time, this assumption might not hold because the outcome at that time, t, 27

28 would need to take into account the entire covariate process not just the covariate values prior to time t. Therefore, the FCCM assumption is not met. E(Y i X i1, X i2,, X it,, X it, Z 1 ) E(Y i X i1, X i2,, X it 1, Z 1 ) In general, one can quantify the bias in an estimator by subtracting the value of the parameter estimated by the particular analysis method, θ, from the expected value of the parameter, E(θ). An estimator is unbiased when this difference is equal to zero. bias(θ) = E(θ) θ = 0 The GEE estimator for the parameter of interest, S β (β, α), is calculated using the following estimating equations. N S β (β, α) = D i V 1 (y i µ i ) = 0 i=1 Therefore, a GEE estimator is unbiased when the following equation is true: bias[s β (β, α)] = E[S β (β, α)] S β(β, α) = 0 Since the estimated value of the parameter in GEE, S β (β, α), is equal to zero, then this equation becomes, bias[s β (β, α)] = E[S β (β, α)] 0 = 0. Thus, the estimator S β (β, α) is unbiased when its expectation is equal to zero. Pepe and Anderson showed that when one can ensure that the assumption of the FCCM holds, then the expected value of the estimator, S β (β, α), is equal to zero, so the estimator is unbiased (Pepe & Anderson, 1994). E[S β (β, α)] = 0 However, when the FCCM assumption does not hold, then we are not assured an unbiased estimator i.e. we cannot be assured that the expectation will equal zero. 28

29 In the GEE framework, when an independence working correlation matrix is used, no pairwise correlation coefficients are estimated (i.e. the working correlation matrix is estimated with zero elements off its diagonal) because the multiple measurements on the same sample/cluster are assumed to be uncorrelated. This simplified estimation of the outcome assures that it will meet the implicit assumption of GEE such that the FCCM assumption will hold, but results in an inefficient estimator. Conversely, utilizing a non-diagonal working correlation matrix introduces the pairwise correlation coefficients that help us to efficiently estimate the effect of the covariate process on the outcome. Yet, these pairwise correlations introduce the possibility that our estimation of the outcome process would need to account for the entire exposure history, X i, not just the covariates prior to that time, X i-1. Therefore, it is possible that the FCCM assumption would no longer hold. Time varying-covariates Not all time-varying variables violate this assumption of GEE; therefore, it is important to understand which of the various types of time-varying variables introduce bias in the GEE framework. Time-varying covariates can have a process that is considered either exogenous or endogenous with respect to the outcome process. When the time-varying covariate is considered exogenous, then the covariate value at a particular time is independent of all previous outcome measurements and is controlled externally rather than being determined by factors within the covariate process. An example of an exogenous variable would be the measurement time which is not stochastic and is externally controlled. Another example of an exogenous variable would be the treatment assignment in a randomized cross-over study design; the treatment assignment is determined by study design not by the covariate or outcome process. For exogenous covariates, the assumption holds that the expected value of the 29

30 outcome at a particular time, t, depends only upon the covariates prior to that time, and so coefficients estimated from modeling this covariate using GEE will be unbiased. Alternatively, a time-varying covariate is considered endogenous when its value is determined by factors within the covariate process, or by factors within the outcome process that introduce a feedback loop. One example of an endogenous, time-varying covariate is found in a study of the association between body mass index (BMI) and child morbidity discussed by Lai and Small (Lai & Small, 2007). A child s BMI or morbidity at a particular time could be determined by the child s BMI at the previous time i.e. by factors within the covariate process. Another possibility is that the child s BMI could be determined by the child s BMI at the previous time while also being determined by the outcome (whether or not the child was ill) at the previous measurement. For instance, a child who is sick may not eat as much as usual, resulting in a lower weight at the next measurement. A feedback loop exists from the outcome process to the covariate process in the second, but not the first example. However, in both examples, previous values of the covariate could affect the covariate s current value. The GMM estimation method introduced by Lai and Small distinguishes between exogenous and endogenous types of time-varying covariates depending on the effect of variable on the covariate and outcome processes (Lai & Small, 2007). Lai and Small more narrowly define endogenous covariates into two types, Type II and Type III, whereas they use the term, Type I covariate, to identify exogenous variables. They identify the variable in the former example (where BMI and its effect on the outcome is only determined by previous values of the covariate) as a Type II covariate and the variable in the latter example (where BMI and its effect on the outcome is altered by a feedback loop from the outcome to the covariate as well as previous values of the covariate) as a Type III covariate. These variable types (Type I, Type II, and Type III) dictate which estimating equations will be included in the GMM estimation. 30

31 Time-varying Covariates and Model Residuals Lai and Small further describe how these covariate types are correlated with the residuals of the model in which they are included (Lai & Small, 2007). For Type I covariates, the effect of the current covariate value on the outcome is not dependent upon past covariate values and there is no feedback from the outcome process to the covariate process. This means that both future and past values of the covariate are uncorrelated with current residuals of the marginal regression model. For Type II covariates, the effect of the current covariate value on the outcome is dependent upon past covariate values but there is no feedback from the outcome process to the covariate process. Therefore, future values of the covariate are uncorrelated with current residuals, but past values of the covariate are correlated with current residuals. For Type III covariates, the effect of the current covariate value on the outcome is dependent upon past covariate values and there is a feedback from the outcome process to the covariate process. Therefore, both future and past values of the covariate are correlated with current residuals. Lai and Small postulate that, when the independent working correlation structure is used, valid estimating equations are being excluded from the estimation of the outcome. This can be illustrated through the calculation of the expected value of the outcome in GEE which is shown below in an example with three correlated measurements within the cluster. In the expected value, the working correlation, Corr(Y i ), can be represented as a matrix that indicates which estimating equations will be included in the calculation. If a particular estimating equation should be included (i.e. that estimating equation is considered not biased), then a correlation value will be present (represented below as m) in that row/column location. However, if the estimating equation is considered biased and should not be included, then the value will 31

32 be zero. Recall that D i is a matrix of partial derivatives of µ i with respect to the parameter estimates, β. E[D i A i 12 Corr(Y i )A i 12 (y i µ i )] = 1/σ 0 0 m 11 m 12 m 13 = D i ( 0 1/σ 0 ) ( m 21 m 22 m 23 ) ( 0 0 1/σ m 31 m 32 m 33 1/σ /σ /σ y 11 µ 11 ) ( y 12 µ 12 ) y 13 µ 13 When a non-independence structure is employed, all pairwise correlation values are used to calculate the estimator i.e. the Corr(Y i ) matrix will have a value in each row/column combination, i.e. m 11, m 12, etc.). m 11 m 12 m 13 Corr(Y i ) = ( m 21 m 31 m 22 m 32 m 23 ) m 33 For a Type I variable, for which there is no reason to believe that any of the estimating equations would be invalid, Lai and Small advocate using all values in the correlation matrix rather than the independence structure, as shown above. In this case, when the appropriate matrix multiplication is performed, all possible estimating equations contribute to estimation. Where T indicates the number of functions to estimate and is equal to t, the number of repeated measures per subject, all T 2 functions will be used i.e. 3*3 = 9. However, in the presence of a Type II covariate, Lai and Small postulate that only T(T+1)/2 estimating equations will be valid i.e. 3(3+1)/2=6. They recommend using the following correlation structure as it defines future values of the covariate as uncorrelated with current residuals. Corr(Y i ) = ( m m 21 m 22 0 m 31 m 32 m 33 ) 32

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

Paper 1025-2017 GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM Kyle M. Irimata, Arizona State University; Jeffrey R. Wilson, Arizona State University ABSTRACT The