Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A.

Size: px

Start display at page:

Download "Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A."

Jeffrey May
6 years ago
Views:

1 One-Step Generalized Estimating Equations in complex surveys with large cluster sizes with application to the United State s Nationwide Inpatient Sample Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A. Debajyoti Sinha Florida State University, Tallahassee, FL, U.S.A. Nathanael Hevelone Brigham and Women s Hospital, Boston, MA, U.S.A. Edward Giovannucci Harvard School of Public Health, Boston, MA, U.S.A. Jim C. Hu University of California, Los Angeles, CA, U.S.A. Louis L. Nguyen Brigham and Women s Hospital, Boston, MA, U.S.A. December 11,

2 Summary Complex survey sampling is widely-used to study large finite populations. The designs typically involve a weighted stratified sample of independent clusters, where the units within a cluster are correlated. For national surveys, the clusters tend to be very large. In our motivating example using the 2010 Nationwide Inpatient Sample (NIS) with 8,023,590 patients, the outcome variable is 48 hour post-surgical complication, and our a priori regression model, which includes interactions, has 51 parameters. Consistent parameter estimates can be obtained under the naive assumption of independent subjects. These estimates are inefficient when the intra-cluster correlation (ICC) is high. With a relatively large number of regression parameters, efficient estimation is important. Efficient generalized estimating equations (GEE) use all pairs of observations within a cluster. For the 2010 NIS, there are 92.6 billion pairs of observations, making estimation of the ICC using all pairs computationally prohibitive. We propose an efficient, computationally feasible estimator, based on three innovations: 1) a GEE that does not require matrix inversions; 2) a simpler formula to estimate the ICC that avoids summing over all pairs; 3) a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE, but is much less computationally intensive. Key words: efficient estimation; intra-cluster correlation; Population survey; weighted estimating equations.

3 1 Introduction Complex survey sampling is widely-used to sample a fraction of a large finite population while accounting for its size and characteristics. Complex survey sampling designs are typically stratified cluster samples. Design-based analyses usually incorporate the weighting, stratification, and clustering variables. In complex surveys of health or medical outcomes, the clusters are often comprised of households, people living in the same city block, patients with the same doctor, or patients admitted to the same hospital. When estimating the regression parameters of a generalized linear model for any type of outcome (continuous or discrete) from complex survey data with large clusters, for reasons of computational feasibility, the most popular approach is to naively assume the observations within a cluster are independent to obtain consistent regression parameter estimates (Binder, 1983; Liang and Zeger, 1986; Shah et al., 1997); a consistent estimate of the covariance matrix of these estimates can be obtained using a sandwich estimator (Binder, 1983; Liang and Zeger, 1986; Shah et al., 1997). These estimates, obtained under the assumption of independence, can be inefficient when the intracluster correlation (ICC) is relatively large. For datasets with a large number of clusters (say > 1000) and/or a large number of observations within clusters (say > 1000), the loss of efficiency may be relatively inconsequential for estimates of covariate effects of clinical importance. However, the loss of efficiency for the naive independence estimates could be important for rare outcomes or regression models with many a priori covariates or interaction terms, resulting in reduced power and wider confidence intervals than an approach that incorporates the intraclass correlation (ICC) in the estimation procedure. One popular approach to gain efficiency when estimating the regression parameters of a generalized linear model with clustered data is to use the generalized estimating equations (GEE) approach (Liang and Zeger, 1986), incorporating the ICC under an exchangeable (compound symmetry) correlation structure. However, for large clusters that may arise in complex sampling, this approach can be impractical because it uses all pairs of outcomes within a cluster to estimate the ICC. Further, GEE as typically implemented requires inversion of matrices of the same dimensions as the cluster size, which could be time consuming for very large cluster sizes. In this paper, we propose a computationally feasible estimator when there are a large number of clusters and large cluster sizes; the proposed estimator is based on the following three innovations: 1) construct a GEE in a form that does not require inversion of any matrices; 2) use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes; 3) use a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE, but has the advantage that it is much less computationally intensive. 3

4 Our motivating example is from the 2010 Nationwide Inpatient Sample (NIS) developed as part of the United States Healthcare Cost and Utilization Project (HCUP) and sponsored by the United State s Agency for Healthcare Research and Quality. The NIS is a 20% stratified, cluster probability sample that encompasses approximately 8 million acute hospital stays per year from approximately 1000 hospitals in 37 states. It is the largest all-payer inpatient care observational dataset in the U.S. and is representative of approximately 90% of all hospital discharges. In the NIS, hospitals in the sampling frame are stratified by five key characteristics. Then, a random sample of hospitals (clusters) is chosen from each of the strata. The NIS includes all discharges from the selected hospitals. Each hospital has a different probability of being selected in the sample depending on the five characteristics that determine the strata. As a result, each hospital, and thus all discharges within the hospital, are given a weight so that any results can be extrapolated to the entire universe of hospitals in the United States. Because approximately 20% of the universe of hospitals are sampled, the weights (or inverse probability of being sampled) are usually close to five. In the 2010 NIS, there are 60 strata, 1049 clusters (hospitals), and the average cluster size is We use the NIS to compare outcomes between three type of surgery: robotic-assisted laparoscopic surgery, non-robotic-assisted laparoscopic surgery and open surgery for three different procedures. Robotic-assisted laparoscopic surgery has been rapidly adopted (Barbash and Glied, 2010) despite the dearth of evidence demonstrating superior outcomes compared to traditional surgical approaches (i.e., non-robotic-assisted laparoscopic surgery and open surgery). The Institute of Medicine has prioritized robotic surgery for comparative effectiveness research (versus the other two types of surgery). The three procedures considered are radical prostatectomy for prostate cancer, nephrectomy for kidney cancer, and partial nephrectomy for kidney cancer. The binary outcome variable of interest is whether a patient had an in-hospital complication within the first 48 hours post-surgery. Table 1 gives the weighted sum and proportion of complications for each type of surgery and procedure. We can see that open surgery has the highest proportion of complications for each of the three procedures. The predictors of complications for each procedure (besides type of surgery) are: hospital bed size (categorized as small, medium, and large), hospital location (urban versus rural), hospital teaching status (teaching versus not teaching), patient race (white versus other), gender (except for prostate cancer), stage (advanced cancer stage versus other), number of patient comorbidities, and private insurance (yes, no). When estimating a regression model for a subgroup (each of the three procedures), one can restrict the analysis to the particular subgroup and obtain unbiased parameter estimates. However, because of the complex survey design, restricting the analysis to the subgroup of interest does not yield correct standard errors (Graubard and Korn, 1996). Unlike a simple random sam- 4

5 ple, for most complex survey designs, the variance-covariance matrix for the parameter estimates is not block diagonal across subgroups. As a result, all subjects must be incorporated in the analysis, even when interest is only in a small subgroup. Correct standard errors for regression models can be obtained by analyzing data from all subjects, and including interaction terms between subgroup (procedure for NIS) and all covariates in the model. For example, using the National Health Interview Survey, Graubard and Korn (1996) show that the 1987 death rates from digestive cancer per 100,000 individuals for whites is estimated to be 69.01; using the full sample with interactions, the estimated standard error (SE) is However, by incorrectly restricting the analysis to whites, the estimated SE is 0.89, an almost 4-fold decrease. Thus, for the NIS analysis, all 8,023,590 patients must be included in the analysis and subgroup-specific estimates are obtained through the inclusion of interactions between subgroup and covariates. For the NIS analysis, in addition to the three procedures (subgroups) of interest, a fourth subgroup with everybody else (about 8 million patients) must also be created. Note further, even though the full dataset has 8,023,590 patients, that the estimates under the naive assumption of independence could be inefficient for some of the NIS subgroups, resulting in reduced power and wider confidence intervals than an approach that incorporates the intraclass correlation (ICC) in the estimation procedure. In the 2010 NIS, there were 16,641 radical prostatectomies, 10,195 nephrectomies, and 4,323 partial nephrectomies. There are 12 parameters of interest (excluding intercept) for each procedure and incorporating the ICC in the analysis can potentially improve the efficiency. In Section 2, we briefly describe the generalized estimating equations under an exchangeable (compound symmetry) correlation structure, and show how it can be expressed in a form that does not require inversion of any matrices. In Section 3, we propose a consistent estimator of the ICC that does not require summing over all pairs of outcomes. In Section 4, we discuss a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE but is far less computationally intensive. In Section 5, we apply the proposed approach to the regression analyses of the data from the NIS data and compare computation time to standard fully-iterated GEE. 2 Generalized estimating equations In general, the sampling design can be multi-stage, and each subject in the sample has a different probability of being selected into the sample such that each subject sampled can have a different weight determined by the inverse of the probability of being selected into the sample. Here, we 5

6 focus on the most common design in which the finite population consists of h = 1,..., H strata, l = 1,..., N h population clusters within stratum h, and m = 1,..., M population subjects within cluster l of stratum h. Consider a regression model, in which the data collected on the mth subject from the lth cluster in stratum h are the outcome variable Y m and a vector x m = (x m1,..., x mk ) of K covariates. Let µ m = E(Y m x m, β) denote the expectation of the outcome Y m given the covariates and regression parameters β. In many applications, it is of interest to estimate the regression coefficients β from a generalized linear regression model E(Y m x m, β) = µ m = µ m (β) = g(x mβ), (2.1) where g( ) is a specified function, such as g(a) = a for linear regression and g(a) = exp(a)/[1 + exp(a)] for logistic regression. Further, the variance of Y m given the covariates is of the general form V ar(y m x m, β) = v(µ m )φ, (2.2) where v(µ m ) can be any function of µ m that is always positive and φ is a scale parameter that may be fixed or may need to be estimated. Typical functions include v(µ m ) = µ m (1 µ m ) for binary data, v(µ m ) = µ m for count data with Poisson variability, and v(µ m ) = µ 2 m for positive data with exponential variability. In the NIS study, if a cluster (hospital) is included in the complex survey, then all patients within the cluster are sampled. Thus, to indicate which clusters are sampled from the population of clusters, we define the indicator random variable δ = { 1 if the lth cluster in stratum h is selected into the sample 0 if otherwsie. Further, we let p denote the probability that the cluster is selected into the survey. In the NIS, p is approximately 0.2, although it can vary depending on hospital characteristics. We let Y = [Y 1,..., Y M ] be an M 1 vector containing the outcomes for the M subjects from the lth cluster in stratum h; X = [x 1,..., x M ] represent the M K matrix containing the covariate vectors for the subjects from the lth cluster in stratum h, and µ = [µ 1,..., µ M ] be an M 1 vector containing the means for the M subjects from the lth cluster in stratum h. 6

7 i.e., We assume the correlation between any two observations in the same cluster is exchangeable, ρ = Corr(Y j, Y k X ); ρ is often referred to as the intra-class correlation coefficient (ICC). The structure of the correlation matrix of Y is often called compound symmetry (or exchangeable): R = Corr(Y ) = ρi + (1 ρ)j J where I is an M M identity matrix and J is an M 1 vector of 1 s. In this case, the covariance matrix of Y is V = A 1/2 R A 1/2, where A is a diagonal matrix, with diagonal elements V ar(y m x m, β) = v(µ m )φ, with v(µ m ) specified entirely by the marginal distributions, i.e., by β. form To estimate β, Liang and Zeger (1986) consider generalized estimating equations of the u( β) = H N h w D 1 V [Y µ ( β)] = 0, (2.3) where D = d[µ (β)] dβ, and V is the covariance matrix of Y described above. Note that V is a function of the unknown correlation ρ which must be estimated; however, the scale parameter φ can be ignored when solving this equation for β. Here, also, w = δ p, so that each cluster in the sample is weighted by the inverse probability of being selected. We first simplify (2.3) by simplifying V 1 1. Note that V = A 1/2 algebra results, and thus V 1 = R 1 = 1 (1 ρ) I ρ (1 ρ)[(1 ρ) + n ρ] J J, 1 (1 ρ) A 1 R 1 A 1/2 ρ (1 ρ)[(1 ρ) + M ρ] (A 1/2 J )(A 1/2 J ).. Using matrix Thus, the estimating function in (2.3) becomes N 1 H h N H h u(β) = w D A 1 (1 ρ) [Y ρ µ ] w (1 ρ)[(1 ρ) + M ρ] D (A 1/2 J )(A 1/2 J ) [Y µ ]. Further, without loss of generality since we are setting this equal to 0 and solving for β, we can multiply the estimating function by (1 ρ) to obtain u(β) = H N h w D A 1 [Y µ ] H N h ρ w [(1 ρ) + M ρ] D (A 1/2 J )(A 1/2 J ) [Y µ ]. (2.4) 7

8 The second sum in the estimating equation can be simplified further by noting that (A 1/2 M J ) [Y µ ] = (Y m µ m )/ v m and where is a given column of D. Then, (2.4) becomes D (A 1/2 M J ) = d m / v m d m = d[µ m(β)] dβ u(β) = N H h M N H h w d m (Y m µ m )/v m w ρ [(1 ρ) + M ρ] [ M ] d q / M v q (Y m µ m )/ v m. Note, this estimating equations does not involve inverting any matrices. The first sum in the estimating equation, u I (β) = H N h M q=1 (2.5) w d m (Y m µ m )/v m (2.6) is just an estimating equation under the naive assumption of independence of observations in a cluster, and yields consistent, but possibly inefficient estimators of β. Thus, the second sum in the estimating equation is where efficiency is gained when incorporating the ICC (ρ). Because ρ is unknown, it must be estimated and plugged into (2.5) to realize potential efficiency gains. Any consistent estimator of ρ will yield the same asymptotic efficiency of the resulting estimator of β. In the following section, we consider a very simple estimator that is computationally feasible with large clusters. Regardless of whether ρ is known or consistently estimated, using Taylor series expansions similar to Liang and Zeger (1986) and Prentice (1988), assuming that the regression for µ is correctly specified, the solution β for u( β) = 0 is consistent for β; in addition, N 1/2 ( β β) has an asymptotic distribution which is multivariate normal with mean vector 0. The asymptotic covariance matrix of β can be consistently estimated by (Binder, 1983) H N h Ŵ 1 1 H N h H {u ( β) ū h ( β)}{u ( β) ū h ( β)} N h Ŵ, (2.7) 8

9 where M u (β) = w ρ d m (Y m µ m )/v m w [(1 ρ) + M ρ] [ M is the sum of the score vectors from the subjects in cluster l of stratum h; is the sample mean of the u (β) s and W = W (β, ρ) [ ] d[u = E (β)] dβ ū h++ ( β) = 1 M N h u (β) l=1 ] d q / M v q (Y m µ m )/ v m q=1 = M w d m (Y m µ m )/v m w ρ [(1 ρ)+m ρ] [ M q=1 d q/ v q ] [ M q=1 d q/ v q ]. (2.8) Also, Ŵ and u (β) are evaluated at β and a consistent estimate of ρ; in the following section we consider estimation of ρ. 3 Estimating the ICC Denoting the true residual for the mth subject from the lth cluster in stratum h by, e m = (Y m µ m )/ v m, then by definition, for m m, E(w e m e m ) = ρ. This suggests that a consistent method of moments estimator of ρ (Liang and Zeger, 1986) can be obtained via 1 H N h H N h ρ = w M (M 1)/2 w ê m ê m (3.9) m<m where ê m = (Y m µ m )/ v m. Note that the estimator given by (3.9) requires the sum of H h=1 Nh l=1 M (M 1)/2 pairs. For NIS, there are approximately 1.8 billion pairs, so that this formulation of (3.9) could be impractical. 9

10 Thus, However, note that 2 M M ê m = ê 2 m + 2 ê m ê m. m<m 2 M ê m ê m = m<m 2 ê m M ê 2 m which requires the sum of 2M terms instead of M (M 1)/2 terms. Thus, we suggest using the following formulation to obtain identical estimate to that given in (3.9), H N h ρ = w M (M 1)/2 1 2 H N h M w ê m M ê 2 m ; (3.10) this expression has 2 H h=1 Nh l=1 M (2 times the total sample size) terms instead of H h=1 Nh l=1 w M (M 1)/2 terms. For NIS, this translates into approximately 16 million terms using our proposed approach instead of 92.6 billion terms using all pairs. 4 One-step estimator Ordinarily, one iterates between estimating β and ρ until convergence. However, if the estimate of β under independence is initially used to estimate ρ, and the resulting estimate of ρ is plugged back into (2.5) and u( β) = 0 is solved for β (treating the estimate of ρ as known), this yields an asymptotically equivalent estimator as the fully iterated GEE estimator. The proof of this result follows along the lines of that discussed in Lehmann (1983, p. 422) for creating a one-step asymptotically efficient estimator. The use of a one-step GEE estimator has the advantage that it is much less computationally intensive than the standard fully-iterated GEE for large cluster sizes. In particular, a one-step estimator of β is formed by using one iteration of a Fisher scoring algorithm for obtaining a solution to u( β) = 0, with the estimate under the naive assumption of independence as the starting value. The one-step estimator is obtained via β 1GEE = β I + [W ( β I, ρ I )] 1 u( β I ), (4.11) where β 1GEE is the one-step GEE estimator, β I is the estimator under the naive assumption of independence (solution to u I (β I ) = 0 in (2.6)), ρ I is calculated by using β I to estimate µ m 10

11 and v m in ê m in (3.10), and H N h W (β, ρ) = W (β, ρ). Note further, since u I ( β I ) = 0, that u( β I ) = u I ( β I ) H Nh w ρ [(1 ρ)+m ρ] = H Nh w ρ [(1 ρ)+m ρ] so that (4.11) simplifies to β 1GEE = β I [W ( β I, ρ I )] 1 H N h w [ M q=1 d q / v q ] M (Y m µ m )/ v m [ M q=1 d q / v q ] M (Y m µ m )/ v m, M ρ [(1 ρ) + M ρ] q=1 d q / M v q (Y m µ m )/ v m. (4.12) The asymptotic covariance matrix of β 1GEE can be consistently estimated using (2.7) evaluated at β 1GEE. Thus, in general, our approach to make the estimation feasible is threefold: 1) express the GEE in a form that does not require inversion of any matrices; 2) use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes; 3) use a one-step GEE estimator. 5 Application to 2010 Nationwide Inpatient Sample In this section, we present results for analyses of data from the NIS study discussed in the Introduction. The binary outcome for a patient is a complication within the first 48 hours post surgery (1 if complications, 0 if otherwise). We fit a logistic regression model for this outcome. The hospital-level covariates of interest are: medium bedsize (1 if medium bed size, 0 if otherwise), large bedsize (1 if large bed size, 0 if otherwise), urban (1 if urban, 0 if otherwise), and teaching (1 if teaching, 0 if otherwise). The patient-level covariates of interest are: Caucasian (1 if Caucasian, 0 if otherwise), comorbidities (number of patient comorbidities), age (in years), stage (1 if advanced cancer stage, 0 if otherwise), insurance (1 if private insurance, 0 if otherwise), robotic (1 if robotic surgery, 0 if otherwise), and laparoscopic (1 if Laparoscopic surgery, 0 if otherwise), and female gender (except for prostatectomy). Table 2 gives the estimates of β for the three surgical procedures naively assuming independence (independence working correlation ) with a robust sandwich variance estimate as well as 11

12 our one-step estimator with an exchangeable (ICC) correlation structure (and also with a robust sandwich variance estimate). In general, an approximate estimate of the asymptotic relative efficiency of the estimates can be obtained by comparing the robust variance estimators of β under the two different correlation structures. From Table 2, we see that the estimated standard errors of β under exchangeability can be discernibly smaller than those under independence. For example, the estimated relative efficiency of the robotic effect is 71% for prostatectomy, 58% for nephrectomy, and 48% for partial nephrectomy. Similar gains in estimated efficiency with an exchangeable correlations structure are seen for most parameters in the model. Based on the results from our one-step estimator with an exchangeable (ICC) correlations structure, for all three surgical procedures robotic surgery significantly (P < 0.05) reduces complications when compared to open surgery. In particular, for prostatectomy, nephrectomy, and partial nephrectomy, robotic surgery decreases the odds of complications by exp(.310) 0.73, exp(.269) 0.76, and exp(.664) 0.51, respectively. Adjusted for the other factors, robotic surgery (versus open surgery) appears to lead to the greatest reduction in the odds of complications for partial nephrectomy. For nephrectomy and partial nephrectomy, laparoscopic surgery significantly (P < 0.05) reduces complications versus open surgery; however, there appears to be no difference in rates of complications for prostatectomy. In particular, for prostatectomy, nephrectomy, and partial nephrectomy, laparoscopic surgery decreases the odds of complications by exp(.002) 1.0, exp(.206) 0.81, and exp(.559) 0.50, respectively. Adjusted for the other factors, similar to robotic surgery, laparoscopic surgery (versus open surgery) appears to lead to the greatest reduction in the odds of complications for partial nephrectomy. In general, the patient level factors (race, gender, comorbidites) are more significantly related to complications than the hospital level predictors. Table 3 provides a comparison of the computation time (real not CPU) it takes for various approaches to fitting the model as implemented in different SAS procedures (SAS Institute Inc. 2014). It can be seen that the standard (weighted) logistic regression maximum likelihood estimation under naive independence, and without a robust variance correction (as implemented using PROC LOGISTIC in SAS), is the fastest (2.1 minutes); standard logistic regression maximum likelihood estimation under naive independence but with a sandwich variance estimate (as implemented using PROC SURVEYLOGISTIC in SAS) takes 4.3 minutes. Thus, calculation of the robust sandwich variance estimate for standard logistic regression takes an additional 2.2 minutes (the algorithms for calculating the standard logistic regression estimates in PROC LO- GISTIC and PROC SURVEYLOGISTIC are the same). Use of a SAS macro for implementing our proposed 1-step approach takes 11.8 minutes, as opposed to the fully-iterated estimation (as implemented in PROC GENMOD in SAS) with exchangeable correlation, which takes

13 hours. We note that use of only a single iteration of PROC GENMOD with exchangeable correlation should be comparable to the 1-step approach because PROC GENMOD in SAS uses the standard logistic regression maximum likelihood estimates under naive independence as starting values. Thus, direct comparison of our 1-step approach (11.8 minutes) to 1-step of PROC GEN- MOD (1.4 hours) emphasizes the potential advantage of the proposed method; 1-step of PROC GENMOD takes approximately 7 times longer than our proposed approach. We attribute the advantage in computation time to the following: 1) we have expressed the GEE in a form that does not require inversion of any matrices; and 2) we use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes within a cluster. We note that we fit the models on a 64-bit PC workstation with an Intel Xeon CPU E GHz processor with 16.0 GB of RAM and an SSD hard drive; this PC workstation is much faster than a typical desktop PC. Finally, although the results presented here were obtained using SAS, we also attempted to use the same dataset to obtain the fully-iterated GEE estimates using both gee in R (Carey, 2002) and the xtgee command in Stata (StataCorp, 2013); both packages were unable to produce estimates of the model parameters due to insufficient memory. 6 Conclusion In this paper we propose a one-step estimating equations approach with an exchangeable correlation (constant ICC) for gaining efficiency in analyses of large complex surveys. When applied to data from the 2010 NIS we found that, even with sample sizes greater than 1000 in each subgroup, there were substantial efficiency gains with the use of our approach, leading to discernibly narrower confidence intervals. In addition, we found that the one-step estimates were almost identical to the fully-iterated estimates obtained from PROC GENMOD in SAS, but took only a fraction of the time to compute (11.1 minutes versus 8.1 hours for PROC GEN- MOD in SAS). Furthermore, the one-step estimator is asymptotically equivalent to the fully iterated estimator. We attribute the very discernible computational advantage of our proposed approach to the following three factors: 1) we have constructed the GEE in a form that does not require inversion of any matrices; 2) we use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes within a cluster; 3) we only use a one-step GEE estimator instead of fully-iterating. Our results also suggest that if a fully-iterated estimate is preferred, then it is still advantageous to use our approach with equation (4.11) to fully iterate. Finally, we note that although the GEE estimates obtained under exchangeability, as produced by many software packages, are consistent, the robust or sandwich variance estimates do not take the stratification into account. As a result, we cannot recommend that standard imple- 13

14 mentations of the GEE be used for hypothesis testing or the construction of confidence intervals in settings where the number of observations is small enough that computational time is not prohibitive. A SAS macro implementing our proposed one-step estimating equations approach with an exchangeable correlation is available from the first author upon request. Acknowledgments Acknowledgments We thank Caprice Greenberg for advice on the analysis of the data from the Nationwide Inpatient Sample.. We are grateful for the support provided by grants MH , CA and CA from the U.S. National Institutes of Health. 14

15 References Barbash G.I. and Glied S.A. (2010) New technology and health care costs the case of robotassisted surgery. N Engl J Med 363, 701. Binder, D. (1983). On the variance of asymptotically normal estimators from complex surveys. International Statistical Review, 51, Carey V.J. (2002). gee: Generalized Estimation Equation Solver. R package version ; Ported from S-PLUS to R by Thomas Lumley (versions 3.13 and 4.4) and Brian Ripley (version 4.13). Lehmann, E.L. (1983). Theory of Point Estimation. New York: Wiley. Liang, K.Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, Prentice, R.L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, SAS Institute Inc (2014). SAS OnlineDoc, Version 9.3. Cary, NC. URL Shah, B.V., Barnwell, B.G. and Bieler, G.S. (1997) SUDAAN Users Manual, Release 7.5. Research Triangle Park: Research Triangle Institute. StataCorp (1997), Stata Statistical Software: Release 5.0, College Station, TX: Stata Corporation. StataCorp (2013) Stata 13 Longitudinal/Panel-Data Reference Manual. College Station, TX: Stata Press. 15

16 Table 1. Complication status by procedure and surgery type, weighted counts (N) and column percentages (%). Type of Surgery Procedure Complication Robotic Laparoscopic Open N(%) N(%) N(%) Prostatectomy No 51694(93%) 336 (90%) (89%) Yes 4161 (7%) 39 (10%) 2623 (11%) Nephrectomy No 2327 (67%) 5053 (68%) (61%) Yes 1139 (33%) 2339 (32%) (39%) Partial No 6527 (81%) 984 (80%) 7822 (68%) Nephrectomy Yes 1492 (29%) 246 (20%) 3604 (32%) 16

17 Table 2. Comparison of logistic regression parameter estimates for the post-operative surgical complications data, stratified by procedure. Working Effect Correlation Estimate SE Z-statistic P-value Prostatectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC <0.001 Advanced Stage Ind ICC Private Insurance Ind ICC Robotic Ind <0.001 ICC <0.001 Laparoscopic Ind ICC Nephrectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC Female Ind <0.001 ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC Advanced Stage Ind <0.001 ICC <0.001 Private Insurance Ind <0.001 ICC <0.001 Robotic Ind ICC Laparoscopic Ind ICC Partial Nephrectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC Female Ind <0.001 ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC Advanced Stage Ind ICC Private Insurance Ind ICC Robotic Ind <0.001 ICC <0.001 Laparoscopic Ind ICC <0.001

18 Table 3. Computation times (real not CPU) required to obtain the logistic regression parameter estimates. SAS Procedure Time (in minutes) Logistic 2.1 Surveylogistic 4.3 SAS macro for 1-step estimator 11.8 Genmod ICC 1-step a 84.0 Genmod ICC fully-iterated a PROC GENMOD in SAS uses independence estimate as starting values; as a result, 1-step of PROC GENMOD should be comparable to the proposed 1-step estimator 18

Testing for independence in J K contingency tables with complex sample. survey data

Biometrics 00, 1 23 DOI: 000 November 2014 Testing for independence in J K contingency tables with complex sample survey data Stuart R. Lipsitz 1,, Garrett M. Fitzmaurice 2, Debajyoti Sinha 3, Nathanael