Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A.

Size: px
Start display at page:

Download "Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A."

Transcription

1 One-Step Generalized Estimating Equations in complex surveys with large cluster sizes with application to the United State s Nationwide Inpatient Sample Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A. Debajyoti Sinha Florida State University, Tallahassee, FL, U.S.A. Nathanael Hevelone Brigham and Women s Hospital, Boston, MA, U.S.A. Edward Giovannucci Harvard School of Public Health, Boston, MA, U.S.A. Jim C. Hu University of California, Los Angeles, CA, U.S.A. Louis L. Nguyen Brigham and Women s Hospital, Boston, MA, U.S.A. December 11,

2 Summary Complex survey sampling is widely-used to study large finite populations. The designs typically involve a weighted stratified sample of independent clusters, where the units within a cluster are correlated. For national surveys, the clusters tend to be very large. In our motivating example using the 2010 Nationwide Inpatient Sample (NIS) with 8,023,590 patients, the outcome variable is 48 hour post-surgical complication, and our a priori regression model, which includes interactions, has 51 parameters. Consistent parameter estimates can be obtained under the naive assumption of independent subjects. These estimates are inefficient when the intra-cluster correlation (ICC) is high. With a relatively large number of regression parameters, efficient estimation is important. Efficient generalized estimating equations (GEE) use all pairs of observations within a cluster. For the 2010 NIS, there are 92.6 billion pairs of observations, making estimation of the ICC using all pairs computationally prohibitive. We propose an efficient, computationally feasible estimator, based on three innovations: 1) a GEE that does not require matrix inversions; 2) a simpler formula to estimate the ICC that avoids summing over all pairs; 3) a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE, but is much less computationally intensive. Key words: efficient estimation; intra-cluster correlation; Population survey; weighted estimating equations.

3 1 Introduction Complex survey sampling is widely-used to sample a fraction of a large finite population while accounting for its size and characteristics. Complex survey sampling designs are typically stratified cluster samples. Design-based analyses usually incorporate the weighting, stratification, and clustering variables. In complex surveys of health or medical outcomes, the clusters are often comprised of households, people living in the same city block, patients with the same doctor, or patients admitted to the same hospital. When estimating the regression parameters of a generalized linear model for any type of outcome (continuous or discrete) from complex survey data with large clusters, for reasons of computational feasibility, the most popular approach is to naively assume the observations within a cluster are independent to obtain consistent regression parameter estimates (Binder, 1983; Liang and Zeger, 1986; Shah et al., 1997); a consistent estimate of the covariance matrix of these estimates can be obtained using a sandwich estimator (Binder, 1983; Liang and Zeger, 1986; Shah et al., 1997). These estimates, obtained under the assumption of independence, can be inefficient when the intracluster correlation (ICC) is relatively large. For datasets with a large number of clusters (say > 1000) and/or a large number of observations within clusters (say > 1000), the loss of efficiency may be relatively inconsequential for estimates of covariate effects of clinical importance. However, the loss of efficiency for the naive independence estimates could be important for rare outcomes or regression models with many a priori covariates or interaction terms, resulting in reduced power and wider confidence intervals than an approach that incorporates the intraclass correlation (ICC) in the estimation procedure. One popular approach to gain efficiency when estimating the regression parameters of a generalized linear model with clustered data is to use the generalized estimating equations (GEE) approach (Liang and Zeger, 1986), incorporating the ICC under an exchangeable (compound symmetry) correlation structure. However, for large clusters that may arise in complex sampling, this approach can be impractical because it uses all pairs of outcomes within a cluster to estimate the ICC. Further, GEE as typically implemented requires inversion of matrices of the same dimensions as the cluster size, which could be time consuming for very large cluster sizes. In this paper, we propose a computationally feasible estimator when there are a large number of clusters and large cluster sizes; the proposed estimator is based on the following three innovations: 1) construct a GEE in a form that does not require inversion of any matrices; 2) use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes; 3) use a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE, but has the advantage that it is much less computationally intensive. 3

4 Our motivating example is from the 2010 Nationwide Inpatient Sample (NIS) developed as part of the United States Healthcare Cost and Utilization Project (HCUP) and sponsored by the United State s Agency for Healthcare Research and Quality. The NIS is a 20% stratified, cluster probability sample that encompasses approximately 8 million acute hospital stays per year from approximately 1000 hospitals in 37 states. It is the largest all-payer inpatient care observational dataset in the U.S. and is representative of approximately 90% of all hospital discharges. In the NIS, hospitals in the sampling frame are stratified by five key characteristics. Then, a random sample of hospitals (clusters) is chosen from each of the strata. The NIS includes all discharges from the selected hospitals. Each hospital has a different probability of being selected in the sample depending on the five characteristics that determine the strata. As a result, each hospital, and thus all discharges within the hospital, are given a weight so that any results can be extrapolated to the entire universe of hospitals in the United States. Because approximately 20% of the universe of hospitals are sampled, the weights (or inverse probability of being sampled) are usually close to five. In the 2010 NIS, there are 60 strata, 1049 clusters (hospitals), and the average cluster size is We use the NIS to compare outcomes between three type of surgery: robotic-assisted laparoscopic surgery, non-robotic-assisted laparoscopic surgery and open surgery for three different procedures. Robotic-assisted laparoscopic surgery has been rapidly adopted (Barbash and Glied, 2010) despite the dearth of evidence demonstrating superior outcomes compared to traditional surgical approaches (i.e., non-robotic-assisted laparoscopic surgery and open surgery). The Institute of Medicine has prioritized robotic surgery for comparative effectiveness research (versus the other two types of surgery). The three procedures considered are radical prostatectomy for prostate cancer, nephrectomy for kidney cancer, and partial nephrectomy for kidney cancer. The binary outcome variable of interest is whether a patient had an in-hospital complication within the first 48 hours post-surgery. Table 1 gives the weighted sum and proportion of complications for each type of surgery and procedure. We can see that open surgery has the highest proportion of complications for each of the three procedures. The predictors of complications for each procedure (besides type of surgery) are: hospital bed size (categorized as small, medium, and large), hospital location (urban versus rural), hospital teaching status (teaching versus not teaching), patient race (white versus other), gender (except for prostate cancer), stage (advanced cancer stage versus other), number of patient comorbidities, and private insurance (yes, no). When estimating a regression model for a subgroup (each of the three procedures), one can restrict the analysis to the particular subgroup and obtain unbiased parameter estimates. However, because of the complex survey design, restricting the analysis to the subgroup of interest does not yield correct standard errors (Graubard and Korn, 1996). Unlike a simple random sam- 4

5 ple, for most complex survey designs, the variance-covariance matrix for the parameter estimates is not block diagonal across subgroups. As a result, all subjects must be incorporated in the analysis, even when interest is only in a small subgroup. Correct standard errors for regression models can be obtained by analyzing data from all subjects, and including interaction terms between subgroup (procedure for NIS) and all covariates in the model. For example, using the National Health Interview Survey, Graubard and Korn (1996) show that the 1987 death rates from digestive cancer per 100,000 individuals for whites is estimated to be 69.01; using the full sample with interactions, the estimated standard error (SE) is However, by incorrectly restricting the analysis to whites, the estimated SE is 0.89, an almost 4-fold decrease. Thus, for the NIS analysis, all 8,023,590 patients must be included in the analysis and subgroup-specific estimates are obtained through the inclusion of interactions between subgroup and covariates. For the NIS analysis, in addition to the three procedures (subgroups) of interest, a fourth subgroup with everybody else (about 8 million patients) must also be created. Note further, even though the full dataset has 8,023,590 patients, that the estimates under the naive assumption of independence could be inefficient for some of the NIS subgroups, resulting in reduced power and wider confidence intervals than an approach that incorporates the intraclass correlation (ICC) in the estimation procedure. In the 2010 NIS, there were 16,641 radical prostatectomies, 10,195 nephrectomies, and 4,323 partial nephrectomies. There are 12 parameters of interest (excluding intercept) for each procedure and incorporating the ICC in the analysis can potentially improve the efficiency. In Section 2, we briefly describe the generalized estimating equations under an exchangeable (compound symmetry) correlation structure, and show how it can be expressed in a form that does not require inversion of any matrices. In Section 3, we propose a consistent estimator of the ICC that does not require summing over all pairs of outcomes. In Section 4, we discuss a one-step GEE estimator that is asymptotically equivalent to the fully-iterated GEE but is far less computationally intensive. In Section 5, we apply the proposed approach to the regression analyses of the data from the NIS data and compare computation time to standard fully-iterated GEE. 2 Generalized estimating equations In general, the sampling design can be multi-stage, and each subject in the sample has a different probability of being selected into the sample such that each subject sampled can have a different weight determined by the inverse of the probability of being selected into the sample. Here, we 5

6 focus on the most common design in which the finite population consists of h = 1,..., H strata, l = 1,..., N h population clusters within stratum h, and m = 1,..., M population subjects within cluster l of stratum h. Consider a regression model, in which the data collected on the mth subject from the lth cluster in stratum h are the outcome variable Y m and a vector x m = (x m1,..., x mk ) of K covariates. Let µ m = E(Y m x m, β) denote the expectation of the outcome Y m given the covariates and regression parameters β. In many applications, it is of interest to estimate the regression coefficients β from a generalized linear regression model E(Y m x m, β) = µ m = µ m (β) = g(x mβ), (2.1) where g( ) is a specified function, such as g(a) = a for linear regression and g(a) = exp(a)/[1 + exp(a)] for logistic regression. Further, the variance of Y m given the covariates is of the general form V ar(y m x m, β) = v(µ m )φ, (2.2) where v(µ m ) can be any function of µ m that is always positive and φ is a scale parameter that may be fixed or may need to be estimated. Typical functions include v(µ m ) = µ m (1 µ m ) for binary data, v(µ m ) = µ m for count data with Poisson variability, and v(µ m ) = µ 2 m for positive data with exponential variability. In the NIS study, if a cluster (hospital) is included in the complex survey, then all patients within the cluster are sampled. Thus, to indicate which clusters are sampled from the population of clusters, we define the indicator random variable δ = { 1 if the lth cluster in stratum h is selected into the sample 0 if otherwsie. Further, we let p denote the probability that the cluster is selected into the survey. In the NIS, p is approximately 0.2, although it can vary depending on hospital characteristics. We let Y = [Y 1,..., Y M ] be an M 1 vector containing the outcomes for the M subjects from the lth cluster in stratum h; X = [x 1,..., x M ] represent the M K matrix containing the covariate vectors for the subjects from the lth cluster in stratum h, and µ = [µ 1,..., µ M ] be an M 1 vector containing the means for the M subjects from the lth cluster in stratum h. 6

7 i.e., We assume the correlation between any two observations in the same cluster is exchangeable, ρ = Corr(Y j, Y k X ); ρ is often referred to as the intra-class correlation coefficient (ICC). The structure of the correlation matrix of Y is often called compound symmetry (or exchangeable): R = Corr(Y ) = ρi + (1 ρ)j J where I is an M M identity matrix and J is an M 1 vector of 1 s. In this case, the covariance matrix of Y is V = A 1/2 R A 1/2, where A is a diagonal matrix, with diagonal elements V ar(y m x m, β) = v(µ m )φ, with v(µ m ) specified entirely by the marginal distributions, i.e., by β. form To estimate β, Liang and Zeger (1986) consider generalized estimating equations of the u( β) = H N h w D 1 V [Y µ ( β)] = 0, (2.3) where D = d[µ (β)] dβ, and V is the covariance matrix of Y described above. Note that V is a function of the unknown correlation ρ which must be estimated; however, the scale parameter φ can be ignored when solving this equation for β. Here, also, w = δ p, so that each cluster in the sample is weighted by the inverse probability of being selected. We first simplify (2.3) by simplifying V 1 1. Note that V = A 1/2 algebra results, and thus V 1 = R 1 = 1 (1 ρ) I ρ (1 ρ)[(1 ρ) + n ρ] J J, 1 (1 ρ) A 1 R 1 A 1/2 ρ (1 ρ)[(1 ρ) + M ρ] (A 1/2 J )(A 1/2 J ).. Using matrix Thus, the estimating function in (2.3) becomes N 1 H h N H h u(β) = w D A 1 (1 ρ) [Y ρ µ ] w (1 ρ)[(1 ρ) + M ρ] D (A 1/2 J )(A 1/2 J ) [Y µ ]. Further, without loss of generality since we are setting this equal to 0 and solving for β, we can multiply the estimating function by (1 ρ) to obtain u(β) = H N h w D A 1 [Y µ ] H N h ρ w [(1 ρ) + M ρ] D (A 1/2 J )(A 1/2 J ) [Y µ ]. (2.4) 7

8 The second sum in the estimating equation can be simplified further by noting that (A 1/2 M J ) [Y µ ] = (Y m µ m )/ v m and where is a given column of D. Then, (2.4) becomes D (A 1/2 M J ) = d m / v m d m = d[µ m(β)] dβ u(β) = N H h M N H h w d m (Y m µ m )/v m w ρ [(1 ρ) + M ρ] [ M ] d q / M v q (Y m µ m )/ v m. Note, this estimating equations does not involve inverting any matrices. The first sum in the estimating equation, u I (β) = H N h M q=1 (2.5) w d m (Y m µ m )/v m (2.6) is just an estimating equation under the naive assumption of independence of observations in a cluster, and yields consistent, but possibly inefficient estimators of β. Thus, the second sum in the estimating equation is where efficiency is gained when incorporating the ICC (ρ). Because ρ is unknown, it must be estimated and plugged into (2.5) to realize potential efficiency gains. Any consistent estimator of ρ will yield the same asymptotic efficiency of the resulting estimator of β. In the following section, we consider a very simple estimator that is computationally feasible with large clusters. Regardless of whether ρ is known or consistently estimated, using Taylor series expansions similar to Liang and Zeger (1986) and Prentice (1988), assuming that the regression for µ is correctly specified, the solution β for u( β) = 0 is consistent for β; in addition, N 1/2 ( β β) has an asymptotic distribution which is multivariate normal with mean vector 0. The asymptotic covariance matrix of β can be consistently estimated by (Binder, 1983) H N h Ŵ 1 1 H N h H {u ( β) ū h ( β)}{u ( β) ū h ( β)} N h Ŵ, (2.7) 8

9 where M u (β) = w ρ d m (Y m µ m )/v m w [(1 ρ) + M ρ] [ M is the sum of the score vectors from the subjects in cluster l of stratum h; is the sample mean of the u (β) s and W = W (β, ρ) [ ] d[u = E (β)] dβ ū h++ ( β) = 1 M N h u (β) l=1 ] d q / M v q (Y m µ m )/ v m q=1 = M w d m (Y m µ m )/v m w ρ [(1 ρ)+m ρ] [ M q=1 d q/ v q ] [ M q=1 d q/ v q ]. (2.8) Also, Ŵ and u (β) are evaluated at β and a consistent estimate of ρ; in the following section we consider estimation of ρ. 3 Estimating the ICC Denoting the true residual for the mth subject from the lth cluster in stratum h by, e m = (Y m µ m )/ v m, then by definition, for m m, E(w e m e m ) = ρ. This suggests that a consistent method of moments estimator of ρ (Liang and Zeger, 1986) can be obtained via 1 H N h H N h ρ = w M (M 1)/2 w ê m ê m (3.9) m<m where ê m = (Y m µ m )/ v m. Note that the estimator given by (3.9) requires the sum of H h=1 Nh l=1 M (M 1)/2 pairs. For NIS, there are approximately 1.8 billion pairs, so that this formulation of (3.9) could be impractical. 9

10 Thus, However, note that 2 M M ê m = ê 2 m + 2 ê m ê m. m<m 2 M ê m ê m = m<m 2 ê m M ê 2 m which requires the sum of 2M terms instead of M (M 1)/2 terms. Thus, we suggest using the following formulation to obtain identical estimate to that given in (3.9), H N h ρ = w M (M 1)/2 1 2 H N h M w ê m M ê 2 m ; (3.10) this expression has 2 H h=1 Nh l=1 M (2 times the total sample size) terms instead of H h=1 Nh l=1 w M (M 1)/2 terms. For NIS, this translates into approximately 16 million terms using our proposed approach instead of 92.6 billion terms using all pairs. 4 One-step estimator Ordinarily, one iterates between estimating β and ρ until convergence. However, if the estimate of β under independence is initially used to estimate ρ, and the resulting estimate of ρ is plugged back into (2.5) and u( β) = 0 is solved for β (treating the estimate of ρ as known), this yields an asymptotically equivalent estimator as the fully iterated GEE estimator. The proof of this result follows along the lines of that discussed in Lehmann (1983, p. 422) for creating a one-step asymptotically efficient estimator. The use of a one-step GEE estimator has the advantage that it is much less computationally intensive than the standard fully-iterated GEE for large cluster sizes. In particular, a one-step estimator of β is formed by using one iteration of a Fisher scoring algorithm for obtaining a solution to u( β) = 0, with the estimate under the naive assumption of independence as the starting value. The one-step estimator is obtained via β 1GEE = β I + [W ( β I, ρ I )] 1 u( β I ), (4.11) where β 1GEE is the one-step GEE estimator, β I is the estimator under the naive assumption of independence (solution to u I (β I ) = 0 in (2.6)), ρ I is calculated by using β I to estimate µ m 10

11 and v m in ê m in (3.10), and H N h W (β, ρ) = W (β, ρ). Note further, since u I ( β I ) = 0, that u( β I ) = u I ( β I ) H Nh w ρ [(1 ρ)+m ρ] = H Nh w ρ [(1 ρ)+m ρ] so that (4.11) simplifies to β 1GEE = β I [W ( β I, ρ I )] 1 H N h w [ M q=1 d q / v q ] M (Y m µ m )/ v m [ M q=1 d q / v q ] M (Y m µ m )/ v m, M ρ [(1 ρ) + M ρ] q=1 d q / M v q (Y m µ m )/ v m. (4.12) The asymptotic covariance matrix of β 1GEE can be consistently estimated using (2.7) evaluated at β 1GEE. Thus, in general, our approach to make the estimation feasible is threefold: 1) express the GEE in a form that does not require inversion of any matrices; 2) use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes; 3) use a one-step GEE estimator. 5 Application to 2010 Nationwide Inpatient Sample In this section, we present results for analyses of data from the NIS study discussed in the Introduction. The binary outcome for a patient is a complication within the first 48 hours post surgery (1 if complications, 0 if otherwise). We fit a logistic regression model for this outcome. The hospital-level covariates of interest are: medium bedsize (1 if medium bed size, 0 if otherwise), large bedsize (1 if large bed size, 0 if otherwise), urban (1 if urban, 0 if otherwise), and teaching (1 if teaching, 0 if otherwise). The patient-level covariates of interest are: Caucasian (1 if Caucasian, 0 if otherwise), comorbidities (number of patient comorbidities), age (in years), stage (1 if advanced cancer stage, 0 if otherwise), insurance (1 if private insurance, 0 if otherwise), robotic (1 if robotic surgery, 0 if otherwise), and laparoscopic (1 if Laparoscopic surgery, 0 if otherwise), and female gender (except for prostatectomy). Table 2 gives the estimates of β for the three surgical procedures naively assuming independence (independence working correlation ) with a robust sandwich variance estimate as well as 11

12 our one-step estimator with an exchangeable (ICC) correlation structure (and also with a robust sandwich variance estimate). In general, an approximate estimate of the asymptotic relative efficiency of the estimates can be obtained by comparing the robust variance estimators of β under the two different correlation structures. From Table 2, we see that the estimated standard errors of β under exchangeability can be discernibly smaller than those under independence. For example, the estimated relative efficiency of the robotic effect is 71% for prostatectomy, 58% for nephrectomy, and 48% for partial nephrectomy. Similar gains in estimated efficiency with an exchangeable correlations structure are seen for most parameters in the model. Based on the results from our one-step estimator with an exchangeable (ICC) correlations structure, for all three surgical procedures robotic surgery significantly (P < 0.05) reduces complications when compared to open surgery. In particular, for prostatectomy, nephrectomy, and partial nephrectomy, robotic surgery decreases the odds of complications by exp(.310) 0.73, exp(.269) 0.76, and exp(.664) 0.51, respectively. Adjusted for the other factors, robotic surgery (versus open surgery) appears to lead to the greatest reduction in the odds of complications for partial nephrectomy. For nephrectomy and partial nephrectomy, laparoscopic surgery significantly (P < 0.05) reduces complications versus open surgery; however, there appears to be no difference in rates of complications for prostatectomy. In particular, for prostatectomy, nephrectomy, and partial nephrectomy, laparoscopic surgery decreases the odds of complications by exp(.002) 1.0, exp(.206) 0.81, and exp(.559) 0.50, respectively. Adjusted for the other factors, similar to robotic surgery, laparoscopic surgery (versus open surgery) appears to lead to the greatest reduction in the odds of complications for partial nephrectomy. In general, the patient level factors (race, gender, comorbidites) are more significantly related to complications than the hospital level predictors. Table 3 provides a comparison of the computation time (real not CPU) it takes for various approaches to fitting the model as implemented in different SAS procedures (SAS Institute Inc. 2014). It can be seen that the standard (weighted) logistic regression maximum likelihood estimation under naive independence, and without a robust variance correction (as implemented using PROC LOGISTIC in SAS), is the fastest (2.1 minutes); standard logistic regression maximum likelihood estimation under naive independence but with a sandwich variance estimate (as implemented using PROC SURVEYLOGISTIC in SAS) takes 4.3 minutes. Thus, calculation of the robust sandwich variance estimate for standard logistic regression takes an additional 2.2 minutes (the algorithms for calculating the standard logistic regression estimates in PROC LO- GISTIC and PROC SURVEYLOGISTIC are the same). Use of a SAS macro for implementing our proposed 1-step approach takes 11.8 minutes, as opposed to the fully-iterated estimation (as implemented in PROC GENMOD in SAS) with exchangeable correlation, which takes

13 hours. We note that use of only a single iteration of PROC GENMOD with exchangeable correlation should be comparable to the 1-step approach because PROC GENMOD in SAS uses the standard logistic regression maximum likelihood estimates under naive independence as starting values. Thus, direct comparison of our 1-step approach (11.8 minutes) to 1-step of PROC GEN- MOD (1.4 hours) emphasizes the potential advantage of the proposed method; 1-step of PROC GENMOD takes approximately 7 times longer than our proposed approach. We attribute the advantage in computation time to the following: 1) we have expressed the GEE in a form that does not require inversion of any matrices; and 2) we use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes within a cluster. We note that we fit the models on a 64-bit PC workstation with an Intel Xeon CPU E GHz processor with 16.0 GB of RAM and an SSD hard drive; this PC workstation is much faster than a typical desktop PC. Finally, although the results presented here were obtained using SAS, we also attempted to use the same dataset to obtain the fully-iterated GEE estimates using both gee in R (Carey, 2002) and the xtgee command in Stata (StataCorp, 2013); both packages were unable to produce estimates of the model parameters due to insufficient memory. 6 Conclusion In this paper we propose a one-step estimating equations approach with an exchangeable correlation (constant ICC) for gaining efficiency in analyses of large complex surveys. When applied to data from the 2010 NIS we found that, even with sample sizes greater than 1000 in each subgroup, there were substantial efficiency gains with the use of our approach, leading to discernibly narrower confidence intervals. In addition, we found that the one-step estimates were almost identical to the fully-iterated estimates obtained from PROC GENMOD in SAS, but took only a fraction of the time to compute (11.1 minutes versus 8.1 hours for PROC GEN- MOD in SAS). Furthermore, the one-step estimator is asymptotically equivalent to the fully iterated estimator. We attribute the very discernible computational advantage of our proposed approach to the following three factors: 1) we have constructed the GEE in a form that does not require inversion of any matrices; 2) we use a simple formula to estimate the ICC that does not require summing over all pairs of outcomes within a cluster; 3) we only use a one-step GEE estimator instead of fully-iterating. Our results also suggest that if a fully-iterated estimate is preferred, then it is still advantageous to use our approach with equation (4.11) to fully iterate. Finally, we note that although the GEE estimates obtained under exchangeability, as produced by many software packages, are consistent, the robust or sandwich variance estimates do not take the stratification into account. As a result, we cannot recommend that standard imple- 13

14 mentations of the GEE be used for hypothesis testing or the construction of confidence intervals in settings where the number of observations is small enough that computational time is not prohibitive. A SAS macro implementing our proposed one-step estimating equations approach with an exchangeable correlation is available from the first author upon request. Acknowledgments Acknowledgments We thank Caprice Greenberg for advice on the analysis of the data from the Nationwide Inpatient Sample.. We are grateful for the support provided by grants MH , CA and CA from the U.S. National Institutes of Health. 14

15 References Barbash G.I. and Glied S.A. (2010) New technology and health care costs the case of robotassisted surgery. N Engl J Med 363, 701. Binder, D. (1983). On the variance of asymptotically normal estimators from complex surveys. International Statistical Review, 51, Carey V.J. (2002). gee: Generalized Estimation Equation Solver. R package version ; Ported from S-PLUS to R by Thomas Lumley (versions 3.13 and 4.4) and Brian Ripley (version 4.13). Lehmann, E.L. (1983). Theory of Point Estimation. New York: Wiley. Liang, K.Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, Prentice, R.L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, SAS Institute Inc (2014). SAS OnlineDoc, Version 9.3. Cary, NC. URL Shah, B.V., Barnwell, B.G. and Bieler, G.S. (1997) SUDAAN Users Manual, Release 7.5. Research Triangle Park: Research Triangle Institute. StataCorp (1997), Stata Statistical Software: Release 5.0, College Station, TX: Stata Corporation. StataCorp (2013) Stata 13 Longitudinal/Panel-Data Reference Manual. College Station, TX: Stata Press. 15

16 Table 1. Complication status by procedure and surgery type, weighted counts (N) and column percentages (%). Type of Surgery Procedure Complication Robotic Laparoscopic Open N(%) N(%) N(%) Prostatectomy No 51694(93%) 336 (90%) (89%) Yes 4161 (7%) 39 (10%) 2623 (11%) Nephrectomy No 2327 (67%) 5053 (68%) (61%) Yes 1139 (33%) 2339 (32%) (39%) Partial No 6527 (81%) 984 (80%) 7822 (68%) Nephrectomy Yes 1492 (29%) 246 (20%) 3604 (32%) 16

17 Table 2. Comparison of logistic regression parameter estimates for the post-operative surgical complications data, stratified by procedure. Working Effect Correlation Estimate SE Z-statistic P-value Prostatectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC <0.001 Advanced Stage Ind ICC Private Insurance Ind ICC Robotic Ind <0.001 ICC <0.001 Laparoscopic Ind ICC Nephrectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC Female Ind <0.001 ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC Advanced Stage Ind <0.001 ICC <0.001 Private Insurance Ind <0.001 ICC <0.001 Robotic Ind ICC Laparoscopic Ind ICC Partial Nephrectomy: Medium bed Ind ICC Large bed Ind ICC Urban Ind ICC teaching Ind ICC Caucasian Ind ICC Female Ind <0.001 ICC <0.001 Comorbidities Ind <0.001 ICC <0.001 Age Ind ICC Advanced Stage Ind ICC Private Insurance Ind ICC Robotic Ind <0.001 ICC <0.001 Laparoscopic Ind ICC <0.001

18 Table 3. Computation times (real not CPU) required to obtain the logistic regression parameter estimates. SAS Procedure Time (in minutes) Logistic 2.1 Surveylogistic 4.3 SAS macro for 1-step estimator 11.8 Genmod ICC 1-step a 84.0 Genmod ICC fully-iterated a PROC GENMOD in SAS uses independence estimate as starting values; as a result, 1-step of PROC GENMOD should be comparable to the proposed 1-step estimator 18

Testing for independence in J K contingency tables with complex sample. survey data

Testing for independence in J K contingency tables with complex sample. survey data Biometrics 00, 1 23 DOI: 000 November 2014 Testing for independence in J K contingency tables with complex sample survey data Stuart R. Lipsitz 1,, Garrett M. Fitzmaurice 2, Debajyoti Sinha 3, Nathanael

More information

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM Paper 1025-2017 GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM Kyle M. Irimata, Arizona State University; Jeffrey R. Wilson, Arizona State University ABSTRACT The

More information

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.

More information

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study Science Journal of Applied Mathematics and Statistics 2014; 2(1): 20-25 Published online February 20, 2014 (http://www.sciencepublishinggroup.com/j/sjams) doi: 10.11648/j.sjams.20140201.13 Robust covariance

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Repeated ordinal measurements: a generalised estimating equation approach

Repeated ordinal measurements: a generalised estimating equation approach Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related

More information

Software for GEE: PROC GENMOD and SUDAAN Babubhai V. Shah, Research Triangle Institute, Research Triangle Park, NC

Software for GEE: PROC GENMOD and SUDAAN Babubhai V. Shah, Research Triangle Institute, Research Triangle Park, NC Software for GEE: PROC GENMOD and SUDAAN Babubhai V. Shah, Research Triangle Institute, Research Triangle Park, NC Abstract Until recently, most of the statistical software was limited to analyzing data

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs Motivating Example: The data we will be using comes from a subset of data taken from the Los Angeles Study of the Endometrial Cancer Data

More information

GEE for Longitudinal Data - Chapter 8

GEE for Longitudinal Data - Chapter 8 GEE for Longitudinal Data - Chapter 8 GEE: generalized estimating equations (Liang & Zeger, 1986; Zeger & Liang, 1986) extension of GLM to longitudinal data analysis using quasi-likelihood estimation method

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs The data for the tutorial came from support.sas.com, The LOGISTIC Procedure: Conditional Logistic Regression for Matched Pairs Data :: SAS/STAT(R)

More information

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 217, Boston, Massachusetts Outline 1. Opportunities and challenges of panel data. a. Data requirements b. Control

More information

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION STATISTICS IN MEDICINE GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION NICHOLAS J. HORTON*, JUDITH D. BEBCHUK, CHERYL L. JONES, STUART R. LIPSITZ, PAUL J. CATALANO, GWENDOLYN

More information

A measure of partial association for generalized estimating equations

A measure of partial association for generalized estimating equations A measure of partial association for generalized estimating equations Sundar Natarajan, 1 Stuart Lipsitz, 2 Michael Parzen 3 and Stephen Lipshultz 4 1 Department of Medicine, New York University School

More information

Longitudinal Modeling with Logistic Regression

Longitudinal Modeling with Logistic Regression Newsom 1 Longitudinal Modeling with Logistic Regression Longitudinal designs involve repeated measurements of the same individuals over time There are two general classes of analyses that correspond to

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Approximate Median Regression via the Box-Cox Transformation

Approximate Median Regression via the Box-Cox Transformation Approximate Median Regression via the Box-Cox Transformation Garrett M. Fitzmaurice,StuartR.Lipsitz, and Michael Parzen Median regression is used increasingly in many different areas of applications. The

More information

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association

More information

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC Mantel-Haenszel Test Statistics for Correlated Binary Data by Jie Zhang and Dennis D. Boos Department of Statistics, North Carolina State University Raleigh, NC 27695-8203 tel: (919) 515-1918 fax: (919)

More information

Strati cation in Multivariate Modeling

Strati cation in Multivariate Modeling Strati cation in Multivariate Modeling Tihomir Asparouhov Muthen & Muthen Mplus Web Notes: No. 9 Version 2, December 16, 2004 1 The author is thankful to Bengt Muthen for his guidance, to Linda Muthen

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software January 2006, Volume 15, Issue 2. http://www.jstatsoft.org/ The R Package geepack for Generalized Estimating Equations Ulrich Halekoh Danish Institute of Agricultural

More information

Biostatistics Workshop Longitudinal Data Analysis. Session 4 GARRETT FITZMAURICE

Biostatistics Workshop Longitudinal Data Analysis. Session 4 GARRETT FITZMAURICE Biostatistics Workshop 2008 Longitudinal Data Analysis Session 4 GARRETT FITZMAURICE Harvard University 1 LINEAR MIXED EFFECTS MODELS Motivating Example: Influence of Menarche on Changes in Body Fat Prospective

More information

ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS

ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS Background Independent observations: Short review of well-known facts Comparison of two groups continuous response Control group:

More information

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School

More information

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Robust Bayesian Variable Selection for Modeling Mean Medical Costs Robust Bayesian Variable Selection for Modeling Mean Medical Costs Grace Yoon 1,, Wenxin Jiang 2, Lei Liu 3 and Ya-Chen T. Shih 4 1 Department of Statistics, Texas A&M University 2 Department of Statistics,

More information

Justine Shults 1,,, Wenguang Sun 1,XinTu 2, Hanjoo Kim 1, Jay Amsterdam 3, Joseph M. Hilbe 4,5 and Thomas Ten-Have 1

Justine Shults 1,,, Wenguang Sun 1,XinTu 2, Hanjoo Kim 1, Jay Amsterdam 3, Joseph M. Hilbe 4,5 and Thomas Ten-Have 1 STATISTICS IN MEDICINE Statist. Med. 2009; 28:2338 2355 Published online 26 May 2009 in Wiley InterScience (www.interscience.wiley.com).3622 A comparison of several approaches for choosing between working

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Finite Population Sampling and Inference

Finite Population Sampling and Inference Finite Population Sampling and Inference A Prediction Approach RICHARD VALLIANT ALAN H. DORFMAN RICHARD M. ROYALL A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane

More information

GEE Analysis of negatively correlated binary responses: a caution

GEE Analysis of negatively correlated binary responses: a caution STATISTICS IN MEDICINE Statist. Med. 2000; 19:715}722 GEE Analysis of negatively correlated binary responses: a caution James A. Hanley *, Abdissa Negassa and Michael D. deb. Edwardes Department of Epidemiology

More information

Modelling heterogeneous variance-covariance components in two-level multilevel models with application to school effects educational research

Modelling heterogeneous variance-covariance components in two-level multilevel models with application to school effects educational research Modelling heterogeneous variance-covariance components in two-level multilevel models with application to school effects educational research Research Methods Festival Oxford 9 th July 014 George Leckie

More information

In Praise of the Listwise-Deletion Method (Perhaps with Reweighting)

In Praise of the Listwise-Deletion Method (Perhaps with Reweighting) In Praise of the Listwise-Deletion Method (Perhaps with Reweighting) Phillip S. Kott RTI International NISS Worshop on the Analysis of Complex Survey Data With Missing Item Values October 17, 2014 1 RTI

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Complex sample design effects and inference for mental health survey data

Complex sample design effects and inference for mental health survey data International Journal of Methods in Psychiatric Research, Volume 7, Number 1 Complex sample design effects and inference for mental health survey data STEVEN G. HEERINGA, Division of Surveys and Technologies,

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Generalized linear models with a coarsened covariate

Generalized linear models with a coarsened covariate Appl. Statist. (2004) 53, Part 2, pp. 279 292 Generalized linear models with a coarsened covariate Stuart Lipsitz, Medical University of South Carolina, Charleston, USA Michael Parzen, University of Chicago,

More information

The equivalence of the Maximum Likelihood and a modified Least Squares for a case of Generalized Linear Model

The equivalence of the Maximum Likelihood and a modified Least Squares for a case of Generalized Linear Model Applied and Computational Mathematics 2014; 3(5): 268-272 Published online November 10, 2014 (http://www.sciencepublishinggroup.com/j/acm) doi: 10.11648/j.acm.20140305.22 ISSN: 2328-5605 (Print); ISSN:

More information

Describing Stratified Multiple Responses for Sparse Data

Describing Stratified Multiple Responses for Sparse Data Describing Stratified Multiple Responses for Sparse Data Ivy Liu School of Mathematical and Computing Sciences Victoria University Wellington, New Zealand June 28, 2004 SUMMARY Surveys often contain qualitative

More information

Package mmm. R topics documented: February 20, Type Package

Package mmm. R topics documented: February 20, Type Package Package mmm February 20, 2015 Type Package Title an R package for analyzing multivariate longitudinal data with multivariate marginal models Version 1.4 Date 2014-01-01 Author Ozgur Asar, Ozlem Ilk Depends

More information

An R # Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM

An R # Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM An R Statistic for Fixed Effects in the Linear Mixed Model and Extension to the GLMM Lloyd J. Edwards, Ph.D. UNC-CH Department of Biostatistics email: Lloyd_Edwards@unc.edu Presented to the Department

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data Bryan A. Comstock and Patrick J. Heagerty Department of Biostatistics University of Washington

More information

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling J. Shults a a Department of Biostatistics, University of Pennsylvania, PA 19104, USA (v4.0 released January 2015)

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Introduction to mtm: An R Package for Marginalized Transition Models

Introduction to mtm: An R Package for Marginalized Transition Models Introduction to mtm: An R Package for Marginalized Transition Models Bryan A. Comstock and Patrick J. Heagerty Department of Biostatistics University of Washington 1 Introduction Marginalized transition

More information

The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference

The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference An application to longitudinal modeling Brianna Heggeseth with Nicholas Jewell Department of Statistics

More information

SAS Macro for Generalized Method of Moments Estimation for Longitudinal Data with Time-Dependent Covariates

SAS Macro for Generalized Method of Moments Estimation for Longitudinal Data with Time-Dependent Covariates Paper 10260-2016 SAS Macro for Generalized Method of Moments Estimation for Longitudinal Data with Time-Dependent Covariates Katherine Cai, Jeffrey Wilson, Arizona State University ABSTRACT Longitudinal

More information

Assessing intra, inter and total agreement with replicated readings

Assessing intra, inter and total agreement with replicated readings STATISTICS IN MEDICINE Statist. Med. 2005; 24:1371 1384 Published online 30 November 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2006 Assessing intra, inter and total agreement

More information

Statistical methods for marginal inference from multivariate ordinal data Nooraee, Nazanin

Statistical methods for marginal inference from multivariate ordinal data Nooraee, Nazanin University of Groningen Statistical methods for marginal inference from multivariate ordinal data Nooraee, Nazanin DOI: 10.1016/j.csda.2014.03.009 IMPORTANT NOTE: You are advised to consult the publisher's

More information

Confidence intervals for the variance component of random-effects linear models

Confidence intervals for the variance component of random-effects linear models The Stata Journal (2004) 4, Number 4, pp. 429 435 Confidence intervals for the variance component of random-effects linear models Matteo Bottai Arnold School of Public Health University of South Carolina

More information

5.3 LINEARIZATION METHOD. Linearization Method for a Nonlinear Estimator

5.3 LINEARIZATION METHOD. Linearization Method for a Nonlinear Estimator Linearization Method 141 properties that cover the most common types of complex sampling designs nonlinear estimators Approximative variance estimators can be used for variance estimation of a nonlinear

More information

BIOS 312: Precision of Statistical Inference

BIOS 312: Precision of Statistical Inference and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2009 Paper 251 Nonparametric population average models: deriving the form of approximate population

More information

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

More information

Pseudo-score confidence intervals for parameters in discrete statistical models

Pseudo-score confidence intervals for parameters in discrete statistical models Biometrika Advance Access published January 14, 2010 Biometrika (2009), pp. 1 8 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asp074 Pseudo-score confidence intervals for parameters

More information

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

LOGISTICS REGRESSION FOR SAMPLE SURVEYS 4 LOGISTICS REGRESSION FOR SAMPLE SURVEYS Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-002 4. INTRODUCTION Researchers use sample survey methodology to obtain information

More information

On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm

On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm On the Use of the Bross Formula for Prioritizing Covariates in the High-Dimensional Propensity Score Algorithm Richard Wyss 1, Bruce Fireman 2, Jeremy A. Rassen 3, Sebastian Schneeweiss 1 Author Affiliations:

More information

Bayesian Multivariate Logistic Regression

Bayesian Multivariate Logistic Regression Bayesian Multivariate Logistic Regression Sean M. O Brien and David B. Dunson Biostatistics Branch National Institute of Environmental Health Sciences Research Triangle Park, NC 1 Goals Brief review of

More information

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter

More information

A Design-Sensitive Approach to Fitting Regression Models With Complex Survey Data

A Design-Sensitive Approach to Fitting Regression Models With Complex Survey Data A Design-Sensitive Approach to Fitting Regression Models With Complex Survey Data Phillip S. Kott RI International Rockville, MD Introduction: What Does Fitting a Regression Model with Survey Data Mean?

More information

ABSTRACT INTRODUCTION. SESUG Paper

ABSTRACT INTRODUCTION. SESUG Paper SESUG Paper 140-2017 Backward Variable Selection for Logistic Regression Based on Percentage Change in Odds Ratio Evan Kwiatkowski, University of North Carolina at Chapel Hill; Hannah Crooke, PAREXEL International

More information

Standardization methods have been used in epidemiology. Marginal Structural Models as a Tool for Standardization ORIGINAL ARTICLE

Standardization methods have been used in epidemiology. Marginal Structural Models as a Tool for Standardization ORIGINAL ARTICLE ORIGINAL ARTICLE Marginal Structural Models as a Tool for Standardization Tosiya Sato and Yutaka Matsuyama Abstract: In this article, we show the general relation between standardization methods and marginal

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Models for Longitudinal Analysis of Binary Response Data for Identifying the Effects of Different Treatments on Insomnia

Models for Longitudinal Analysis of Binary Response Data for Identifying the Effects of Different Treatments on Insomnia Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3067-3082 Models for Longitudinal Analysis of Binary Response Data for Identifying the Effects of Different Treatments on Insomnia Z. Rezaei Ghahroodi

More information

A SAS MACRO FOR ESTIMATING SAMPLING ERRORS OF MEANS AND PROPORTIONS IN THE CONTEXT OF STRATIFIED MULTISTAGE CLUSTER SAMPLING

A SAS MACRO FOR ESTIMATING SAMPLING ERRORS OF MEANS AND PROPORTIONS IN THE CONTEXT OF STRATIFIED MULTISTAGE CLUSTER SAMPLING A SAS MACRO FOR ESTIMATING SAMPLING ERRORS OF MEANS AND PROPORTIONS IN THE CONTEXT OF STRATIFIED MULTISTAGE CLUSTER SAMPLING Mamadou Thiam, Alfredo Aliaga, Macro International, Inc. Mamadou Thiam, Macro

More information

Logistic Regression Models for Multinomial and Ordinal Outcomes

Logistic Regression Models for Multinomial and Ordinal Outcomes CHAPTER 8 Logistic Regression Models for Multinomial and Ordinal Outcomes 8.1 THE MULTINOMIAL LOGISTIC REGRESSION MODEL 8.1.1 Introduction to the Model and Estimation of Model Parameters In the previous

More information

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized logit models for nominal multinomial responses. Local odds ratios Generalized logit models for nominal multinomial responses Categorical Data Analysis, Summer 2015 1/17 Local odds ratios Y 1 2 3 4 1 π 11 π 12 π 13 π 14 π 1+ X 2 π 21 π 22 π 23 π 24 π 2+ 3 π 31 π 32 π

More information

Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc.

Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc. Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc. Nur Faezah Jamal, Nor Mariyah Abdul Ghafar, Isma Liana Ismail and Mohd Zaki Awang

More information

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION Ernest S. Shtatland, Ken Kleinman, Emily M. Cain Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT In logistic regression,

More information

Group comparisons in logit and probit using predicted probabilities 1

Group comparisons in logit and probit using predicted probabilities 1 Group comparisons in logit and probit using predicted probabilities 1 J. Scott Long Indiana University May 27, 2009 Abstract The comparison of groups in regression models for binary outcomes is complicated

More information

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 217, Chicago, Illinois Outline 1. Opportunities and challenges of panel data. a. Data requirements b. Control

More information

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest.

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest. Conditional and Unconditional Categorical Regression Models with Missing Covariates Glen A. Satten and Raymond J. Carroll Λ December 4, 1999 Abstract We consider methods for analyzing categorical regression

More information

INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA. Belfast 9 th June to 10 th June, 2011

INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA. Belfast 9 th June to 10 th June, 2011 INTRODUCTION TO MULTILEVEL MODELLING FOR REPEATED MEASURES DATA Belfast 9 th June to 10 th June, 2011 Dr James J Brown Southampton Statistical Sciences Research Institute (UoS) ADMIN Research Centre (IoE

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Libraries 1997-9th Annual Conference Proceedings ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Eleanor F. Allan Follow this and additional works at: http://newprairiepress.org/agstatconference

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Recent Developments in Multilevel Modeling

Recent Developments in Multilevel Modeling Recent Developments in Multilevel Modeling Roberto G. Gutierrez Director of Statistics StataCorp LP 2007 North American Stata Users Group Meeting, Boston R. Gutierrez (StataCorp) Multilevel Modeling August

More information

Combining Non-probability and Probability Survey Samples Through Mass Imputation

Combining Non-probability and Probability Survey Samples Through Mass Imputation Combining Non-probability and Probability Survey Samples Through Mass Imputation Jae-Kwang Kim 1 Iowa State University & KAIST October 27, 2018 1 Joint work with Seho Park, Yilin Chen, and Changbao Wu

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

More Statistics tutorial at Logistic Regression and the new:

More Statistics tutorial at  Logistic Regression and the new: Logistic Regression and the new: Residual Logistic Regression 1 Outline 1. Logistic Regression 2. Confounding Variables 3. Controlling for Confounding Variables 4. Residual Linear Regression 5. Residual

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Asymptotic equivalence of paired Hotelling test and conditional logistic regression Asymptotic equivalence of paired Hotelling test and conditional logistic regression Félix Balazard 1,2 arxiv:1610.06774v1 [math.st] 21 Oct 2016 Abstract 1 Sorbonne Universités, UPMC Univ Paris 06, CNRS

More information

Integrated approaches for analysis of cluster randomised trials

Integrated approaches for analysis of cluster randomised trials Integrated approaches for analysis of cluster randomised trials Invited Session 4.1 - Recent developments in CRTs Joint work with L. Turner, F. Li, J. Gallis and D. Murray Mélanie PRAGUE - SCT 2017 - Liverpool

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

A UNIFIED APPROACH FOR ASSESSING AGREEMENT FOR CONTINUOUS AND CATEGORICAL DATA

A UNIFIED APPROACH FOR ASSESSING AGREEMENT FOR CONTINUOUS AND CATEGORICAL DATA Journal of Biopharmaceutical Statistics, 17: 69 65, 007 Copyright Taylor & Francis Group, LLC ISSN: 1054-3406 print/150-5711 online DOI: 10.1080/10543400701376498 A UNIFIED APPROACH FOR ASSESSING AGREEMENT

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Reducing Computation Time for the Analysis of Large Social Science Datasets

Reducing Computation Time for the Analysis of Large Social Science Datasets Reducing Computation Time for the Analysis of Large Social Science Datasets Douglas G. Bonett Center for Statistical Analysis in the Social Sciences University of California, Santa Cruz Jan 28, 2014 Overview

More information

Multilevel Methodology

Multilevel Methodology Multilevel Methodology Geert Molenberghs Interuniversity Institute for Biostatistics and statistical Bioinformatics Universiteit Hasselt, Belgium geert.molenberghs@uhasselt.be www.censtat.uhasselt.be Katholieke

More information

PubH 7405: REGRESSION ANALYSIS INTRODUCTION TO LOGISTIC REGRESSION

PubH 7405: REGRESSION ANALYSIS INTRODUCTION TO LOGISTIC REGRESSION PubH 745: REGRESSION ANALYSIS INTRODUCTION TO LOGISTIC REGRESSION Let Y be the Dependent Variable Y taking on values and, and: π Pr(Y) Y is said to have the Bernouilli distribution (Binomial with n ).

More information

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next Book Contents Previous Next SAS/STAT User's Guide Overview Getting Started Syntax Details Examples References Book Contents Previous Next Top http://v8doc.sas.com/sashtml/stat/chap29/index.htm29/10/2004

More information

Analysing longitudinal data when the visit times are informative

Analysing longitudinal data when the visit times are informative Analysing longitudinal data when the visit times are informative Eleanor Pullenayegum, PhD Scientist, Hospital for Sick Children Associate Professor, University of Toronto eleanor.pullenayegum@sickkids.ca

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information