A measure of partial association for generalized estimating equations

Size: px
Start display at page:

Download "A measure of partial association for generalized estimating equations"

Transcription

1 A measure of partial association for generalized estimating equations Sundar Natarajan, 1 Stuart Lipsitz, 2 Michael Parzen 3 and Stephen Lipshultz 4 1 Department of Medicine, New York University School of Medicine, and the VA New York Harbor Healthcare System, US 2 Division of General Internal Medicine, Brigham and Women s Hospital, US 3 Goizueta Business School, Emory University, US 4 Department of Pediatrics, University of Miami School of Medicine, US Abstract: In a regression setting, the partial correlation coefficient is often used as a measure of standardized partial association between the outcome y and each of the covariates in x =[x 1,..., x K ]. In a linear regression model estimated using ordinary least squares, with y as the response, the estimated partial correlation coefficient between y and x k can be shown to be a monotone function, denoted f(z), of the Z-statistic for testing if the regression coefficient of x k is 0. When y is non-normal and the data are clustered so that y and x are obtained from each member of a cluster, generalized estimating equations are often used to estimate the regression parameters of the model for y given x. In this paper, when using generalized estimating equations, we propose using the above transformation f(z) of the GEE Z-statistic as a measure of partial association. Further, we also propose a coefficient of determination to measure the strength of association between the outcome variable and all of the covariates. To illustrate the method, we use a longitudinal study of the binary outcome heart toxicity from chemotherapy in children with leukaemia or sarcoma. Key words: coefficient of determination; longitudinal data; repeated measures Received November 2004; revised January 2006; accepted March Introduction In a normal linear regression setting, investigators are often interested in measuring the partial association between the outcome and the kth covariate, controlling for the other covariates. If one uses ordinary least squares (OLS) to estimate the regression parameters, the usual estimate of the partial correlation coefficient (Magee, 1990) is a monotone function, denoted f(z),ofthez-statistic for testing if the kth regression coefficient is 0. In an increasing number of biomedical studies, the data are clustered so that an outcome and covariate vector are obtained from each member of a cluster, and generalized estimating equations (GEE) (Liang and Zeger, 1986; Prentice, 1988) are Address for correspondence: Sundar Natarajan, 423 East 23rd Street, Room North, New York, NY sundar.natarajan@med.nyu.edu 2007 SAGE Publications / X

2 176 S Natarajan et al. used to estimate the regression parameters. For such repeated measures and clustered data settings, investigators are still often interested in obtaining a measure of partial association between the outcome and a covariate for a given member of the cluster. In this paper, we propose using the above transformation f(z)of the GEE Z-statistic as a measure of partial association. Further, in repeated measures and clustered data settings, investigators are often interested in measuring the strength of association between the outcome and all of the covariates, as measured by a coefficient of determination. In normal linear regression, the coefficient of determination R 2 can be shown to be the same monotone function discussed earlier, f(w), of the Wald statistic W, for testing if all parameters (except the intercept) are 0. In this paper, for the GEE approach, we propose a coefficient of determination that is the same function f(w) of a GEE Wald statistic W for testing that all regression coefficients (except the intercept) are 0. To illustrate the method, we use a dataset from a longitudinal study (Lipshultz et al., 1995) to explore the cardiotoxic effects of doxorubicin chemotherapy for the treatment of acute lymphoblastic leukaemia or osteogenic sarcoma in childhood. There are 115 patients in the study. The outcome measured over time is abnormal wall stress of the heart (yes, no); it is measured at the end of chemotherapy until the present time. The maximum follow-up time is 18.5 years from the end of treatment. Children were not measured at pre-specified times, so the observation times are unequally spaced. The covariates of interest are the cumulative dose of doxorubicin, sex, age at end of treatment and time since the end of treatment. Table 1 shows data from 10 of the 115 patients on file. We use GEE to estimate the logistic regression model for the probability of abnormal wall stress as a function of these covariates. We are interested in a measure of partial association at a given time between wall stress and each of the covariates. Investigators are often interested in standardized measures of partial association in order to directly compare the degree to which each of the covariates helps explain the variance in the outcome variable. In particular, as seen in Table 1, since the covariates are measured on different scales (dose is in milligrams; age and time are in years; and sex is dichotomous), it is difficult to directly compare the magnitude of the covariate affects the investigators wanted a standardized measure of partial association for which to compare the covariate effects. For example, in clinical practice, it would be important to quantify the degree to which separate variables explain the outcome; this would allow modification/refinement of therapeutic strategies. In Section 2, we review the partial correlation coefficient when using OLS, and, in Section 3, we use the form of the partial correlation coefficient in OLS to define a measure of partial association for GEE. Section 4 discusses the coefficient of determination for OLS, and again uses the form of the coefficient of determination in OLS to define a coefficient of determination for GEE. Section 5 analyses the above longitudinal example; in Section 6, we perform simulations to explore the finite sample properties of our proposed methods.

3 Measure of partial association for GEEs 177 Table 1 Data from cardiotoxicity study Time since Wall Age at end Patient treatment a stress b Dose c of treatment a Sex F F F F F F F F F F F F F F M M M M M M M M M M M M M M M M Notes: a years. b 1 = abnormal, 0 = normal. c in mg. 2 Review of ordinary least squares We have a random sample of N independent individuals, in which the ith individual has response Y i and a K 1 covariate vector x i =[x i1,..., x ik ]. The response Y i is

4 178 S Natarajan et al. assumed to be normal with mean E(Y i x i ) = β 1 x i β K x ik and variance Var(Y i x i ) = σ 2 ; if the model has an intercept, then x i1 = 1 for all i. Welet β k be the maximum likelihood estimator of β k, Ŷ i = β 0 + β 1 x i β K x ik be the maximum likelihood estimator of E(Y i x i ), and Ni=1 ˆσ 2 (Y i Ŷ i ) 2 = (2.1) N be the maximum likelihood estimator of σ 2. The variance estimate of β k, Var( β k ) is the kth diagonal element of the (X X) 1 matrix times ˆσ 2, where the ith row of X is x i. Next, we discuss the partial correlation coefficient, and show how the partial correlation coefficient can be written as a function of the usual Z-statistic for testing H 0 : β k = 0. Without loss of generality, suppose we are interested in the partial correlation coefficient between Y i and x ik, say ρk 2, controlling for the other covariates. Neter et al. (1996) show that the square of the estimated partial correlation coefficient is ρ K 2 = SSE(X 1,..., X K 1 ) SSE(X 1,..., X K ) = 1 SSE(X 1,..., X K ) SSE(X 1,..., X K 1 ) SSE(X 1,..., X K 1 ), (2.2) where SSE(X 1,..., X K ) is the sums of squares error with all covariates in the model and SSE(X 1,..., X K 1 ) is the sums of squares error with all covariates except x ik in the model. Further, the F -statistic for testing H 0 : β K = 0 can be written as F K = SSE(X 1,..., X K 1 ) SSE(X 1,..., X K ), SSE(X 1,..., X K )/(N K) or, equivalently, 1 + F K /(N K) = SSE(X 1,..., X K 1 ) SSE(X 1,..., X K ). (2.3) After substituting (2.3) into (2.2), we obtain ρ K 2 = F K /(N K) = F K/(N K) 1 + F K /(N K). (2.4) Finally, note that we can also write F K as F K = (N K) ZK 2 N, (2.5)

5 Measure of partial association for GEEs 179 where Z k = β k Var( β k ), (2.6) is the usual Wald-statistic for testing H 0 : β k = 0. Substituting (2.5) in (2.4) and taking the square root, we obtain ρ k = Z k/ N 1 + Z 2 k /N. (2.7) In the following section, we use this form of the partial correlation coefficient to define a new measure of partial association for GEE. 3 Repeated measures and generalized estimating equations Repeated measures studies arise often in biomedical studies. In repeated measures studies, the basic sampling unit is a group or cluster of subjects; a measurement is made on each subject within the cluster. The observations within the cluster, one from each subject, constitute the repeated measurements on the cluster. In a developmental toxicology study, the cluster is a litter and the newborn are the subjects within the cluster; in an eye disease study, the cluster is the person and the two eyes are the subjects within the cluster. In the example shown in Table 1, the cluster is the person and the repeated measures are the binary cardiotoxic measurements over time. Thus, instead of a univariate response, each individual contributes n i responses. Individual i(i= 1,..., N), has an n i 1 response vector Y i = [Y i1,..., Y ini ], where Y it is a member of the exponential family, which includes normal, Bernoulli, Poisson and gamma random variables. There is also a K 1 covariate vector x ij = [x ij 1,..., x ij K ] associated with Y ij, such that for some function g( ), and µ ij = E(Y ij x ij ) = g(x ij β), Var(Y ij x ij ) = h(µ ij ), (3.1) for some function h(µ ij ). For example, when Y ij is binary, and logistic regression is used, µ ij = ex ij β 1 + e x ij β and Var(Y ij x ij ) = µ ij (1 µ ij ).

6 180 S Natarajan et al. The GEE estimator of β, β, is found by solving the estimating equations u β ( β) = N i=1 D i 1 V i [Y i µ i ( β)] =0, (3.2) where µ i = [µ i1,..., µ ini ], D i = µ i (β)/ β, and V i is the n i n i working correlation matrix of Y i. This working correlation matrix is specified through a working correlation matrix. In particular, the correlation structure Y i, is accounted for by R i (α), an i n i working correlation matrix, which is fully specified by an s 1 vector of unknown parameters α. In (3.2), V i = A 1/2 i R i (α)a 1/2 i, where A i is a n i n i diagonal matrix with Var(Y it x it ) = h(µ it ) as the tth diagonal entry. The estimate of β is obtained by plugging in a consistent estimator of α into (3.2) and solving for β iteratively. At each iteration, the estimate of α from the working correlation matrix as well as the estimator β are refined. Liang and Zeger (1986) show that under mild regularity conditions, β is a consistent estimator of β. A robust, sandwich estimator (White, 1982) can be used to consistently estimate the covariance matrix of β, and is particularly useful since the working correlation may be misspecified. To test H 0 : β k = 0, one can use the GEE Wald Z-statistic Z k = β k, (3.3) Var( β k ) where Var( β k ) is the robust variance estimate. Making the analogy to (2.7), we define a measure of partial association between Y ij and x ij k to be ρ k = Z k/ N 1 + Z 2 k /N. (3.4) Since ρ k 2 = Z2 k /N 1 + Zk 2/N is always between 0 and 1, then ρ k will always be in the interval [ 1, 1]. However, we now propose a slight modification to ρ k in (3.4) because, for ordinary logistic regression, the Wald statistic in (3.3) has been shown to have poor properties as β k gets large (Hauck and Donner, 1977). In particular, Hauck and Donner (1977) showed that the Wald statistic is not a monotone increasing function of the parameter estimate β k as the distance between the β k and the null value (0) increases. In fact,

7 Measure of partial association for GEEs 181 as the distance between the β k and the 0 increases, the Wald statistic increases to a certain point and then decreases toward 0. Basically, the Wald statistic Zk 2 behaves this way because, as β k, Var( β K ) increases at a faster rate than β k 2. Because of this property, Hauck and Donner (1977) recommend that the Wald statistic in (2.6) not be used with logistic regression. Thus, since Zk 2 is not a monotone increasing function of the parameter estimate β k, ρ k 2 in (3.4) will not be either. In many data analysis problems, an alternative to the Wald statistic that has greater power is the likelihood ratio statistic. Unfortunately, there is no likelihood in GEE, so a likelihood ratio statistic cannot be used to form a measure of partial association for GEE. Here, we propose the use of the Wald statistic with the variance of β k estimated under the null H 0 : β k = 0. To obtain the variance estimate under the null, say Ṽar( β k ), one replaces β in the GEE robust variance Var( β k ) with the estimate β under the null. Thus, our proposed Wald statistic for use in a measure of partial association is β k Z k =, (3.5) Ṽar( β k ) which, under the null, as N, is approximately chi-square with 1 df. Then, we define the measure of partial association between Y ij and x ij k to be ρ k = Z k / N 1 + Z 2 k /N. (3.6) The intuition is that ρ k transforms the test statistic Z k to a more intuitively appealing ( 1, 1) scale. In Section 6, we compare the asymptotic properties of (3.4) and (3.6) in a simple logistic regression setting without repeated measures to explore the properties of these two measures of partial association. 4 Extension to the coefficient of determination Suppose we have a normal linear regression model as in Section 2, with the model E(Y i x i ) = β 0 + β 1 x i β K x ik = x i β, and variance Var(Y i x i ) = σ 2. We let β be the maximum likelihood estimator of β, and Var( β) = ˆσ 2 (X X) 1 be the estimated covariance matrix of β, where ˆσ 2 is given by (2.1). To test

8 182 S Natarajan et al. H 0 : β 1 =...= β K = 0, one can use the Wald statistic Q =[C β] [C Var( β)c ] 1 [C β], (4.1) where C is a (K 1) K matrix with its first column having all 0s, and its last (K 1) columns being the (K 1) identity matrix. Christensen (1996) shows that the coefficient of determination, R 2, equals R 2 = Q/N 1 + Q/N. (4.2) For the repeated measures model E(Y ij x ij ) = g(β 0 + β 1 x ij β K x ij K ), one can use a Wald test like (4.1) to test H 0 : β 1 =... = β K = 0, and form an R 2 statistic like (4.2). However, because of the problems encountered using the Wald statistic discussed in the previous section, we propose the coefficient of determination R 2 = Q/N 1 + Q/N, (4.3) where Q =[C β] [CṼar( β)c ] 1 [C β], (4.4) is the Wald statistic with the GEE robust covariance matrix estimated under the null H 0 : β 1 =... = β K = 0, and denoted by Ṽar( β). Again, the intuition is that R 2 transforms the Wald test statistic Q to a more intuitively appealing (0, 1) scale. The properties of our proposed coefficient of determination in (4.4) is explored in Section 7. Unlike linear regression using OLS, there is no guarantee that a model with additional covariates would have a larger R 2, although most of the time it will be true. If an additional covariate adds very little information, it is possible that this R 2 could decrease very slightly; this possibility is explored in simulations in Section 6. For likelihood-based methods, if one substituted a likelihood ratio statistic for the Wald statistic Q in R 2, as proposed by Bohrnstedt and Knoke (1994), then it would be true that a model with additional covariates would have a larger value of R 2. Unfortunately, since there is no likelihood with GEE, this substitution cannot be used. 5 Example Late cardiotoxic effects of the chemotherapy doxorubicin are increasingly a problem for patients who survive childhood cancer. Cardiotoxicity is often progressive and

9 Measure of partial association for GEEs 183 some patients have disabling symptoms. In our example (Lipshultz et al., 1995), the objective was to identify risk factors for late cardiotoxicity, as measured by abnormal wall stress of the heart over time: Y ij equals 1 if abnormal wall stress or 0 if normal wall stress at time j. The abnormal wall stress was determined by examining echocardiograms from N = 115 children and adults who had received cumulative doses of 45 to 550 mg of doxorubicin per square metre of body-surface area for the treatment of acute lymphoblastic leukaemia in childhood. The covariates are the cumulative dose (ranging from 45 to 550 mg); age at end of treatment (ranging from 1.5 to 20 years); time since the end of chemotherapy to the given wall stress measurement (ranging from 0 to 15.5 years); and sex (1 = female, 0 = male). Table 1 shows data from 10 of the 115 patients on file. If we let π ij = pr(y ij = 1 x ij ) be the probability of abnormal wall stress at time t, then the logistic regression model of interest at time t is logit(π ij ) = β 0 + β 1 sex i + β 2 age i + β 3 dose i + β 4 τ ij, where τ ij is a function that maps the jth measurement (j = 1,..., n i ) on individual i to the time since the end of chemotherapy. Since the observation times for each individual is different, we modelled the correlation as autoregressive 1 (AR1), i.e., for a subject seen at times j and l, Corr(Y ij, Y il x ij, x il ) = α τ ij τ il, where 0 < α < 1. When j = l, there is perfect correlation and the correlation equals 1. When observations are far apart in time, then as τ ij τ il, α = 0 and the observations are uncorrelated. However, to explore how the proposed measure of partial association is affected by different working correlation models, we also obtained GEE estimates under the naive assumption of independence, as well as under an exchangeable correlation structure, in which Corr(Y ij, Y il x ij, x il ) = α, for all j = l. Here, we calculate the measure of partial association and the coefficient of determination using the Wald statistic with robust variance estimated under the null. Table 2 gives GEE estimates for the three working correlation models, as well as the estimated measures of partial association, and R 2 formed from sequentially going down the table (adding each covariate to the model as you go down the table). Note that the measures of partial association and coefficients of determination are very similar for all working correlation models; further note that R 2 increases as more covariates are added to the model. Using the estimates of β from the AR1 working correlation model, dose is the most significant predictor, with a measure of partial association of about 0.43, meaning that an increase in dose increases the probability

10 184 S Natarajan et al. Table 2 Estimates from the cardiac study for three working correlation models Cov Standard Partial Variable model Estimate error z-value p-value corr R 2 Intercept IND EXC AR Dose IND EXC AR Female IND EXC AR Age at Trt IND EXC AR Time since end IND EXC AR of an abnormal wall stress measurement. In this example, the clinical investigator was not sure whether to scale dose as milligrams (as it is scaled in the current dataset) or grams; since the Z-statistic does not depend on linear transformations of the covariate, our measure of partial association is the same for either scale, so it does not matter which scale when we use our measure of partial association. Female gender is the next most significant predictor, with a 0.17 partial association, meaning that females have a higher probability of abnormal wall stress. Age at the end of treatment is the next most significant predictor, with a 0.15 partial association, meaning that an older patient is more likely to have abnormal wall stress. Finally, there appears to be no significant partial association between the time since the end of treatment and wall stress. The overall measure R 2 = 0.23 may indicate that if other covariates were available, we might be able to find a better fitting model. Note, some coefficients of determination are constrained to be less than 1 for binary outcome data (discussed in the following section) so that a value of 0.23 might actually mean a good fit; however, in simulations performed in the next section, our proposed coefficient of determination does not appear to be constrained to be less than 1, so that a value of 0.23 may indicate that the model is not a great fit. Although this is just an example, the measures of partial association were very similar for all working correlation models. However, for some datasets (especially with time-varying covariates), more complex working correlation models (e.g., AR1 versus independence) will often lead to higher efficiency than simpler working

11 Measure of partial association for GEEs 185 correlation models, which in turn will lead to larger Wald statistics, and thus larger values of the partial associations. 6 Asymptotic study of measures of partial association To compare (3.6) with (3.4) in a simple data setting without repeated measures, we consider logistic regression for a (2 2) table. Suppose that the binary covariate is denoted X i and the binary outcome is Y i. In the most general case, with both X i and Y i random, the joint probability that (X i = j) and (Y i = k) is denoted by p jk = pr[(x i = j), (Y i = k)], for j = 0, 1, and k = 0, 1. For simple logistic regression of the Bernoulli outcome Y i versus the binary covariate X i, the logistic regression coefficient of X i, denoted by β 1, is the log odds ratio, β 1 = log(p 11 ) + log(p 00 ) log(p 10 ) log(p 01 ). The maximum likelihood estimate (MLE) of β 1, say β 1, is calculated by replacing p jk in β 1 by the MLE p jk = n jk /N, where n jk is the number of subjects with (X i = j) and (Y i = k), and N is the total sample size. The variance of β 1 under the alternative that the odds ratio does not equal 1 is Var( β 1 ) = 1 ( N p 11 p 00 p 10 p 01 The variance of the MLE under the null that the odds ratio equals 1 is Var 0 ( β 1 ) = 1 ( N p 1+ p +1 p 0+ p +0 p 1+ p +0 p 0+ p +1 where p j+ = p j0 + p j1 and p +k = p 0k + p 1k are the marginal probabilities. Then, it can be easily shown that, as N, ρ 1 in (3.4) converges in probability to ρ 1a = and ρ 1 in (3.6) converges in probability to / β 1 p p p p 1 01 ). 1 + β 2 1 /[p p p p 1 01 ] ), ρ 1o = / β 1 (p1+ p +1 ) 1 + (p 0+ p +0 ) 1 + (p 1+ p +0 ) 1 + (p 0+ p +1 ) β1 2/[(p 1+p +1 ) 1 + (p 0+ p +0 ) 1 + (p 1+ p +0 ) 1 + (p 0+ p +1 ) 1 ]

12 186 S Natarajan et al. Rho METHOD Wald Null Wald BETA1 Figure 1 Asymptotic study for partial associations Then, to compare ρ 1a and ρ 1o, we consider the scenario in which the cell probabilities p 10 = 0.05 and p 11 = 0.45, and p 00 varies from 0.01 to 0.49 in increments of 0.05, and correspondingly, p 01 varies from 0.49 to 0.01, giving log-odds ratios (β 1 ) that range from 0 to 7.7 and odds ratios that range from 1 to A plot comparing ρ 1a and ρ 1o, in this scenario is given in Figure 1. From Figure 1, we see that ρ 1o (Wald ρ with variance under the null) gives higher values than ρ 1a (Wald ρ with variance under the alternative) throughout the whole curve. The Wald ρ with variance under the null appears to be converging to 1 as β 1 gets larger, whereas, as discussed in Hauck and Donner (1977), the usual Wald statistic in (3.3) has poor properties; it is not a monotone increasing function of the parameter estimate as the distance between the parameter estimate and the null value increases. This bears out in Figure 1, in which ρ 1a increases slowly as β 1 increases from 0 to 4.5, and then decreases back toward 0 for β 1 > 4.5. This suggests that Wald ρ with variance under the alternative, ρ, has poor properties, and should not be used. In particular, even though we have only considered a non-repeated measures case here, we expect similar results to hold for GEE, since maximum likelihood for ordinary logistic regression is a special case of GEE. 7 Simulations for coefficients of determination In this section we study the finite sample performance of our proposed coefficients of determination. In order to compare our proposed coefficients to a well-established

13 Measure of partial association for GEEs 187 coefficient of determination, for example, the pseudo-r 2 proposed by Cox and Snell (1989) based on the likelihood-ratio statistic, we perform simulations in which the true model is an ordinary logistic regression for independent univariate observations. In particular, we have performed simulations for ordinary logistic regression with univariate outcomes instead of GEE for repeated measures data so that our proposed coefficient of determination can be compared to the coefficient of determination based on the likelihood (Cox and Snell, 1989). The pseudo-r 2 proposed by Cox and Snell (1989) based on the likelihood-ratio statistic is defined as [ ] L(0) 2/n RLR 2 = 1, L( β) where L( β) denotes the likelihood for the fitted model, and L(0) denotes the likelihood for the model in which all regression coefficients (except the intercept) are 0. Below, we compare the size of the three coefficients of determination (Cox and Snell s likelihood ratio, Wald, and Wald with variance estimated under the null) as a logistic regression coefficient gets larger. We also explore the number of times the Wald coefficients of determination decrease as an unimportant covariate (in which the true logistic regression coefficient is 0) is added to the model. We formulate the true Bernoulli distribution from which we simulate as p(y i x i1, x i2, x i3, β) = p y i i (1 p i ) y i, (7.1) where the logit of the probability of success equals logit[pr(y i = 1 x i1, x i2, x i3, β)] = x i1 x i2 + β 3 x i3, (7.2) i = 1,..., n. Also, in the simulations, β 3 is varied from 0 to 5.5, and, for simplicity, we fixed the total sample size at n = 400. Here, for the covariate distributions, we let X i1, X i2 and X i3 all be independent of each other, with X i1 having a Bernoulli distribution with pr(x i = 1) = 0.5; and X i2 and X i3 both having N(0, 1) distributions. We took one sample of size 400 random vectors (X i1, X i2, X i3 ), and fixed this covariate distribution for all simulations. We performed 12 sets of 1000 simulations; corresponding to varying β 3 from 0 to 5.5 in increments of 0.5. First, we want to compare the size of the coefficients of determination as β 3 increases. In particular, we calculated the three R 2 measures when fitting the full model (7.2) with covariates X i1, X i2 and X i3. Figure 2 plots the average value of the R 2 s over the 1000 simulations at each value of β 3. The results from Figure 2 are similar to those in Figure 1. From Figure 2, we see that the Wald R 2 with variance under the null gives higher values than the other two. In fact, the Wald R 2 with variance under the null appears to be converging to 1 as β 3 gets larger. As discussed earlier, Hauck and Donner (1977) showed that the usual Wald statistic in (3.3) has poor properties; it is not a monotone increasing function of the parameter estimate

14 188 S Natarajan et al. R-Square METHOD Wald null Likelihood _ Wald BETA3 Figure 2 Simulation results for coefficients of determination as the distance between the parameter estimate and the null value increases. This bears out in Figure 2, in which the coefficient of determination based on the usual Wald statistic increases slowly as β 3 increases from 0 to 2.5, and then decreases back toward 0 for β 3 > 2.5. The likelihood-ratio coefficient of determination (Cox and Snell, 1989) does not give as high values as the Wald R 2 with variance under the null; however, as discussed by Nagelkerke (1991), the likelihood-ratio R 2 can take on a maximum value of [1 L(0)] (2/n), which is the reason it is always less than our proposed method. For the simulation with β 3 = 5.5, on average, the maximum value that the likelihood-ratio R 2 can take on is approximately 0.75; the average value of the likelihood-ratio R 2 at β 3 = 5.5 is When we divide by 0.75, we get 0.82, which is much closer to 0.89, the value of the Wald R 2 with variance under the null. Thus, unlike the likelihood coefficient of determination proposed by Cox and Snell (1989), our proposed coefficient of determination does not appear to be constrained to be less than 1; our coefficient of determination appears to be more similar to the redefined likelihood coefficient of determination proposed by Nagalkerke (1991), which is not constrained to be less than 1. Note, as discussed in Section 4, in theory, under the weak assumption that β 0 is finite under the null H 0 : β 1 =...= β K = 0, R 2 will converge to 1 if any of the β k s converge to ± ; the simulations in this section confirm that result. In the simulations, we also calculated the three R 2 measures when fitting the reduced model with covariates X i1 and X i2 (that is, dropping X i3 out of the model).

15 Measure of partial association for GEEs 189 We did this in order to estimate the percentage of times the R 2 measures decrease when adding another covariate (X i3 ) to the model. In particular, we calculated the number of times R 2 with (X i1, X i2, X i3 ) in the model is less than R 2 with just (X i1, X i2 ) in the model. Of course, for the likelihood ratio, the number of times will be 0 since the likelihood always increases as more covariates are added to the model. When β 3 = 0, the number of times R 2 decreased was 14.2% for the usual Wald, and 7.6% for the Wald with variance under the null. However, for practical purposes, if R 2 was rounded to the third decimal point, the number of times R 2 decreased would be 0. For any simulations with β 3 > 0, R 2 never decreased, even without rounding. Since the likelihood ratio R 2 is not available for GEE, and our Wald R 2 with variance under the null compares favorably to the likelihood ratio R 2, we propose use of our Wald R 2 with variance under the null. Because of its poor properties, we suggest never using the coefficient of determination based on the usual Wald statistic. 8 Discussion In this paper, we have proposed a measure of partial association for GEE that is an extension of what is used for linear regression. The set of partial associations may be especially useful when the covariates are measured on different scales. Most importantly, since the Z-statistic is invariant to linear transformations, and our measure of partial association is a monotone function of the Z-statistic, our measure of partial association is invariant to linear transformation. Thus, even if two investigators have a disagreement of how to scale the covariates, the measure of partial association will be unaffected by the scale change. Although partial associations are useful, they are not without their problems (Greenland et al., 1986). The partial associations do not provide more insight than is given in the estimates, estimated standard errors, Z-statistics, and p-values, but is a summary measure. We do not suggest that an investigator use the partial associations alone instead of the usual results that are published, but in conjunction with them. We have also proposed an overall R 2 to measure the strength of association between the outcome variable and predictors. This proposed R 2 is a monotone function of the Wald statistic with variance under the null for testing if all regression coefficients (except the intercept) are 0. If an additional covariate is very nonsignificant, there is a small possibility that our proposed R 2 could decrease very slightly. However, in simulations, we found that, for all practical purposes, our proposed measure of R 2 never decreased. Further, unlike other proposed coefficients of determination, our proposed coefficient can take on values close to 1. In our example, we found that the measures of partial association and coefficient of determination based on the Wald under the null were very similar for all working

16 190 S Natarajan et al. correlations. However, in datasets with time-varying covariates, more complex covariance models (e.g., AR1 versus independence) will often lead to higher efficiency than simpler covariance models, which in turn will lead to larger Wald statistics, and thus larger values of the partial associations and coefficient of determination. Acknowledgements We are grateful for the support provided by grants AI 60373, GM 29745, HL 69800, CA 74015, CA 70101, CA 68484, HL from the National Institutes of Health (USA), and RCD from the Department of Veterans Affairs (USA). References Bohrnstedt G and Knoke D (1994) Statistics for Social Data Analysis (Third Edition). Itasca, IL: FE Peacock Publishers, Inc. Christensen R (1996) Plane Answers to Complex Questions. The Theory of Linear Models (Second Edition). New York, NY: Springer-Verlag. Cox DR and Snell EJ (1989) Analysis of binary data. London: Chapman and Hall. Greenland S, Schlesselman JJ and Criqui MH (1986) The fallacy of employing standardized regression coefficients and correlations as measures of effect. American Journal of Epidemiology, 123, Hauck WW and Donner A (1977) Wald s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association, 72, Liang KY and Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, Lipshultz SE, Lipsitz SR, Mone SM, Goorin AM, Sallan SE and Sanders SP (1995) Female sex and higher drug dose as risk factors for late cardiotoxic effects of doxorubicin therapy for childhood cancer. New England Journal of Medicine, 332, Magee L (1990) R 2 measures based on Wald and likelihood ratio joint significance tests. The American Statistician, 44, Nagelkerke NJD (1991) A note on a general definition of the coefficient of determination. Biometrika, 78, Neter J, Kutner MH, Nachsheim CJ and Wasserman W (1996) Applied Linear Statistical Models (Fourth Edition). Boston, MA: McGraw-Hill. Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics, 44, White H (1982) Maximum likelihood estimation under mis-specified models. Econometrica, 50, 1 26.

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.

More information

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study Science Journal of Applied Mathematics and Statistics 2014; 2(1): 20-25 Published online February 20, 2014 (http://www.sciencepublishinggroup.com/j/sjams) doi: 10.11648/j.sjams.20140201.13 Robust covariance

More information

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association

More information

Longitudinal Modeling with Logistic Regression

Longitudinal Modeling with Logistic Regression Newsom 1 Longitudinal Modeling with Logistic Regression Longitudinal designs involve repeated measurements of the same individuals over time There are two general classes of analyses that correspond to

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Sample Size and Power Considerations for Longitudinal Studies

Sample Size and Power Considerations for Longitudinal Studies Sample Size and Power Considerations for Longitudinal Studies Outline Quantities required to determine the sample size in longitudinal studies Review of type I error, type II error, and power For continuous

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

Repeated ordinal measurements: a generalised estimating equation approach

Repeated ordinal measurements: a generalised estimating equation approach Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related

More information

GEE for Longitudinal Data - Chapter 8

GEE for Longitudinal Data - Chapter 8 GEE for Longitudinal Data - Chapter 8 GEE: generalized estimating equations (Liang & Zeger, 1986; Zeger & Liang, 1986) extension of GLM to longitudinal data analysis using quasi-likelihood estimation method

More information

Approximate Median Regression via the Box-Cox Transformation

Approximate Median Regression via the Box-Cox Transformation Approximate Median Regression via the Box-Cox Transformation Garrett M. Fitzmaurice,StuartR.Lipsitz, and Michael Parzen Median regression is used increasingly in many different areas of applications. The

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling J. Shults a a Department of Biostatistics, University of Pennsylvania, PA 19104, USA (v4.0 released January 2015)

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION STATISTICS IN MEDICINE GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION NICHOLAS J. HORTON*, JUDITH D. BEBCHUK, CHERYL L. JONES, STUART R. LIPSITZ, PAUL J. CATALANO, GWENDOLYN

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Lecture 3.1 Basic Logistic LDA

Lecture 3.1 Basic Logistic LDA y Lecture.1 Basic Logistic LDA 0.2.4.6.8 1 Outline Quick Refresher on Ordinary Logistic Regression and Stata Women s employment example Cross-Over Trial LDA Example -100-50 0 50 100 -- Longitudinal Data

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Bayesian Multivariate Logistic Regression

Bayesian Multivariate Logistic Regression Bayesian Multivariate Logistic Regression Sean M. O Brien and David B. Dunson Biostatistics Branch National Institute of Environmental Health Sciences Research Triangle Park, NC 1 Goals Brief review of

More information

Trends in Human Development Index of European Union

Trends in Human Development Index of European Union Trends in Human Development Index of European Union Department of Statistics, Hacettepe University, Beytepe, Ankara, Turkey spxl@hacettepe.edu.tr, deryacal@hacettepe.edu.tr Abstract: The Human Development

More information

Stuart R. Lipsitz and Garrett M. Fitzmaurice, Joseph G. Ibrahim, Debajyoti Sinha, Michael Parzen. and Steven Lipshultz

Stuart R. Lipsitz and Garrett M. Fitzmaurice, Joseph G. Ibrahim, Debajyoti Sinha, Michael Parzen. and Steven Lipshultz J. R. Statist. Soc. A (2009) 172, Part 1, pp. 3 20 Joint generalized estimating equations for multivariate longitudinal binary outcomes with missing data: an application to acquired immune deficiency syndrome

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A.

Stuart R. Lipsitz Brigham and Women s Hospital, Boston, MA, U.S.A. Garrett M. Fitzmaurice Harvard Medical School, Boston, MA, U.S.A. One-Step Generalized Estimating Equations in complex surveys with large cluster sizes with application to the United State s Nationwide Inpatient Sample Stuart R. Lipsitz Brigham and Women s Hospital,

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Generating Half-normal Plot for Zero-inflated Binomial Regression

Generating Half-normal Plot for Zero-inflated Binomial Regression Paper SP05 Generating Half-normal Plot for Zero-inflated Binomial Regression Zhao Yang, Xuezheng Sun Department of Epidemiology & Biostatistics University of South Carolina, Columbia, SC 29208 SUMMARY

More information

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC Mantel-Haenszel Test Statistics for Correlated Binary Data by Jie Zhang and Dennis D. Boos Department of Statistics, North Carolina State University Raleigh, NC 27695-8203 tel: (919) 515-1918 fax: (919)

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use Modeling Longitudinal Count Data with Excess Zeros and : Application to Drug Use University of Northern Colorado November 17, 2014 Presentation Outline I and Data Issues II Correlated Count Regression

More information

Generalized Estimating Equations (gee) for glm type data

Generalized Estimating Equations (gee) for glm type data Generalized Estimating Equations (gee) for glm type data Søren Højsgaard mailto:sorenh@agrsci.dk Biometry Research Unit Danish Institute of Agricultural Sciences January 23, 2006 Printed: January 23, 2006

More information

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements: Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported

More information

Estimating Explained Variation of a Latent Scale Dependent Variable Underlying a Binary Indicator of Event Occurrence

Estimating Explained Variation of a Latent Scale Dependent Variable Underlying a Binary Indicator of Event Occurrence International Journal of Statistics and Probability; Vol. 4, No. 1; 2015 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Estimating Explained Variation of a Latent

More information

A generalized linear mixed model for longitudinal binary data with a marginal logit link function

A generalized linear mixed model for longitudinal binary data with a marginal logit link function A generalized linear mixed model for longitudinal binary data with a marginal logit link function MICHAEL PARZEN, Emory University, Atlanta, GA, U.S.A. SOUPARNO GHOSH Texas A&M University, College Station,

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Master s Written Examination - Solution

Master s Written Examination - Solution Master s Written Examination - Solution Spring 204 Problem Stat 40 Suppose X and X 2 have the joint pdf f X,X 2 (x, x 2 ) = 2e (x +x 2 ), 0 < x < x 2

More information

Advanced Quantitative Data Analysis

Advanced Quantitative Data Analysis Chapter 24 Advanced Quantitative Data Analysis Daniel Muijs Doing Regression Analysis in SPSS When we want to do regression analysis in SPSS, we have to go through the following steps: 1 As usual, we choose

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Introduction to the Logistic Regression Model

Introduction to the Logistic Regression Model CHAPTER 1 Introduction to the Logistic Regression Model 1.1 INTRODUCTION Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Lecture 1 Introduction to Multi-level Models

Lecture 1 Introduction to Multi-level Models Lecture 1 Introduction to Multi-level Models Course Website: http://www.biostat.jhsph.edu/~ejohnson/multilevel.htm All lecture materials extracted and further developed from the Multilevel Model course

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data Bryan A. Comstock and Patrick J. Heagerty Department of Biostatistics University of Washington

More information

Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University

Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University A SURVEY OF VARIANCE COMPONENTS ESTIMATION FROM BINARY DATA by Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University BU-1211-M May 1993 ABSTRACT The basic problem of variance components

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 15 1 / 38 Data structure t1 t2 tn i 1st subject y 11 y 12 y 1n1 Experimental 2nd subject

More information

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

More information

Class Notes. Examining Repeated Measures Data on Individuals

Class Notes. Examining Repeated Measures Data on Individuals Ronald Heck Week 12: Class Notes 1 Class Notes Examining Repeated Measures Data on Individuals Generalized linear mixed models (GLMM) also provide a means of incorporang longitudinal designs with categorical

More information

STAT 526 Advanced Statistical Methodology

STAT 526 Advanced Statistical Methodology STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 10 Analyzing Clustered/Repeated Categorical Data 0-0 Outline Clustered/Repeated Categorical Data Generalized Linear Mixed Models Generalized

More information

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION Ernest S. Shtatland, Ken Kleinman, Emily M. Cain Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT In logistic regression,

More information

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

LOGISTICS REGRESSION FOR SAMPLE SURVEYS 4 LOGISTICS REGRESSION FOR SAMPLE SURVEYS Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-002 4. INTRODUCTION Researchers use sample survey methodology to obtain information

More information

BIOS 2083 Linear Models c Abdus S. Wahed

BIOS 2083 Linear Models c Abdus S. Wahed Chapter 5 206 Chapter 6 General Linear Model: Statistical Inference 6.1 Introduction So far we have discussed formulation of linear models (Chapter 1), estimability of parameters in a linear model (Chapter

More information

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington Analsis of Longitudinal Data Patrick J. Heagert PhD Department of Biostatistics Universit of Washington 1 Auckland 2008 Session Three Outline Role of correlation Impact proper standard errors Used to weight

More information

Testing for independence in J K contingency tables with complex sample. survey data

Testing for independence in J K contingency tables with complex sample. survey data Biometrics 00, 1 23 DOI: 000 November 2014 Testing for independence in J K contingency tables with complex sample survey data Stuart R. Lipsitz 1,, Garrett M. Fitzmaurice 2, Debajyoti Sinha 3, Nathanael

More information

Describing Stratified Multiple Responses for Sparse Data

Describing Stratified Multiple Responses for Sparse Data Describing Stratified Multiple Responses for Sparse Data Ivy Liu School of Mathematical and Computing Sciences Victoria University Wellington, New Zealand June 28, 2004 SUMMARY Surveys often contain qualitative

More information

Models for binary data

Models for binary data Faculty of Health Sciences Models for binary data Analysis of repeated measurements 2015 Julie Lyng Forman & Lene Theil Skovgaard Department of Biostatistics, University of Copenhagen 1 / 63 Program for

More information

Tests of independence for censored bivariate failure time data

Tests of independence for censored bivariate failure time data Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 12 1 / 34 Correlated data multivariate observations clustered data repeated measurement

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

NIH Public Access Author Manuscript Stat Med. Author manuscript; available in PMC 2014 October 16.

NIH Public Access Author Manuscript Stat Med. Author manuscript; available in PMC 2014 October 16. NIH Public Access Author Manuscript Published in final edited form as: Stat Med. 2013 October 30; 32(24): 4162 4179. doi:10.1002/sim.5819. Sample Size Determination for Clustered Count Data A. Amatya,

More information

Linear Regression With Special Variables

Linear Regression With Special Variables Linear Regression With Special Variables Junhui Qian December 21, 2014 Outline Standardized Scores Quadratic Terms Interaction Terms Binary Explanatory Variables Binary Choice Models Standardized Scores:

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl

More information

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM Paper 1025-2017 GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM Kyle M. Irimata, Arizona State University; Jeffrey R. Wilson, Arizona State University ABSTRACT The

More information

Generalized Linear Modeling - Logistic Regression

Generalized Linear Modeling - Logistic Regression 1 Generalized Linear Modeling - Logistic Regression Binary outcomes The logit and inverse logit interpreting coefficients and odds ratios Maximum likelihood estimation Problem of separation Evaluating

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Correlation and simple linear regression S5

Correlation and simple linear regression S5 Basic medical statistics for clinical and eperimental research Correlation and simple linear regression S5 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/41 Introduction Eample: Brain size and

More information

STAT 526 Spring Final Exam. Thursday May 5, 2011

STAT 526 Spring Final Exam. Thursday May 5, 2011 STAT 526 Spring 2011 Final Exam Thursday May 5, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Longitudinal data analysis using generalized linear models

Longitudinal data analysis using generalized linear models Biomttrika (1986). 73. 1. pp. 13-22 13 I'rinlfH in flreal Britain Longitudinal data analysis using generalized linear models BY KUNG-YEE LIANG AND SCOTT L. ZEGER Department of Biostatistics, Johns Hopkins

More information

Introduction to mtm: An R Package for Marginalized Transition Models

Introduction to mtm: An R Package for Marginalized Transition Models Introduction to mtm: An R Package for Marginalized Transition Models Bryan A. Comstock and Patrick J. Heagerty Department of Biostatistics University of Washington 1 Introduction Marginalized transition

More information

Longitudinal analysis of ordinal data

Longitudinal analysis of ordinal data Longitudinal analysis of ordinal data A report on the external research project with ULg Anne-Françoise Donneau, Murielle Mauer June 30 th 2009 Generalized Estimating Equations (Liang and Zeger, 1986)

More information

Hypothesis Testing for Var-Cov Components

Hypothesis Testing for Var-Cov Components Hypothesis Testing for Var-Cov Components When the specification of coefficients as fixed, random or non-randomly varying is considered, a null hypothesis of the form is considered, where Additional output

More information

CHL 5225 H Crossover Trials. CHL 5225 H Crossover Trials

CHL 5225 H Crossover Trials. CHL 5225 H Crossover Trials CHL 55 H Crossover Trials The Two-sequence, Two-Treatment, Two-period Crossover Trial Definition A trial in which patients are randomly allocated to one of two sequences of treatments (either 1 then, or

More information

Logistic Regression. Continued Psy 524 Ainsworth

Logistic Regression. Continued Psy 524 Ainsworth Logistic Regression Continued Psy 524 Ainsworth Equations Regression Equation Y e = 1 + A+ B X + B X + B X 1 1 2 2 3 3 i A+ B X + B X + B X e 1 1 2 2 3 3 Equations The linear part of the logistic regression

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Pseudo-score confidence intervals for parameters in discrete statistical models

Pseudo-score confidence intervals for parameters in discrete statistical models Biometrika Advance Access published January 14, 2010 Biometrika (2009), pp. 1 8 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asp074 Pseudo-score confidence intervals for parameters

More information

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 29 14.1 Regression Models

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

More Statistics tutorial at Logistic Regression and the new:

More Statistics tutorial at  Logistic Regression and the new: Logistic Regression and the new: Residual Logistic Regression 1 Outline 1. Logistic Regression 2. Confounding Variables 3. Controlling for Confounding Variables 4. Residual Linear Regression 5. Residual

More information

Beyond GLM and likelihood

Beyond GLM and likelihood Stat 6620: Applied Linear Models Department of Statistics Western Michigan University Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence

More information

Justine Shults 1,,, Wenguang Sun 1,XinTu 2, Hanjoo Kim 1, Jay Amsterdam 3, Joseph M. Hilbe 4,5 and Thomas Ten-Have 1

Justine Shults 1,,, Wenguang Sun 1,XinTu 2, Hanjoo Kim 1, Jay Amsterdam 3, Joseph M. Hilbe 4,5 and Thomas Ten-Have 1 STATISTICS IN MEDICINE Statist. Med. 2009; 28:2338 2355 Published online 26 May 2009 in Wiley InterScience (www.interscience.wiley.com).3622 A comparison of several approaches for choosing between working

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information