# BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Save this PDF as:

Size: px
Start display at page:

Download "BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY"

## Transcription

1 BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1 Department of Epidemiology and Medical Statistics, School of Public health University of Bielefeld 2 Institute of Medical Biometry, Epidemiology & Informatics, University of Mainz 3 Department of Statistics, Ludwig Maximilians Universität München SFB 386 Discussion Paper 362 ABSTRACT Parameter estimates of logistic and Cox regression models are biased for finite samples. In a simulation study we investigated for both models the behaviour of the bias in relation to sample size and further parameters. In the case of a dichotomous explanatory variable x the magnitude of the bias is strongly influenced by the baseline risk defined by the constants of the models and the risk resulting for the high risk group. To conduct a direct comparison of the bias of the two models analyses were based on the same simulated data. Overall, the bias of the two models appear to be similar, however, the Cox model has less bias in situations where the baseline risk is high. 1. INTRODUCTION Logistic regression and Cox regression are frequently used for analyzing study data in medical research. Both methods analyze the relation of time dependent events to other influencing variables. The logistic model describes the event probability in a distinct observation time window and the Cox model the instantaneous event probability at a given time point both in dependence on explanatory variables. For the logistic regression the individual binary information "event occurred: yes or no" is used whereas the Cox regression considers the individual time till the event occurs. In general the method for estimating the parameters is based on the likelihood or partial likelihood, respectively. It is well known that 1

2 the estimates of both Maximum-Likelihood (ML) methods are only asymptotically unbiased which results in a bias for finite samples (1, 2, 3). A more or less relevant bias is present especially for small samples (4, 5), which is the case for many epidemiological and clinical studies, the bias could be relevant for parameter estimation, but a clear rule for a threshold for sample size cannot be found in the literature, as the size and direction of the bias does not rely on this issue alone. The bias of ML-estimates is directed away from zero which means that the expectation of the estimate is always larger in absolute value than the true parameter (2, 3). However, results of a systematic observation of this relations of the single regression methods have not been published. Comparisons between logistic regression and Cox regression models describing the occurrence of a dichotomous event in a distinct observation time interval were mainly related to the similarity of the parameter estimates for the same explanatory variable deriving from the different methods (6, 7, 8, 10). The bias of the parameter estimates was not investigated in these papers. The most consistent finding was that the estimates are nearly identical in the case of a rare event and a short observation time interval. Green and Symons (8) gave a short overview for the theoretical reasons of this. Increasing event probability or/and increasing observation time interval led to increasing difference of the estimates. Peduzzi et al. (7) examined a binary explanatory variable x in the models and suggested that the agreement between the estimates depends on the baseline event probability and to some extend on the probability ratio of x=1 and x=0. Ingram and Kleinman (6) discussed the influence of nonexponential survival times and non-proportional hazards as strengthening effects for the difference. They also found that sample size had no or only little effect on the difference of the estimates independent of the event probability (6). No considerations concerning a possible difference of the bias of the two methods have been made. Annesi et al. (9) found that the asymptotic relative efficiency of logistic regression model and Cox regression model is very close to 1 unless the event probability is increasing. They concluded that the Cox model is superior to the logistic model mainly when analyzing longitudinal data. In this paper, the amount of small sample bias in dependence on the data situation is investigated for both the logistic and the Cox regression model. The influence of different parameters on the bias is investigated for the single regression methods and the two approaches are compared with regard to bias and variance of the parameter estimates by means of simulation. 2

3 2. RELATIONS BETWEEN THE LOGISTIC AND THE COX REGRESSION MODEL The model for the logistic regression describes the dependence of the odds of an event on explanatory variables X with parameters β. Let Y the binary response variable indicating the occurrence of an event in a distinct time interval ranging from 0 to t z and let α the constant defining the risk in the case of all x i =0 than the relation is given by P( Y = 1 X, tz ) ( α + β X = e ) 1 P( Y = 1 X, t ) z (1) whereas the Cox regression models the hazard of an event depending on explanatory variables X: ( θ X h t) = h ( t) e ) (2) ( 0 Here, the response variable is given as the time till the event occurs, which is commonly called survival time. The baseline hazard h 0 (t), which is characterized by the underlying baseline survival time distribution defining the risk in the case of all x i =0, is unspecified in the Cox regression model. Although we were only interested in the bias of the parameter estimate of the explanatory variable X the parameters specifying the survival time distribution are needed to describe the behaviour of this bias as we will show below. The bias of the estimate of the parameters β and θ will be investigated, respectively. Is x a dichotomous covariable and is t z the length of the observation time than β is given by P( Y = 1 x = 1, tz ) P( Y = 0 x = 1, tz ) β = log = log( OR) P( Y 1 x 0, tz ) (3) = = P( Y 0 x 0, tz ) = = in the logistic regression model, which is equal to the log odds ratio (OR) and in the Cox regression model θ is h( t, x = 1) θ = log = log( HR) h( t, x 0) (4), = 3

4 which is equal to the log of the hazard ratio (HR). To conduct a direct comparison of the bias of the two regression methods both analyses were done for the same sample data. Data simulated by survival time simulation models can be transformed into a dataset suitable for logistic regression if as a restriction censoring only at the end of the observation time window is allowed. For this, the binary censoring variable indicating whether a observed survival time ended with an event or not is used as the binary response variable for the logistic regression model. The true parameters of the logistic model can be calculated using the parameters of the survival time simulation model and the equations for risk for a distinct observation time or distinct time point, respectively, of the logistic and the Cox model (see appendix I). Equating the risk formulae and solving for the parameters of the logistic model gives exact relations between the logistic and the Cox parameters. However, this relations hold only true if the observation time length was the same for all individuals. 3. SIMULATION STUDY We wanted to compare the performance of the covariable parameters estimates when the regression methods are applied to sample datasets. For this, the true underlying parameter values determining the covariable values in the dataset have to be known. Therefore we used simulation models to produce the datasets. For each constellation of true parameters and sample size 10,000 datasets were simulated and analyzed. To keep models simple we included only one binary covariable x ~ B(1, 0.5) for which the bias was investigated. Data for the logistic regression were simulated by using the logistic model as a parameter for a Bernoulli distribution resulting in a dichotomous response variable y indicating whether an event occurred or not. exp α + βx y ~ B(1,p) with p = 1+ exp α + βx (5) The intercept α determines the baseline risk P 0 (Y=1 X=0) and β is the parameter for the covariable x. For the Cox regression the survival time data t were simulated by the inverse of the survival time distribution function referring to the Cox model applied in the regression analysis. The 4

5 event probability is replaced by a uniformly distributed random number z with a value range from 0 to 1 (11). The distribution function also determines the baseline risk and baseline hazard of the Cox model, respectively. To evaluate a possible influence of the chosen survival time distribution we used three different of them: the exponential (6), the Weibull (7), and the Gompertz (8) distribution. The simulation models for the survival times are, respectively: log 1 z (Exponential) t = λ exp θ x (6) log 1 z γ (Weibull) t = (7) κ exp( η ) x 1 1 δ log 1 z (Gompertz) t = log 1 δ (8) τ exp ν x Here λ, κ and γ, τ and δ determine the baseline risk and θ, η, ν are the parameters for the covariable, respectively. A distinct observation time window was defined for censoring of the survival time data. No further censoring was simulated to allow also the application of logistic regression. The intercept and the survival time distribution parameters were selected that the baseline risk P 0 (Y=1 X=0) was approximately between 0.2 and 0.8 for the logistic simulation models and 0.07 and 0.88 for the Cox simulation models. Additionally the covariable parameter was varied resulting in a range for risk P 1 (Y=1 X=1) of 0.08 to 0.92 for logistic models and of 0.03 to 0.97 for the Cox models. Further variations of the parameters for the logistic simulation model to extend the range for risk P 1 led to an increase of simulated data sets where the regression analysis did not converge. For the direct comparison of the bias of the logistic and the Cox regression the variation of the parameters was restricted to the risk range that was suitable for the logistic simulation models. To get an impression of the relative effect of these parameters compared to the variation of sample size, this parameter was varied in an 5

6 interval between 100 and 500, for which a relevant sample size dependent bias has been reported (5). The bias is defined as the mean deviation of the estimate from the true parameter. For a specific parameter constellation k=10,000 simulated datasets were analyzed. So, the bias of a parameter β of a covariable x in the logistic model is estimated as bias Log 1 = k k i= 1 ( ˆ β β ) (9) i and correspondingly for θ, η, and ν in the Cox model. Bias estimates of logistic regression and Cox regression, both based on the same 10,000 datasets, were compared by calculating the difference of the percentage values (PBD: percentage bias difference): bias exp Log bias Cox PBD = 100 (10) β θ and correspondingly for the estimates of the Weibull (bias Cox-wei, η) and Gompertz (bias Coxgom, ν) distributed data. To get an impression whether the estimates from logistic regression or those from Cox regression are closer to there true parameter values for a distinct set of true parameters, the mean difference of the absolute deviation found in the analyses of a single dataset was calculated (APED: absolute percentage error difference): k 1 ˆ ˆ βi β θi θ APED = 100 k i= 1 β θ (11) where k denotes the number of datasets (10,000). Correspondingly this was done for the parameter estimates of the Cox-regression for the Weibull (η) and Gompertz (ν) distributed data. Saying it simple, APED measures which of the two estimates on an average lies closer to its true parameter value 6

7 4. RESULTS We plotted the bias in dependence of P 1 (Y=1 X=1) because o plot using the true parameter values on the horizontal axis was not suitable to show the relations and symmetry of the bias behaviour of the different models. Fixing all parameters except of the explanatory variable parameter, the resulting bias of the logistic regression models, when plotted versus the risk P 1 (Y=1 X=1), produces a curve shown in Fig. 1a. The bias is monotone increasing with increasing risk P 1. Around P 1 =0.5 the increase is small but towards values closer to 0 or 1 the bias becomes very large, respectively. The point of intersection of the bias graph with the horizontal axis bias=0 is determined by the intercept parameter, e.g. the baseline risk. A point symmetry regarding the bias estimate for P 1 =0.5 is found for the bias of the logistic regression. Variation of the main model parameters leaves the shape of the bias graph unchanged, but the whole curve is shifted vertically downwards with increasing baseline risk (Fig. 1a). This also results in a horizontal shift of the point of intersection of the bias curve with the axis bias=0. In an interval of moderate P 1 (0.15 to 0.85) the percentage bias of logistic regression varies only little but towards 'extremer' values (<0.15 and >0.85) a strong increase of the bias is present (Fig. 1b). The boundaries of this different behaviour move towards extremer risks with increasing sample size. The point symmetry of the graph of the absolute bias results in a further symmetry when looking at the percentage bias: the combined graphs of complementary baseline risks (Fig. 1b: 0.27 and 0.73) are symmetrical to the vertical axis at P 1 =0.5. Another interesting fact is that if sample size and baseline risk are constant the smallest percentage bias is found near an risk P 1 that is complementary to the true baseline risk. As the bias is decreasing with increasing sample size, the bias curve is shifted closer towards the axis bias=0 for higher sample sizes (Fig. 2). The main facts mentioned so far are also true for the bias of parameter estimates in the Cox regression except of the point symmetry which is not found for the latter (Fig. 3a). The strong decrease of bias for small P 1 is similar to that for logistic regression parameters but the increase at high P 1 starts not till much extremer values. As a consequence, when comparing two complementary baseline risks, the bias estimates of the different risks P 1 for the higher baseline risk are all smaller than for the corresponding baseline risk less than 0.5 (Fig. 3b). 7

8 Different survival time distributions have no effect on the bias: the percentage values are identical if the distribution parameters resulted in a corresponding baseline risk in the different models. The percentage bias of logistic regression is always higher than the corresponding bias of the Cox regression, however, for small P 1 the difference is almost zero (Fig. 4). The difference of the percentage bias s is increasing with increasing risk P 1 and is decreasing with increasing sample size. However, even for a sample size of n=100 the percentage bias s of both regression methods as well as the absolute difference of the percentage bias s do not exceed 6% when looking at moderate risks P 1 only. The increase of the mean difference of the percentage absolute deviations (APED) of the single estimates with increasing risk P 1 depends on the baseline risk: higher baseline risk results in a steeper slope (Fig. 5). For small baseline risks the increase is almost zero. At small P 1 the mean absolute deviations for logistic regression estimates seem to be a little bit smaller than those for Cox regression estimates. Again, these findings are not changed when different survival time distributions are used for the simulation of the data. 5. DISCUSSION We conducted a simulation study to observe the behaviour of the bias of ML parameter estimates in logistic regression and Cox regression in relation to sample size and the true values for baseline risk and the explanatory variable parameter. The baseline risk for the logistic model is characterized by the intercept and for the Cox model by the parameters of the survival time distribution responsible for the baseline hazard. To our knowledge, no results concerning a systematic comparison of the bias of the two regression methods has been presented so far. Callas et al. (10) used a measure named percentage relative bias in point estimates, but they observed the difference between the point estimates of two logistic regression methods and the estimate of the Cox regression, respectively. Results presented here give rise to the presumption that the published differences found for comparisons of the crude estimates of the both methods when analyzing the same explanatory variables (6, 7, 8) are at least partly due to the different bias behaviour of the parameter estimates in the different models. 8

9 From our simulations the following conclusions can be drawn. In the case of a binary explanatory variable x the bias of the parameter estimates obtained from the ML regression methods depends not only on the sample size but also on the baseline risk and the risk P 1 (Y=1 X=1). The intensity of the influence of a single parameter on the bias is not independent from the other parameters. This relation is different for logistic regression and Cox regression. In general a strong bias is present for extreme baseline risks and for extreme risks P 1 (Y=1 X=1). With both regression methods a high bias of the covariable parameter has to be expected if the number of events in the group only affected by the baseline risk is small. For the logistic regression this is also true for the number of non-events which results in the point symmetry of the bias curve. The following limitations of our study should be considered. Of course, much higher values for the percentage bias in logistic regression than presented here would be expected if small sample sizes and extreme baseline risks would be combined as it was possible for the simulated datasets for Cox regression. As we simulated the sample data by chance the case with no events or no non-events for x=0 occurred more often for extremer baseline risks. In this cases the logistic regression analysis did not converge and no estimate could be obtained, e.g. this cases could not be included when estimating the bias. As an exclusion of these cases would shift the bias estimate, we decided here to present no results for parameter constellations where the converge criteria of the regression were not reached for at least 99.9% of the regarding 10,000 simulations. Further research is required to investigate the association between the risk of continuous covariables and the parameter bias as well as the influence of additional covariables on the bias. Despite these limitations the following practical implications can be made. Although the higher power of the Cox regression allows to extend the interval of possible risks that can be analyzed with small sample sizes when compared to logistic regression this extent is combined with a strong increase in bias, especially for low baseline risks and low P 1 (Y=1 X=1). To define situations when bias correction is required three factors have to be considered: baseline risk, risk of the exposed (here risk P 1 (Y=1 X=1)), and sample size. Schaefer (5) presented a bias correction method for logistic regression. He recommended to apply this 9

10 method to regression analyses if sample size is 200 or less and mentioned no further criteria. However, as shown here, for small sample sizes the bias is rather small when the baseline risk and the risk P 1 (Y=1 X=1) are moderate and bias correction methods might not be necessary. However, also if the total number of events is high but the baseline risk is rather small this would lead to strong bias. This means for logistic regression that, if the numbers of events or non-events for the group characterized only by the intercept of the logistic model tend towards zero, the estimates for all covariable parameters would be highly biased. In general an extreme baseline risk and/or an extreme P 1 (Y=1 X=1) leads to a strong bias. For logistic regression the boundaries between moderate and extreme risks are symmetrical shifted towards zero and one, respectively, by increasing the sample size. This shift of the boundaries is not symmetrical for Cox regression parameters and is more pronounced towards zero as even for small sample sizes a strong bias increase occurs only at very high risks. Most critical are the cases where a low baseline risk is present and parameters for protective covariables are estimated. The case vice versa is only relevant for logistic regression. However, the decision, whether a bias correction is suitable or not, should depend on the ability of the considered correction method to give sufficient results especially in the cases of extreme baseline risk or extreme P 1 (Y=1 X=1). If there is no necessity for doing Cox regression because of other than type I censoring then the additional effort for collecting survival times instead of the binary information event: yes or no might be not justified when working on low baseline risks or low risks P 1 for the investigated covariables. Concerning the bias the advantages of Cox regression are most pronounced when analyzing high baseline risks. Another advantage is that at high extreme risks P 1 parameters can be estimated with a rather small bias. In summary, not only sample size is the main factor for the decision, whether a bias correction is suitable in an analysis based on a logistic or Cox regression model or not, but also the baseline risk and the risk for the exposed (here risk P 1 (Y=1 X=1)). Concerning the bias the advantages of Cox regression compared to logistic regression are mainly in the case of high baseline risks and/or extreme risks for the exposed. 10

11 APPENDIX I We used the exponential, the Weibull, and the Gompertz distribution for the simulation of survival times. As a restriction censoring is only allowed at the end of the simulated observation time window. The true parameters of the logistic model were calculated using the parameters of the survival time simulation model and the equations for risk R for a distinct observation time t z or distinct time point t z, respectively, of the logistic and the Cox model. α+ βx e (Logistic) R(tz) = α+ βx (12) 1+ e ( θ x λt e ) (Cox: exponential) e z R(tz) = 1 (13) γ ( η x κ t e ) (Cox: Weibull) R(t ) e z = (14) z 1 τ ν x (Cox: Gompertz) 1 e δtz e δ (15) R(tz ) = 1 e Equating (12) and (13) (or (12) and (14), or (12) and (15), respectively) for x=0 and is λ (or κ, γ, or τ, δ, respectively) the parameter of the survival time model based on an exponential (or Weibull, or Gompertz, respectively) distribution then the intercept of the logistic model, α, is given by: z (Exponential) ( λt α = log 1 e z ) + λt (16) γ κ t γ (Weibull) α = log1 z e + κt z (17) τ ( δt z ) 1 e δ τ δt α = log 1 e ( 1 e z ) (Gompertz) δ (18) 11

12 Analogous the parameter β of the binary covariable of the logistic model is calculated for the case x=1 and substituting α by (16) (or (17) or (18), respectively). Here, the parameter θ (or η or ν, respectively) of the survival time model has to be considered: ( ) θx λt 1 z e ( θx) (Exponential) β log e = + λt z e 1 λt (19) 1 z e ( ) γ ηx κ t 1 z e ( ηx) (Weibull) β log e γ = + κt z e 1 γ ( κ t ) (20) 1 z e τ ( νx ) 1 ( δtz ) e e δ (Gompertz) 1 τ ( ( δt )) ( ( νx β log e = + 1 e z 1 e )) (21). τ ( 1 ( δt e z ) ) δ δ 1 e 12

13 REFERENCES 1. Cordeiro, G.M.; McCullagh, P. Bias Correction in Generalized Linear Models. R Statist Soc B 1991, 3, Firth, D. Bias reduction of maximum likelihood estimates. Biometrika 1993, 89 (1), McCullagh, P.; Nelder, J.A. Generalized Linear Models. Chapman and Hall: London, Whaley, F.S. Comparison of different Maximum Likelihood Estimators in a small sample logistic regression with two independent binary variables. Statistics in Medicine 1991,10, Schaefer, R.L. Bias correction in maximum likelihood logistic regression. Statistics in Medicine 1983, 2, Ingram, D.D.; Kleinman, J.C. Empirical comparison of proportional hazards and logistic regression models. Statistics in Medicine 1989, 8, Peduzzi, P.; Holford, T.; Detre, K.; Chan, Y.-K. Comparison of the logistic and Cox regression models when outcome is determined in all patients after a fixed period of time. Journal of Chronic Diseases 1987, 40(8), Green, M.S.; Symons, M.J. A comparison of the logistic risk function and the proportional hazards model in prospective epidemiologic studies. Journal of Chronic Diseases 1983, 36(10), Annesi, I.; Moreau, T.; Lellouch, J. Efficiency of the logistic regression and Cox proportional hazards models in longitudinal studies. Statistics in Medicine 1989, 8, Callas, P.W.; Pastides, H.; Hosmer, D.W. Empirical comparison of proportional hazards, poisson, and logistic regression modeling of occupationl cohort data. American Journal of Industrial Medicine 1998, 33, Bender, R.; Augustin, T.; Blettner, M. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine 2003 (submitted for publication) 13

14 Figures 1a) 1b) Figure 1: Bias estimates of logistic regression parameter estimates for different true values of the parameter of a binary explanatory variable x and different sample sizes plotted versus true risk for x=1; graphs shown for two different baseline risks a) as absolute values and b) in percent of the true parameter. 14

15 Figure 2: Bias estimates of logistic regression parameter estimates for different true values of the parameter of a binary explanatory variable x and sample size n=100 plotted versus true risk for x=1; graphs shown are all for the same baseline risk. 15

16 3a) 3b) Figure 3: Bias estimates of Cox regression parameter estimates for different true values of the parameter of a binary explanatory variable x and sample size n=100 plotted versus true risk for x=1; graphs shown for two different baseline risks a) as absolute values and b) in percent of the true parameter. 16

17 Figure 4: Difference between the percentage biases (see equation 10) in logistic regression and Cox regression of a binary explanatory variable x for identical baseline risks and risks for x=1; estimates are shown for two sample sizes. 17

18 Figure 5: Mean differences of the absolute percentage deviations of true parameters and parameter estimates for single samples (see equation 11) between logistic regression and Cox regression; results are shown for n=100 and different baseline risks and risks for x=1 18

### Continuous Time Survival in Latent Variable Models

Continuous Time Survival in Latent Variable Models Tihomir Asparouhov 1, Katherine Masyn 2, Bengt Muthen 3 Muthen & Muthen 1 University of California, Davis 2 University of California, Los Angeles 3 Abstract

### Extensions of Cox Model for Non-Proportional Hazards Purpose

PhUSE 2013 Paper SP07 Extensions of Cox Model for Non-Proportional Hazards Purpose Jadwiga Borucka, PAREXEL, Warsaw, Poland ABSTRACT Cox proportional hazard model is one of the most common methods used

### Analysis of competing risks data and simulation of data following predened subdistribution hazards

Analysis of competing risks data and simulation of data following predened subdistribution hazards Bernhard Haller Institut für Medizinische Statistik und Epidemiologie Technische Universität München 27.05.2013

### 8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

### Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective Anastasios (Butch) Tsiatis and Xiaofei Bai Department of Statistics North Carolina State University 1/35 Optimal Treatment

### BIOS 312: Precision of Statistical Inference

and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

### Investigating Models with Two or Three Categories

Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

### Repeated ordinal measurements: a generalised estimating equation approach

Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related

### PASS Sample Size Software. Poisson Regression

Chapter 870 Introduction Poisson regression is used when the dependent variable is a count. Following the results of Signorini (99), this procedure calculates power and sample size for testing the hypothesis

### CTDL-Positive Stable Frailty Model

CTDL-Positive Stable Frailty Model M. Blagojevic 1, G. MacKenzie 2 1 Department of Mathematics, Keele University, Staffordshire ST5 5BG,UK and 2 Centre of Biostatistics, University of Limerick, Ireland

### SAMPLE SIZE ESTIMATION FOR SURVIVAL OUTCOMES IN CLUSTER-RANDOMIZED STUDIES WITH SMALL CLUSTER SIZES BIOMETRICS (JUNE 2000)

SAMPLE SIZE ESTIMATION FOR SURVIVAL OUTCOMES IN CLUSTER-RANDOMIZED STUDIES WITH SMALL CLUSTER SIZES BIOMETRICS (JUNE 2000) AMITA K. MANATUNGA THE ROLLINS SCHOOL OF PUBLIC HEALTH OF EMORY UNIVERSITY SHANDE

### Stein-Rule Estimation under an Extended Balanced Loss Function

Shalabh & Helge Toutenburg & Christian Heumann Stein-Rule Estimation under an Extended Balanced Loss Function Technical Report Number 7, 7 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

### Frailty Models and Copulas: Similarities and Differences

Frailty Models and Copulas: Similarities and Differences KLARA GOETHALS, PAUL JANSSEN & LUC DUCHATEAU Department of Physiology and Biometrics, Ghent University, Belgium; Center for Statistics, Hasselt

### An Analysis of Record Statistics based on an Exponentiated Gumbel Model

Communications for Statistical Applications and Methods 2013, Vol. 20, No. 5, 405 416 DOI: http://dx.doi.org/10.5351/csam.2013.20.5.405 An Analysis of Record Statistics based on an Exponentiated Gumbel

### Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

### Categorical and Zero Inflated Growth Models

Categorical and Zero Inflated Growth Models Alan C. Acock* Summer, 2009 *Alan C. Acock, Department of Human Development and Family Sciences, Oregon State University, Corvallis OR 97331 (alan.acock@oregonstate.edu).

### A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

### Statistical Methods in Clinical Trials Categorical Data

Statistical Methods in Clinical Trials Categorical Data Types of Data quantitative Continuous Blood pressure Time to event Categorical sex qualitative Discrete No of relapses Ordered Categorical Pain level

### Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach Kristina Pestaria Sinaga, Manuntun Hutahaean 2, Petrus Gea 3 1, 2, 3 University of Sumatera Utara,

### Guideline on adjustment for baseline covariates in clinical trials

26 February 2015 EMA/CHMP/295050/2013 Committee for Medicinal Products for Human Use (CHMP) Guideline on adjustment for baseline covariates in clinical trials Draft Agreed by Biostatistics Working Party

### A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

### Harvard University. Harvard University Biostatistics Working Paper Series. Survival Analysis with Change Point Hazard Functions

Harvard University Harvard University Biostatistics Working Paper Series Year 2006 Paper 40 Survival Analysis with Change Point Hazard Functions Melody S. Goodman Yi Li Ram C. Tiwari Harvard University,

### Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 25: Semiparametric Models Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations

### Incorporating published univariable associations in diagnostic and prognostic modeling

Incorporating published univariable associations in diagnostic and prognostic modeling Thomas Debray Julius Center for Health Sciences and Primary Care University Medical Center Utrecht The Netherlands

### Imputation of rounded data

11 0 Imputation of rounded data Jan van der Laan and Léander Kuijvenhoven The views expressed in this paper are those of the author(s) and do not necessarily reflect the policies of Statistics Netherlands

### Survival Distributions, Hazard Functions, Cumulative Hazards

BIO 244: Unit 1 Survival Distributions, Hazard Functions, Cumulative Hazards 1.1 Definitions: The goals of this unit are to introduce notation, discuss ways of probabilistically describing the distribution

### M2S2 Statistical Modelling

UNIVERSITY OF LONDON BSc and MSci EXAMINATIONS (MATHEMATICS) May-June 2006 This paper is also taken for the relevant examination for the Associateship. M2S2 Statistical Modelling Date: Wednesday, 17th

### Generalizing the MCPMod methodology beyond normal, independent data

Generalizing the MCPMod methodology beyond normal, independent data José Pinheiro Joint work with Frank Bretz and Björn Bornkamp Novartis AG Trends and Innovations in Clinical Trial Statistics Conference

### Analysis of recurrent event data under the case-crossover design. with applications to elderly falls

STATISTICS IN MEDICINE Statist. Med. 2007; 00:1 22 [Version: 2002/09/18 v1.11] Analysis of recurrent event data under the case-crossover design with applications to elderly falls Xianghua Luo 1,, and Gary

### Chapter 4 Regression Models

23.August 2010 Chapter 4 Regression Models The target variable T denotes failure time We let x = (x (1),..., x (m) ) represent a vector of available covariates. Also called regression variables, regressors,

### Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

### arxiv: v2 [stat.me] 8 Jun 2016

Orthogonality of the Mean and Error Distribution in Generalized Linear Models 1 BY ALAN HUANG 2 and PAUL J. RATHOUZ 3 University of Technology Sydney and University of Wisconsin Madison 4th August, 2013

### Multivariate Survival Data With Censoring.

1 Multivariate Survival Data With Censoring. Shulamith Gross and Catherine Huber-Carol Baruch College of the City University of New York, Dept of Statistics and CIS, Box 11-220, 1 Baruch way, 10010 NY.

### On case-crossover methods for environmental time series data

On case-crossover methods for environmental time series data Heather J. Whitaker 1, Mounia N. Hocine 1,2 and C. Paddy Farrington 1 * 1 Department of Statistics, The Open University, Milton Keynes, UK.

### Applied Linear Statistical Methods

Applied Linear Statistical Methods (short lecturenotes) Prof. Rozenn Dahyot School of Computer Science and Statistics Trinity College Dublin Ireland www.scss.tcd.ie/rozenn.dahyot Hilary Term 2016 1. Introduction

### 3.1 Events, Sample Spaces, and Probability

Chapter 3 Probability Probability is the tool that allows the statistician to use sample information to make inferences about or to describe the population from which the sample was drawn. 3.1 Events,

### STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

### Semiparametric maximum likelihood estimation in normal transformation models for bivariate survival data

Biometrika (28), 95, 4,pp. 947 96 C 28 Biometrika Trust Printed in Great Britain doi: 1.193/biomet/asn49 Semiparametric maximum likelihood estimation in normal transformation models for bivariate survival

### Predictive Modeling Using Logistic Regression Step-by-Step Instructions

Predictive Modeling Using Logistic Regression Step-by-Step Instructions This document is accompanied by the following Excel Template IntegrityM Predictive Modeling Using Logistic Regression in Excel Template.xlsx

### Analysis of 2 n Factorial Experiments with Exponentially Distributed Response Variable

Applied Mathematical Sciences, Vol. 5, 2011, no. 10, 459-476 Analysis of 2 n Factorial Experiments with Exponentially Distributed Response Variable S. C. Patil (Birajdar) Department of Statistics, Padmashree

### Supplementary materials for:

Supplementary materials for: Tang TS, Funnell MM, Sinco B, Spencer MS, Heisler M. Peer-led, empowerment-based approach to selfmanagement efforts in diabetes (PLEASED: a randomized controlled trial in an

### Generalized Linear Models with Functional Predictors

Generalized Linear Models with Functional Predictors GARETH M. JAMES Marshall School of Business, University of Southern California Abstract In this paper we present a technique for extending generalized

### PENALIZING YOUR MODELS

PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

### Analysis of Cure Rate Survival Data Under Proportional Odds Model

Analysis of Cure Rate Survival Data Under Proportional Odds Model Yu Gu 1,, Debajyoti Sinha 1, and Sudipto Banerjee 2, 1 Department of Statistics, Florida State University, Tallahassee, Florida 32310 5608,

### A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type

### Probability. Introduction to Biostatistics

Introduction to Biostatistics Probability Second Semester 2014/2015 Text Book: Basic Concepts and Methodology for the Health Sciences By Wayne W. Daniel, 10 th edition Dr. Sireen Alkhaldi, BDS, MPH, DrPH

### The Genetic Algorithm is Useful to Fitting Input Probability Distributions for Simulation Models

The Genetic Algorithm is Useful to Fitting Input Probability Distributions for Simulation Models Johann Christoph Strelen Rheinische Friedrich Wilhelms Universität Bonn Römerstr. 164, 53117 Bonn, Germany

### Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

### Unobserved Heterogeneity

Unobserved Heterogeneity Germán Rodríguez grodri@princeton.edu Spring, 21. Revised Spring 25 This unit considers survival models with a random effect representing unobserved heterogeneity of frailty, a

### The OSCAR for Generalized Linear Models

Sebastian Petry & Gerhard Tutz The OSCAR for Generalized Linear Models Technical Report Number 112, 2011 Department of Statistics University of Munich http://www.stat.uni-muenchen.de The OSCAR for Generalized

### Monitoring risk-adjusted medical outcomes allowing for changes over time

Biostatistics (2014), 15,4,pp. 665 676 doi:10.1093/biostatistics/kxt057 Advance Access publication on December 29, 2013 Monitoring risk-adjusted medical outcomes allowing for changes over time STEFAN H.

### Statistical inference on the penetrances of rare genetic mutations based on a case family design

Biostatistics (2010), 11, 3, pp. 519 532 doi:10.1093/biostatistics/kxq009 Advance Access publication on February 23, 2010 Statistical inference on the penetrances of rare genetic mutations based on a case

### Generalized Linear Models (1/29/13)

STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

### Tutorial 2: Power and Sample Size for the Paired Sample t-test

Tutorial 2: Power and Sample Size for the Paired Sample t-test Preface Power is the probability that a study will reject the null hypothesis. The estimated probability is a function of sample size, variability,

### AFT Models and Empirical Likelihood

AFT Models and Empirical Likelihood Mai Zhou Department of Statistics, University of Kentucky Collaborators: Gang Li (UCLA); A. Bathke; M. Kim (Kentucky) Accelerated Failure Time (AFT) models: Y = log(t

### Comparison between conditional and marginal maximum likelihood for a class of item response models

(1/24) Comparison between conditional and marginal maximum likelihood for a class of item response models Francesco Bartolucci, University of Perugia (IT) Silvia Bacci, University of Perugia (IT) Claudia

### How likely is Simpson s paradox in path models?

How likely is Simpson s paradox in path models? Ned Kock Full reference: Kock, N. (2015). How likely is Simpson s paradox in path models? International Journal of e- Collaboration, 11(1), 1-7. Abstract

### ANALYSIS OF COMPETING RISKS DATA WITH MISSING CAUSE OF FAILURE UNDER ADDITIVE HAZARDS MODEL

Statistica Sinica 18(28, 219-234 ANALYSIS OF COMPETING RISKS DATA WITH MISSING CAUSE OF FAILURE UNDER ADDITIVE HAZARDS MODEL Wenbin Lu and Yu Liang North Carolina State University and SAS Institute Inc.

### Estimating Treatment Effects and Identifying Optimal Treatment Regimes to Prolong Patient Survival

Estimating Treatment Effects and Identifying Optimal Treatment Regimes to Prolong Patient Survival by Jincheng Shen A dissertation submitted in partial fulfillment of the requirements for the degree of

### Topics in Current Status Data. Karen Michelle McKeown. A dissertation submitted in partial satisfaction of the. requirements for the degree of

Topics in Current Status Data by Karen Michelle McKeown A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor in Philosophy in Biostatistics in the Graduate Division

### A simple graphical method to explore tail-dependence in stock-return pairs

A simple graphical method to explore tail-dependence in stock-return pairs Klaus Abberger, University of Konstanz, Germany Abstract: For a bivariate data set the dependence structure can not only be measured

### Random Effects Models for Network Data

Random Effects Models for Network Data Peter D. Hoff 1 Working Paper no. 28 Center for Statistics and the Social Sciences University of Washington Seattle, WA 98195-4320 January 14, 2003 1 Department of

### Known unknowns : using multiple imputation to fill in the blanks for missing data

Known unknowns : using multiple imputation to fill in the blanks for missing data James Stanley Department of Public Health University of Otago, Wellington james.stanley@otago.ac.nz Acknowledgments Cancer

### Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

### Comparing IRT with Other Models

Comparing IRT with Other Models Lecture #14 ICPSR Item Response Theory Workshop Lecture #14: 1of 45 Lecture Overview The final set of slides will describe a parallel between IRT and another commonly used

### Regression models for multivariate ordered responses via the Plackett distribution

Journal of Multivariate Analysis 99 (2008) 2472 2478 www.elsevier.com/locate/jmva Regression models for multivariate ordered responses via the Plackett distribution A. Forcina a,, V. Dardanoni b a Dipartimento

### Semiparametric Varying Coefficient Models for Matched Case-Crossover Studies

Semiparametric Varying Coefficient Models for Matched Case-Crossover Studies Ana Maria Ortega-Villa Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial

### Model Estimation Example

Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

### IE 230 Probability & Statistics in Engineering I. Closed book and notes. 120 minutes.

Closed book and notes. 10 minutes. Two summary tables from the concise notes are attached: Discrete distributions and continuous distributions. Eight Pages. Score _ Final Exam, Fall 1999 Cover Sheet, Page

### An EM-Algorithm Based Method to Deal with Rounded Zeros in Compositional Data under Dirichlet Models. Rafiq Hijazi

An EM-Algorithm Based Method to Deal with Rounded Zeros in Compositional Data under Dirichlet Models Rafiq Hijazi Department of Statistics United Arab Emirates University P.O. Box 17555, Al-Ain United

### Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

METRON - International Journal of Statistics 2008, vol. LXVI, n. 3, pp. 285-298 SHALABH HELGE TOUTENBURG CHRISTIAN HEUMANN Mean squared error matrix comparison of least aquares and Stein-rule estimators

### Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Quantitative Methods for Economics, Finance and Management (A86050 F86050) Matteo Manera matteo.manera@unimib.it Marzio Galeotti marzio.galeotti@unimi.it 1 This material is taken and adapted from Guy Judge

### Solutions for Examination Categorical Data Analysis, March 21, 2013

STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a.

### Graph Detection and Estimation Theory

Introduction Detection Estimation Graph Detection and Estimation Theory (and algorithms, and applications) Patrick J. Wolfe Statistics and Information Sciences Laboratory (SISL) School of Engineering and

### Model comparison and selection

BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

### Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School

### Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

### Analysis of 2 n factorial experiments with exponentially distributed response variable: a comparative study

Analysis of 2 n factorial experiments with exponentially distributed response variable: a comparative study H. V. Kulkarni a and S. C. Patil (Birajdar) b Department of Statistics, Shivaji University, Kolhapur,

### arxiv: v2 [stat.me] 5 May 2016

Palindromic Bernoulli distributions Giovanni M. Marchetti Dipartimento di Statistica, Informatica, Applicazioni G. Parenti, Florence, Italy e-mail: giovanni.marchetti@disia.unifi.it and Nanny Wermuth arxiv:1510.09072v2

### Multi-period credit default prediction with time-varying covariates. Walter Orth University of Cologne, Department of Statistics and Econometrics

with time-varying covariates Walter Orth University of Cologne, Department of Statistics and Econometrics 2 20 Overview Introduction Approaches in the literature The proposed models Empirical analysis

### Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events Liam F. McGrath September 15, 215 Abstract When separation is a problem in binary

### Asymptotic Relative Efficiency in Estimation

Asymptotic Relative Efficiency in Estimation Robert Serfling University of Texas at Dallas October 2009 Prepared for forthcoming INTERNATIONAL ENCYCLOPEDIA OF STATISTICAL SCIENCES, to be published by Springer

### COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS

COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS EX 1 Given the following series of data on Gender and Height for 8 patients, fill in two frequency tables one for each Variable, according to the model

### Modeling and Analysis of Recurrent Event Data

Edsel A. Peña (pena@stat.sc.edu) Department of Statistics University of South Carolina Columbia, SC 29208 New Jersey Institute of Technology Conference May 20, 2012 Historical Perspective: Random Censorship

### Unit 3. Discrete Distributions

BIOSTATS 640 - Spring 2016 3. Discrete Distributions Page 1 of 52 Unit 3. Discrete Distributions Chance favors only those who know how to court her - Charles Nicolle In many research settings, the outcome

### Latent Class Analysis

Latent Class Analysis Karen Bandeen-Roche October 27, 2016 Objectives For you to leave here knowing When is latent class analysis (LCA) model useful? What is the LCA model its underlying assumptions? How

### MDPs with Non-Deterministic Policies

MDPs with Non-Deterministic Policies Mahdi Milani Fard School of Computer Science McGill University Montreal, Canada mmilan1@cs.mcgill.ca Joelle Pineau School of Computer Science McGill University Montreal,

### Accounting for Baseline Observations in Randomized Clinical Trials

Accounting for Baseline Observations in Randomized Clinical Trials Scott S Emerson, MD, PhD Department of Biostatistics, University of Washington, Seattle, WA 9895, USA October 6, 0 Abstract In clinical

### From semi- to non-parametric inference in general time scale models

From semi- to non-parametric inference in general time scale models Thierry DUCHESNE duchesne@matulavalca Département de mathématiques et de statistique Université Laval Québec, Québec, Canada Research

### Prof. Dr. Roland Füss Lecture Series in Applied Econometrics Summer Term Introduction to Time Series Analysis

Introduction to Time Series Analysis 1 Contents: I. Basics of Time Series Analysis... 4 I.1 Stationarity... 5 I.2 Autocorrelation Function... 9 I.3 Partial Autocorrelation Function (PACF)... 14 I.4 Transformation

### AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Scatterplots and Correlation Name Hr A scatterplot shows the relationship between two quantitative variables measured on the same individuals. variable (y) measures an outcome of a study variable (x) may

### Latent Growth Models 1

1 We will use the dataset bp3, which has diastolic blood pressure measurements at four time points for 256 patients undergoing three types of blood pressure medication. These are our observed variables:

### Sample Size and Power I: Binary Outcomes. James Ware, PhD Harvard School of Public Health Boston, MA

Sample Size and Power I: Binary Outcomes James Ware, PhD Harvard School of Public Health Boston, MA Sample Size and Power Principles: Sample size calculations are an essential part of study design Consider

### CURE MODEL WITH CURRENT STATUS DATA

Statistica Sinica 19 (2009), 233-249 CURE MODEL WITH CURRENT STATUS DATA Shuangge Ma Yale University Abstract: Current status data arise when only random censoring time and event status at censoring are

### CS489/698: Intro to ML

CS489/698: Intro to ML Lecture 04: Logistic Regression 1 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of

### The Dark Corners of the Labor Market

The Dark Corners of the Labor Market Vincent Sterk Conference on Persistent Output Gaps: Causes and Policy Remedies EABCN / University of Cambridge / INET University College London September 2015 Sterk

### ST495: Survival Analysis: Maximum likelihood

ST495: Survival Analysis: Maximum likelihood Eric B. Laber Department of Statistics, North Carolina State University February 11, 2014 Everything is deception: seeking the minimum of illusion, keeping

### Estimating the Hausman test for multilevel Rasch model. Kingsley E. Agho School of Public health Faculty of Medicine The University of Sydney

Estimating the Hausman test for multilevel Rasch model Kingsley E. Agho School of Public health Faculty of Medicine The University of Sydney James A. Athanasou Faculty of Education University of Technology,