Sample size calculations for logistic and Poisson regression models

Similar documents
Sample size determination for logistic regression: A simulation study

On power and sample size calculations for Wald tests in generalized linear models

Discrete Multivariate Statistics

Fisher information for generalised linear mixed models

Figure 36: Respiratory infection versus time for the first 49 children.

,..., θ(2),..., θ(n)

Generalized Linear Models (GLZ)

8 Nominal and Ordinal Logistic Regression

A note on profile likelihood for exponential tilt mixture models

LOGISTIC REGRESSION Joseph M. Hilbe

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Investigating Models with Two or Three Categories

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Robust covariance estimator for small-sample adjustment in the generalized estimating equations: A simulation study

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

Stat 5101 Lecture Notes

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

A Comparative Study of Power and Sample Size Calculations for Multivariate General Linear Models Gwowen Shieh Published online: 10 Jun 2010.

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Linear Regression Models P8111

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

6 Single Sample Methods for a Location Parameter

An Approximate Test for Homogeneity of Correlated Correlation Coefficients

Logistic regression: Miscellaneous topics

Generalized logit models for nominal multinomial responses. Local odds ratios

DISPLAYING THE POISSON REGRESSION ANALYSIS

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Repeated ordinal measurements: a generalised estimating equation approach

STA6938-Logistic Regression Model

PQL Estimation Biases in Generalized Linear Mixed Models

Generalized linear models

Some Approximations of the Logistic Distribution with Application to the Covariance Matrix of Logistic Regression

Statistics in medicine

Longitudinal data analysis using generalized linear models

Multinomial Logistic Regression Models

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

Stat 642, Lecture notes for 04/12/05 96

Generalized Linear Models. Kurt Hornik

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Marcia Gumpertz and Sastry G. Pantula Department of Statistics North Carolina State University Raleigh, NC

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

arxiv: v2 [stat.me] 8 Jun 2016

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Subject CS1 Actuarial Statistics 1 Core Principles

Generalized, Linear, and Mixed Models

STATISTICS SYLLABUS UNIT I

Multivariate statistical methods and data mining in particle physics

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Generalized Linear Models for Non-Normal Data

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Semiparametric Generalized Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

HANDBOOK OF APPLICABLE MATHEMATICS

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

A comparison study of the nonparametric tests based on the empirical distributions

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Logistic Regression in R. by Kerry Machemer 12/04/2015

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Pseudo-score confidence intervals for parameters in discrete statistical models

Nonlinear multilevel models, with an application to discrete response data

Solutions for Examination Categorical Data Analysis, March 21, 2013

Model Estimation Example

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

ROBUSTNESS OF TWO-PHASE REGRESSION TESTS

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Interpretation of the Fitted Logistic Regression Model

Multivariate Regression (Chapter 10)

General Regression Model

Final Examination Statistics 200C. T. Ferguson June 11, 2009

A Few Special Distributions and Their Properties

Generalized Linear Models 1

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

PASS Sample Size Software. Poisson Regression

Introduction to General and Generalized Linear Models

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Distribution Theory. Comparison Between Two Quantiles: The Normal and Exponential Cases

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Introduction to the Logistic Regression Model

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Lecture 4: Newton s method and gradient descent

High-Throughput Sequencing Course

Comparison of Estimators in GLM with Binary Data

Single-level Models for Binary Responses

Summary of Chapters 7-9

Transcription:

Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National Chiao T ung University, Hsinchu, T aiwan 35, R.O.C. gwshieh@cc.nctu.edu.tw SUMMARY A method is proposed for improving sample size calculations for logistic and Poisson regression models by incorporating the limiting value of the maximum likelihood estimates of nuisance parameters under the composite null hypothesis. The method modifies existing approaches of Whittemore (98) and Signorini (99) and provides explicit formulae for determining the sample size needed to test hypotheses about a single parameter at a specified significance level and power. Simulation studies assess its accuracy for various model configurations and covariate distributions. The results show that the proposed method is more accurate than the previous approaches over the range of conditions considered here. Some key words: Generalised linear model; Information matrix; Logistic regression; Maximum likelihood estimator; Poisson regression; Power; Sample size; Wald statistic.. INTRODUCTION The class of generalised linear models, first introduced by Nelder & Wedderburn (972) and later expanded by McCullagh & Nelder (989, Ch. 2), is specified by assuming independent scalar response variables Y (i=,...,n)to have an exponential-family probability density of the form i exp[{y h b(h)}/a(w) +c(y, w)]. () The expected value E(Y )=m is related to the canonical parameter h by the function m=b (h), where b denotes the first derivative of b. The link function g relates the linear predictors g to the mean response g=g(m). The linear predictors can be written as g=b +XTb, where X is a K vector of covariates, and b and b=(b,...,b )T represent the corresponding K K+ unknown regression coefficients. The scale parameter w is assumed to be known. Assume (y,x), for i=,...,n, is a random sample from the joint distribution of (Y, X) with probability i i density function f(y,x)= f(y X)f (X), where f(y X) has the form defined in () and f(x) is the probability density function for X. The form of f(x)is assumed to depend on none of the unknown parameters b and b. The likelihood function associated with the data is L(b, b)= a N f(y,x)= a N f(y x )f(x ). i i i i i i= i= It follows from the standard asymptotic theory that the maximum likelihood estimator (b@, b@t)t is asymptotically normally distributed with mean (b, bt)t and with variance covariance matrix given by the inverse of the (K+) (K+) Fisher information matrix I(b, b), where the (i, j)th

94 GWOWEN SHIEH element of I is I ij = E A 2 log L b i b j B (i, j=,...,k). We wish to test the composite null hypothesis H : b = against the alternative hypothesis H : b N, while treating (b ) as a nuisance parameter. The Wald-type test of this hypothesis is based on b@ /{var(b@ )}/2, where b@ is the first entry of b@ =(b@, b@ 2,...,b@ K )T and var(b@ ) is the second diagonal term of I (b@, b@). The actual test is performed by referring the statistic to its asymptotic distribution under the null hypothesis, which is the standard normal distribution. In general, there is no simple closed-form expression for Fisher s information matrix except in some special cases. In view of their practical importance and the existence of explicit formulae, we restrict attention to the cases of logistic and Poisson regression models. For the logistic regression model, an approximate expression for Fisher s information matrix was provided in Whittemore (98). The approximation is based on the moment generating function of the distribution of the covariates and is valid when the overall response probability is small. A formula for determining the sample size is based on the resulting asymptotic variance of the maximum likelihood estimator of the parameters. Later, the technique was extended to the Poisson regression model in Signorini (99). However, in this case, the expression of Fisher s information matrix is exact and there is no restriction of use in terms of the overall response level. This paper presents a direct modification of the approaches of Whittemore (98) and Signorini (99) to the question of sample size calculations. The major difference between our approach and theirs is that the value of the nuisance parameter under the null model is different from that under the alternative model. We use the limiting value of the maximum likelihood estimator of (b ) under the constraint b = as specified in the null hypothesis. Self & Mauritsen (988) and Self et al. (992) took a similar approach to the determination of sample sizes for the score statistic and likelihood ratio test, respectively, within the framework of generalised linear models. Although their methods are applicable to logistic and Poisson regression models, they assumed all the covariates in the model to be categorical with a finite number of covariate configurations. However, here we allow the covariate to be continuous or discrete with an infinite number of configurations, as in the cases of normal and Poisson covariates. In 2, the methodology is described and the procedures are illustrated with examples. In 3, simulation studies are performed and comparison made with the methods of Whittemore (98), Signorini (99) and Self et al. (992). 2. THE PROPOSED METHOD It was shown in Whittemore (98) that I j N exp(b )E{X X exp(xtb)} (i, j=,...,k), ij i j with X =, for logistic regression with small response probabilities, while the equation is exact for the case of Poisson regression as described by Signorini (99). Based on this question and the moment generating function of the distribution of covariates, m(t)=e{exp(xtt)}, the asymptotic variance of the maximum likelihood estimator b@ of b can be expressed as V(b, b)/n, where V(b, b)=n(b)/exp(b ); n(b) is the second diagonal element of M (b) with M=(m ij ), m =m =m, m i =m i =m i,m i = m/ t i, and m ij = 2m/ t i t j, for i, j=,...,k. In order to approximate the required sample size, we need to examine the asymptotic mean and variance of b@ under the null model, as in Self & Mauritsen (988). Let (b*, b* 2,...,b* K ) denote the solution of the equation lim N2 N E{S N (b,,b 2 )}=, where S N represents derivatives of the loglikelihood function with respect to (b ) and E{.} denotes expectation taken with respect to the true value of (b, b ). We synthesise the ideas of Whittemore (98) and Self & Mauritsen (988) by incorporating the value (b*, b*) with b*=(, b* 2,...,b* K ) as the asymptotic mean under the null model. Thus, the sample size needed to test the hypothesis

Miscellanea 95 H : b = with specified significance level a and power c against the alternative H : b N is N /2(b* qv, b*)z a/2 +V /2(b, b)z c b r 2, (2) where Z is the ( p)th percentile of the standard normal distribution. In contrast, Whittemore p (98) and Signorini (99) set the values of the nuisance parameter equal to (b, b()) with b()= (, b,...,b ), under both the null and alternative models. In our notation this gives 2 K N exp( b ) WS qn/2(b())z a/2 +n/2(b)z c b r 2. (3) In order to improve the accuracy, some modification was provided in Whittemore (98). For the univariate case, K=, the sample size is more accurately calculated as where N W =N WS {+2 exp(b )d(b )}, (4) d(b )= n/2()+n/2(b )R(b ), n/2()+n/2(b ) R(b )=n(b ){m (2b ) 2m (b )m (b )m (2b )+m 2(b )m(2b )m2(b )}. For the multivariate case, K2, the correction is too complicated, and a simple version was proposed for routine use: N W2 =N WS {+2 exp(b )}. (5) In summary, the actual implementation of the proposed approach is as follows. For a chosen logistic or Poisson regression model with specified parameter value (b, b), distribution of covariates f(x)and required significance a and power c, the minimum sample size N needed is determined from equation (2). The test statistic is b@ {N exp(b@ )/n(b@)}/2, (6) which is referred to the standard normal distribution. The null hypothesis is rejected if the absolute value of the statistic exceeds Z. a/2 To illustrate the general formula we continue the sample size calculations in Whittemore (98) and Signorini (99) for logistic and Poisson regression models, respectively. Whittemore (98) presented the problem of testing whether or not the incidence of coronary heart disease among white males aged 39 59 is related to their serum cholesterol level. During an 8-month follow-up period, the probability of a coronary heart disease event for a subject with the population mean serum cholesterol level is judged to be 7, and the cholesterol levels in this population are well represented by a standard normal distribution. Hence, we consider a simple logistic regression with g=b +Xb, b =log( 7), and X~N(, ). To detect the odds ratio of e, corresponding to b =, and e 5, corresponding to b = 5, for a subject with a cholesterol level of one standard deviation above the mean at a significance level 5 and with power 95, equation (4) yields N =2 47 and N =839, respectively. Here W W d(b )= +(+b2 ) exp(5b2 /4), n(b )=exp( b2/2). +exp( b2/4) To detect the same effects with the proposed method (2), the required sample sizes are N = 8 478 and N =662, respectively. The respective values of b* are 2 6549 and 2 554. Signorini (99) discussed a study of water pollution in terms of the number of illnesses and infections contracted per swimming season for ocean swimmers versus non-ocean or infrequent swimmers. We continue to consider a simple Poisson regression for the number of infections, with

96 GWOWEN SHIEH Table. Calculated sample sizes and estimates of actual power at specified sample size for logistic regression Proposed method Whittemore (98) Self et al. (992) Power Power Power 9 95 9 95 9 95 Bernoulli ( 5) Sample sizea (N,N,N) W S 739 265 245 2923 938 2396 Nominal powerb at N 9 95 7737 8644 8667 9288 Estimated power 882 9392 882 9392 8658 934 Error 98 8 65 748 9 6 Normalised Poisson (5)c Sample sizea (N,N,N) W S 364 44 53 634 484 599 Nominal powerb at N 8997 9499 732 826 82 87 Estimated power 8792 929 8792 929 844 922 Error 25 29 49 3 384 3 Standard normald Sample sizea (N,N,N) W S 4 57 547 666 545 675 Nominal powerb at N 92 95 7933 8755 83 8777 Estimated power 862 9222 862 9222 8346 94 Error 39 278 679 467 36 237 Multinomial ( 76, 9,, 4) Sample sizea (N,N,N) W2 S 5785 6929 635 7545 5783 752 Nominal powerb at N 9 95 8684 929 9 9439 Estimated power 926 9542 926 9542 8834 9262 Error 26 42 532 252 67 77 Multinomial ( 4,,, 4) Sample sizea (N,N,N) W2 S 33 475 3869 47 3245 44 Nominal powerb at N 9 95 8457 952 956 9528 Estimated power 9274 962 9274 962 97 9546 Error 274 2 87 45 4 8 Multinomial ( 25, 25, 25, 25) Sample sizea (N,N,N) W2 S 726 25 235 283 947 248 Nominal powerb at N 8999 95 7875 8756 8627 926 Estimated power 879 9376 879 9376 862 9256 Error 29 24 95 62 5 4 a Sample sizes needed to achieve power 9 and 95, respectively. b Nominal powers at calculated sample sizes of the proposed method in a. c The categorical approximation of Po(5) is (, 2, 3, 4, 5, 6, 7, 8, 9, ) with probabilities ( 44, 842, 44, 755, 755, 462, 44, 653, 363, 38). d The categorical approximation is ( 2 2, 7, 2, 7, 2, 2, 7, 2, 7, 2 2) with probabilities ( 228, 44, 98, 499, 95, 95, 499, 98, 44, 228). a single Bernoulli covariate indicating ocean swimming if X= and with p=pr(x=)= 5. In this case, the estimated infection rate of non-ocean or infrequent swimmers is 85, corresponding to b =log( 85), and the ratio of mean response for X= to mean response for X= is 3, corresponding to b =log( 3). It follows from equation (3) with n(b )={p exp(b )} +( p) that the sample sizes N WS needed for a significance level 5 with power 8, 9 and 95 are 58, 685 and 84, respectively. Our equation (2) gives 469, 629 and 779 at the three corresponding power levels and b* = 228.

Miscellanea 97 Table 2. Calculated sample sizes and estimates of actual power at specified sample size for Poisson regression Proposed method Signorini (99) Self et al. (992) Power Power Power 9 95 9 95 9 95 Bernoulli ( 5) Sample sizea (N,N,N) WS S 835 2285 2354 286 856 2295 Nominal powerb at N 9 95 869 895 8968 9492 Estimated power 888 9494 888 9494 898 954 Error 2 6 8 589 5 22 Normalised Poisson (5)c Sample sizea (N,N,N) WS S 389 472 464 554 455 563 Nominal powerb at N 8999 9498 8299 96 855 94 Estimated power 8966 9283 8966 9382 882 9278 Error 33 6 667 322 297 74 Standard normald Sample sizea (N,N,N) WS S 437 54 58 69 52 633 Nominal powerb at N 8997 95 8485 985 852 954 Estimated power 8762 944 8762 944 8748 939 Error 235 96 277 29 282 23 Multinomial ( 76, 9,, 4) Sample sizea (N,N,N) WS S 665 7383 628 7442 4933 6 Nominal powerb at N 9 95 897 9483 959 9776 Estimated power 9462 976 9462 976 9364 9656 Error 462 26 49 223 55 2 Multinomial ( 4,,, 4) Sample sizea (N,N,N) WS S 3537 4353 3963 4825 343 3886 Nominal powerb at N 9 95 865 9265 935 9682 Estimated power 9342 9692 9342 9692 94 97 Error 342 92 727 427 95 28 Multinomial ( 25, 25, 25, 25) Sample sizea (N,N,N) WS S 835 2285 2354 286 856 2295 Nominal powerb at N 9 95 869 895 8968 9492 Estimated power 894 9484 894 9484 8954 956 Error 97 6 835 579 4 4 a Sample sizes needed to achieve power 9 and 95, respectively. b Nominal powers at calculated sample sizes of the proposed method in a. c The categorical approximation of Po(5) is (, 2, 3, 4, 5, 6, 7, 8, 9, ) with probabilities ( 44, 842, 44, 755, 755, 462, 44, 653, 363, 38). d The categorical approximation is ( 2 2, 7, 2, 7, 2, 2, 7, 2, 7, 2 2) with probabilities ( 228, 44, 98, 499, 95, 95, 499, 98, 44, 228). 3. SIMULATION STUDIES The finite-sample adequacy of our formula was assessed through simulations, in which we also compared the proposed method to those of Whittemore (98), Signorini (99) and Self et al. (992). The results are presented in Tables and 2 for logistic and Poisson regression models, respectively. For both regression models, two linear predictors are examined, namely g=b +X b and g= b +X b +X b. In the case of the simple linear predictor g=b +X b, we consider Ber( 5), 2 2 normalised Po(5) and standard normal distributions for the covariate X. For the second predictor

98 GWOWEN SHIEH g=b +X b +X 2 b 2, the joint distribution of (X,X 2 ) is assumed to be multinomial with probabilities p, p 2, p 3 and p 4, corresponding to (x,x 2 ) values of (, ), (, ), (, ) and (, ), respectively. Three sets of (p, p 2, p 3, p 4 ) are studied to represent different distributional shapes, namely ( 76, 9,, 4), ( 4,,, 4) and ( 25, 25, 25, 25). Both the parameter of interest b and the confounding parameter b 2 are taken to be log 2. The intercept parameter b is chosen to satisfy the overall response m: = 5, where m: =E[exp(g)/{+exp(g)}] and m: =E{exp(g)} for the logistic and Poisson regression models, respectively. First, we calculated from equations (2) (5) the sample sizes (N,N WS,N W,N W2 ) required to achieve the selected significance 5 and power ( 9, 95) within the model specifications. Let N S denote the sample size for the approach of Self et al. (992). Since they assumed all of the covariates to be categorical with a finite number of configurations, discretisation schemes are needed for the cases of Poisson and standard normal covariates; the chosen schemes are listed in the footnotes of Tables and 2. These estimates of sample size allow comparison of relative efficiencies of the approaches. Since the magnitude of the sample size affects the accuracy of the asymptotic distribution and the resulting formulae, we unify the sample sizes in the simulations; the sample size N is chosen as the benchmark and is used to recalculate the nominal powers for all competing approaches. Estimates of the true power associated with given sample size and model configuration are then computed through Monte Carlo simulation based on 5 independent datasets. For each replicate, N covariate values are generated from the selected distribution. These covariate values determine the incidence rates for generating N Bernoulli or Poisson outcomes. Then the test statistic is computed and the estimated power is the proportion of the 5 replicates whose test statistic values exceed the critical value. The adequacy of the sample size formula is determined by the difference between the estimated power and nominal power specified above. All calculations are performed using programs written with SAS/IML (SAS Institute, 989). The results in Tables and 2 suggest that there is a close agreement between the estimated power and the nominal power for the proposed method regardless of the model configuration and covariate distribution. The only exceptions are with the extremely skewed multinomial covariate distribution ( 76, 9,, 4). The approach of Self et al. (992) is also very good at achieving the nominal levels, but the approaches of Whittemore (98) and Signorini (99) incur much larger errors. We conclude that the proposed method maintains the accuracy within a reasonable range of nominal power and is much more accurate than the previous approaches proposed by Whittemore (98) and Signorini (99). Nevertheless, unbalanced allocations appear to degrade the accuracy of sample size calculations. ACKNOWLEDGEMENT I would like to thank the editor and referees for their helpful comments that improved the presentation of this paper. This work was partially supported by the National Science Council of Taiwan. REFERENCES MCCULLAGH, P. & NELDER, J. A. (989). Generalized L inear Models, 2nd ed. London: Chapman and Hall. NELDER, J. A. & WEDDERBURN, R. W. M. (972). Generalized linear models. J. R. Statist. Soc. A 35, 37 84. SAS INSTITUTE (989). SAS/IML Software: Usage and Reference, Version 6. Carey, NC: SAS Institute Inc. SELF, S. G. & MAURITSEN, R. H. (988). Power/sample size calculations for generalized linear models. Biometrics 44, 79 86.

Miscellanea 99 SELF, S. G., MAURITSEN, R. H. & OHARA, J. (992). Power calculations for likelihood ratio tests in generalized linear models. Biometrics 48, 3 9. SIGNORINI, D. F. (99). Sample size for Poisson regression. Biometrika 78, 446 5. WHITTEMORE, A. S. (98). Sample size for logistic regression with small response probability. J. Am. Statist. Assoc. 76, 27 32. [Received March 2. Revised April 2]