Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Similar documents
Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

Confidence Intervals for a Ratio of Binomial Proportions Based on Unbiased Estimators

Reports of the Institute of Biostatistics


In Defence of Score Intervals for Proportions and their Differences

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

Logistic regression: Miscellaneous topics

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Statistics in medicine

Statistical inference of a measure for two binomial variates

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta

Lecture 01: Introduction

INVARIANT SMALL SAMPLE CONFIDENCE INTERVALS FOR THE DIFFERENCE OF TWO SUCCESS PROBABILITIES

A simulation study for comparing testing statistics in response-adaptive randomization

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Simultaneous Confidence Intervals for Risk Ratios in the Many-to-One Comparisons of Proportions

Objective Bayesian Hypothesis Testing and Estimation for the Risk Ratio in a Correlated 2x2 Table with Structural Zero

Efficient and Exact Tests of the Risk Ratio in a Correlated 2x2 Table with Structural Zero

Chapter 2: Describing Contingency Tables - I

STAT 705: Analysis of Contingency Tables

Practice Problems Section Problems

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Measures of Association and Variance Estimation

And the Bayesians and the frequentists shall lie down together...

Lecture 24. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Testing Independence

Loglikelihood and Confidence Intervals

Pseudo-score confidence intervals for parameters in discrete statistical models

Unit 9: Inferences for Proportions and Count Data

Correlation and regression

Welcome! Webinar Biostatistics: sample size & power. Thursday, April 26, 12:30 1:30 pm (NDT)

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions

And the Bayesians and the frequentists shall lie down together...

Categorical Data Analysis Chapter 3

Discrete Multivariate Statistics

Inverse Sampling for McNemar s Test

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Estimation and sample size calculations for correlated binary error rates of biometric identification devices

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Inferences for the Ratio: Fieller s Interval, Log Ratio, and Large Sample Based Confidence Intervals

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Statistical Inference for the Risk Ratio in 2x2 Binomial Trials with Stuctural Zero

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Institute of Actuaries of India

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA6938-Logistic Regression Model

Basic Concepts of Inference

Comparison of Estimators in GLM with Binary Data

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Sample size calculations for logistic and Poisson regression models

Exact unconditional tests for a 2 2 matched-pairs design

Probability and Probability Distributions. Dr. Mohammed Alahmed

Comparing p s Dr. Don Edwards notes (slightly edited and augmented) The Odds for Success

Generalized confidence intervals for the ratio or difference of two means for lognormal populations with zeros

Unit 9: Inferences for Proportions and Count Data

7 Estimation. 7.1 Population and Sample (P.91-92)

1/24/2008. Review of Statistical Inference. C.1 A Sample of Data. C.2 An Econometric Model. C.4 Estimating the Population Variance and Other Moments

A Likelihood Ratio Test

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

BIOS 625 Fall 2015 Homework Set 3 Solutions

Unit 14: Nonparametric Statistical Methods

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

PROD. TYPE: COM. Simple improved condence intervals for comparing matched proportions. Alan Agresti ; and Yongyi Min UNCORRECTED PROOF

The Components of a Statistical Hypothesis Testing Problem

HANDBOOK OF APPLICABLE MATHEMATICS

Test Volume 11, Number 1. June 2002

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Bayesian Confidence Intervals for the Ratio of Means of Lognormal Data with Zeros

A SAS/AF Application For Sample Size And Power Determination

CHAPTER 9, 10. Similar to a courtroom trial. In trying a person for a crime, the jury needs to decide between one of two possibilities:

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

A Classroom Approach to Illustrate Transformation and Bootstrap Confidence Interval Techniques Using the Poisson Distribution

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Sampling Distributions: Central Limit Theorem

Standard Error of Technical Cost Incorporating Parameter Uncertainty

Model Estimation Example

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Large Sample Properties of Estimators in the Classical Linear Regression Model

Multiple Sample Categorical Data

SPRING 2007 EXAM C SOLUTIONS

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

The Calibrated Bayes Factor for Model Comparison

Reconstruction of individual patient data for meta analysis via Bayesian approach

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

This paper has been submitted for consideration for publication in Biometrics

Accepted Manuscript. Comparing different ways of calculating sample size for two independent means: A worked example

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Transcription:

Biometrical Journal 42 (2000) 1, 59±69 Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Kung-Jong Lui Department of Mathematical Sciences College of Sciences San Diego State University USA Summary This paper discusses interval estimation of the simple difference (SD) between the proportions of the primary infection and the secondary infection, given the primary infection, by developing three asymptotic interval estimators using Wald's test statistic, the likelihood-ratio test, and the basic principle of Fieller's theorem. This paper further evaluates and compares the performance of these interval estimators with respect to the coverage probability and the expected length of the resulting confidence intervals. This paper finds that the asymptotic confidence interval using the likelihood ratio test consistently performs well in all situations considered here. When the underlying SD is within 0.10 and the total number of subjects is not large (say, 50), this paper further finds that the interval estimators using Fieller's theorem would be preferable to the estimator using the Wald's test statistic if the primary infection probability were moderate (say, 0.30), but the latter is preferable to the former if this probability were large (say, 0.80). When the total number of subjects is large (say, 200), all the three interval estimators perform well in almost all situations considered in this paper. In these cases, for simplicity, we may apply either of the two interval estimators using Wald's test statistic or Fieller's theorem without losing much accuracy and efficiency as compared with the interval estimator using the asymptotic likelihood ratio test. Key words: Interval Estimation; Coverage probability; Likelihood ratio test; Fieller's Theorem. 1. Introduction To establish the characteristics of a given disease, one of the interesting problems is to assess the effect due to the primary infection on the likelihood of developing the secondary infection. For example, consider the data (Agresti, 1990, Pages 45±46) about a sample of calves. Calves are first classified by whether they get a primary pneumonia infection. After recovering from the primary infection, calves are then reclassified by whether they develop a secondary infection within a defined time period. In this situation, observations are taken from the same group of calves and hence are likely to be dependent. Therefore, when estimating the simple difference (SD) between the probability of the primary infection and the conditional probability

60 K.-J. Lui: Confidence Intervals of the Difference between Proportions of the secondary infection, given the primary infection, we cannot apply all the interval estimators of SD developed under two independent samples (Thomas and Gart, 1977; Anbar, 1983, 1984; Beal, 1987; Mee, 1984; Hauck and Anderson, 1986; Miettinen and Nurminen, 1985; Santner and Snell, 1980; Wallenstein, 1997). Note that the completely randomized trial, in which calves are randomly allocated into the control and experimental groups, is certainly not ethical and adequate for use here. In this paper, we concentrate discussion on interval estimation of the SD between the probability of the primary infection and the conditional probability of the secondary infection, given the primary infection. We develop three asymptotic interval estimators using Wald's test statistic, the likelihood ratio test, and the basic principle of Fieller's theorem. To evaluate and compare the performance of these interval estimators, we calculate the coverage probability and the expected length of the resulting confidence intervals on the basis of the exact distribution in a variety of situations. We find that the interval estimator using the asymptotic likelihood ratio test, which involves a sophisticated numerical procedure, consistently performs well in all the situations considered here. When the underlying SD is within 0.10 and the total number of subjects is not large (say, 50), the interval estimator using Fieller's theorem would be preferable to the estimator using the Wald's test statistic if the underlying primary infection probability were moderate (say, 0.30). On the other hand, however, the latter would be preferable to the former if the underlying primary infection probability were high (say, 0.80). When the total number of subjects is large (say, 200), all the three estimators perform reasonably well in almost all situations considered in this paper. Therefore, for simplicity, we may apply either of the two asymptotic interval estimators using Wald's test statistic or Fieller's theorem in these situations without losing much accuracy and efficiency as compared with the asymptotic confidence interval using the likelihood ratio test. Note that Agresti (1990) discusses a hypothesis testing procedure for testing whether there is an effect due to the primary infection on the probability of developing the secondary infection and Lui (1998) discusses interval estimation of risk ratio between the two successive infections. However, none of these two papers considers interval estimation of the SD as focused here. 2. Interval Estimators Consider a study, in which the data can be summarized by use of the following 2 2 table: Secondary Infection Yes No Primary Yes p 11 p 12 p 1: Infection No p 22 p 22 ;

Biometrical Journal 42 (2000) 1 61 where 0 < p ij < 1 (for i ˆ 1; 2 and j ˆ 1; 2) denotes the probability of the corresponding cells, p 1: ˆ p 11 p 12, and p 1: p 22 ˆ 1. As also noted elsewhere (Agresti, 1990), by definition, no subject can have the secondary infection without first having the primary infection (i.e., p 21 ˆ 0). In this paper, we focus discussion on interval estimation of the SD between the probability of the primary infection and the conditional probability of the secondary infection, given the primary infection. In terms of the p ij, the SD, denoted by d, is defined as p 1: p 11 =p 1:. Hence, for given p 1: and d, we have p 11 ˆ p 1: p 1: d, p 12 ˆ p 1: 1 p 1: d, and p 22 ˆ 1 p 1:. Note that the range for d, by definition, is 1 < d < 1. Suppose that we take a random sample of n subjects. Let n ij denote the corresponding number of subjects who fall in the cell with probability p ij. Then the log-likelihood for a given (n 11 ; n 12 ; n 22 ) is then Log L ˆ C n 11 flog p 1: log p 1: d g n 12 flog p 1: log 1 p 1: d g n 22 log 1 p 1: ; 1 where C is a constant, that does not depend on parameters d and p 1:. On the basis of (1), we can easily show that the maximum likelihood estimates (MLEs) of p 1: and d are ^p 1: ˆ n 11 n 12 =n and ^d ˆ ^p 1: ^p 11 =^p 1:, respectively, where ^p 11 ˆ n 11 =n. Furthermore, with using the inverse of the observed information matrix, we obtain the estimate dvar ^d of the asymptotic variance for ^d to be f^p 11^p 12 =^p 3 1: ^p 1: 1 ^p 1: g=n (Appendix). Therefore, the asymptotic 1 a % confidence interval for d is m l ; m u Š ; where m l ˆ max n 1; ^d q Z a=2 dvar ^d o and m u ˆ min 2 n1; ^d q o Z a=2 dvar ^d and Z a is the upper 100a th percentile of the standard normal distribution. For testing H 0 : d ˆ d 0 versus H a : d 6ˆ d 0, it is easy to see that the acceptance region using the asymptotic likelihood ratio test consists of all sample vectors (n 11 ; n 12 ; n 22 ) such that 2 n 11 log ^p 11 n 12 log ^p 12 n 22 log ^p 22 n 11 log f^p 1: d 0 g log f^p 1: d 0 d 0 gš n 12 log f^p 1: d 0 g log f1 ^p 1: d 0 d 0 gš n 22 log f1 ^p 1: d 0 g c 2 a ; 3 where ^p ij ˆ n ij =n is the MLE of p ij ; ^p 1: d 0 denotes the conditional MLE of p 1:, for a given fixed d 0 (Appendix), and c 2 a is the upper 100a th percentile of the central c 2 -distribution with one degree of freedom. Therefore, we can obtain the asymptotic likelihood ratio test based confidence interval by inverting the acceptance region (Casella and Berger, 1990): r l ; r u Š ; 4

62 K.-J. Lui: Confidence Intervals of the Difference between Proportions where 1 < r l < r u < 1 are the smaller and the larger roots of d 0 such that 2 n 11 log ^p 11 n 12 log ^p 12 n 22 log ^p 22 n 11 log f^p 1: d 0 g log f^p 1: d 0 d 0 gš n 12 log f^p 1: d 0 g log f1 ^p 1: d 0 d 0 gš n 22 log f1 ^p 1: d 0 g ˆ c 2 a : Recall that, by definition, the d defined here can be rewritten as a ratio p 2 1: p 11 =p 1:. Following Fieller's theorem (Casella and Berger, 1990), we define Z ˆ n^p 2 1: ^p 1: = n 1 ^p 11 d^p 1:. Note that the expectation E n^p 2 1: ^p 1: = n 1 ˆ p 2 1: and E ^p 11 ˆ p 11. Thus, E Z ˆ 0. By use of the delta method and the multivariate p Central Limit Theorem (Anderson, 1958), we can easily show that n Z asymptotically follows the normal distribution with mean 0 and asymptotic variance Var 3 ˆ p 11 1 p 11 2np 1: 1 = n 1 dš 2 p 1: 1 p 1: 2 2np 1: 1 = n 1 dš p 11 p 22. Thus, the probability that PfZ 2 = Var 3 =n Za=2 2 g ˆ: 1 a if n were large. This leads us to consider the following working quadratic equation in d: ^Ad 2 ^Bd ^C 0 ; 5 where ^A ˆ ^p 2 1: Z2 a=2^p 1: 1 ^p 1: =n, ^B ˆ 2 n^p 2 1: ^p 1: = n 1 ^p 11 Š ^p 1: Za=2 2 2n^p 1: 1 ^p 1: 1 ^p 1: = n 1 nš ^p 11^p 22 =nš, and ^C ˆ n^p 2 1: ^p 1: = n 1 ^p 11 Š 2 Za=2 2 ^p 11 1 ^p 11 =n 2n^p 1: 1 2 ^p 1: 1 ^p 1: = n 1 2 nš 2 2n^p 1: 1 ^p 11^p 22 = n 1 nš. If both ^A > 0 and ^B 2 4 ^A ^C > 0, then the asymptotic 100 1 a % confidence interval of SD as n is large is given by q l ; q u Š ; 6 where and n q l ˆ max 1; ^B n q u ˆ min 1; ^B p ^B 2 4 ^A ^C p ^B 2 4 ^A ^C o = 2 ^A o = 2 ^A. 3. Coverage Probability and Expected Length To evaluate the finite-sample performance of interval estimators (2, 4, and 6) for the SD, we calculate the coverage probability and the expected length of the resulting 95% confidence interval on the basis of the exact trinomial distribution. By definition, the coverage probability is simply equal to P 1 d 2 c l ; c u Š f n 11 ; n 12 ; n 22, where c l ; c u Š is the confidence interval obtained by use of (2, 4, and 6) and is a function of n 11 ; n 12 ; n 22, 1 d 2 c l ; c u Š is the indicator function and ˆ 1 if d 2 c l ; c u Š is true, and ˆ 0, otherwise, and where f n 11 ; n 12 ; n 22 is

Biometrical Journal 42 (2000) 1 63 the trinomial distribution with the underlying cell probabilities: p 11 ; p 12 ; and p 22. Similarly, the expected length of the resulting confidence interval is given by P cu c l f n 11 ; n 12 ; n 22. Note that when ^p 1: ˆ 0; ^d is not well-defined and interval estimator (2) is inapplicable. Similarly, in this case, the coefficient of the quadratic terms d 2 in equation (5) is 0 and hence we cannot apply (6) to obtain the confidence interval of d either. Furthermore, if either ^A < 0 or ^B 2 4 ^A ^C < 0, then (6) cannot be applied as well. Note also that the logarithmic function log X is defined only for 0 < X < 1. Therefore, if any cell frequency n ij in a random vector (n 11 ; n 12 ; n 22 ) were 0, we would not be able to apply interval estimator (4). When evaluating the performance of (2, 4, and 6), we calculate the coverage probability and the expected length, conditional upon those samples in which the confidence limits of using the respective interval estimator exist. For completeness, we also calculate the probability that we fail to produce confidence limits for each of interval estimators (2, 4, and 6). For given values of p 1: and d, as noted before, all parameter values: p 11 ˆ p 1: p 1: d, p 12 ˆ p 1: 1 p 1: d, and p 22 ˆ 1 p 1: are uniquely determined. We consider the situations, in which p 1: ˆ 0:30, 0.50, and 0.80; d ˆ 0:30; 0:20; 0:10;... ; 0:30 but which such a restriction that the corresponding cell probabilities: p 11 ; p 12; and p 22 are all >0; and n ˆ 50, 100, and 200. We write programs in SAS (1990) to enumerate the exact probability f n 11 ; n 12 ; n 22 of the desired trinomial distribution. 4. Results Table 1 summarizes the results about the coverage probability and the expected length of the resulting 95% confidence intervals conditional upon those samples in which the confidence limits of the respective interval estimator exist in a variety of situations. As seen from Table 1, when n 200, all estimators perform reasonably well in almost all situations considered here. When both n and p 1: are not large (i.e., n ˆ 50 and p 1: ˆ 0:30) and d is within 0.10, estimators (4 and 6) outperforms estimator (2), of which the coverage probability is likely to be less than the desired confidence level. On the other hand, in these cases but in which p 1: is large (ˆ 0:80), estimator (2 and 4) is preferable to estimator (6). We also find that the probability of failing to produce an 95% confidence interval by use of either estimator (2 and 6) is negligible (< 0:001) in all situations considered in Table 1, but this probability for use of (4) can be of practical significance when n is not large (ˆ 50). 5. An example To illustrate the practical usefulness of (2, 4, and 6), we consider the example (Agresti, 1990, Pages 45±46) about 156 calves born in Florida. Calves are first

64 K.-J. Lui: Confidence Intervals of the Difference between Proportions Table 1 The coverage probability and the expected length (presented in parenthesis) of the resulting 95% confidence interval for the underlying risk difference between the primary infection and the secondary infection given the primary infection d ˆ 0:30; 0:20;... ; 0:30 but with such a restriction that p 11 ; p 12 ; and p 22 are all >0 for use of estimators (2, 4, 6) in the situations, in which the probability of primary infection p 1: ˆ 0:30, 0.50, and 0.80; and the total number of subjects n ˆ 50, 100, and 200 n 50 100 200 p 1: Estimator 2 4 6 2 4 6 2 4 6 d 0.30 0.3 0.926 0.941 0.919 0.942 0.946 0.934 0.943 0.950 0.943 (0.548) (0.528) (0.628) (0.391) (0.384) (0.416) (0.278) (0.275) (0.286) 0.2 0.930 0.944 0.937 0.941 0.948 0.942 0.945 0.949 0.946 (0.557) (0.537) (0.640) (0.398) (0.390) (0.423) (0.282) (0.279) (0.291) 0:1 0.922 0.942 0.943 0.940 0.948 0.948 0.945 0.949 0.949 (0.548) (0.530) (0.631) (0.391) (0.384) (0.416) (0.278) (0.275) (0.286) 0.0 0.924 0.949 0.955 0.939 0.949 0.950 0.947 0.947 0.952 (0.518) (0.510) (0.599) (0.371) (0.366) (0.395) (0.263) (0.262) (0.271) 0.1 0.910 0.955 0.962 0.936 0.948 0.957 0.942 0.948 0.954 (0.465) (0.475) (0.542) (0.334) (0.335) (0.357) (0.238) (0.238) (0.245) 0.2 0.935 0.958 0.971 0.938 0.959 0.965 0.943 0.948 0.954 (0.380) (0.431) (0.453) (0.275) (0.288) (0.296) (0.196) (0.200) (0.203) 0.50 0.3 0.934 0.954 0.914 0.944 0.948 0.931 0.946 0.949 0.939 (0.412) (0.412) (0.443) (0.294) (0.293) (0.304) (0.209) (0.208) (0.212) 0.2 0.938 0.945 0.918 0.943 0.948 0.932 0.947 0.949 0.942 (0.448) (0.443) (0.479) (0.319) (0.317) (0.329) (0.226) (0.225) (0.230) 0.1 0.930 0.952 0.940 0.947 0.951 0.942 0.948 0.949 0.946 (0.468) (0.460) (0.499) (0.333) (0.330) (0.343) (0.236) (0.235) (0.240) 0.0 0.937 0.951 0.944 0.947 0.951 0.949 0.948 0.949 0.945 (0.474) (0.466) (0.506) (0.338) (0.334) (0.348) (0.239) (0.238) (0.243) 0.1 0.941 0.946 0.942 0.945 0.949 0.944 0.949 0.950 0.948 (0.468) (0.460) (0.499) (0.333) (0.330) (0.343) (0.236) (0.235) (0.240) 0.2 0.941 0.945 0.943 0.945 0.949 0.946 0.948 0.950 0.948 (0.448) (0.443) (0.479) (0.319) (0.317) (0.329) (0.226) (0.225) (0.230) 0.3 0.935 0.946 0.944 0.946 0.947 0.945 0.948 0.950 0.948 (0.412) (0.412) (0.443) (0.294) (0.293) (0.304) (0.209) (0.208) (0.212) 0.80 0.1 0.941 0.955 0.919 0.948 0.950 0.939 0.950 0.950 0.946 (0.284) (0.293) (0.293) (0.203) (0.206) (0.206) (0.144) (0.145) (0.145) 0.0 0.944 0.952 0.928 0.947 0.948 0.941 0.949 0.949 0.944 (0.328) (0.332) (0.336) (0.233) (0.235) (0.236) (0.166) (0.166) (0.167) 0.1 0.943 0.949 0.932 0.947 0.950 0.942 0.948 0.949 0.946 (0.356) (0.356) (0.364) (0.253) (0.253) (0.256) (0.180) (0.180) (0.181) 0.2 0.943 0.946 0.934 0.948 0.949 0.944 0.949 0.950 0.945 (0.371) (0.369) (0.379) (0.264) (0.264) (0.267) (0.187) (0.187) (0.188) 0.3 0.945 0.948 0.941 0.945 0.948 0.944 0.948 0.950 0.947 (0.377) (0.373) (0.384) (0.268) (0.267) (0.271) (0.190) (0.190) (0.191)

Biometrical Journal 42 (2000) 1 65 classified according to whether they are infected with pneumonia within 60 days after birth. They are then classified again by whether they develop a secondary infection within two weeks after clearing up the first infection. As shown in Table 3.2 on Page 46 by Agresti (1990), we have n 11 ˆ 30, n 12 ˆ 63, and n 22 ˆ 63. With given these data, the estimate ^d is 0.274. Applying interval estimators (2, 4, and 6), we obtain the 95% confidence intervals of d to be [0.151, 0.396], [0.148, 0.392], and [0.137, 0.385], respectively. Because the lower limits of these resulting confidence intervals are all larger then 0, applying any of these interval estimators may suggest that the primary infection of pneumonia should stimulate a natural immunity to reduce the likelihood of a secondary infection. Although this inference is the same as that claimed elsewhere with using a hypothesis test procedure (Agresti, 1990, Page 47), we do need to implicitly assume that the immunity level of calves to pneumonia does not vary much within the first 3 months of birth and the follow-up period of 14 days is sufficiently long enough to calculate the proportion of the secondary infection to draw the above conclusion. When applying the study design discussed here to study the natural immunity, it is certainly important to decide how to choose an appropriate length of the follow-up period. However, this decision is essentially dependent on subjective knowledge of the characteristics of the underlying disease and beyond the scope of this paper. 6. Discussion The coverage probability of interval estimator (4) using the asymptotic likelihoodratio test consistently agrees reasonably well with the desired confidence level of 95% in all situations considered in Table 1, while those of estimators (2 and 6) can be less than the 95% when n is not large. Furthermore, the expected length for use of (4) may often be the shortest among these three estimators when the coverage probability is in the near neighborhood of 95% (Table 1). Therefore, in the situation in which the probability of failing to produce an interval estimate by use of (4) is negligible, estimator (4) might be generally recommended if n were not large (ˆ 50). On the other hand, use of (4) requires a sophisticated numerical procedure to calculate the confidence limits, while application of the other two estimators (2 and 6) is simple to implement. Thus, when n is large 200 and all the three estimators are essentially equivalent, we may wish to apply estimators (2 and 6) for simplicity. In the above example, the MLEs of p 1: and d are ^p 1: ˆ: 0:60 and ^d ˆ: 0:274, respectively. The total number of subjects n is 156. According to the results presented in Table 1, all three interval estimators (2, 4 and 6) are appropriate for use in this case. This is consistent with the finding that all the resulting 95% confidence intervals are similar to one another. Note that the probability of failing to produce confidence limits for use of (2 and 6), as shown in Table 2, is negligible for all situations considered here. There-

66 K.-J. Lui: Confidence Intervals of the Difference between Proportions Table 2 The probability of failing to produce an 95% confidence interval in application of interval estimators (2, 4, and 6) for the underlying risk difference d ˆ 0:30; 0:20; 0:10;... ; 0:30 but with such a restriction that p 11 ; p 12 ; and p 22 are all >0 in the situations, in which the prohability of primary infection p 1: ˆ 0:30, 0.50, and 0.80; and the total number of subjects n ˆ 50, 100, and 200 n 50 100 200 p 1: Estimator 2 4 6 2 4 6 2 4 6 d 0.30 0.3 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.2 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.1 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:0 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:1 0.000 0.045 0.000 0.000 0.002 0.000 0.000 0.000 0.000 0:2 0.000 0.218 0.000 0.000 0.048 0.000 0.000 0.002 0.000 0.50 0.3 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:3 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.80 0.1 0.000 0.015 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0:3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 fore, the resulting coverage probability and the expected length for these two estimators calculated conditional upon the samples in which the confidence limits exist are essentially equivalent to those normally calculated over all samples. However, the probability of failing to apply (4) when any cell frequency, n 11 ; n 12 ; or n 22 equals 0 can be non-negligible. For example, when n ˆ 50, p 1: ˆ 0:30, and d ˆ 0:20, this probability is approximately 0.218 (Table 2). To avoid this limitation in application of (4), we can apply the commonly-used adjustment for sparse data by adding 0.50 to each cell frequency whenever this occurs. With use of this and hoc adjustment in the above case considered in Table 2, we find that the coverage probability and the expected length change from 0.958 and 0.431 to 0.950 and 0.412, respectively. The magnitudes of these changes are certainly of no practical importance. In fact, we have recalculated all the coverage probability and the expected length with use of this as hoc adjustment to eliminate the probability of failing to produce confidence limits for using (4) in all situations considered in Table 1. Because the differences between the results of using (4) presented in

Biometrical Journal 42 (2000) 1 67 Table 1 and those with this adjustment are generally quite small, we decide not to present them for brevity. Finally, note that though the logarithmic transformation has been successfully applied to derive the confidence interval for the other epidemiologic indices such as risk ratio or odds ratio (Katz et al., 1978; Lui, 1995, 1996, and 1998), we do not recommend use of this transformation to derive the confidence interval of the SD as focused here. This is not only because the sampling distribution of log ^d can be even more skewed than that of ^d when the underlying d is small, but also because log ^d is undefined when ^d is <0. In summary, this paper proposes three asymptotic confidence interval for the SD between successive infections. This paper demonstrates that the interval estimator using the asymptotic likelihood ratio test can consistently perform well in a variety of situations. However, application of this procedure involves iterative numerical calculation. When the probability of the underlying primary infection is moderate (ˆ 0:30) and the SD is within 0.10, we may use the interval estimator using the Fieller's theorem. On the other hand, when the probability of the underlying primary infection is high (ˆ 0:80), we may apply the interval estimator using the Wald's test statistic. Acknowledgements The author wishes to thank the referee for many helpful and valuable comments to improve the clarity of this paper. This work in part was supported by the grant from the Agency for Health Care Policy and Research #R01-HS07161. Appendix For a given sample vector (n 11 ; n 12 ; n 22 ), the log-likelihood is Log L ˆ C n 11 flog p 1: log p 1: d g n 12 flog p 1: log 1 p 1: d g n 22 log 1 p 1: : Then the MLEs of p 1: and d are simply the roots for p 1: and d of the following two equations: @ Log L @p 1: ˆ n 11 f1=p 1: 1= p 1: d g n 12 f1=p 1: 1= 1 p 1: d g n 22 = 1 p 1: ˆ 0 A:1 and @ Log L @d ˆ n 11 = p 1: d n 12 = 1 p 1: d ˆ 0 : A:2

68 K.-J. Lui: Confidence Intervals of the Difference between Proportions We can easily show that the MLEs are ^p 1: ˆ n 11 n 12 =n and ^d ˆ ^p 1: ^p 11 =^p 1:. Furthermore, @ 2 Log L @p 2 ˆ n 11 f1=p 2 1: 1= p 1: d 2 g 1: n 12 f1=p 2 1: 1= 1 p 1: d 2 g n 22 = 1 p 1: 2 ; A:3 @ 2 Log L @d 2 ˆ n 11 = p 1: d 2 n 12 = 1 p 1: d 2 ; A:4 @ Log L @p 1: @d ˆ n 11= p 1: d 2 n 12 = 1 p 1: d 2 : A:5 When substituting the MLEs ^p 1: and ^d for the corresponding parameters in (A.3±A.5) we can obtain the estimate of the asymptotic variance for the MLE ^d to be f^p 11^p 12 =^p 3 1: ^p 1: g=n through use of the inverse of the observed information matrix. Note that for a given fixed d 0 such that 1 < d 0 < 1, as p 1: increases from @ Log L max f0; d 0 g to min f1; 1 d 0 g, the value of in the @p 1: left-hand of equation (A.1) decreases from 1 to 1. Furthermore, (A.1) is a continuous function over max f0; d 0 g p 1: min f1; 1 d 0 g. These suggest that, for a given fixed d 0, where 1 < d 0 < 1, the conditional MLE ^p 1: d 0 of p 1: is simply the unique root for p 1: (falling in the range of max f0; d 0 g p 1: min f1; 1 d 0 g of equation (A.1) with replacing d by d 0. References Agresti, A., 1990: Categorical Data Analysis. Wiley, New York. Anbar, D., 1983: On estimating the difference between two probabilities, with special reference to clinical trials. Biometrics 39, 257±262. Anbar, D., 1984: Confidence bounds for the difference between two probabilities. Biometrics (reply to letter) 40, 1176. Anderson, T. W., 1958: An Introduction to Multivariate Statistical Analysis. Wiley, New York. Beal, S. L., 1987: Asymptotic confidence intervals for the difference between two binomial parameters for use with small samples. Biometrics 43, 941±950. Casella, G. and Berger, R. L., 1990: Statistical Inference. Duxbury, Belmont, California. Hauck, W. W. and Anderson, S., 1986: A comparison of large sample confidence interval methods for the difference of two binomial probabilities. The American Statistician 40, 318±322. Katz, D., Baptista, J., Azen, S. P., and Pike, M. C., 1978: Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics 34, 469±474. Lui, K.-J., 1995: Confidence intervals for the risk ratio in cohort studies under inverse sampling. Biometrical Journal 37, 965±971. Lui, K.-J., 1996: Notes on Confidence limits for the odds ratio in case-control studies under inverse sampling. Biometrical Journal 38, 221±229. Lui, K.-J., 1998: Interval estimation of risk ratio between the secondary infection given the primary infection and the primary infection. Biometrics 54, 706±711.

Biometrical Journal 42 (2000) 1 69 Mee, R. W., 1984: Confidence bounds for the difference between two probabilities. Biometrics 40, 1175±1176. Miettinen, O. and Nurminen, M., 1985: Comparison analysis of two rates. Statistics in Medicine 4, 213±226. Santner, T. J. and Snell, M. K., 1980: Small-sample confidence intervals for p 1 p 2 and p 1 =p 2 in 2 2 contingency tables. Journal of the American Statistical Association 73, 386±394. Thomas, D. G. and Gart, J. J., 1977: A table of exact confidence limits for differences and ratios of two proportions and their odds ratios. Journal of the American Statistical Association 72, 73±76. SAS Institute, Inc., 1990: SAS Language, Version 6, 1st edition. Cary, North Carolina. Wallenstein, S., 1997: A non-iterative accurate asymptotic confidence interval for the difference between two proportions. Statistics in Medicine 16, 1329±1336. Kung-Jong Lui Received, November 1997 Department of Mathematical Sciences Revised, August 1999 College of Sciences Accepted, August 1999 San Diego State University 5500 Campanile Drive San Diego, CA 92182-7720 USA E-mail: kjl@rohan.sdsu.edu