A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995), Murphy (1995), Pan (1997) and many others. We investigate in this paper the behavior of the two competing hazard formulation of the empirical likelihood: the Poisson and Binomial empirical likelihood. Simulation results show that the Binomial empirical likelihood have a better chi square approximation under null hypothesis. AMS 1991 Subject Classification: Primary 62G10; secondary 62G05. Key Words and Phrases: Censored data, Weighted hazards, Wilks theorem, Chi-square approximation. 1. Introduction Empirical likelihood (EL) method was first proposed by Thomas and Grunkmier (1975). Owen (1988, 1990, 2001) made this into a general methodology. He used parameters that are basically the means of the underlying distribution. Pan (1997), Fang (2000) and Pan and Zhou (2002) advocated using parameters that are weighted hazards. They showed among other things that by using a hazard formulation of the empirical likelihood and weighted hazards parameter, the empirical likelihood ratio handles right censored data easily. However, there are two competing versions of the hazard formulation of the empirical likelihood: namely the binomial and the Poisson version. They both are maximized by the well known Nelson-Aalen estimator. But the maximum values and the maximum of likelihood under constraints are different. We study in this paper their behavior as related to the chi square approximation to the empirical likelihood ratio under null hypothesis. We show that for discrete distributions the binomial EL ratio have much better chi square approximations under the null hypothesis. This difference is more profound for larger sample sizes as the approximation for Poisson EL do not improve with increasing sample size, where as the binomial does. For continuous distributions both ELs have good chi square approximations under null hypothesis, with Poisson some times slightly ahead of binomial in small samples. When sample sizes increase both approximations are good. 1

2. Poisson and Binomial Empirical Likelihood Suppose X 1,,X n are n independent, identically distributed observations. Assume the distribution of the X i is F(t) and the cumulative hazard function of X i is Λ(t). With right censoring, we only observe T i = min(x i,c i ) and δ i = I [Xi C i ] (1) where C i are the censoring times, assumed to be independent, identically distributed, and independent of the X i s. As shown in Pan and Zhou (2002) and Fang (2000), computations are much easier with the empirical likelihood formulated in terms of the (cumulative) hazard function. The hazard formulation of the censored data log empirical likelihood (denoted by log EL(Λ x )) is given as follows: log EL(Λ) = i {d i log v i + (R i d i )log(1 v i )} (2) where t i are the ordered, distinct values of T i ; d i = n j=1 I [Tj =t i ]δ j, and R i = n j=1 I [Tj t i ]. See, for example, Thomas and Grunkemeier (1975) and Li (1995) for similar notation and definition. Here, 0 < v i 1 are the discrete hazards at t i. We shall call this version of empirical likelihood the Binomial likelihood, following Murphy (1995). The maximization of (2) with respect to v i is known to be attained at the jumps of the Nelson-Aalen estimator: v i = d i /R i. Fang (2000) considered a hypothesis testing or confidence interval for a parameter θ with respect to the cumulative hazard function θ = g(t) log(1 dλ(t)) where g(t) is a nonnegative function. We note that θ are functionals of the cumulative hazard function. The constraints we shall impose on the hazards v i are: for given functions g( ) and constants µ, we have N 1 i g(t i )log(1 v i ) = µ, (3) where N is the total number of distinctive observation values. We need to exclude the last value as we always have v N = 1 for discrete hazards. 2

The EL ratio test statistic in terms of hazards is given by W 2 = 2{log maxel(λ)(with constraint (3)) log max EL(Λ)(without constraint)}. We have the following result that proves a version of Wilks theorem for W 2 under some regularity conditions. For proof please see Fang (2000). Theorem 1 Suppose µ = g(t)log{1 dλ(t)}. Then, the test statistic W 2 has asymptotically a distribution with 1 degrees of freedom. Remark 1 The integration constraints are originally given as θ = g(t)dlog{1 F(t)}. The above formulations are found by using the suggestive notation dlog{f(t)} = log{dλ(t)}. These two formulations are identical for both continuous and discrete F(t). Parallel results with Poisson likelihood function (defined below) and integral constraints were obtained by Pan and Zhou (2002). The Poisson definition of the empirical likelihood function is n log EL 2 = δ i log Λ(T i ) Λ(T j ). (4) i=1 j:t j T i This is called Poisson extension of the empirical likelihood, because it is in the form of a likelihood of a sequence of conditional Poisson trials. Johansen (1983) showed that the Poisson extension corresponds to the probability distribution from a dynamical Poisson process. See also Murphy (1995). It turns out, both empirical likelihood functions defined above are maximized by the jumps of the Nelson-Aalen (NA) estimator, ˆΛ(T i ) NA = δ i nj=1 I(T j T i ). Pan and Zhou (2002) studied the limit of the Poisson EL ratio with a general parameter of the form g(t)dλ(t) = g(t i ) Λ(T i ) = θ. They also obtained a chi square limit for the -2 log likelihood ratio under mild regularity conditions similar to Theorem 1 above. 3. Comparison of the EL ratios from Binomial and Poisson Type We first compare the performance of the chi square approximation for the two type of EL ratios under a discrete distribution. Tied observations arise naturally in biomedical 3

researches and other type of studies. Monte Carlo s with discrete distributions can closely mimic this type of data. To compare the Poisson and Binomial empirical likelihood ratio dealing with tied observations, s with a discrete exponential distribution were conducted. The discrete exponential random variable was generated by rounding the random variable from a standard exponential distribution to the second digit after the decimal point; and all values greater than 3 from the exponential distribution are defined as 3. An alternative approach, breaking the tied observations by adding a small value, was also applied to the Poisson empirical likelihood ratio approach. For example observations 2,2,2 will be replaced by 2,2 + ǫ,2 + 2ǫ etc. Constraint I(ti t) Λ(t i ) = θ (5) and I(ti t)log[1 Λ(t i )] = θ (6) was used for Poisson and Binomial empirical likelihood, respectively, where t is a known value. Setting t equal to 1.2 for both constraint (6) and (7), Monte Carlo of 1000 run was conducted for all three approaches (Poisson, Poisson with tie break and Binomial) using the same sample. Figures 1 and 2 are QQ-plots of the Poisson empirical likelihood ratios with respect to the χ 2 (1) percentiles. Figure 2 represents the result after artificially breaking the tied observations in the sample. The approximation is poor for both Poisson empirical likelihood ratio approaches, with tie-breaking slightly better. Figure 3 shows the results using constraint (7) and Binomial likelihood. The approximation to χ 2 (1) is very good compared to the Poisson EL ratio. Thus, the Binomial empirical likelihood ratio approach is superior to the Poisson empirical likelihood ratio approach, when the underlying distribution for the survival time is discrete. Next, we compare the two empirical likelihoods when the underlying distribution for the survival time is continuous. Using standard exponential as the survival time distribution, the was conducted with t equals 0.67 or 2.3 in the constraint (6) and (7). The survival function of the standard exponential at these two t values are 0.5 and 0.1, respectively. The results from Figure 4 indicate that the Poisson extension is comparable to Binomial extension when 1 F(t) = 0.5, and better than Binomial extension approach, when 1 4

Figure 1: 1000 Simulations with constraint (6) and Poisson empirical likelihood ratios n = 30 n = 80 n = 150 n = 500 5

Figure 2: 1000 Simulations with constraint (6) and Poisson empirical likelihood ratios. Tied observations in the sample were artificially broken by adding a small value n = 30 n = 80 n = 150 n = 500 6

Figure 3: 1000 Simulations with constraint (7) and binomial empirical likelihood ratios n = 30 n = 80 n = 150 n = 500 7

Figure 4: 3000 Simulations with standard exponential distribution. (a) Constraint (6) with 1 F(t) = 0.5 and Poisson extension. (b) Constraint (7) with 1 F(t) = 0.5 and Binomial extension. (c) Constraint (6) with 1 F(t) = 0.1 and Poisson extension. (d) Constraint (7) with 1 F(t) = 0.1 and Binomial extension. (a) n = 20 (b) n = 20 (c) n = 30 (d) n = 30 F(t) = 0.1. However these differences diminish when sample size n increases. Much more s were done and we only report representative ones here. In conclusion, the EL ratio from the Poisson likelihood is only suitable for survival time from continuous distributions, it cannot handle tied observation when distribution is discrete. The Poisson likelihood approach gives very good approximation when the underlying distribution is continuous. According to the results from the continuous distribution, the approximation from the Binomial extension approach is good when 1 F(t) = 0.5. In the case when 1 F(t) = 0.1, larger sample size is needed to reach a good approximation. We think this is due to the censoring in the tails. In other words, when there are censoring and the parameter considered is sensitive to the tail behavior, then the Poisson likelihood may have a better chi square approximation. 8

4. Concluding Remarks Two competing empirical likelihood definition lead to very different behavior for the log likelihood ratio. Practitioners should be aware of those and chose the appropriate one for the data at hand, so that the calculation of confidence interval or p-value in testing are more acurate. References Andersen, P.K., Borgan, O., Gill, R. and Keiding, N. (1993), Statistical Models Based on Counting Processes. Springer, New York. Efron, B. (1967). The Two Sample Problem With Censored Data. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 4, 831-883. Fang, H. (2000) Binomial Empirical Likelihood with Discrete Censored Data. Ph.D. Dissertation Department of Statistics, University of Kentucky. Gill, R. (1989), Non-and Semi-parametric Maximum likelihood estimator and the von Mises Method (I) Scand. J. Statist. 16, 97-128. Johansen, S. (1983) An Extension of Cox s Regression Model International Statistical Review, 51, 258-262. Kaplan, E. and Meier, P. (1958), Non-parametric Estimator From Incomplete Observations J. Amer. Statist. Assoc. 53, 457 481. Li, G. (1995) On Nonparametric Likelihood Ratio Estimation of survival Probabilities for Censored Data Statistics and Probability Letters, 25, 95-104. Murphy, S. A. (1995) Likelihood Ration Based Confidence Intervals in Survival Analysis Journal of the American Statistical Association, 90, 1399-1405. Murphy, S. and Van der Varrt, (1997). Semiparametric Likelihood Ratio Inference. Ann. Statist. 25, 1471-1509. Owen, A. (1988). Empirical Likelihood Ratio Confidence Intervals for a Single Functional. Biometrika, 75 237-249. Owen, A. B. (1991), Empirical Likelihood for Linear Models The Annals of Statistics, 19, 1725-1747. Owen, A., (2001). Empirical Likelihood. Chapman & Hall, London. Pan, X. (1997), Empirical Likelihood Ratio Method for Censored Data Ph.D. Dissertation Department of Statistics, University of Kentucky. Pan, X.R. and Zhou, M., (2002). Empirical likelihood in terms of cumulative hazard function for censored data. J. Multivariate Analysis 80 (1), 166 188. Thomas, D. R. and Grunkemeier, G. L. (1975), Confidence Interval Estimation of Survival Probabilities for Censored Data Journal of the American Statistical Association, 70, 865-871. Department of Statistics University of Kentucky Lexington, KY 40506-0027 9