MODELING AND INFERENCE FOR AN ORDINAL EFFECT SIZE MEASURE

Size: px

Start display at page:

Download "MODELING AND INFERENCE FOR AN ORDINAL EFFECT SIZE MEASURE"

Ross Beasley
5 years ago
Views:

1 MODELING AND INFERENCE FOR AN ORDINAL EFFECT SIZE MEASURE By EUIJUNG RYU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 c 2007 Euijung Ryu 2

3 To my parents and Dr. Alan Agresti 3

4 ACKNOWLEDGMENTS First of all I express my deepest gratitude to Dr. Alan Agresti for serving as my dissertation advisor and offering endless support and continuous encouragement. Also, I thank drs. Ronald Randles, Michael Daniels, Babette Brumback, and James Algina for serving on my committee. I am very grateful to my parents for constant love and confidence in my success. Without their support, I would not be the person I am now. I also thank my sisters and brother for endless emotional support. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION An Ordinal Effect Size Measure Properties of the Measure Area Under Receiver Operating Characteristic Curve Mann-Whitney Statistic and its Variance Expression Continuous Case Categorical Case Existing Methods to Find Confidence Intervals Halperin, Hamdy, and Thall (HHT) Confidence Interval Newcombe s Score Confidence Interval Outline of Dissertation CONFIDENCE INTERVALS UNDER AN UNRESTRICTED MODEL Basic Introduction of Four Confidence Intervals Wald Confidence Interval Likelihood Ratio Test (LRT)-based Confidence Interval Score Confidence Interval Pseudo Score-type Confidence Interval Wald Confidence Interval for θ Wald Confidence Interval based on the Logit Transformation Comparison with Newcombe s Wald Confidence Interval Restricted ML Estimates Multinomial-Poisson Homogeneous Models Algorithms to Find the Restricted ML Estimates Likelihood Ratio Test-based Confidence Interval Score Confidence Interval Pseudo Score-type Confidence Interval CONFIDENCE INTERVALS UNDER A PARAMETRIC MODEL Wald Confidence Intervals Restricted ML Estimates Confidence Intervals based on Restricted ML Estimates

6 3.4 Example SIMULATION STUDIES Factorial Design of Conditions Evaluation Criteria for Confidence Intervals Coverage Probability Expected Length Overall Summaries Comparison of Methods CONFIDENCE INTERVALS FOR MATCHED-PAIRS DATA Two Ordinal Effect Size Measures Wald Confidence Intervals Restricted ML Estimation LRT-based Confidence Interval Score Confidence Interval Pseudo Score-type Confidence Interval Example: Data Analysis Simulation Study CONFIDENCE INTERVALS FOR FULLY-RANKED DATA Performance of the Methods Connections with an Effect Size Measure for Normal Distributions MODELING THE ORDINAL EFFECT SIZE MEASURE WITH EXPLANATORY VARIABLES Fixed-Effects Modelling with Categorical Covariates Maximum Likelihood Estimation Score Confidence Interval Goodness-of-fit Tests Example: Data Analysis Simpson s Paradox Fixed-Effects Modelling with Continuous Covariates Cumulative Logit Models with Continuous Covariates Example: Data Analysis Random-Effects Modelling Cumulative Logits Models with Random Effects Estimation and Prediction Example: Data Analysis SUMMARY AND FUTURE RESEARCH Summary Future Research

7 APPENDIX A PROOFS B R CODES B.1 R Codes to Calculate Score Confidence Interval for a 2 c Table B.2 R Codes to Calculate Logit Wald Confidence Interval for Matched-Pairs Data C OPINION ABOUT SURROGATE MOTHERHOOD AND THE LIKELIHOOD OF SELLING A KIDNEY REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 1-1 Shoulder tip pain scores after laparoscopic surgery Two sets of cell probabilities with θ = Frequencies of two categorical variables with c categories Restricted ML estimates under H 0 : θ = Non-zero counts summarized in a 2 3 table Restricted ML estimates of cell probabilities in Table Confidence intervals for θ in Table 1-1 with and without assuming a cumulative logit model Cell probabilities for different conditions With sample sizes 100 each, coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) With sample sizes 50 each, coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) With sample sizes (50, 100), coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) With sample sizes (10, 100), coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) With sample sizes (10, 50), coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) With sample sizes (10, 10), coverage probabilities (CP) and overall summaries from simulation study for cases in which cumulative logit model holds (CL) or does not hold (Not CL) Mean coverage probabilities (CP) for twelve, averaged over sample sizes (100, 100) or (50, 50), c = 3 and 6, θ = 0.5 and Overall performance summaries of coverage probability (CP) from simulation study for twelve methods, averaged over several sample sizes, c = 3 and 6, θ = 0.5 and 0.8, and whether or not a cumulative logit model holds

9 5-1 Matched-pairs data with c categories Joint cell probabilities of the matched-pairs data c table with marginal row totals and column totals from Table Opinion about premarital and extramarital sex % Confidence Intervals for θ MP Joint cell probabilities with c = 6, θ = 0.5, and ρ = Overall performances for matched-pairs data over all conditions considered Coverage probabilities for matched-pairs data with c =6 and sample sizes = Coverage probabilities for matched-pairs data with c =3 and sample sizes = Coverage probabilities for matched-pairs data with c =6 and sample sizes = Coverage probabilities for matched-pairs data with c =3 and sample sizes = Coverage probabilities for matched-pairs data with c =6 and sample sizes = Coverage probabilities for matched-pairs data with c =3 and sample sizes = Fully-ranked data with c = Coverage probabilities (CP) and overall summaries for fully-ranked data with sample sizes (10, 10) and (20, 30) Relationship between and θ Shoulder tip pain scores stratified by gender and age ML estimates of θ k and β j parameters, with their 95% score confidence intervals Opinion about surrogate motherhood and the likelihood of selling a kidney at age Point estimates and confidence intervals of θ(x) as age(x) varies, under main-effects model Clinical trial relating treatment to response for eight centers ML estimates of parameters and their standard errors EB estimates of θ s and their standard errors of prediction C-1 Opinion about surrogate motherhood and the likelihood of selling a kidney with age between 18 and

10 C-2 Opinion about surrogate motherhood and the likelihood of selling a kidney with age between 36 and C-3 Opinion about surrogate motherhood and the likelihood of selling a kidney with age between 56 and C-4 Opinion about surrogate motherhood and the likelihood of selling a kidney with age between 76 and

11 Figure LIST OF FIGURES page specificity=P (Y 1 > k) and sensitivity=p (Y 2 > k), k = 1, 2, 3, and Plot of asymptotic efficiency of Mann-Whitney estimate relative to parametric estimate

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MODELING AND INFERENCE FOR AN ORDINAL EFFECT SIZE MEASURE Chair: Dr. Alan Agresti Major: Statistics By Euijung Ryu August 2007 An ordinal measure of effect size is a simple and useful way to describe the difference between two ordered categorical distributions. This measure summarizes the probability that an outcome from one distribution falls above an outcome from the other, adjusted for ties. The ordinal effect size measure is simple to interpret and has connections with a commonly used effect size measure for normal distributions. We develop and compare confidence interval methods for the measure. Simulation studies show that with independent multinomial samples, confidence intervals based on inverting the score test and a pseudo-score test perform well. This score method also seems to work well with fully-ranked data, but for dependent samples a simple Wald interval on the logit scale can be better with small samples. We also explore how the ordinal effect size measure relates to an effect measure commonly used for normal distributions, and we consider a logit model for describing how it depends on categorical explanatory variables. The methods are illustrated for several studies that compare two groups, including a study comparing two treatments for shoulder tip pain. 12

13 CHAPTER 1 INTRODUCTION When we are interested in comparing two groups, it is useful to know not only whether or not they have a statistically significant difference, but also the effect size, which is a measure to describe the difference between the two groups. For instance, consider the situation in which a score is sampled from one distribution and a score is sampled from another distribution. If we consider a continuous outcome having a distribution such as the normal, then there are two kinds of measures to explain the effect size: an absolute measure and a relative measure. An example of an absolute measure is the difference in the two group means, and an example of a relative measure is the standardized difference, obtained by dividing the difference of two means by the pooled standard deviation for those means, which is referred as Cohen s d (Cohen 1992). If the outcomes are binary, the difference of proportions is used as an absolute measure and the relative risk and the odds ratio are used as relative measures. A common interest in many research areas is to compare two groups when a measurement is on an ordered categorical scale. For instance, we use a study (Lumley 1996) that compares an active treatment with a control treatment for patients having shoulder tip pain after laparoscopic surgery (Table 1-1). Table 1-1. Shoulder tip pain scores after laparoscopic surgery Treatments Active Control The two treatments were randomly assigned to 41 patients. The patients rated their pain level on a scale from 1 (low) to 5 (high) on the fifth day after the surgery. The responses are ordered categorical and can be summarized by frequencies in a 2 5 contingency table. The absolute distances between the categories are unknown. When a patient s pain level is high, it is clear that the patient has more pain than a patient who rates his/her pain level lower than that. But it is unclear how to assign a numerical 13

14 value for how much more pain that patient has. To describe the two treatment effects, therefore, we need a summary measure that uses the relative size of outcomes instead of actual magnitudes. When responses are ordinal, a common method is to assign scores to the categories and find the means. The scores assign distances between categories and treat the measurement scale as interval. If a modeling approach such as the cumulative logit model with proportional odds structure is used, then the group effect is explained by an odds ratio summary using cumulative probabilities (Agresti 2002). In our dissertation we use an alternative measure that treats the response as ordinal but is simpler to interpret for an audience not familiar with odds ratios and that has connections with an effect size for normal distributions. 1.1 An Ordinal Effect Size Measure Suppose Y 1 and Y 2 are independent random variables that have at least an ordinal measurement scale and have the same support. An effect size measure that embodies the relative size of outcomes rather than assuming actual magnitude is θ = P (Y 1 < Y 2 ) P (Y 1 = Y 2 ), which is linearly related to P (Y 1 < Y 2 ) P (Y 1 > Y 2 ), Somers D (Somers 1962), and P (Y 1 < Y 2 )/P (Y 1 > Y 2 ) (see Agresti (1981) and Newcombe (2006a)). If Y 1 and Y 2 have continuous distributions, then θ is simply P (Y 1 < Y 2 ), since P (Y 1 = Y 2 ) = 0. On the other hand, if Y 1 and Y 2 are categorical, then ties occur with positive probability so that P (Y 1 = Y 2 ) must be accounted for in comparing the two populations. From the form of θ, we can see that it takes on values between 0 and 1 and the larger θ is, the greater the probability that a random variable Y 2 will be larger than a random variable Y 1. The random variable Y 1 tends to take smaller (larger) values than the random variable Y 2 if θ > 0.5 (θ < 0.5), and the two random variables are called tendentiously equal if θ = 0.5 (see Brunner and Munzel, 2000). 14

15 This measure was introduced by Klotz (1966) to test the hypothesis of equality of the distributions of Y 1 and Y 2 against alternatives where one sample is stochastically larger than the other. The hypothesis of testing θ = 0.5, which includes testing equality of the two distributions, has been called the generalized or nonparametric Behrens-Fisher problem (see Brunner and Munzel, 2000). Vargha and Delaney (1998) called θ a measure of stochastic superiority of variable Y 2 over variable Y 1. This measure is also termed as common language effect size (McGraw and Wong, 1992) Properties of the Measure Suppose that Y 1 and Y 2 are ordinal random variables having c categories labelled 1, 2,, c, from least to greatest in degree. Let π i = P (Y 1 = i) and λ j = P (Y 2 = j). Denoting π = (π 1, π 2,, π c ) T and λ = (λ 1, λ 2,, λ c ) T, we get θ = P (Y 1 < Y 2 ) (Y 1 = Y 2 ) = P (Y 2 > a)p (Y 1 = a) a=1 c 1 = a=1 b>a = λ T Aπ, π a λ b π a λ a a=1 P (Y 2 = a)p (Y 1 = a) a=1 (1 1) where A = This measure has the following properties. Property 3 through Property 5 also hold for the case that Y 1 and Y 2 are continuous. Proofs for categorical cases are in the appendix, except for Property 1.1, 1.5 and 1.6 whose proofs are obvious. 15

16 Property 1.1. If c = 2, then θ = 0.5(1 + π 1 λ 1 ). In other words, θ is simply a linear function of the difference of proportions, π 1 λ 1. Property 1.2. If categories are reversed from (1, 2,, c 1, c) to (c, c 1,, 2, 1) or if the rows are interchanged, then θ changes to 1 θ. Property 1.3. If Y 1 and Y 2 are identically distributed or if they have symmetric distributions with support over all c categories, then θ = 0.5. Property 1.4. If Y 2 is stochastically larger than Y 1, then θ > 0.5. Property 1.5. If there is no overlapping in Y 1 and Y 2 in the sense that P (Y 1 > Y 2 ) = 1 or P (Y 1 < Y 2 ) = 1, then θ is either 0 or 1. The converse is also true. Property 1.6. If both Y 1 and Y 2 are degenerate at a given value, then θ = 0.5. From Property 1.3, two identical variables give θ = 0.5. However, many distributions that are not identical can give θ = 0.5 if they give no preference in terms of likely having larger values (Troendle 2002), which includes symmetric distributions. For instance, cell probabilities in Table 1.2 give θ = 0.5. Table 1-2. Two sets of cell probabilities with θ = 0.5 Identical distribution Symmetric distribution Y 1 π 1 π 2 π 3 π 4 π 5 Y 1 π 1 π 2 π 3 π 2 π 1 Y 2 π 1 π 2 π 3 π 4 π 5 Y 2 λ 1 λ 2 λ 3 λ 2 λ Area Under Receiver Operating Characteristic Curve The ordinal effect size measure θ is equivalent to the area under the receiver operating characteristic (ROC) curve (Bamber 1975). The ROC curve is a tool to assess the ability of a diagnostic test to discriminate between two groups, such as a diseased group and a non-diseased group. Let Y 1 and Y 2 represent observations from the non-diseased group and diseased group respectively. If they are discrete with values k, k = 1, 2,, c, and observations with values of Y 1 and Y 2 above category k are classified to be diseased, then P (Y 1 k) is called the specificity and P (Y 2 > k) is called the sensitivity. The ROC curve is a graph in which a point is plotted with its horizontal coordinate 1-specificity and as 16

17 sensitivity specificity Figure specificity=P (Y 1 > k) and sensitivity=p (Y 2 > k), k = 1, 2, 3, and 4. its vertical coordinate sensitivity for each point k, including the point (1, 1). In other words, the ROC curve has points {(P (Y 1 > c), P (Y 2 > c)), (P (Y 1 > c 1), P (Y 2 > c 1)),, (P (Y 1 > 1), P (Y 2 > 1)), (1, 1)}, and it has straight-line segments between the adjacent points. See Figure 1-1 for an example. Here Y 1 corresponds to a distribution of rating of CT images for a group with normal disease status, and Y 2 corresponds to a distribution of rating of CT images for a group with abnormal disease status. Bamber (1975) showed that the area under the ROC curve, which is the sum of the area of all trapezoids under the ROC graph, is equal to P (Y 1 < Y 2 ) P (Y 1 = Y 2 ) = θ. 17

18 1.2 Mann-Whitney Statistic and its Variance Expression Let Y 11,, Y 1n1 denote a random sample of size n 1 from population 1 with a cumulative distribution function F and let Y 21,, Y 2n2 denote an independent random sample of size n 2 from population 2 with a cumulative distribution function G. We distinguish cases in which both F and G are continuous with cases in which they are categorical, to find a point estimate of θ and its variance Continuous Case Suppose the cdfs F and G are continuous. Mann and Whitney (1947) proposed the following statistic to test a null hypothesis F = G against the alternative that F is stochastically smaller than G. If we define φ(x, y) = 1 if x < y 0 if otherwise, then the Mann-Whitney U- statistic is n 1 n 2 U = φ(y 1i, Y 2j ). i=1 j=1 A relatively large value of U provides evidence against the null hypothesis. The U-statistic is the sum of correlated, identically distributed Bernoulli random variables, and the related statistic U/(n 1 n 2 ) is a generalized U-statistic. The expectation of U is equal to n 1 n 2 θ, where θ = P (Y 1 < Y 2 ). The variance of U is given by V c = n 1 n 2 [θ(1 θ) + (n 2 1)(Q 1 θ 2 ) + (n 1 1)(Q 2 θ 2 )], (1 2) where Q 1 = P [Y 1i < Y 2j and Y 1i < Y 2k ], j k, Q 2 = P [Y 1i < Y 2j and Y 1k < Y 2j ], i k. Here, Y 1i, Y 1k, Y 2j, and Y 2k are independently distributed. The derivation of V c can be found in Mann and Whitney (1947) and in Lehmann (1975, p ). 18

19 1.2.2 Categorical Case Now suppose that Y 1 and Y 2 have ordered categories with π and λ as cell probabilities. Table 1.3 is a contingency table with frequencies {n ij }. As mentioned in Section 1.1, there are ties in the table with a positive probability since the variables are categorical. In this case, the Mann-Whitney statistic, which was designed for no tied data, should be modified. Table 1-3. Frequencies of two categorical variables with c categories 1 2 c 1 c Y 1 n 11 n 12 n 1(c 1) n 1c Y 2 n 21 n 22 n 2(c 1) n 2c The modified Mann-Whitney U-statistic is n 1 n 2 U = φ(y 1i, Y 2j ), i=1 j=1 where φ(x, y) = 1 if x < y 1 if x = y 2 0 if x > y. Kruskal (1957) proposed a similar form of the statistic accounting for ties, and Klotz (1966) gave the U-statistic form. The U- statistic can also be summarized in terms of cell counts: c 1 U = n 1a n 2b a=1 b>a n 1a n 2a. The variance of V ar(u), which can be found in Halperin, Hamdy, and Thall (1989) without the derivation, is V d = n 1 n 2 [ θ (n 1 + n 2 1)θ 2 + (n 2 1)C + (n 1 1)D 1 4 a=1 ] π i λ i, (1 3) i=1 19

20 where c 1 C = π i ( D = This can be shown as follows: i=1 j=2 j=i+1 i=1 λ j + λ i 2 )2 + π cλ 2 c 4, and j 1 λ j ( π i + π j 2 )2 + π2 1λ 1 4. Let Y 1, Y 1, Y 2, and Y 2 be independently distributed. From Noether (1967), the variance of U is where V ar(u) = n 1n 2 4 [p p 12 + (n 2 1)(p p 122 p + 122) + (n 1 1)(p p 112 p + 112) (n 1 + n 2 1)(2θ 1) 2 ], p + 12 = P (Y 1 < Y 2 ), p 12 = P (Y 1 > Y 2 ), p = P (Y 1 < Y 2, Y 2 ), p 122 = P (Y 2, Y 2 < Y 1 ), p = 2P (Y 2 < Y 1 < Y 2 ), p = P (Y 1, Y 1 < Y 2 ), p 112 = P (Y 1, Y 1 > Y 2 ), p = 2P (Y 1 < Y 2 < Y 1 ). Treating counts in each row as having a multinomial distribution, these components are expressed explicitly in terms of cell probabilities. That is, denoting the cell probabilities for the first row by (π 1,, π c ), and the cell probabilities for the second row by (λ 1,, λ c ), we have p p 12 = 1 p = p = 2 i=1 j=1 i=1 i 1 λ i ( π j ) 2, π i ( i=1 j=i+1 π i λ i, p = p 122 = i=1 π i ( i=1 j=1 j=i+1 i 1 π i ( λ j ) 2, i 1 λ j )( λ j ), p = 2 i=1 λ j ) 2, λ i ( i=1 p 112 = j=i+1 λ i ( i=1 j=i+1 i 1 π j )( π j ). i=1 π j ) 2, 20

21 Since 1 = j<i λ j + λ i + λ j>i λ j for each i, p p 122 p = [ π i ( λ j ) 2 + ( λ j ) 2 2( λ j )( ] λ j ) i j>i j<i j>i j<i = [ π i 2 ] 2 λ j + λ i 1 i j>i [ ] = 4 π i ( λ j + λ i 2 )2 θ i=1 j=i+1 Using a similar argument, we have [ p p 112 p = 4 λ i ( i=1 j=i+1 π j + π i 2 )2 θ Substituting these into the variance form, we get the formula in (1-3). 1.3 Existing Methods to Find Confidence Intervals Our focus in this research is partly to propose and evaluate confidence intervals for θ, primarily for the case when a 2 c contingency table is under consideration. If Y 1 and Y 2 have ordered categories, a 2 c table is given as in Subsection On the other hand, if Y 1 and Y 2 have continuous distributions as in Subsection 1.2.1, and have sample sizes n 1 and n 2 respectively, then the data must be ranked to make the corresponding 2 c table, where c = n 1 + n 2. All c scores from the combined data are arranged in order of magnitude. A rank of 1 is assigned to the lowest score, a rank of 2 to the second lowest score and so on, with a rank of c being assigned to the highest score. If a score in the group of size n 1 has the rank j, j = 1,, c, then the 2 c table has 1 for cell (1, j), and has 0 otherwise. Similarly if a score in the group of size n 2 has the rank k, k = 1,, c, then the table has 1 for cell (2, k), and has 0 otherwise. Assuming there are no ties, the sum of counts in each column is equal to 1. This type of data is referred to as fully-ranked data. When c = 2, which relates to the difference of proportion, there are many possibilities for confidence intervals, including eleven methods evaluated by Newcombe (1998). For ]. 21

22 general c, only a few methods exist. Unlike c = 2, the situation is much more complicated because θ is no longer a linear function of the difference of cell probabilities. Basically there are two types of confidence intervals when c > 2, depending on whether or not we assume underlying continuous distributions. However, both types use U/n 1 n 2 as an estimator of θ. As we saw in Section 1.2, the variance of U in the categorical case, V d, is different from that obtained when an underlying continuous distribution is assumed, V c. Furthermore, V c has different forms depending on what distribution is used as an underlying continuous distribution. So, we can get several different confidence intervals depending on this assumption Halperin, Hamdy, and Thall (HHT) Confidence Interval Hochberg (1981) proposed confidence interval methods for P (Y 1 < Y 2 ) P (Y 1 > Y 2 ) using U-statistics and the delta method for ordered categorical data. Since P (Y 1 < Y 2 ) P (Y 1 > Y 2 ) is equal to 2θ 1, there is an equivalent relation between P (Y 1 < Y 2 ) P (Y 1 > Y 2 ) and θ. As a result, Hochberg s methods can be used to find confidence intervals for θ. Halperin, Hamdy, and Thall (1989) provided a distribution-free confidence interval for θ, which is based on the pivotal quantity Z 2 HHT = (ˆθ θ) 2 / ˆV HHT, where ˆV HHT uses estimates of some parameters but is an explicit function of θ. Based on a simulation study, Halperin, Hamdy, and Thall (1989) mentioned that their approach is as good as or better than that of Hochberg s (1981) U-statistic-based method, and is especially better for extreme values of θ, i.e., relatively far away from 0.5, in terms of deviation from nominal coverage probability. Below, we will discuss the Halperin, Hamdy, and Thall (1989) method. The main part of the Halperin, Hamdy, and Thall (1989) method is to find ˆV HHT. The idea came from Halperin, Gilbert, and Lachin (1987), in which they used a pivotal quantity to obtain a distribution-free confidence interval for P (Y 1 < Y 2 ). They divide the variance into two parts: the first part is an explicit function of θ, and for the second part 22

23 they find a lower bound and an upper bound. After suitable estimators are substituted for the remaining parameters, ˆV HHT is an explicit function of θ. Using this variance for the pivotal quantity, instead of substituting ˆθ into θ, they find a confidence interval for θ by solving the corresponding quadratic equation of θ. Details are as follows: From equation (1-3), the variance of ˆθ is [ V d /(n 1 n 2 ) 2 = 1 θ (n 1 + n 2 1)θ 2 + (n 2 1)C + (n 1 1)D 1 n 1 n 2 4 ] π i λ i. i=1 To use the argument of Halperin et al. (1987), they divide the variance of ˆθ into two parts. Aside from n 1 and n 2, one of them should be a function of θ alone. Let V 1 = θ (n 1 + n 2 1)θ 2 and V 2 = (n 2 1)C + (n 1 1)D 1 4 π i λ i, (1 4) i=1 so that V d /(n 1 n 2 ) = V 1 + V 2 and clearly V 1 is a function of θ alone. Note that V 2 is bounded above and below by explicit functions of θ since θ 2 C, D θ. That is, L V 2 U, where L = (n 1 + n 2 2)θ and U = (n 1 + n 2 2)θ. Thus, there exists a ρ which is between 0 and 1 such that V 2 = ρl + (1 ρ)u [ ] ρ = (n 1 + n 2 2) θ ρθ(1 θ). 4(n 1 + n 2 2) (1 5) Assuming min{n 1, n 2 }, then 1 4 c i=1 π iλ i = o(min{n 1, n 2 }) in equation (1-4) and ρ 4(n 1 +n 2 2) = o(1) since ρ is bounded in equation (1-5). Since V 1 and V 2 are o(n 1 n 2 ), and both the o(min{n 1, n 2 }) and o(1) terms have lower orders than the other terms in equations (1-4) and (1-5), combining equations (1-4) and (1-5) after ignoring these terms and solving in terms of ρ gives ρ = (n 1 + n 2 2)θ (n 2 1)C (n 1 1)D. (n 1 + n 2 2)θ(1 θ) The next step is to find a consistent estimator, ˆρ, for ρ, so that ˆV 2 = (n 1 + n 2 2)[θ ˆρθ(1 θ)], which is an explicit function of θ, aside from ˆρ, n 1 and n 2. Letting Ĉ and ˆD be 23

24 the estimators of C and D based on substituting the ML estimators, Halperin et al. (1989) used the following unbiased estimators of C, D, and θ(1 θ): C = Ĉ 1 n 2 1 D = ˆD 1 n 1 1 c 1 i=1 ˆπ i [(1 ˆλ i ) j=i+1 j 1 ˆλ j [(1 ˆπ j ) j=2 i=1 ˆλ j ( j=i+1 j 1 ˆπ j ( θ(1 θ) = (n 1n 2 n 1 n 2 + 2)ˆθ n 1 n 2ˆθ2 + C + D. (n 1 1)(n 2 1) n 1 n 2 Then, ˆρ is obtained by plugging ˆθ, C, D, and ˆθ, C, D, and i=1 ˆλ j ) 2 ] ˆπ j ) 2 ] 1 4(n 2 1) 1 4(n 1 1) ˆπ iˆλi (1 ˆλ i ) i=1 ˆλ j ˆπ j (1 ˆπ j ) j=1 θ(1 θ) into ρ, and ˆρ is consistent because θ(1 θ) are all consistent. If ˆρ < 0, define ˆρ = 0, and likewise define ˆρ = 1 if ˆρ > 1. It is also possible that ˆθ = C = D = 0, in which case ˆρ is indeterminate. In this case, define ˆρ = 0. Hence, ˆV HHT = 1 n 1 n 2 [V 1 + ˆV 2 ] = 1 n 1 n 2 [n 1 + n 2 1 (n 1 + n 2 2)ˆρ]θ(1 θ). A confidence interval using the pivotal quantity (ˆθ θ) 2 / ˆV HHT is the set of θ satisfying n 1 n 2 (ˆθ θ) 2 [n 1 + n 2 1 (n 1 + n 2 2)ˆρ]θ(1 θ) z2 α/2, where z α/2 denotes the (1 α/2) quantile of the standard normal distribution. From the above inequality, the 100(1 α)% confidence limits for θ are given explicitly by K + 2ˆθ ± [K 2 + 4K ˆθ(1 1/2 ˆθ)], 2(K + 1) where K = [n 1 + n 2 1 (n 1 + n 2 2)ˆρ]z 2 α/2 /(n 1n 2 ). This method is robust to distributional assumptions. However, it has the drawback of ignoring any information that might be available on the underlying distribution. Another problem of this method is that they ignored some lower-order terms in V 2 to find the estimated variance ˆV HHT. Although it gives a simple form for ˆV HHT, it is possible that 24

25 this method does not perform well when the sample size is small because they dropped lower-order terms Newcombe s Score Confidence Interval Recently, Newcombe (2006b) compared eight asymptotic confidence interval methods including the Hanley and McNeil (1982) distribution-free method and a simplified method assuming exponential distributions. The Halperin, Hamdy, and Thall (1989) method discussed in Subsection assumed that the data have ordinal categories. Unlike the Halperin et al. (1989) method, Newcombe s (2006b) methods are based on the assumption that Y 1 and Y 2 have continuous distributions, in which ties occur with probability 0. So he used equation (1-2) for the variance of U, while Halperin et al. (1989) used equation (1-3). Newcombe (2006b) mentioned that these methods are applicable to both continuous and ordinal categorical cases. He used the Hanley and McNeil (1982) simplified method to find a score-type confidence interval. Newcombe (2006b) preferred this method because it is simple to implement and he found it to be best in certain evaluation criteria. Therefore in this section we focus only on this method, without mentioning all the methods he considered. Recall that a nonparametric estimator of θ is ˆθ = U/n 1 n 2 and assuming both Y 1 and Y 2 have continuous distributions, equation (1-2) gives a standard error of ˆθ, θ(1 θ) + (n 2 1)(Q 1 θ s.e.(ˆθ) = 2 ) + (n 1 1)(Q 2 θ 2 ). n 1 n 2 Hanley and McNeil (1982) gave two methods to estimate Q 1 and Q 2. In the first method, they did not assume any distributions for Y 1 and Y 2 and found distribution-free estimators for Q 1 and Q 2. In the second method, they simplified two parameters assuming some distributions for Y 1 and Y 2 and then estimated them using ˆθ. So, the second method is a distribution-based approach. We now discuss their second method. First note that s.e.(ˆθ) involves three parameters, θ, Q 1, and Q 2. Although Q 1 and Q 2 depend on the underlying distribution we assume, 25

26 Hanley and McNeil (1982) demonstrated that s.e.(ˆθ) is changed only slightly as we change underlying distributions for any fixed θ. One advantage of using an underlying exponential distribution is that Q 1 and Q 2 can be expressed as simple functions of θ: Suppose Y 1 Exponential(λ 1 ) and Y 2 Exponential(λ 2 ). Direct calculation gives θ = λ 1 λ 1 +λ 2, Q 1 = λ 1 λ 1 +2λ 2, and Q 2 = λ 1 λ 2 λ 1 +λ 2 + λ 2 2λ 1 +λ 2. Thus, we get the following simple forms for Q 1 and Q 2 : Q 1 = θ 2 θ and Q 2 = 2θ2 1 + θ. To find a score-type confidence interval that applies under the exponential model, Newcombe (2006b) substituted these into the form of the standard error for ˆθ, and solved the equation: ˆθ θ = z α/2 θ(1 θ)[1 + (n 2 1) 1 θ 2 θ + (n 1 1) θ 1+θ ] n 1 n 2. (1 6) There is no closed form solution of this equation. But, squaring both sides, it is a quartic function for θ. So it can be solved by iterative methods such as the Newton-Raphson algorithm. Blyth and Still (1983) mentioned that an equivariance property is desired for a binomial confidence interval. The meaning of equivariance in their paper is as follows. Suppose that Y with Binomial (n, p) distribution has a confidence interval (L, U) for p and consider transformations Y n Y and p 1 p. The confidence interval is said to have the equivariance property if the confidence interval for p is equal to (1 U, 1 L). For ordinal categorical data, the concept of the equivariance is related to reversing the categories from (1, 2,, c) to (c,, 2, 1). From Property 1.2 in Section 1.1, the corresponding θ is equal to 1 θ. Because of the asymmetry of the exponential distributions, the confidence interval from equation (1-6) does not possess the equivariance property. Newcombe (2006b) modified this confidence interval to satisfy an equivariance property, suggesting to use n 1+n 2 2 for both n 1 and n 2 in the numerator of equation (1-6). 26

27 Since θ = 1 θ, we know that ˆθ = 1 ˆθ, and the above equation gives ˆθ θ = z α/2 θ (1 θ )[1 + (n 1 1) 1 θ 2 θ + (n 2 1) θ 1+θ ] n 1 n 2. (1 7) To have the equivariance property, the right hand sides in equation (1.6) and (1.7) should be the same. This is satisfied if both n 1 and n 2 are equal, but they are not equal in general. Even though these are not equal from the data, the property holds if the same value for n 1 and n 2 is used in the numerator. For instance, if the mid point, (n 1 + n 2 )/2, of n 1 and n 2 is substituted for n 1 and n 2 in the numerator, the property holds. However, it should be noted that this confidence interval is valid only if n 1 and n 2 grow at the same rate. Newcombe (2006b) mentioned that it may perform adequately under a misspecification of the true distributions, but its validity is questionable since it is based on a strong assumption about the distributions which may not hold. 1.4 Outline of Dissertation In the previous section, we reviewed methods that already exist to find confidence intervals for θ. A basic assumption of Newcombe s methods is that the data are from continuous distributions, although he used cell counts to estimate parameters in the variance formulas. But he did not give any justification as to why we can use his methods for ordinal categorical data. On the other hand, the confidence interval proposed by Halperin et al. (1989) is actually designed for ordinal categorical data and is distribution-free. In my PhD research, I investigate other methods to find confidence intervals for θ. Using a likelihood function, which depends on the model structure we use, we consider several asymptotic confidence intervals: Wald confidence intervals, the likelihood ratio test (LRT)-based confidence interval, the score confidence interval, and a pseudo score-type confidence interval. These confidence intervals are obtained by inverting the corresponding test statistics for H 0 : θ = θ 0. Since θ can be expressed in terms of the parameters in the model, the null hypothesis H 0 : θ = θ 0 gives a constraint function of the parameters. 27

28 When there are certain likelihood functions and constraint functions, Aitchison and Silvey (1958, 1960) developed a method to find the ML estimates of parameters, which maximizes the likelihood function under the constraint. In this dissertation, we use several different model structures, and so we have several different likelihood functions and different constraint functions. We use the Aitchison and Silvey (1958, 1960) methods to find asymptotic confidence intervals. The following chapters of this dissertation provide the details for developing the five confidence intervals under different model structures. In Chapters 2 and 3, we consider a single 2 c table, assuming there are no explanatory variables. When we have a 2 c contingency table, it is natural to assume that the counts in each row have a multinomial distribution and the two distributions are independent. In Chapter 2, we develop the confidence intervals under an unrestricted model in which we do not assume any relationship between the probabilities in the first and second rows. Under this assumption, we can use a likelihood function that is a function of the 2(c 1) nonredundant cell probabilities. Then, θ is expressed in terms of those cell probabilities, and so the constraint function corresponding to the null hypothesis is also a function of these probabilities. Based on the Aitchison and Silvey (1958) methods, Lang (2004) showed a unified theory of ML inference in contingency tables with a constraint that is sufficiently smooth and homogeneous. A definition of being homogenous is described in Lang (2004). Although he expressed the likelihood and the constraint in terms of expected cell counts rather than actual cell probabilities, we can use his result to find the restricted ML estimates of cell probabilities after showing that our situation is a special case of Lang s. We find the LRT-based confidence interval, the score confidence interval, and a pseudo score-type confidence interval using these restricted ML estimates. For Wald confidence intervals, we will use the actual variance instead of an asymptotic form of the variance. 28

29 The common assumption for a 2 c table in this dissertation is that the two rows have ordinal categories. To utilize the ordinality within a model, in Chapter 3, we use a cumulative logit model (see Agresti, 2002, p. 274), which is the most popular model for ordinal responses. We refer to this as the parametric model to distinguish from the unrestricted model used in Chapter 3. This model applies to all 2(c 1) cumulative probabilities, and it assumes the same group effect for each cumulative probability. Wald confidence intervals can be obtained by using particular algorithms included in standard softwares, because they give ML estimates for the parameters and so what we need to do to find the confidence interval is to substitute the estimates in the form of θ and its variance. For the LRT-based confidence interval and the score confidence interval, and a pseudo score-type confidence interval, we again use the Aitchison and Silvey (1958, 1960) methods to find restricted ML estimates of parameters in the cumulative logit model under the null hypothesis θ = θ 0. We can use either a Newton-Raphson algorithm or Lang s algorithm. Chapter 4 will discuss a planned simulation study to compare the proposed confidence intervals for θ in a 2 c table to existing methods. For this purpose, we generate data under several conditions by changing the sample size, the number of columns, the true θ value, and whether the parametric model holds. For evaluation criteria, we will use the coverage probability and three overall summaries. In Chapter 5 and 6, we consider two different data structures that are not independent. Matched-pairs data and fully-ranked data cases are considered in Chapter 5 and 6, respectively. For matched-pairs data in Chapter 5, we will deal with data that are summarized in a form of a c c contingency table instead of 2 c table, which produces a dependency in the two samples. To compare the two samples, we can use the parameter θ, which is same as the one used in the previous sections, except marginal row totals and column totals are used. In Chapter 6 for fully-ranked data, a 2 c table is considered, but each column refers to a single observation. For both chapters, we will discuss how to use 29

30 confidence interval methods discussed in the previous chapters, and evaluate performances of the methods. In Chapters 2 and 3, we have focused on finding confidence intervals for θ in a single 2 c table when there are no explanatory variables. In Chapter 7, we propose modelling θ when explanatory variables exist. If they exist, we should consider at least two related contingency tables simulatively. We consider only the case in which there are explanatory variables for an unrestricted model. The parameter θ in each 2 c table is expressed in terms of unrestricted cell probabilities, and the vector of θ s (or the logit scales of θ s) from the tables is modelled by a linear form of the effect parameters of explanatory variables. In this case, the likelihood function can be expressed as a function of the cell probabilities, and the corresponding constraint function can be expressed in terms of the cell probabilities as well as the effect parameters. Lang (2005) developed a theory for a so-called homogeneous linear predictor (HLP) model. We will use his result after showing that our model is a special case of the HLP model. To estimate θ s and the effect parameters, we will use the score confidence intervals since the method performs well, based on simulation studies discussed in Chapter 4. 30

31 CHAPTER 2 CONFIDENCE INTERVALS UNDER AN UNRESTRICTED MODEL Assume that that Y 1 and Y 2 are independent random samples from multinomial distributions of sizes n 1 and n 2 with cell probabilities π = (π 1, π 2,, π c ) T and λ = (λ 1, λ 2,, λ c ) T with frequencies {n ij }. Since Y 1 and Y 2 are independent, the log-likelihood is l(π, λ) = y T 1 log(π) + y T 2 log(λ), where y 1 = (n 11,, n 1c ) T and y 2 = (n 21,, n 2c ) T. We are interested in finding good confidence intervals for θ = λ T Aπ. Note that we have 2(c 1) unknown parameters from two multinomial distributions, since π c and λ c are known if we have the first (c 1) cell probabilities. Intuitively we can imagine there might be some relation between π and λ. So it is natural to consider a model to handle this. But we will never know the true model from the data. Without risking a misspecified model, in this chapter we merely use an unrestricted (saturated) model. 2.1 Basic Introduction of Four Confidence Intervals In this section, we introduce four standard confidence intervals (CI): the original Wald CI, the likelihood ratio test-based CI, the score CI, and a pseudo score-type CI. These four CIs use a likelihood function and result from inverting the corresponding hypothesis tests. To introduce each hypothesis test, in generic form, consider a null hypothesis H 0 : θ = θ Wald Confidence Interval Let ˆθ be the ML estimate of θ. Under certain regularity conditions, ˆθ has a limiting normal distribution with mean θ 0 and variance var(ˆθ). Let variance. Then, ˆθ θ 0 var(ˆθ) ˆ d N(0, 1). var(ˆθ) ˆ denote the estimated 31

32 The 100(1 α)% Wald-type confidence interval for θ is the set of θ 0 for which ˆθ θ 0 < z α/2. var(ˆθ) ˆ Thus, ˆθ ± z α/2 var(ˆθ) ˆ is the confidence interval Likelihood Ratio Test (LRT)-based Confidence Interval Let l 0 denote the maximized value of the log likelihood over the parameter space corresponding to the null hypothesis H 0 : θ = θ 0, let l 1 denote the maximized value of the log likelihood over the entire general parameter space. The 100(1 α)% likelihood ratio test-based confidence interval is the set of θ 0 for which G 2 (θ 0 ) = 2(l 0 l 1 ) < χ 2 (1 α),1, where χ 2 (1 α),1 = z2 α/2 is the 100(1 α) percentile of a χ2 distribution with one degree of freedom Score Confidence Interval Let u(θ) be the score function of θ, and i(θ) be the expected Fisher information. That is, u(θ) = l(θ)/ θ and i(θ) = E[ 2 l(θ)/ θ 2 ]. Then, the score statistic to test H 0 : θ = θ 0 is S 2 (θ 0 ) = [u(θ 0)] 2 i(θ 0 ), where u(θ 0 ) is the score function and the expected Fisher information evaluated at θ 0. The 100(1 α)% score confidence interval is the set of θ 0 that satisfies S 2 (θ 0 ) < χ 2 (1 α), Pseudo Score-type Confidence Interval Let var(ˆθ) denote the variance evaluated at parameters under H 0 : θ = θ 0. Using the null variance instead of the non-null variance (which was used in the Wald test statistic), a 32

33 pseudo score-type test statistic is ˆθ θ 0. var(ˆθ) Then a 100(1 α)% pseudo score-type confidence interval is the set of θ 0 for which ˆθ θ 0 var(ˆθ) < z α/2, which typically needs to be evaluated using numerical methods. 2.2 Wald Confidence Interval for θ Recall that the kernel of the log likelihood is l(π, λ) = a n 1a log(π a ) + b n 2b log(λ b ) Define Ω = {vec(π, λ) : 1 T π = 1, 1 T λ = 1} which is a subset of [0, 1] 2c. Then we have the following lemmas whose proofs are in the appendix. Lemma 2.1. Ω is a convex set. Lemma 2.2. l(π, λ) is a strictly concave function. These lemmas imply that there exists a unique global maximum of l(π, λ) on the domain Ω. By taking a partial derivative for each parameter, we get the following maximum likelihood estimators (MLE) for π and λ: ˆπ a = n 1a n 1, and ˆλ a = n 2a n 2, a = 1,, c. we have Since λ T Aπ is a continuous function of (π, λ), by the invariance property for MLE, [ ˆθ = ˆλ T Aˆπ = 1 c 1 n 1 n 2 n 1a n 2b a=1 b>a ] n 1a n 2a a=1 = U n 1 n 2. Suppose that min{n 1, n 2 } goes to infinity and n 1 n ɛ, 0 < ɛ < 1, where n = n 1 + n 2. By the weak law of large numbers, ˆπ p π, and ˆλ p λ. 33

34 Thus, by the continuous mapping theorem, ˆθ p θ as n. The next step is to find an asymptotic distribution of ˆθ. This can be done by either using the delta method or using Hoeffding s (1948) decomposition for a U-statistic. We show in the appendix the following result. Theorem 2.1. Provided that 0 < θ < 1, n(ˆθ θ) d N(0, ɛ 1 (C θ 2 ) + (1 ɛ) 1 (D θ 2 )). Substituting ML estimates into the variance of ˆθ, V d (n 1 n 2, we have ) 2 [ ] ˆVˆθ = 1 (n 2 1)(Ĉ n 1 n ˆθ 2 ) + (n 1 1)( ˆD ˆθ 2 ) + ˆθ(1 ˆθ) 1 ˆπ iˆλi. 2 4 i Again since Ĉ p C and ˆD p D, n ˆVˆθ p ɛ 1 (C θ 2 ) + (1 ɛ) 1 (D θ 2 ). Therefore, we have the following theorem. Theorem 2.2. Provided that 0 < θ < 1, ˆθ θ ˆVˆθ d N(0, 1). Let s.e.(ˆθ) ˆ denote the square root of ˆVˆθ. Then the 100(1 α)% Wald confidence interval for θ is obtained by inverting the Wald test statistic for H 0 : θ = θ 0. The form of the confidence interval is ˆθ ± z α/2 s.e.(ˆθ). ˆ This confidence interval has some problems. Recall that θ 2 C, D θ. This implies that if θ is either 0 or 1, both C and D are either 0 or 1. From Property 5, we know that π i λ i = 0, i = 1,, c if θ is either 0 or 1. Hence, a degenerate confidence interval 34

35 results if ˆθ equals either 0 or 1 or if ˆθ = 0.50 with all observations falling in a single column. In near-extreme cases, the distribution of ˆθ is usually highly skewed, and the lower bound or the upper bound of this interval may fall outside of [0, 1]. Along with these problems, Wald intervals generally perform poorly for parameters based on proportions. For example, Brown, Cai and DasGupta (2001) showed the Wald confidence interval for a binomial proportion has chaotic coverage probabilities Wald Confidence Interval based on the Logit Transformation As mentioned before, the Wald confidence interval for θ can include values that are below 0 or above 1. If this happens, one choice is to truncate the confidence interval so that the confidence interval is restricted to a subset of the parameter space [0,1]. A more promising Wald approach constructs the interval for a transformation of θ, such as logit(θ), and then inverts it to the θ scale. From the delta method, the Wald confidence interval for logit(θ) is logit(ˆθ) ± z α/2 ˆVˆθ ˆθ(1 ˆθ). Its bounds (LB, U B) induce the interval (exp(lb)/1 + exp(lb), exp(u B)/1 + exp(u B)) for θ. If ˆθ is either 0 or 1, we take the interval to be [0, 1], which is unappealing compared to the intervals obtained with the following methods Comparison with Newcombe s Wald Confidence Interval Newcombe (2006) discussed eight confidence interval methods for θ. The Wald confidence interval is one of them. Since we also developed the Wald confidence interval in this section, it is natural to compare the two different Wald confidence intervals. Recall that the estimator of θ is ˆθ = 1 n 1 n 2 i [I(Y 1i < Y 2j ) + 0.5I(Y 1i = Y 2j )]. j Both methods use this as a point estimator of θ. The biggest difference between the two methods is that Newcombe (2004) assumes continuous distributions, while ordinal 35

Modeling and inference for an ordinal effect size measure

STATISTICS IN MEDICINE Statist Med 2007; 00:1 15 Modeling and inference for an ordinal effect size measure Euijung Ryu, and Alan Agresti Department of Statistics, University of Florida, Gainesville, FL