Paradoxical Results in Multidimensional Item Response Theory

Size: px

Start display at page:

Download "Paradoxical Results in Multidimensional Item Response Theory"

Douglas Evan O’Brien’
5 years ago
Views:

1 Paradoxical Results in Multidimensional Item Response Theory Giles Hooker, Matthew Finkelman and Armin Schwartzman Abstract In multidimensional item response theory MIRT, it is possible for the estimate of a subject s ability in some dimension to decrease after they have answered a question correctly. This paper investigates how and when this type of paradoxical result can occur. We demonstrate that many response models and statistical estimates can produce paradoxical results and that in the popular class of linearly compensatory models, maximum likelihood estimates are guaranteed to do so. In light of these findings, the appropriateness of multidimensional item response methods for assigning scores in high-stakes testing is called into question. Introduction Jane and Jill are fast friends who are nonetheless intensely competitive. At the end of high school they each take an entrance exam for a prestigious university. After the exam, they compare notes and discover that they gave the same answers for every question but the last. On checking their materials, it is clear that Jane answered this question correctly, but Jill answered incorrectly. They are therefore very surprised, when the test results are published, to find that Jill passed but Jane did not! Lawsuits ensue. The university maintains that it followed well established statistical procedures: the questions on the test were designed to simultaneously examine both language and analytic skills, and a multiple-hurdle rule Segall 2000 based on maximum likelihood estimates of each student s abilities was used to ensure that admitted students were proficient in both. The university had re-checked its calculations many times and was satisfied the correct decision had been made. Jane s lawyers countered that, whatever the statistical correctness of the agency s procedures, it is unreasonable that an examinee should be penalized for getting more questions correct. Could such a situation occur? ***** 2007 demonstrated empirically that it could. On examining an operational data set they found that were a passing threshold for a test set injudiciously, nearly 6% of students could move from fail to pass by changing answers from correct to incorrect. The possibility of obtaining a lower score by getting more questions correct or vice versa was labeled a paradoxical result and has clear implications for fairness. This is a property of statistical estimates for multidimensional latent abilities, even when the models for student responses appear reasonable. How does this occur? ***** 2007 provide an intuitive explanation. Suppose that each question on the test given to Jane and Jill required both language and analytical skills to answer correctly, and Jane and Jill got some of these correct and some incorrect. The final question, however, was very difficult in terms of analysis, but did not require strong language skills. That Jane got this question correct suggests that her analytical skills must be very good indeed. This being the case, the only explanation for her previous

2 incorrect answers is that her language skills must be quite low. By contrast, Jill, in getting the final question incorrect, has demonstrated fewer analytic skills and must have relied on stronger language skills to answer previous questions correctly. The estimate of Jane s language ability therefore dipped below the required threshold, while Jill s was pushed upwards; both obtained satisfactory analysis scores. This provides an intuitive explanation for how paradoxical results occur and makes clear that they may not be unreasonable. The authors nonetheless feel that it is better not to put students in the position of second-guessing when their best answer may be harmful to them. This paper provides a mathematical analysis of paradoxical results that began as an attempt to find conditions under which statistical estimation would always avoid them. Our analysis is not encouraging: paradoxical results can occur across a wide range of two-dimensional item response models using a large class of statistical ability estimates. Moreover, in the popular class of linearly compensatory models, every non-separable test has a response sequence for which maximum likelihood estimates MLEs of abilities are paradoxical. Indeed, for any ability dimension, almost every answer sequence for every test is paradoxical in the sense that the estimate of ability can either be made to increase by changing a correctly answered item to incorrect, or to decrease by changing an incorrectly answered item to correct, provided that question is chosen appropriately. The results in this paper establish the existence of paradoxical results under general conditions. They imply, for example, that within a broad class of multidimensional models, it is not enough to restrict the parametric form of an item response function in order to avoid paradoxical results. The exact conditions and assumptions are given in Sec. 2. They essentially reduce to the following sufficient, but not necessary, conditions: All second derivatives of the log of the item response surface should be strictly negative. Either MLEs, or priors with no correlation between abilities, are being used to estimate a subject s ability. These can be unrealistic in practice; the first excludes response functions with guessing parameters and the second may assume an unlikely correlation. Neither of these conditions are necessary for paradoxical results, however, and we investigate the extent to which they may be violated and still produce such results. We also believe that our analysis sheds light on the general mathematical causes of paradoxical results in a way that makes an analysis in specific cases possible. As an example, our study of linearly compensatory models uses the framework developed in the general case to provide sufficient conditions for an individual item to produce a paradoxical result, including for Bayesian estimates with non-independent priors. We note that tests that exhibit simple structure every question depending on only one ability dimension violate the first of the above conditions and these can be guaranteed not to produce paradoxical results in two ability dimensions or when independence priors are used. The paper is structured as follows. We provide a basic framework for multidimensional item response theory in Sec. 2. Some basic theoretical results and constructions are given in Sec. 3. We apply these results to frequentist estimates in Sec. 4 and to Bayesian estimates in Sec. 5. The specific case of linearly compensatory models is explored in Sec. 6. Sec. 7 considers extensions to models not covered in our assumptions in particular, models that involve guessing parameters, non-independence priors, discrete ability spaces and spaces of three or more dimensions; empirically we show that paradoxical results remain common under these conditions. Sec. 8 examines a limited solution given in terms of enforcing regularity conditions. 2

3 2 Definitions and Assumptions The assumptions made here and in Sec. 3 are used throughout Sections 4, 5 and 6. We discuss the effect of relaxing these assumptions in Sec. 7. Multidimensional item response theory assumes that questions hereafter referred to as items may measure more than one latent trait, with θ R d indicating a given subject s proficiency in each of d dimensions. In the scenario above, there are dimensions representing language and analytic skills. We assume that all item parameters have been estimated and are defined with respect to fixed dimensions. We further assume that subjects abilities are modeled as being unrestricted elements of R d. In order to simplify the analysis below, we restrict to abilities indexed in R 2, but state as much as possible in more general terms. Some comments regarding higher dimensional models are given in Sec. 7. We assume that a subject s answers to items are marked either as correct or incorrect. The item s response function, P, gives the probability of a student getting a particular test item correct as a function of the student s abilities and parameters specific to the item. We will frequently refer to an item as meaning both the specifics of a test question and its response function. Common choices include Compensatory: P θ, a i = ga i0 + a T i θ where g is a log-concave cumulative distribution function and each a i is a d-vector with non-negative entries. Most commonly used are Logistic: gt = e t / + e t e.g., Ackerman 996; Reckase 985. Normal ogive: gt is the cumulative distribution for a Gaussian random variable e.g., Bock et al See Reckase 997 for a review including a history of multidimensional IRT models. The label compensatory is used since a subject s lack of ability in dimension may be made up for by proficiency in dimension 2 and vice versa. We describe the models above as being linearly compensatory; they admit particularly easy analysis. Non-compensatory: P θ, a i = d j= g ja ij0 + a ij θ j in which each g j is a log-concave cumulative distribution function and a ij > 0 Bolt and Lall 2003; Whitely 980. The term non-compensatory is applied here since no degree of proficiency in dimension 2 will overcome a lack of ability on dimension. Our results are more general than these particular models, however, and we set out very general conditions below. Definition 2.. We define a test to be a collection of N items with probability functions P i θ,..., θ d, i =,..., N for ability parameters θ = θ,..., θ d. When θ is used as an argument to a function, it is taken to be shorthand for the vector of arguments θ,..., θ d, or a sub-vector of these, and we use both notations interchangeably. It makes sense that increasing a subject s ability should not reduce their probability of getting an item correct, hence the following natural restriction: Definition 2.2. A response function P θ,..., θ d is monotone if P is a monotone increasing function of θ j for each θ,..., θ j, θ j+,..., θ d. Monotone response functions are considered in Junker and Sijtsma 2000 and Antal 2007; non-monotone response functions can clearly be seen to produce paradoxical results. This is also a concern for multiplechoice models, for example in Thissen and Steinberg 997. We make the distinction that our results are 3

4 specifically based on monotone response functions. These cannot produce paradoxical results in unidimensional IRT, but will do so in MIRT. In addition, we require a somewhat more restrictive assumption about models for a correct response: Definition 2.3. A response function P θ has log negative second derivatives if its second derivatives exist and 2 log P θ θ j θ k for each j and k possibly equal everywhere in R d. 0 and 2 log P θ θ j θ k 0 Straightforward calculation gives that each of the models listed above satisfies this condition. Importantly, P θ and P θ are log-concave as functions of each θ j with the other ability dimensions fixed. In what follows, we will be interested in properties of classes of items; specifically, are there simple restrictions that ensure a test cannot produce paradoxical results? Our main concern here is whether there are families of response functions that can be guaranteed to avoid paradoxical results. For any class of items, our mathematical analysis consists of finding a test comprised of items drawn from that class that will produce a paradoxical result. If such a test exists, then the combination of items within the test becomes important. In order to make this argument formal, we need to define which items can be selected to form a test. Definition 2.4. A class of two-dimensional items is said to be parametric if the response functions for each item can be written as P a θ, a 2 θ 2, γ for a 0, a 2 0 R and γ finite-dimensional. A parametric class is complete if it contains items for all a 0 and a 2 0. In particular, a complete class contains an item for which a = 0; that is, θ does not affect the probability of getting the item correct. If a complete parametric class could be shown to avoid paradoxical results, we could safely design tests using items from this class. Sec. 6 removes the assumption of completeness for linearly compensatory models. As noted in the introduction, two-dimensional tests exhibiting simple structure are not subject to paradoxical results, and we restrict to Definition 2.5. A class of items with log negative second derivatives is non-simple if it contains at least one item for which the inequalities in are strict for at least one pair j, k with j k. We now turn to the outcome of administering a test to a subject: Definition 2.6. For a test, each subject produces a response vector: a series of N binary answers y = y,..., y N which are coded 0 or according to whether item i is answered correctly. Definition 2.7. For any given test, a score is a function mapping from the response vectors to real numbers, S : {0, } N R. An important class of scores are those defined with respect to some estimate of a subject s ability vector, ˆθ. Examples include Multiple hurdle scores Segall 2000: Sy = Iˆθ i c i. Composite scores van der Linden 999: Sy = s i ˆθi or I s i ˆθi C. 4

5 Either score can represent nuisance dimensions by setting c i = or s i = 0. Our results will be concerned with the estimate ˆθ, which can be thought of as a composite score with s, s 2,..., s d =, 0,..., 0. However, paradoxical results for this score can clearly have implications for multiple hurdle or other composite scores. Specific conditions on when this can happen for composite scores and linearly compensatory models are investigated in Sec. 6. As discussed, one of the properties of scores that we would like to retain is regularity. Definition 2.8. We define a partial ordering on responses by stating y z if y i z i for all i and this inequality is strict for some i. A score is regular if y z Sy Sz. A paradoxical result is defined to be the existence of response vectors y and z such that y z and Sz < Sy. As with items, we define a partial ordering of functions and a monotonicity condition for functionals of them. Definition 2.9. We define a partial ordering of unidimensional functions by gt ft if gt < ft, t R. A functional T mapping functions to the real line is said to be monotone increasing if it is monotone decreasing if g f T [g] < T [f]; g f T [g] > T [f]. Most unidimensional frequentist and Bayesian estimates are monotone increasing functionals of the derivative of the log likelihood. 3 Basic Results and Constructions Before proceeding to demonstrate that scores based on statistical estimates of an ability parameter are not regular, we provide some basic notation, a few general results and constructions for the results given below. 3. Statistical Constructs Most statistical estimates may be written as functionals of the log likelihood lθ; y under the assumption that responses are conditionally independent across items given subjects. In our proofs, we will use l k θ; y to denote the log likelihood after k items have been administered: l k θ; y = k y i log P i θ + y i log P i θ. i= l k θ; y is a function only of the first k entries in y, and ignores the following entries. Throughout, lθ; y will be taken to imply l N θ; y: the log likelihood at the end of the test. We note that the log-negative second derivative condition Def. 2.3 implies that all the second derivatives of lθ; y are negative. Frequentist estimates for θ are most easily expressed in terms of the derivatives of lθ; y. When using Bayesian methods, we also place a restriction on the priors available: 5

6 Definition 3.. A density µθ is independent and log-concave if µθ = d j= µ jθ j where each µ j is a log-concave density. We assume that Bayesian estimates are conducted with a prior in which the traits are modeled as being independent. This is a substantial restriction that ensures that the posterior distribution after k items, f k θ y = elk θ y µθ e l k θ y µθdθ, also has log-negative second derivatives. We discuss relaxing this condition in Sec. 7. We want to emphasize that Def. 3. is used to define a statistical estimation procedure rather than to make any statement about the distribution of traits in a population. Our results refer to the behavior of individual statistical estimates and do not depend on the population of subjects. We denote the marginal posterior density of θ i by f k i θ i y and its density conditional on all the other ability parameters by f k i θ i θ i, y. The corresponding cumulative distribution functions are written F k i θ i y and F k i θ i θ i, y. 3.2 Basic Results We begin with two central lemmas. These describe the behavior of the derivatives of the likelihood with respect to θ i, keeping the other entries in θ fixed. For simplicity, throughout this section, we take i = and let θ = θ 2,..., θ d. The results will clearly not depend on this choice of i. We use t as an argument in place of θ when examining a likelihood as a function of θ. All proofs are given in Appendix A. Lemma 3.. Assume that a test consists of N items which are monotone in d abilities and have log negative second derivatives, at least one of which is strict. If θ θ Def. 2.8, then Lemma 3.2. Define l t, θ θ ; y l t, θ θ ; y. θ θ, c; y = { } l t : t, θ ; y = c. θ Under the conditions of Lemma 3., dθ /dθ j < 0, for j = 2,..., d. For c = 0 this is the MLE for θ with all other parameters held fixed. We can define likelihood ratio confidence intervals using other values of c. We note that since the log likelihood is concave, l/ θ is monotone decreasing and hence this function is well defined. Lemma 3.2 provides the basic intuition for the mathematical explanation of paradoxical results. As θ 2, say, increases, the value of θ at which l/ θ crosses zero, say, decreases. This already implies that the MLE for θ conditional on θ 2,..., θ d decreases as θ 2 increases. Figure provides a visual representation of this intuition using the likelihood for the first subject in the data set analyzed in ***** In this figure, the contours of the log likelihood form ellipses with a negatively-oriented major axis. This will always be the case for when the conditions in Def. hold. For Bayesian inference, the analogous result to Lemma 3. follows from the same reasoning and is stated without proof: 6

7 θ 2 + > θ2 * 0 dlθ 2 * /dθ 0 likelihood θ,mle θ 2 ;y dl/dθ 5 0 dlθ 2 + /dθ θ,mle θ 2 * 0.5 θ θ,mle θ θ θ Figure : An illustration of the nature of likelihoods in two-dimensional item response theory using likelihoods from an example test. Left panel: we plot lθ, θ 2 / θ for fixed θ2 and θ 2 + > θ 2. The latter is everywhere below the former. We have also given the maximizing values of θ for θ2 and θ 2 +. Right panel: contours of the likelihood and the MLE for θ solid plotted as a function of θ 2 dashed. If θ 2 is increased the maximizing value for θ must decrease. Lemma 3.3. Under the conditions of Lemma 3., let fθ y be the posterior density based upon an independent and log-concave prior. Let θ θ. Then log f θ t, θ y log f t, θ θ y. While Bayesian inference can be written in terms of the derivatives of a log posterior density, it is far more convenient to produce an analysis in terms of cumulative distribution functions. The following result demonstrates that the ordering relation is preserved between these two. Lemma 3.4. Let f t and f 2 t be twice-differentiable log-concave densities on the real line. Then d log f dt d log f 2 dt t f sds t f 2 sds. This result demonstrates a monotone relationship between derivatives of log densities and cumulative distribution functions. In particular, letting X and Y correspond to cumulative distribution functions F F 2, then X is said to be stochastically greater than Y and consequently for a monotone increasing function g, EgX > EgY. A discussion of the application of stochastic ordering to IRT is given in van der Linden A Paradoxical Test In this section, we construct a test that exhibits paradoxical results for a wide range of statistical estimates of θ. We make use of this test throughout the proofs of our results in Sections 4 and 5. Our main message 7

8 is that among all non-simple complete parametric classes of items with log-negative second derivatives, it is impossible to use the form of the response functions alone to rule out paradoxical results, and one must therefore consider the particular combination of items in a test. Assume that F is any complete parametric non-simple class of two-dimensional monotone items with log-negative second derivatives see Definitions Let P,..., P N be items selected from this family with responses y N = y,..., y N. We assume that the second derivatives of log P i are strictly negative for at least one item i and that l N θ; y N has a unique maximum. We take P N to be such that P N θ, θ 2, γ N = P N 0, θ 2, γ N = P N θ 2, γ N. This item is available by the completeness of F. We define response sequences y = y N, and y 0 = y N, 0 so that y 0 y. Each of the results in Sections 4 and 5 shows that a statistical estimate of ability in dimension has ˆθ y < ˆθ y 0. There is nothing special about the first dimension or the final item; they have been chosen for notational, and conceptual, convenience. The test constructed here requires the final item to exhibit simple structure. This is not required to produce paradoxical results. The continuity arguments of Sec. 7 demonstrate that the result will continue to hold when the final item is replaced by P N ɛ, θ 2, γ N for ɛ > 0. 4 Frequentist Estimates in Two-Dimensional MIRT Many frequentist estimates can be regarded as functionals of the derivative of the log likelihood. Our particular examples are the MLE and the extrema of likelihood ratio confidence intervals. Our purpose in this section is to demonstrate that in any complete parametric class of models there are tests for which these statistics are paradoxical scores. Without loss of generality, we consider statistical estimates for θ. 4. Maximum Likelihood Estimates The MLE for θ is given by the vector ˆθ MLE y that maximizes lθ; y, or alternatively that satisfies l θ ˆθ MLE ; y = 0 assuming that this exists and is unique. ˆθ,MLE y to denotes the first entry of ˆθ MLE y. Theorem 4.. Let F be a non-simple complete parametric class of monotone two-dimensional response models with log negative second derivatives. There exists a test with response functions P,..., P N selected from F, and response vectors y and y 0 to this test that give rise to a paradoxical result. The technique of this proof is to show that in the test defined in Sec. 3.3, ˆθ 2,MLE must increase after the final item is answered correctly and that ˆθ,MLE must decrease by Lemma 3.2. This technique is central to the analysis of the paper. In this proof, we rely on the final item not depending on θ. However, in Sec. 6 we demonstrate how the same idea can be applied to analyze more general tests when linearly compensatory models are employed. The formal proof is given in Appendix B Profile Likelihood Ratio Confidence Bounds We can extend paradoxical results for MLEs to confidence bounds on estimates. A general method for constructing confidence intervals is based on the acceptance region of a likelihood ratio test. A profile 8

9 likelihood ratio lower confidence bound is the smallest value for θ such that a likelihood ratio test for this value against the MLE would not be rejected: { ˆθ,P LB y = min θ : l θ, ˆθ } 2,MLE θ ; y; y l ˆθMLE ; y K. Here K is a cut-off value. This is usually taken to be a quantile of a χ 2 random variable. Theorem 4.2. Under the conditions of Theorem 4., there exists a test P,..., P N drawn from F for which ˆθ,P LB y is not a regular score. The proof for this theorem is given in Appendix B.2. The proof for the equivalent unconditional upper profile confidence bound follows the same construction. We note that the use of confidence bounds based on the Fisher information matrix is conspicuously absent from this analysis. This is because the generality of our assumptions does not allow us to exclude local features in the likelihood which may produce quite different bounds from those obtained by likelihood ratio confidence intervals. In such cases we consider the likelihood ratio confidence bounds to be preferable. 5 Bayesian Estimates in Two-Dimensional MIRT We begin with the maximum a posteriori estimate MAP: { ˆθ MAP = θ : f } θ θ y = 0. As with the MLE, ˆθ,MAP y indicates the first element of ˆθ MAP. Theorem 5.. Under the conditions of Theorem 4., let µ be an independent log-concave prior. There exists a test P,..., P N drawn from F for which ˆθ,MAP y is not a regular score. The proof of this theorem is similar to that of Theorem 4. and is omitted. The analogous result to Theorem 4.2 demonstrates that lower or upper bounds based on contours of the log posterior also exhibit paradoxical results. 5. Marginal Inference Many Bayesian estimates for θ, including credible intervals, can be expressed in terms of monotone functionals of their marginal cumulative distribution function. The most commonly found of these is the expected a posteriori estimate EAP: ˆθ EAP = R d θfθdθ. We can demonstrate paradoxical results for a much more general class of estimates. Theorem 5.2. Let T be a decreasing monotone functional of cumulative distribution functions and F a nonsimple complete parametric class of monotone response models with log-negative second derivatives. There exists a test P,..., P N drawn from F for which T [F θ ; y] is not a regular score. The proof of this theorem is given in Appendix C. It is reasonable to expect any sensible Bayesian estimate to be a monotone decreasing function of the posterior cumulative distribution function; that is, to preserve the stochastic ordering of posterior distributions. The following two corollaries are immediate: 9

10 Corollary 5.. Under the conditions of Theorem 5.2, there exists a test for which the EAP for θ is not a regular score. Corollary 5.2. Under the conditions of Theorem 5.2, there exists a test for which any upper or lower credible bound for θ is not a regular score. 6 Linearly Compensatory Models We can refine our analysis somewhat for the class of linearly compensatory models such as the logistic and normal ogive, in which θ enters the model only as a linear combination θ T a. In this case, we can provide more precise conditions under which paradoxical results may occur without using completeness of the class. In particular, we consider a test using items with linearly compensatory response functions P θ T a,..., P N θ T a N We assume that the P i have log-negative second derivatives with w i d 2 log P i t/dt 2 w i2, and w i d 2 log P i t/dt 2 w i2 for all t and some w i, w i2 < 0, with w i possibly infinite. Let A be the matrix with a T i on the ith row. We also let W and W 2 be diagonal matrices with ith diagonal elements w i and w i2 respectively. We note that the Hessian of the log likelihood can now be written as A T W A for W a diagonal matrix with W W W Frequentist Inference Theorem 6.. Let P θ T a,..., P N θ T a N be response functions for a test employing linearly compensatory monotone response models for abilities θ R d. Let y N = y,..., y N be any response sequence N such that ˆθ MLEy N exists and is unique. Let P N θ T b be a further item in the test and y = y,, y 0 = y, 0. A sufficient condition for ˆθ,MLE y < ˆθ,MLE y 0 is e T A T W A b > 0 2 for all diagonal matrices W such that W W W 2 where e =, 0,..., 0 T d-vector. R d the first Euclidean The condition in this theorem is not immediately interpretable. It will be simplified below. The mathematical strategy for the proof is as follows: we consider a transformation of ability space θ ψ so that, without loss of generality, P N depends only on ψ d. From the arguments used to prove Theorem 4., we know that the MLE for ψ d must increase if the final item is answered correctly. We consider the MLE for θ as a function of ψ d and show that its derivative is negative under the condition 2. Since ψ d increases, the MLE for θ must decrease. We have given the formal details for this proof in Appendix D.. The requirement that the MLE exist and be unique is a technical necessity. It rules out tests where the item parameters all lie on a subspace of R d. This can be expected to be unusual in practice with the notable exception of multidimensional Rasch models. The condition also rules out some answer sequences. Most immediately, all-correct and all-incorrect answer sequences do not have unique MLEs regardless of the item parameters see, for example, ***** Note also that this result excludes tests made up solely of simple structure items. In this case each row of A has only one non-zero entry and A T W A is diagonal and negative definite; 2 is therefore either negative or zero. 0

11 Corollary 6.. Let P θ T a,..., P N θ T a N be response functions for a test comprised of monotone linearly compensatory items with log negative derivatives. Let the test be ordered such that a N a N < a i, i =,..., N. 3 a i Then for any response pattern y N = y,..., y N such that ˆθ,MLE y N is defined and unique, setting y = y N, and y 0 = y N, 0 gives ˆθ,MLE y 0 > ˆθ,MLE y. The proof of this corollary is quite involved and has been given in Appendix D.2. This result demonstrates that for every test using linearly compensatory models, the MLE exhibits paradoxical results. Moreover, almost all answer sequences produce paradoxical results, either increasing the estimate of θ by getting more items incorrect, or decreasing it by getting more correct. It also provides a rule for finding which item to change: the one with least relative weight on dimension. Should a student be able to identify this item, they could be guaranteed to obtain a higher score by answering it incorrectly. We want to emphasize here that the above theorem provides sufficient conditions for an item to produce a paradoxical result; these conditions are not necessary. In addition to the item guaranteed by the corollary, there may be numerous other items that produce paradoxical results. Indeed any item that satisfies 2 is guaranteed to produce a paradoxical result so long as the answers to the other items produce a unique MLE. However, there may be further items that produce paradoxical results for particular values of W. Similar comments can be made for the results below. As a final application, we can extend our analysis above to composite scores of the form Sy = α T ˆθ MLE y. The use of such scores has been suggested, for example, in van der Linden 999. Corollary 6.2. Under the conditions of Theorem 6., let Sy = α T ˆθ MLE y. A sufficient condition for Sy < Sy 0 is α T A T W A b > 0 for all diagonal W W W 2. The proof for this corollary is given in Sec. D Bayesian Inference Bayesian analysis for linearly compensatory models follows along similar lines. In particular, we observe that for an independent and log-concave prior, taking ψ = B θ gives log µ ψ ψ T = B KθB T for K = log µ/ θ θ T. Adding to the second derivatives throughout the proof for Theorem 6. leads to: Theorem 6.2. Under the conditions of Theorem 6., let µ be an independent and log-concave prior with second derivatives bounded below by diagonal matrix K. A sufficient condition for ˆθ,MAP y < ˆθ,MAP y 0 is A T W A + K b > 0 4 for all diagonal W such that W W W 2 and K K 0. e T We note that this condition no longer admits paradoxical results across all answer sequences for which the estimate is defined. In particular, in the case of a two-dimensional θ and Gaussian prior with variances σ 2 and σ2, 2 4 can be reduced to b 2 > b N i= w ia 2 i2 + σ2 2 N = w. 5 ia i a i2

12 It is possible that there is no ordering of the answers for which this is true. The condition becomes more stringent as σ 2 2 decreases. Using a smaller value of σ 2 2 reduces the variation of θ 2 in the posterior distribution and we can think of this strategy as shrinking the test towards being a unidimensional test on dimension. However, we also note that the condition becomes less stringent as N increases and 5 approaches 3. Alternatively, reversing the inequality in 5 can be shown to guarantee that ˆθ,MAP y behaves regularly at the final item. This could be turned into a general criterion for designing tests. However, doing so would restrict to either short tests, low discrimination parameters or strong priors. The condition 5 also provides conditions for paradoxical marginal inference: Theorem 6.3. Under the conditions in Theorem 6.2, let the dimension of θ be 2, and T [F θ ; y] a decreasing monotone functional of the marginal distribution for θ. Let the second derivative of the log prior be bounded below by k. A sufficient condition for T [F θ y ] < T [F θ y 0 ] is for all w i < w i < w i2. The proof of this Theorem is given in Appendix D.4. 7 Model Extensions N b 2 i= > w ia 2 i2 + k b N = w ia i a i2 We have made several assumptions in order to simplify our analysis. In particular, the use of log-concave response models and prior densities with no correlation between ability traits is not always realistic in practice. The relaxation of these assumptions significantly complicates the analysis. This section outlines a number of specific model extensions that are commonly used in practice and some ways to extend the mathematical results above. 7. Extending Existence Results 7.. Embedding in a Larger Space The following result is a direct consequence of the continuity of the statistical estimates we have examined on the various parameters that define the test and estimation procedure: Theorem 7.. Let sˆθy; γ be a monotone function of a MLE, MAP or EAP of θ, or profile confidence bound or credible bound for θ. Let γ be the collection of test parameters including item parameters and parameters of the prior such that the prior and item response functions vary continuously with γ. If y 0 y and sˆθy 0 ; γ > sˆθy ; γ, then for some ɛ > 0, γ γ < ɛ sˆθy 0 ; γ > sˆθy ; γ. This theorem demonstrates that paradoxical results can occur in more general settings than those covered by our assumptions, so long as our assumptions hold in a subspace of the model. In particular, this covers the following models: No Item Has Simple Structure: The results in Sections 4 and 5 were predicated on a test in which the final item response function did not depend on θ, but at least one of the previous items depended on both θ and θ 2. This test is a special case of the set of the tests in a complete parametric class of items. Theorem 7. implies that paradoxical results can occur when all items load onto both dimensions. 2

13 Three or More Ability Dimensions: Our results in Sections 4 and 5 may be viewed as a special case of using a d-dimensional ability vector in which no item depends on the final d 2 abilities. Theorem 7. implies that paradoxical results can occur if the loadings on the final d 2 items are sufficiently small. More generally, if the multidimensional analogue of P N θ in Sec. 3.3 places no weight on θ and ˆθ i,mle N y N ˆθ i,mle y for i = 2,..., d, the arguments used above give that ˆθ,MLE N y < ˆθ,MLE N y 0. If we do not have ˆθ i,mle N y N ˆθ i,mle y, a paradoxical result must hold for some ability dimension other than. These ideas can also be extended to our results for Bayesian inference. Guessing Parameters: A common variant of the item response models examined so far is to introduce guessing parameters of the form P i θ = c i + c i P i θ. This ensures that a subject always has probability of at least c i of answering the item correctly. These models violate our log-negative second derivative condition unless c i = 0 for all i. Theorem 7. implies that paradoxical results will occur for some c i > 0. Non-independence Priors: Bayesian estimates that use priors in which abilities are positively correlated do not always produce log posteriors with positive second derivatives. Letting µθ, ρ be such that µθ, 0 factorizes into log-concave densities, Theorem 7. implies that Bayesian estimates using ρ > 0 can also exhibit paradoxical results. In longer tests, the influence of the prior will be overwhelmed by the data, whatever correlation is used. In each of these cases, our results demonstrate that paradoxical results can occur in tests whose parameters are near those defined in Sec Exactly how far away from these parameters it is still possible to produce paradoxical results will depend on the number of items, the shape of the item response function and the specific parameters in the test Local Conditions Theorems 4. and 5. only require the log likelihood or log posterior to have negative second derivatives in a local region of ˆθ y 0 and ˆθ y. Theorem 7.2. Let l N N θ; y be the log likelihood after N items in a test. Define ˆθ θ 2 ; y to be the conditional MLE for θ as in 0 and assume this is unique. Assume that the response function of the final item does not depend on θ and let ˆθ N y 0 and ˆθ N y be the MLE estimates when the final item is answered incorrectly and correctly, respectively. A sufficient condition for ˆθ N y 0 > ˆθ N y is that the second derivatives of l N θ; y are negative on the line defined by ˆθ N θ 2 ; y, θ 2 for θ 2 between ˆθ 2 N y 0 and ˆθ 2 N y. Proof. This is a direct consequence of the implicit function theorem as used in Theorem 4. restricted to the line ˆθ N θ 2, θ 2. The equivalent result can be shown for MAPs by replacing the log likelihood above with the log posterior. Verify that these points of inflection do not occur near the MLEs or MAPs appears to be analytically difficult, but provides an intuitive motivation for expecting paradoxical results to occur in more general settings than we examine. A similar analysis of EAPs is also possible, but complex. 3

14 7.2 Correlated Priors and Linearly Compensatory Models In order to refine our analysis in a specific case, we examine linearly compensatory models with Gaussian priors. It is clear from the results in Sec. 6 that if K in Theorem 6.2 is constant, it need not be diagonal and a general sufficient condition for a paradoxical result is simply 4. In two dimensions, if we set the prior variances to σ 2 and σ 2 2 with prior correlation ρ, condition 5 becomes a N2 a N > N i= w ia 2 i2 + σ2 2 ρ2 N = w ia i a i2 6 ρ ρ 2 σ σ 2 and we see that making ρ closer to increases the right hand side of 6, resulting in a smaller set of items that will produce paradoxical results. 7.3 An Empirical Study By way of empirically studying the prevalence of paradoxical results under the model violations discussed above, we investigated the operational data in ***** In that paper, a linearly compensatory twodimensional logistic model was used that incorporated guessing parameters and a standard normal prior for a 67-item test of English skills, given to 2500 grade 5 students. Among these students, four pairs exhibited the Jill-and-Jane scenario in the introduction, one answering every question as well or better than the other but yet obtaining a worse result on dimension. Depending on the consequences of the test, this may be regarded as being rare, although it may still be unacceptable in high-stakes settings. Beyond directly comparing subjects, we can also ask the hypothetical Would this subject have done better by deliberately answering some items incorrectly or vice versa?. For every one of these subjects, changing the item with largest relative weight on θ 2 caused ˆθ,EAP y to produce paradoxical results. To illustrate the concerns that these results might produce, ***** 2007 investigated setting passing thresholds for these data. They found it was possible to choose a threshold on θ so that 6% of subjects could be moved from fail to pass by getting more questions wrong. ***** 2007 also investigated varying the prior correlation, ρ; while larger values reduced the incidence of paradoxical results, setting ρ as high as 0.8 did not eliminate them entirely. 7.4 Discrete Ability Spaces Some tests or models such as cognitive diagnosis models CDMs; see Haertel 990, Roussos et al. 2007, and von Davier 2005 classify subjects into one of multiple ordered categories along each dimension. In this context, an assignment is made to one of the categories and thus the use of MLE and MAP estimation is standard. We illustrate the existence of paradoxical results on a two-dimensional ability space partitioned into categories master and non-master. Here, items are monotone in the sense of Definition 2.2 if the probability of getting an item correct for a master is greater than the probability of getting an item correct for a non-master. Table gives numerical values for a hypothetical three-item test that produces a paradoxical result along with the log likelihood values of the answer sequences,0, and,0,0. We observe that the first two items create a ridge giving highest values to the assignments non-master, master and master, nonmaster with the latter being highest. The final item only changes with dimension 2; getting it correct moves the MLE for that dimension from non-master to master. The effect of this is to move the assignment for dimension the other way, creating a paradoxical result. We note that in this example the only way to 4

15 PItem Correct Cumulative log likelihood Category n,n n,m m,n m,m Table : A paradoxical test on a two-dimensional discrete ability space. Left: probability of a correct answer in each class, coded n for non-master and m for master. Right: cumulative log likelihood following a,0, response pattern. The final column gives the log likelihood after a,0,0 response pattern. MLEs are starred. Notice that the category maximizing the likelihood moves from n,m to m,n when the third item is changed from correct to incorrect. achieve a paradoxical result is to move from the non-master, master category into master, non-master or vice versa. Due to the discrete nature of the estimates, this is less common, than paradoxical results for continuous ability spaces. 8 Enforcing Regularity Our analysis indicates that paradoxical results are endemic to multidimensional item response models. In the case of two-dimensional linearly compensatory models, it is not possible to design a test using frequentist estimates that avoids this problem. This fact suggests a need to review the statistical estimates that we use. One immediate method for modifying a MAP or MLE is to impose constraints on the estimate. That is, we seek ability estimates for all subjects such that no paradoxical results occur for the observed response sequences. For the case of MAP estimates, we want estimated vectors θ,..., θ n for n subjects with ˆθ MAP,,..., ˆθ MAP,n = argmax θ,...,θ n lθ ; y lθ n ; y n + log µθ log µθ n subject to the constraints ˆθ MAP,j ˆθ MAP,k if y j y k. 7 The number of these constraints may be reduced by taking account of the transitivity relations ˆθ MAP,i < ˆθ MAP,j and ˆθ MAP,j < ˆθ MAP,k ˆθ MAP,i < ˆθ MAP,k. 8 Here these inequalities could be enforced just for a single dimension of interest, or for all dimensions, depending on the purpose of the test. Similar estimates could be obtained by defining appropriate objective functions for confidence bounds and Bayesian estimates. In the operational data investigated in ***** 2007, there were 2500 ability vectors to estimate, with 5,354 inequalities of the form 7. Accounting for transitivity relations 8 reduced the number of constraints to 3,764. We performed the constrained MAP estimation using the IPOPT constrained optimization routines Wächter and Biegler 2006, starting from the unconstrained EAP estimates. It took 2.5 seconds of The EAP is used here rather than the MAP because it may be calculated without requiring a non-linear optimization. We have elected to use a Bayesian analysis in order to avoid the lack of identifiability of MLEs studied in *****

16 CPU time when no constraints were imposed; this increased to 0.4 when the constraints were imposed for θ only and 35.5 CPU seconds when the estimates of both θ and θ 2 were constrained to satisfy 7. In all, the mean squared difference between the unconstrained EAP and MAP estimates was The mean squared difference between the unconstrained MAP and the MAP constrained to be regular in both θ and θ 2 was Thus, the effect of enforcing regularity constraints has a somewhat larger magnitude than the choice of estimate for θ. Of these, those subject who appeared in the four pairs violating the constraints has twice the squared displacement on average as the others. When the constraint was only imposed on θ 2, the mean squared difference was reduced to Some of this difference is due to the numerical approximations used for constrained optimization within IPOPT. It appears then, that imposing constraints to avoid what might be termed observed paradoxical results is computationally feasible for at least moderate sizes of cohorts and creates moderate distortion in the statistical estimates of a subject s ability. By way of understanding this, we give the following theorem: Theorem 8.. Let {θ k } n k= be a fixed set of finite item parameters for n subjects. Let {ˆθ u kn } n k= be a set of θ estimates for each of these n subjects based on N items without constraints. Let ˆθ c kn, k =,..., n be the corresponding estimate under constraints 7. Assume that N with P j θ k uniformly bounded away from zero and one for all k,..., n and j,...,. Then ˆθ c kn ˆθ u kn 0 almost surely. The proof of this theorem is given in Appendix E. It relies on the fact that observing two subjects with partial inequality is rare for long tests. In short tests, using Bayesian methods is likely to reduce the incidence of observed non-monotonicity; in long tests it is unlikely that students can be placed in a definite order at all. Thus, if observed paradoxical results are the only concern, enforcing regularity constraints will remove the few instances of it without unduly distorting the estimated parameters. The constrained optimization above would avoid the scenario described in the introduction. However, it has the disadvantage of changing the results if more subjects are added. It also does not address the counterfactual if I had only gotten this question wrong, I would have passed. There is, of course, no reason that we need restrict to only those response patterns seen. One way to avoid these concerns is to estimate parameters for all possible response patterns, rather than just those observed, under all relevant regularity conditions. Estimates for θ then amount to matching the observed response pattern to our enumeration of all possible responses and assigning the corresponding score. For an N-item test with a d-dimensional ability space we would need to estimate all d2 N possible ability parameters under d N N k=0 k N k constraints. Doing so is clearly infeasible for all but very short tests. 9 Conclusion Jane and Jill from our opening story were meant to demonstrate the real-world consequences of paradoxical results. Clearly, any testing procedure where doing better has adverse consequences has implications for fairness. Theirs is not a purely hypothetical case, nor a case that could occur only under pathological conditions. ***** 2007 found four pairs of subjects that matched their situation in a real-world data set and a much larger number who could have done better by deliberately answering questions incorrectly; the current paper complements these empirical findings by providing a mathematical explanation for how and when statistical estimation can produce paradoxical results. Our discussion has uncovered sufficient conditions for paradoxical results in multidimensional item response models. Paradoxical results in statistical estimates for multidimensional item response models are common. They are ubiquitous in some of the most popular models. This does not make them unreason- 6

17 able from a statistical standpoint and should not preclude the use of multidimensional models in diagnostic testing. However, our results do provide reason to be cautious about their use in high-stakes tests. Our results do not provide a complete analysis for the extensions cited in Sec. 7 of the form given in Sec. 6. Further extensions such as an analysis of item bundles should also be investigated. Should the user wish to both use statistical methods and maintain regularity, the constrained estimates discussed in Sec. 8 provide a tractable means of at least removing observable paradoxical results. When the broader counterfactual is of concern, the practitioner may reconsider the use of multidimensional testing. References Ackerman, T Graphical representation of multidimensional item response theory analyses. Applied Psychological Measurement 20 4, Antal, T On multidimensional item response theory a coordinate free approach. Electronic Journal of Statistics, Bock, R., R. Gibbons, and E. Muraki 988. Full-information item factor analysis. Applied Psychological Measurement 2, Bolt, D. and V. Lall Estimation of compensatory and noncompensatory multidimensional item response models using markov chain monte carlo. Applied Psychological Measurement 27 6, Finkelman, M., G. Hooker, and J. Wang Unidentifiability and lack of monotonicity in the multidimensional three-parameter logistic model. under review. Haertel, E Continuous and discrete latent structure models of item response data. Psychometrika 55, Junker, B. and K. Sijtsma Latent and manifest monotonicity in item response models. Applied Psychological Measurement 24, Reckase, M The difficulty of test items that measure more than one ability. Applied Psychological Measurement 9, Reckase, M The past and future of multidimensional item response theory. Applied Psychological Measurement 2, Roussos, L., L. DiBello, W. Stout, S. Hartz, R. Henson, and J. Templin The fusion model skills diagnosis system, pp Cambridge, UK: Cambridge University Press. Segall, D. O Principles of multidimensional adaptive testing. In W. J. van der Linden and C. A. W. Glas Eds., Computerized adaptive testing: Theory and Practice, pp Boston: Kluwer Academic Publishers. Thissen, D. and L. Steinberg 997. A response model for multiple choice items. In W. J. van der Linden and R. K. Hambleton Eds., Handbook of item response theory, pp New York: Springer-Verlag. van der Linden, W. J Stochastic order in dichotomous item response models for fiixed, adaptive and multidimensional tests. Psychometrika 63 3, van der Linden, W. J Multidimensional adaptive testing with a minimum error-variance criterion. Journal of Educational and Behavioral Statistics 24, von Davier, M A general diagnostic model applied to language testing data. ETS research report 05-6, Educational Testing Service, Princeton, NJ. 7

Likelihood and Fairness in Multidimensional Item Response Theory

Likelihood and Fairness in Multidimensional Item Response Theory or What I Thought About On My Holidays Giles Hooker and Matthew Finkelman Cornell University, February 27, 2008 Item Response Theory Educational