NESTED LOGIT MODELS FOR MULTIPLE-CHOICE ITEM RESPONSE DATA UNIVERSITY OF TEXAS AT AUSTIN UNIVERSITY OF WISCONSIN-MADISON

Size: px

Start display at page:

Download "NESTED LOGIT MODELS FOR MULTIPLE-CHOICE ITEM RESPONSE DATA UNIVERSITY OF TEXAS AT AUSTIN UNIVERSITY OF WISCONSIN-MADISON"

Hilary Young
5 years ago
Views:

1 PSYCHOMETRIKA VOL. 75, NO. 3, SEPTEMBER 2010 DOI: /S NESTED LOGIT MODELS FOR MULTIPLE-CHOICE ITEM RESPONSE DATA YOUNGSUK SUH UNIVERSITY OF TEXAS AT AUSTIN DANIEL M. BOLT UNIVERSITY OF WISCONSIN-MADISON Nested logit item response models for multiple-choice data are presented. Relative to previous models, the new models are suggested to provide a better approximation to multiple-choice items where the application of a solution strategy precedes consideration of response options. In practice, the models also accommodate collapsibility across all distractor categories, making it easier to allow decisions about including distractor information to occur on an item-by-item or application-by-application basis without altering the statistical form of the correct response curves. Marginal maximum likelihood estimation algorithms for the models are presented along with simulation and real data analyses. Key words: Multiple-choice items, multiple-choice models, nested logit models, nominal response model, marginal maximum likelihood estimation, item information, distractor selection information, distractor category collapsibility. Multiple-choice items are a common form of test item in standardized testing and have been a focus of item response theory (IRT) modeling for decades. A major challenge in building appropriate models for multiple-choice tests is the variety of strategies that can be used in responding to items, and the potential for such strategies to vary depending on the type of test item or test (Hutchinson, 1991). Perhaps the most common IRT approach to modeling multiple-choice item responses is reflected by Bock s Nominal Response Model (NRM; Bock, 1972) and related models such as Thissen and Steinberg s Multiple Choice Model (MCM; Thissen & Steinberg, 1984) and Samejima s Guessing Model (SGM; Samejima, 1979). Such models portray the item response in a competing utility framework where each response category is associated with a selection propensity that is a function of the ability measured by the test. More recent models (e.g., Revuelta 2004, 2005) are based on the same general framework but are designed to possess attractive statistical properties, such as rising distractor selection ratios (Love, 1997). The NRM modeling approach seems most apt for conditions in which the item response is based on a comparative evaluation of all response categories. Consider the following item from a test of English usage: Example Item 1. Select the one underlined part of the following sentence that must be changed in order to make the sentence grammatically correct. The average soda can has a tensile strong capable of supporting a weight (A) (B) (C) (D) of one hundred kilograms. Requests for reprints should be sent to Youngsuk Suh, Department of Educational Psychology, University of Texas at Austin, 1 University Station D5800, Austin, TX 78712, USA. yssuh327@gmail.com 2010 The Psychometric Society 454

2 YOUNGSUK SUH AND DANIEL M. BOLT 455 An anticipated strategy for such an item entails evaluating each response option and selecting the option that seems to best satisfy the requirement of the stem. In this case, option (B) should emerge as the appropriate selection for a high ability examinee. By contrast, many multiple-choice item types invoke strategies that involve problem-solving independent of the response categories. Consider for example the following item from a test of elementary mathematics: Example Item 2. A store sells 168 CDs each week. How many CDs does it sell in 24 weeks? (A) 2196 (B) 3210 (C) 4032 (D) 6118 An expected strategy for this item would entail multiplying 168 by 24, which leads to 4032, and selecting response option (C), with no more than a surface-level evaluation of other response options as not being Under such a strategy, the distractors are only considered as potential responses if the examinee is unable to solve the item. The process might be viewed as one in which a problem-solving strategy precedes a guessing strategy (see, e.g., Hutchinson, 1991; San Martin, del Pino, & De Boeck, 2006), and where evaluation of response options only occurs when the problem-solving strategy fails. The purpose of this paper is to present a modeling framework that may provide a better approximation to this latter process. It is further shown that the new models, unlike previous models for multiple-choice data, possess an attractive property of category collapsibility across all distractor options. This property is argued to be of practical value in settings where distractor information may be needed for some applications of the IRT model but not others. For example, studies of cheating behavior (Wollack, 1997) or appropriateness measurement (Drasgow, Levine, & Williams, 1985) commonly find value in attending to distractor selection information, but are often studied in the context of tests where items are intended to be scored correct/incorrect, and where applications such as test equating may be more easily handled using traditional binary models. In still other applications, such as when using an IRT model to estimate latent ability, attending to distractor selection may be useful for some items but not others, depending on the ability of the item writer to design distractors whose attractiveness varies in relation to the trait. It would thus appear that IRT models that are consistent in how the correct response option is modeled, whether including or excluding distractor information, could be of considerable practical value. A potential limitation of multiple-choice models such as the NRM is their lack of a distractor collapsibility property. The decision to model all possible responses under the NRM, for example, implies a multivariate logit with as many categories as response options. Assuming an item with more than two categories, the correct response characteristic curve under the NRM is incompatible with the corresponding curve under a binary logistic model where the item is scored dichotomously (0/1). Figure 1 provides an illustration of the best fitting NRM and 2-parameter logistic model (2PLM) correct response curves applied to the same five-category test item using the same response data, but where all distractors scored incorrect in the 2PLM case. (The item and data for this example came from a real data illustration described shortly.) As the example illustrates, the difference between curves can be fairly substantial. A second purpose of this paper is therefore to illustrate the advantages of a model that possesses collapsibility with respect to all distractor categories. In the current paper, we use the models to examine distractor information on an item-by-item basis, and demonstrate the potential to retain or ignore information provided by distractors in a variable way using just one model.

3 456 PSYCHOMETRIKA FIGURE 1. Illustration of best-fitting NRM correct response curve, 2PLM curve. 1. Bock s Nominal Response Model The NRM uses a multivariate logit to model category selection. Assume v = 1,...,m i possible response categories for item i. A propensity function Z iv (θ j ) = ζ iv + λ iv θ j represents the attractiveness of category v as a function of an examinee ability level θ j using two item category parameters: a slope parameter λ iv and an intercept parameter ζ iv. Z iv (θ j ) is mapped to a probability metric as: exp Z iv(θ j ) P iv (θ j ) = mi k=1 expz ik(θ j ). (1) The probability of selecting category v is thus affected not only by the propensity toward v, but also by the propensities toward all other categories, making the NRM a divide-by-total model (Thissen & Steinberg, 1986). To resolve a statistical indeterminacy, for all θ j we set m i Z iv (θ j ) = 0 (2) which also implies m i λ iv = 0 and m i ζ iv = 0, resulting in 2(m i 1) free parameters to be estimated per item. (For detailed NRM estimation procedures, see Baker & Kim, 2004, pp ) Despite extensions of the NRM to address issues related to random guessing (e.g., the MCM of Thissen & Steinberg, 1984; the SGM of Samejima, 1979), the NRM generally provides as good a fit to real data as these more complex models (Drasgow, Levine, Tsien, Williams, & Mead, 1995). 2. Nested Logit Models Nested logit models (NLMs; McFadden 1981, 1982) provide an alternative to multivariate logit models, and are appropriate for choice settings where selection possesses a hierarchical structure, as when a final choice decision is made through a sequential process. An NLM represents the final choice among a discrete set of choice options conditional upon choices made at

4 YOUNGSUK SUH AND DANIEL M. BOLT 457 higher levels in the hierarchy. The resulting probability of each discrete choice is modeled as a product of the conditional and unconditional probabilities across levels of the hierarchy. In this paper, we adapt the NLM approach to incorporate latent traits, such as an ability θ in IRT, to provide a competing approach to the NRM for multiple-choice test items. Using the NLM framework, we assume the correct response probability to be formulated by the 2PLM or 3-parameter logistic model (3PLM), and model distractor selection conditional upon an incorrect response using Bock s NRM. This results in an NLM having two levels a higher level (level 1) introducing branches that distinguish a correct versus an incorrect response, and a lower level (level 2) introducing branches that distinguish distractors. The response options are consequently separated into two nests, one nest possessing the correct response only, the second nest possessing all distractors. Formulated in this way, NLMs provide a different portrayal of how the examinee arrives at a correct response; while the NRM emphasizes a comparative evaluation of response options, the NLMs emphasize a solution strategy that occurs independent of evaluating the options. Although the most accurate representation probably lies somewhere in-between (see Section 5), the NLM strategy might be expected to provide a better approximation for many multiple-choice items, such as items represented by example item PL-Nested Logit Model While we will consider both 2PLM and 3PLM versions of the NLMs described above (denoted 2PL-NLM and 3PL-NLM, respectively), we consider the 3PL-NLM in greater detail, recognizing the 2PL-NLM as a special case. Suppose a multiple-choice test is composed of n items and each item has one correct answer and m i distractors, or a total of m i + 1 response categories. Let U ij represent the item i response by examinee j (j = 1,...,N) once keyed for correctness (i.e., U ij = 1 if correct, 0 if incorrect). Further, let D ij v denote the item response in an item examinee distractor category array such that D ij v = 1 when examinee j selects distractor category v (v = 1,...,m i )ofitemi, and 0 otherwise. Under the 3PLM, the probability that an examinee of ability θ j chooses the correct response category on item i is modeled as P ( [ ) U ij = 1 θ j = γi + (1 γ i ) exp (β i+α i θ j ), (3) where β i is an intercept parameter, α i a slope parameter, and γ i a lower asymptote parameter, also referred to as a pseudo-guessing parameter. The probability that an examinee selects distractor category v is modeled as the product of the probability of an incorrect response and the probability of selecting distractor category v conditional upon an incorrect response: P ( U ij = 0,D ij v = 1 θ j ) = P ( Uij = 0 θ j ) P ( Dij v = 1 U ij = 0,θ j ) { [ }[ 1 = 1 γ i + (1 γ i ) 1 + exp (β i+α i θ j ) exp Z iv(θ j ) mi k=1 expz ik(θ j ). (4) As under the NRM, we use Z iv (θ j ) = ζ iv + λ iv θ j to define a propensity toward each distractor category v, now conditional upon an incorrect response. Unlike the NRM, the denominator in the conditional probability is obtained by summing exp Z ik(θ j ) across only distractor categories. Following Bock (1972), an arbitrary linear restriction in Equation (2) is imposed for the distractor category parameters. Figure 2 plots item category characteristic curves (ICCCs) for a simulated multiple-choice item with four response categories, where the fourth category represents the correct response. When item responses for the item are scored as binary and analyzed by the 3PLM, the leftside plot represents the characteristic curve for the correct response. For the same item, use of

5 458 PSYCHOMETRIKA FIGURE 2. ICCCs for a simulated item under the 3PLM and 3PL-NLM. the 3PL-NLM results in ICCCs shown to the right. It should be noted that the item parameter estimates for α, β, and γ are identical under both models, as the correct response probability is formulated in both instances under the 3PLM and is not informed by the particular distractors selected. Naturally the 2PL-NLM is also represented by Equations (3) and (4) above, but where γ i = 0. An appealing feature of the 2PL-NLM is that it contains the same number of parameters as the NRM. Thus for a given dataset, the two models can be compared with respect to loglikelihood in terms of which provides the better structural representation of the data Item Parameter Estimation via an MML Approach for the 3PL-NLM Estimation of the 3PL-NLM is possible using a variant of Bock and Aitkin s (1981) marginal maximum likelihood (MML) procedure. Using the U ij and D ij v notation above, let U j denote the correct response vector for examinee j, and D j represent the response pattern matrix of distractor categories for examinee j, and let [U j, D j denote the complete n [max(m i ) + 1 item response matrix for examinee j. To simplify the notations in Equations (3) and (4), let P(U ij = 1 θ j ) = P i (θ j ), P(U ij = 0,D ij v = 1 θ j ) = P iv (θ j ), and P(D ij v = 1 U ij = 0,θ j ) = P iv u=0 (θ j ). Then assuming local independence, the conditional probability of a response pattern matrix for examinee j given θ j, is the joint probability: P ( [U j, D j θ j,ϖ ) = = [ n m i P i (θ j ) u ij P iv (θ j ) d ij v i=1 m i [ n P i (θ j ) u ( ij 1 Pi (θ j ) ) d ij v P iv u=0 (θ j ) d ij v, (5) i=1

6 YOUNGSUK SUH AND DANIEL M. BOLT 459 where ϖ denotes all item parameters. Under Bock and Aitkin s (1981) approach, the marginal probability of the observed response pattern for examinee j is expressed as P([U j, D j ) = P([Uj, D j θ,ϖ)g(θ τ)dθ, where g(θ τ) is a density function with unknown parameters τ. (The j subscript on θ is dropped because θ j can be seen as a random subject sampled from a population.) When combined across examinees, we write the likelihood as L = N j=1 P([U j, D j ). And the natural logarithm of the likelihood is log L = N log P ( [U j, D j ). (6) j=1 The total number of estimable item parameters is n i=1 (2m i + 3). However, in deriving the likelihood equations, it proves convenient to substitute, following the restriction of Equation (2), a reparameterization of the NRM probability (i.e., P iv u=0 (θ j )) that reduces the number of parameters by 2n. Following Bock (1972), instead of estimating ζ v and λ v (v = 1,...,m i ), we use parameters η v and ξ v (v = 1,...,m i 1) that are defined by difference contrasts of the parameters ζ v and λ v. For example, when m i = 3, the new parameters are defined as [ 1 1 ζ1 ζ B 2 3 T 3 2 = 2 ζ λ 1 λ 2 λ [ ζ1 ζ = 2 ζ 1 ζ 3 = λ 1 λ 2 λ 1 λ 3 [ η1 η 2, (7) ξ 1 ξ 2 where T is a transformation matrix. The likelihood equations can be derived with respect to these new parameters η v and ξ v for the distractor categories, as well as with respect to the item parameters for the correct response category (β, α, and γ ). Suppose ω ih represents an item parameter to be estimated for item i and category h. The likelihood in Equation (6) can be differentiated with respect to ω ih as where log L ω ih = N { ( P [Uj, D j )} 1 j=1 log P ( [U j, D j θ,ϖ ) = { [ log P ( [U j, D j θ,ϖ ) ω ih P ( [U j, D j θ,ϖ ) g ( θ τ ) dθ { n m i log P i (θ) u [ ( ij + log 1 Pi (θ) ) d ij v + log P iv u=0 (θ) d } ij v. (8) i=1 When derived for the correct response category of item i, the first partial derivative of Equation (8) with respect to ω ih can be written as ω ih log P ( [U j, D j θ,ϖ ) = ω ih [ m i log P i (θ) u ij + log(1 P i (θ)) d ij v = ω ih [ log Pi (θ) u ij + log ( 1 P i (θ) ) 1 u ij. (9) The summation across items in Equation (8) can be eliminated by assuming that the item parameter estimates are independent across items. As shown in Equation (9), the estimation of the },

7 460 PSYCHOMETRIKA correct category parameters proceeds independent of the distractor category parameters. The first derivative for a distractor category parameter reduces to ω ih log P ( [U j, D j θ,ϖ ) = ω ih [ mi log P iv u=0 (θ) d ij v, (10) implying the distractor category parameters can be estimated independent of the correct category parameters. Estimates of the new parameters η and ξ can then be used to find the values of the estimates of the original and more conventional parameters ζ and λ for each item. For the case in which m i = 3, using Equation (7) and the constraints (i.e., m i ζ iv = 0 and m i λ iv = 0) yields and ζ 1 = η 1 + η 2 3 λ 1 = ξ 1 + ξ 2 3, ζ 2 = η 2 2η 1 3, λ 2 = ξ 2 2ξ 1 3, ζ 3 = η 1 2η 2, (11) 3, λ 3 = ξ 1 2ξ 2. (12) 3 An EM estimation algorithm was programmed using FORTRAN (Digital Equipment Corporation, 1997). The quadrature points and weights, and initial values of the item parameters were chosen using the same default values as in BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003). The convergence criterion for both the Newton Raphson iterations and EM algorithm in terms of parameter change was set to , and the maximum number of EM cycles to 200. Additional details on derivations of the likelihood equations and implementation of the EM algorithm (including the procedure for computing the standard errors of the item parameter estimates) and the software can be obtained from the first author Information Functions for the 3PL-NLM A potential advantage of the NLMs relates to their quantification of item information, and the ease of studying the relative contribution of distractor categories. Due to the distractor collapsibility property, it becomes possible to directly compare the relative amounts of information provided when including versus excluding distractor information using the estimates of just the one model. As noted earlier, such a property is not present in the NRM, where the collapsing of distractor categories results in a change to the correct response ICCC that also implies a change in information. Such a feature makes it difficult to compare information under the two different forms of scoring. Information functions are particularly useful for the NLMs as they can be used to quantify the increase in the precision of ability estimates when attending to distractors. Information functions for the 3PL-NLM can be derived as follows. As shown in Equation (5), the conditional probability of a response pattern matrix for examinee j is written as L j = P ( [U j, D j θ j,ϖ ) = [ n m i P i (θ j ) u ij Q i (θ j ) d ij v P iv u=0 (θ j ) d ij v, (13) i=1 where 1 P i (θ j ) = γ i + (1 γ i ) 1 + exp (β i+α i θ j ), Q i(θ j ) = 1 P i (θ j ),

8 and YOUNGSUK SUH AND DANIEL M. BOLT 461 exp Z iv(θ j ) P iv u=0 (θ j ) = mi k=1 expz ik(θ j ). To simplify notation, let P i (θ j ) = P ij, Q i (θ j ) = Q ij, P iv u=0 (θ j ) = P ij v u=0, and Z ik (θ j ) = Z ij k. Then taking the natural logarithm of the likelihood function for examinee j yields log L j = { n m i [ } u ij log P ij + d ij v log Qij + log P ij v u=0, (14) i=1 and the first partial derivative of the loglikelihood with respect to θ j is where log L j θ j = { n i=1 u ij α i P ij P i (θ j ) = Q ij P ij + m i [ ( mi ) d ij v α i Pij + k=1 ez ij k (λ iv λ ik ) }, (15) exp (β i+α i θ j ) m i S ij and S ij = exp Z ij k. Then, the second partial derivative of the log-likelihood with respect to θ j is where 2 log L j θ 2 j ( m i k=1 expz ij k (λ iv λ ik )) S ij θ j { n [ P 2 = u ij αi 2 P ij ij γ i Q ij P 2 i=1 ij [ ( mi m i ( k=1 ez ij k (λ iv λ ik ) )} + d ij v αi 2 P ij Q ij + S ij ), (16) θ i = S ij ( m i k=1 expz ij k λ ik (λ iv λ ik )) m i k=1 expz ij k (λ iv λ ik )( m i k=1 λ ik exp Z ij k ) Sij 2 (17) and Q i (θ j ) = 1 P i (θ j ). This second partial derivative contains observed data values. Following usual practice (Kendall & Staurt, 1967), the u ij and d ij v are replaced by their expected values P ij and (1 P ij )P ij v u=0, respectively, resulting in ( 2 ) log L j E θj 2 = k=1 [ n m i P ij I iu (θ j ) + P ij v I iv (θ j ), (18) i=1 where P ij v = (1 P ij )P ij v u=0. The test information function is then given by ( 2 ) [ log L j n m i I(θ j ) = E θj 2 = P ij I iu (θ j ) + P ij v I iv (θ j ). (19) i=1

9 462 PSYCHOMETRIKA For each item, the item information function is given by m i I i (θ j ) = P ij I iu (θ j ) + P ij v I iv (θ j ), (20) where P ij I iu (θ j ) is the contribution of the correct response category to item information, and P ij v I iv (θ j ) is the contribution of distractor category v. Each of these terms is referred to as the information share of a category (Baker & Kim, 2004; Samejima 1969, 1972, 1977). Here, the information share of the correct response category is [ P ij I iu (θ j ) = γ i + (1 γ i ) exp (β i+α i θ j ) [ P 2 αi 2 P ij Q ij ij γ i P 2 ij, (21) the same as in the traditional 3PLM, while the information share of any distractor category v is P ij v I iv (θ j ) { [ } 1 exp Z [ ij v = 1 γ i + (1 γ i ) 1 + exp (β i+α i θ j ) mi α 2 k=1 expz ij k i P ij Q ij mi k=1 + expz ij k (λ iv λ ik ) m i k=1 λ ik exp Z ij k S ij ( m i S 2 ij k=1 expz ij k λ ik (λ iv λ ik )), (22) allowing for quantification of the incremental information provided by the distractor categories. 3. Simulation Studies Simulation studies were conducted to investigate (1) the parameter recovery of the NLMs, (2) the empirical distinguishability of the 2PL-NLM, 3PL-NLM, and NRM, and (3) the statistical performances of the NLMs in testing for and quantifying distractor information Simulation Study Designs Simulation 1: Parameter Recovery Study. Parameter recovery for the 2PL-NLM and 3PL-NLM was evaluated for varying sample size (1,000 and 5,000 examinees) and test length (10-, 20-, and 50-item tests) conditions. For each combination of conditions, examinee ability parameters were generated as θ Normal(0, 1). Item parameters were generated randomly from the following distributions: α Uniform(0.75, 2) and β Uniform( 2.5, 2.5) for the correct response category, and λ v Uniform( 2, 2) and ζ v Uniform( 2, 2) for the distractor categories, followed by the imposition of constraints m i ζ iv = 0 and m i λ iv = 0. Four-category item responses (one correct response and three distractor categories) were generated following either the 2PL-NLM or 3PL-NLM. The same item parameters were applied to generate both the 2PL-NLM and 3PL-NLM data, with γ for the 3PL-NLM set at 0.25 for all items. As has been observed when estimating the 3PLM, the γ parameter is generally difficult to recover without a prior; our intent in using a constant value of 0.25 was based on our desire to match the true parameter with the prior so as to better ascertain the impact of the presence of the guessing parameter on recovery of the other parameters. Thus, the γ parameter was assigned a beta prior with parameters of 5 and 15 during the EM process for item parameter estimation (for detailed procedures, see Baker & Kim, 2004, pp ). 100 replications were simulated for each combination of conditions. The accuracy

10 YOUNGSUK SUH AND DANIEL M. BOLT 463 of item parameter estimates was evaluated with respect to bias (estimated minus true) and root mean squared error (RMSE). For the distractor category parameters, the estimated η iv and ξ iv (v = 1,...,m i 1) were converted to ζ iv and λ iv (v = 1,...,m i ). In order to demonstrate ability parameter estimation and the value of attending to distractors under the NLM, Expected a Posteriori (EAP) estimates were obtained under the 2PLM and 2PL- NLM for the 10- and 50-item test length conditions. Response patterns for 5,000 examinees were simulated at each of 13 discrete θ values ranging from 3.4 to 3.4. Bias and RMSEs were then computed at each of the θ levels generated with respect to both the 2PLM and 2PL-NLM. In addition, the test information functions under the two models were evaluated at each of the θ levels Simulation 2: Model Comparison Study. To evaluate the empirical distinguishability of the NLMs and NRM, a second simulation study was conducted in which the NLMs and NRM were fit to data generated from each of the three models. The models were compared using several likelihood-based criteria: AIC (Akaike, 1974), BIC (Schwarz, 1978), and CAIC (Bozdogan, 1987). Item response data were simulated using as generating parameters the corresponding item parameter estimates from the 36 items studied in the real data analysis reported in the next section (see Table 5 of Section 4 for the 2PL-NLM estimates 1 ). In each case, data were generated for 3,000 examinees as θ Normal(0, 1). 100 datasets were simulated with respect to each of the 2PL-NLM, 3PL-NLM, and NRM. Each of these datasets was then fit using each of the three models, with the AIC, BIC, and CAIC applied to evaluate whether the correct generating model was identified. For the 3PL-NLM, the component of the loglikelihood associated with the prior 2 was removed when calculating the indices Simulation 3: Distractor Information Study. As noted, an important practical benefit of the NLMs is their potential to quantify the contribution of distractor information to the overall information provided by an item. We consider two aspects of this process: (1) testing whether distractors provide incremental information; and (2) quantifying the information provided. In testing for information, we apply a likelihood ratio (LR) test for comparing models under conditions in which the distractors are assumed to provide no information (a reduced model) against conditions in which they do (an augmented model). The reduced model under the 2PL-NLM and 3PL-NLM assumes λ v = 0 for all distractor categories. The LR test was performed on an item-by-item basis in which the reduced model assumed all items but the studied item was fit so as to allow distractor information. While such a test may be desirable to determine whether distractors provide information, in actual practice greater value may be placed on the quantification of information provided by distractors. Following Section 2, such a quantification is provided by the estimated information share of the distractor categories, as computed from the item parameter estimates. In order to evaluate the performance of the LR test and the precision of distractor information estimates, a third simulation was conducted. The first part of the simulation evaluated the Type I error performance of the LR test. Data were generated from each of the 2PL-NLM, 3PL-NLM, and NRM. The generating item parameters were based on those observed as estimates in the real data analysis, but with restrictions imposed to reflect the reduced condition for each of the three models. For the NRM, the reduced condition was simulated by setting λ 1 = =λ 4 = α/4 across the distractor categories. Data were generated for 3,000 examinees and 36 items. 1 The 3PL-NLM and NRM item parameter estimates (and their standard errors) are available from the first author upon request. 2 A beta prior with parameters of 4 and 16 was used, when estimating the guessing parameters in this case, where each item has five response categories.

11 464 PSYCHOMETRIKA TABLE 1. Simulation 1 results: average bias and RMSE for correct category parameters. 2PL-NLM 3PL-NLM n α β α β α β γ α β γ Bias RMSE TABLE 2. Simulation 1 results: average RMSE for distractor category parameters. 2PL-NLM 3PL-NLM n λ ζ λ ζ λ ζ λ ζ RMSE The second part of this simulation evaluated the recovery of the estimated information share of distractor categories. Data were again generated under the three models using the real data item parameter estimates, but now without the restrictions implied by the reduced condition. Recovery was evaluated by comparing the estimated information share compared against the true information share as calculated from the generating parameters. For the NRM, the information share of categories was calculated using methods described by Baker and Kim (2004, pp ) Simulation Study Results Simulation 1: Parameter Recovery Study Results. Bias and RMSEs for the distractor category parameters were collapsed across item categories and items to create an average bias and RMSE for each distractor category parameter type. Similarly, the recovery results for the correct response category parameters were averaged across items. Results are provided in Tables 1 and 2 across all conditions. Bias for the correct response parameters is close to zero under the 2PL-NLM for all conditions, implying no apparent evidence of systematic underestimation or overestimation. (It should be noted that the bias for the distractor parameters is forced to 0 due to the constraint in Equation (2).) Bias is somewhat larger under the 3PL-NLM (i.e., positive for β and negative for α), which may be attributed in part to the greater influence of the priors and small departures of the distribution of generating parameters from that assumed by the priors. RMSEs for the correct response parameters are smaller than those for the distractor parameters under the 2PL-NLM. Similar patterns in relation to the effects of conditions are found for the 2PL-NLM and 3PL- NLM. No systematic patterns related to the number of items are apparent. Also, as expected, the

12 YOUNGSUK SUH AND DANIEL M. BOLT 465 TABLE 3. Simulation 1 results: average bias and RMSE for ability parameters. Bias RMSE 2PLM 2PL-NLM 2PLM 2PL-NLM θ Weighted average RMSEs for the 3PL-NLM were consistently larger than for the 2PL-NLM. Most importantly, the overall results seen here appear comparable to those previously observed using MML techniques under the NRM (Wollack, Bolt, Cohen, & Lee, 2002), suggesting that the NLMs appear to be at least as good as the NRM in terms of parameter recovery. We further confirmed the consistency of our recovery results in comparison to both the 2PLM and 3PLM by comparing results for our generated data when estimated using BILOG-MG using the same γ prior, which were effectively the same. Bias and RMSEs for θ are provided in Table 3. Based on the test information functions shown in Figure 3, it would appear that the distractors provide their greatest relative increases in information at both low and intermediate levels of θ, which is expected as the distractors are more commonly selected among examinees not of high ability. These results are also supported by the RMSEs of Table 3, where the greatest relative declines in RMSE when moving from the 2PLM to the 2PL-NLM are seen for lower θ levels. Not surprisingly, there is also a reduction in bias under the 2PL-NLM at the extreme low θ levels, again owing to the greater amount of information about θ provided by attending to distractors. The results of Table 3 appear to consistently support the value of attending to distractor information. The weighted average statistics in the final row of Table 3 show that when assuming a Normal(0, 1) distribution for θ, we appear to get an approximately 30% decrease in RMSEs Simulation 2: Model Comparison Study Results. Table 4 shows the number of times out of 100 that each model fits best according to each likelihood criterion for each generating model. Although there is some tendency for confusion between the 2PL-NLM and 3PL-NLM when the 3PL-NLM is the generating model (particularly under CAIC), the distinction between the NRM and NLMs seems clearer. The sometimes better comparative fit for the 2PL-NLM compared to the 3PL-NLM when the 3PL-NLM is the generating model is perhaps not surprising, as the ultimate value of the pseudo-guessing parameter is often questionable, especially when the majority of items are relatively easy. Overall, it would thus appear that as statistical models, the NLM and NRM approaches not only provide competing structural representations, but also ones that may be statistically distinguishable when applied to actual test data.

13 466 PSYCHOMETRIKA FIGURE 3. Test information under the 2PLM and 2PL-NLM. TABLE 4. Simulation 2 results: frequencies of model selection. True model Estimated model NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM 2logL AIC BIC CAIC Simulation 3: Distractor Information Study Results. In evaluating the Type I error performance of the LR test, we considered alpha levels of 0.05, 0.01, as well as Bonferroni corrected levels at 0.05 (p = ). When using the 2PL-NLM as both the generating and fitted model, we observe clear evidence of Type 1 error inflation with rejection rates under the 2PL-NLM of 0.14, 0.04, and 0.01, respectively, averaging across the 36 items. Similarly, when using 3PL-NLM, the corresponding rejection rates are 0.17, 0.06, and Even greater inflation is observed when using the NRM as the generating model, with rates of 0.25, 0.11, and 0.04, respectively, for the 2PL-NLM, and 0.22, 0.09, and 0.03 for the 3PL-NLM. Overall, there is clearly evidence of Type I error inflation in applying the LR test, and potential for mistaken inferences when relying solely on the LR test as a basis for including distractor information. At the same time however, we note that in virtually all Type I error cases, the estimated distractor information is near 0, even when the NRM is the generating model. As it is anticipated that most practitioners would attend to the amount of distractor information when deciding whether to attend to distractors, we conducted a follow-up study that evaluated the accuracy of the NLMs in recovering the amount of distractor information.

14 YOUNGSUK SUH AND DANIEL M. BOLT 467 Regardless of whether the 2PL-NLM, 3PL-NLM, or NRM is the generating model, the relative amount of information provided by distractors appears well-recovered. When the 2PL-NLM was the generating model, the correlation between the estimated and true distractor information was 0.97 when fitting the 2PL-NLM and 0.95 when fitting the 3PL-NLM. The mean absolute differences (MADs) were 0.01 and 0.04, respectively. When the 3PL-NLM was the generating model, the respective correlations were 0.93 and 0.98 and the MADs were 0.05 and 0.02, while when the NRM was the generating model, the correlations were still 0.98 and 0.93 and the MADs were 0.02 and 0.09, suggesting that, even in the presence of some model misspecification, recovery appears quite good. 4. Real Data Illustration Data from a 36-item college level mathematics placement test (Center for Placement Testing, 1998) were analyzed. For purposes of reporting model estimates and testing for distractor information, 3,000 examinees were randomly selected from a full dataset of 12,800 examinees. Each item contained five response categories (one correct response and four distractor categories). Inspection of the items suggested a response process more consistent with that discussed in relation to example item 2 shown earlier in the paper. That is, most items would appear to be best solved through use of a problem-solving strategy that initially does not consider the response options. Both the 2PL-NLM and 3PL-NLM were investigated as potential competitors to the NRM. The overall 2 loglikelihood for the 2PL-NLM was as compared to for the NRM (when the NRM was fit using the same algorithm). As both models possess the same number of parameters, it would appear that the 2PL-NLM thus provides a better representation of the data. Table 5 displays the 2PL-NLM estimates. The average standard errors for distractor category slopes and intercepts were both 0.06, and for correct response slope and intercept were 0.05 and 0.04, respectively. To further examine how the NLMs compare to the NRM in terms of model fit, 10 nonoverlapping random samples were drawn from the full dataset, each consisting of 1,000 examinees. Each of the 2PL-NLM, 3PL-NLM, and NRM was fit to the 10 datasets. Table 6 shows 2loglikelihoods under each model, as well as the AIC, BIC, and CAIC indices. On the whole, the 2PL-NLM appears to show comparatively better fit than the 3PL-NLM and NRM across all 10 samples. Figure 4 shows plots of information functions for several example items; the item information share of distractor categories 1 4 (Equation (22)) and the correct response category 5 (Equation (21)), as well as the total item information function (Equation (20)). Apparent from these graphs is the substantial variability across items in the contribution of distractors to overall item information. For example, items 6 and 7 show large amounts of information both in the correct response category and most of the distractor categories, while items 12 and 32 show very small amounts of information, especially in the distractor categories. As noted, the item information share of distractor categories can be used as the basis for a decision about whether to collapse distractor categories and score items simply as correct/incorrect. Using the item parameter estimates from the 2PL-NLM, a comparison of item information functions when including versus excluding distractor information can be performed without any revision of the model, as the item parameter estimates for the correct responses from the 2PL-NLM are the 2PLM item parameter estimates. The item information functions under the 2PLM and 2PL-NLM are plotted in Figure 5. Note that for items 6 and 7, the 2PL-NLM provides a larger amount of information at relatively low levels of ability relative to the 2PLM. Items 12 and 32 yield almost the same information under the two models, suggesting virtually no practical advantage to considering distractor information.

15 468 PSYCHOMETRIKA TABLE 5. 2PL-NLM parameter estimates, mathematics placement test. Slopes Intercepts Item λ 1 λ 2 λ 3 λ 4 α ζ 1 ζ 2 ζ 3 ζ 4 β Beyond quantifying item category information, the statistical significance of information in the distractor categories was evaluated through an LR test for each item using the 2PL-NLM. The results are presented in Table 7. As noted above, the 2logL augmented is equal to for all items. As χα=0.95,df 2 =3 = 7.81, we reject the null hypotheses for all items except items 12 and 32. Recalling the inflated Type I error performance of the LR test in Section 3.2.3, it nevertheless appears that on the whole there is evidence of distractor information in the items on this test, as the LR rejects in 34 out of 36 cases, well beyond the levels of inflation seen in the simulation. However, a quantification of the percentage increase in information, shown in the rightmost column of Table 7, suggests that the increase in information is less than 12% for half of the items.

16 YOUNGSUK SUH AND DANIEL M. BOLT 469 TABLE 6. Model selection comparison across 10 samples. Method Model NRM logL 2PL-NLM PL-NLM NRM AIC 2PL-NLM PL-NLM NRM BIC 2PL-NLM PL-NLM NRM CAIC 2PL-NLM PL-NLM FIGURE 4. Item information and information share of categories under the 2PL-NLM.

17 470 PSYCHOMETRIKA FIGURE 5. Item information under the 2PLM and 2PL-NLM. 5. Alternative Nested Logit Models Although the NLMs presented in this paper appear to provide a better representation of items such as example item 2 when compared to models such as the NRM, a limitation of the models as representations of response process is the absence of the correct response category in the second nest. This limitation can be addressed through overlapping nests, where the same response option is present in more than one nest. We note that such a framework also provides an appealing one in which to better understand the NLM and NRM approaches in relation to each other, as well as potential hybrid approaches. Under an overlapping nest approach, the correct response category could be present in two nests: (1) the nest associated with correct solution strategy; and (2) the nest including the distractor categories. An appealing aspect of the modeling framework is that it emphasizes two general ways by which the examinee arrives at the correct response: (1) correct problem solving apart from response category evaluation; and (2) a comparative evaluation of all response options, as might occur with an educated guess. Such a model would be obtained by generalizing Equations (3) and (4) such that the summation in the rightmost bracketed term in Equation (4) would now include the correct response category (in addition to the distractor categories). The NRM can be viewed as a special case in which β i = and α i = 0, while the 2PL-NLM is a special case in which ζ iv = and λ iv = 0 for the correct response category. The cost of the more general model in comparison to the 2PL-NLM or NRM, therefore, is the addition of two parameters per item. It remains to be seen whether there are conditions in which this model can be effectively estimated. The work of San Martin et al. (2006) likely provides some insight as to the

18 YOUNGSUK SUH AND DANIEL M. BOLT 471 TABLE 7. LR test results and average item information under the 2PLM and 2PL-NLM Item 2LogL reduced LR Significant Bonferroni 2PLM Info 2PL-NLM Info % Increased potential value of these models, although their approach was applied within the framework of a Rasch model and thus did not include either correct response or distractor slope parameters. One setting that may make estimation possible is the presence of distinguishable traits across levels. For example, for some tests it may be reasonable to assume a distinct trait (e.g., testwiseness ) influences selection among response categories at level 2, as compared to the trait (e.g., math ability ) that functions at level 1. An example of this possibility is demonstrated by example item 3. Example Item 3. A tire measures 24 inches in diameter. What is the circumference of the tire in inches? Round your answer to the nearest tenth. (A) 48 (B) (C) 75 (D) 75.4

19 472 PSYCHOMETRIKA While math ability may determine whether the examinee arrives at correct response at level 1, a correlated but potentially distinguishable trait may function within level 2. For example, as only two responses are reported to the nearest tenth (B) and (D), and (C) and (D) are essentially the same answer but differ only in rounding, a testwise respondent could likely ascertain (D) as the correct response. Examples of other NLMs that might be adapted for still other item types might include additional levels. One such case might involve items for which a solution strategy can be broken down into steps and where distractors are designed to catch misapplication of a particular step. Following the same example item above, for instance, it might be anticipated that the process by which a respondent arrives at (D) as the correct response involves first performing the circumference calculation: = , and next determining the correct level at which to round. Correct execution of the first step but incorrect execution of the second would be represented by the choice of (C) as the response. While these models and others likely provide a more accurate representation of the response process than the NLMs considered in this paper, they naturally come at the cost of additional model complexity, as well as the loss of the distractor collapsibility property that motivated the 2PL-NLM and 3PL-NLM considered in this paper. 6. Conclusion Future work with NLMs can address various issues. More applications and direct comparisons against competing models, including models other than the NRM, such as the Nedelsky model (Bechger, Maris, Verstralen, & Verhelst, 2005) are needed. Estimation issues related to more complex NLMs, such as models with overlapping nests, may help clarify the potential value of the NLM strategy in other contexts. Various alternative strategies might be considered, following approaches taken in the discrete choice literature (see e.g., Train, 2003, pp ). For example, some approaches to handling overlapping nests specify a parameter indicating the degree to which a given outcome is a member of each nest. Other generalizations of the nested approach might consider probit as opposed to logit link functions. Additional practical applications of the specific models investigated in this paper may also be of interest. For example, attempts to study differential item functioning (DIF) in multiple-choice items often find value in determining whether particular distractors are responsible for differential functioning of the correct response. Such applications can be studied in a more explicit fashion using the 2PL-NLM and 3PL-NLM as presented in this paper. Still other applications may focus on the use of the models in comparing test items administered under open-ended versus multiplechoice frameworks. Here again the consistency in how the correct response category is modeled should allow for more direct assessment of the consequences of adding multiple-choice response options. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), Baker, F.B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker. Bechger, T.M., Maris, G., Verstralen, H.H.F.M., & Verhelst, N.D. (2005). The Nedelsky model for multiple-choice items. In L.A. van der Ark, M.A. Croon, & K. Sijtsma (Eds.), New developments in categorical data analysis for the social and behavioral sciences. Mahwah: Lawrence Erlbaum Associates. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46,

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

Equivalency Test for Model Fit 1 Running head: EQUIVALENCY TEST FOR MODEL FIT An Equivalency Test for Model Fit Craig S. Wells University of Massachusetts Amherst James. A. Wollack Ronald C. Serlin University