NESTED LOGIT MODELS FOR MULTIPLE-CHOICE ITEM RESPONSE DATA UNIVERSITY OF TEXAS AT AUSTIN UNIVERSITY OF WISCONSIN-MADISON

Similar documents
An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

Equating Tests Under The Nominal Response Model Frank B. Baker

Item Response Theory (IRT) Analysis of Item Sets

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Lesson 7: Item response theory models (part 2)

Comparison between conditional and marginal maximum likelihood for a class of item response models

IRT Model Selection Methods for Polytomous Items

On the Use of Nonparametric ICC Estimation Techniques For Checking Parametric Model Fit

PIRLS 2016 Achievement Scaling Methodology 1

Development and Calibration of an Item Response Model. that Incorporates Response Time

Basic IRT Concepts, Models, and Assumptions

An Introduction to the DA-T Gibbs Sampler for the Two-Parameter Logistic (2PL) Model and Beyond

Whats beyond Concerto: An introduction to the R package catr. Session 4: Overview of polytomous IRT models

A Markov chain Monte Carlo approach to confirmatory item factor analysis. Michael C. Edwards The Ohio State University

A Comparison of Item-Fit Statistics for the Three-Parameter Logistic Model

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Walkthrough for Illustrations. Illustration 1

Ability Metric Transformations

On the Construction of Adjacent Categories Latent Trait Models from Binary Variables, Motivating Processes and the Interpretation of Parameters

A Marginal Maximum Likelihood Procedure for an IRT Model with Single-Peaked Response Functions

The Factor Analytic Method for Item Calibration under Item Response Theory: A Comparison Study Using Simulated Data

Logistic Regression: Regression with a Binary Dependent Variable

What is an Ordinal Latent Trait Model?

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Application of Item Response Theory Models for Intensive Longitudinal Data

Because it might not make a big DIF: Assessing differential test functioning

The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY. Yu-Feng Chang

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Hierarchical Cognitive Diagnostic Analysis: Simulation Study

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

ABSTRACT. Yunyun Dai, Doctor of Philosophy, Mixtures of item response theory models have been proposed as a technique to explore

Model Estimation Example

An Overview of Item Response Theory. Michael C. Edwards, PhD

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

LINKING IN DEVELOPMENTAL SCALES. Michelle M. Langer. Chapel Hill 2006

Some Issues In Markov Chain Monte Carlo Estimation For Item Response Theory

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

A Simulation Study to Compare CAT Strategies for Cognitive Diagnosis

Introducing Generalized Linear Models: Logistic Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

GENERALIZED LATENT TRAIT MODELS. 1. Introduction

Joint Assessment of the Differential Item Functioning. and Latent Trait Dimensionality. of Students National Tests

Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions

SCORING TESTS WITH DICHOTOMOUS AND POLYTOMOUS ITEMS CIGDEM ALAGOZ. (Under the Direction of Seock-Ho Kim) ABSTRACT

Summer School in Applied Psychometric Principles. Peterhouse College 13 th to 17 th September 2010

Statistical and psychometric methods for measurement: Scale development and validation

Local response dependence and the Rasch factor model

Determining the number of components in mixture models for hierarchical data

Contributions to latent variable modeling in educational measurement Zwitser, R.J.

SRMR in Mplus. Tihomir Asparouhov and Bengt Muthén. May 2, 2018

A comparison of two estimation algorithms for Samejima s continuous IRT model

arxiv: v1 [stat.ap] 11 Aug 2014

8 Nominal and Ordinal Logistic Regression

Monte Carlo Simulations for Rasch Model Tests

A Practitioner s Guide to Generalized Linear Models

IRT linking methods for the bifactor model: a special case of the two-tier item factor analysis model

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Comparing IRT with Other Models

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Equating Subscores Using Total Scaled Scores as an Anchor

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 1. Multinomial Dependent Variable. Random Utility Model

Estimating Integer Parameters in IRT Models for Polytomous Items

Anders Skrondal. Norwegian Institute of Public Health London School of Hygiene and Tropical Medicine. Based on joint work with Sophia Rabe-Hesketh

When enough is enough: early stopping of biometrics error rate testing

Econometrics Spring School 2016 Econometric Modelling. Lecture 6: Model selection theory and evidence Introduction to Monte Carlo Simulation

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA

Latent Class Analysis for Models with Error of Measurement Using Log-Linear Models and An Application to Women s Liberation Data

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

MLMED. User Guide. Nicholas J. Rockwood The Ohio State University Beta Version May, 2017

A Use of the Information Function in Tailored Testing

APPENDICES TO Protest Movements and Citizen Discontent. Appendix A: Question Wordings

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Signal Detection Theory With Finite Mixture Distributions: Theoretical Developments With Applications to Recognition Memory

Applied Psychological Measurement 2001; 25; 283

Online Item Calibration for Q-matrix in CD-CAT

A Note on the Equivalence Between Observed and Expected Information Functions With Polytomous IRT Models

Multidimensional Linking for Tests with Mixed Item Types

Chained Versus Post-Stratification Equating in a Linear Context: An Evaluation Using Empirical Data

Chapter 5. Introduction to Path Analysis. Overview. Correlation and causation. Specification of path models. Types of path models

Diversity partitioning without statistical independence of alpha and beta

The Rasch Poisson Counts Model for Incomplete Data: An Application of the EM Algorithm

Ensemble Rasch Models

COWLEY COLLEGE & Area Vocational Technical School

Nonparametric Online Item Calibration

flexmirt R : Flexible Multilevel Multidimensional Item Analysis and Test Scoring

Logistic Regression and Item Response Theory: Estimation Item and Ability Parameters by Using Logistic Regression in IRT.

STA 216, GLM, Lecture 16. October 29, 2007

An Introduction to Mplus and Path Analysis

Modeling differences in itemposition effects in the PISA 2009 reading assessment within and between schools

Parametric Identification of Multiplicative Exponential Heteroskedasticity

Computerized Adaptive Testing With Equated Number-Correct Scoring

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. A Multinomial Error Model for Tests with Polytomous Items

Item Response Theory for Scores on Tests Including Polytomous Items with Ordered Responses

Seminar über Statistik FS2008: Model Selection

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Bayesian Analysis of Latent Variable Models using Mplus

The Simplex Method: An Example

Transcription:

PSYCHOMETRIKA VOL. 75, NO. 3, 454 473 SEPTEMBER 2010 DOI: 10.1007/S11336-010-9163-7 NESTED LOGIT MODELS FOR MULTIPLE-CHOICE ITEM RESPONSE DATA YOUNGSUK SUH UNIVERSITY OF TEXAS AT AUSTIN DANIEL M. BOLT UNIVERSITY OF WISCONSIN-MADISON Nested logit item response models for multiple-choice data are presented. Relative to previous models, the new models are suggested to provide a better approximation to multiple-choice items where the application of a solution strategy precedes consideration of response options. In practice, the models also accommodate collapsibility across all distractor categories, making it easier to allow decisions about including distractor information to occur on an item-by-item or application-by-application basis without altering the statistical form of the correct response curves. Marginal maximum likelihood estimation algorithms for the models are presented along with simulation and real data analyses. Key words: Multiple-choice items, multiple-choice models, nested logit models, nominal response model, marginal maximum likelihood estimation, item information, distractor selection information, distractor category collapsibility. Multiple-choice items are a common form of test item in standardized testing and have been a focus of item response theory (IRT) modeling for decades. A major challenge in building appropriate models for multiple-choice tests is the variety of strategies that can be used in responding to items, and the potential for such strategies to vary depending on the type of test item or test (Hutchinson, 1991). Perhaps the most common IRT approach to modeling multiple-choice item responses is reflected by Bock s Nominal Response Model (NRM; Bock, 1972) and related models such as Thissen and Steinberg s Multiple Choice Model (MCM; Thissen & Steinberg, 1984) and Samejima s Guessing Model (SGM; Samejima, 1979). Such models portray the item response in a competing utility framework where each response category is associated with a selection propensity that is a function of the ability measured by the test. More recent models (e.g., Revuelta 2004, 2005) are based on the same general framework but are designed to possess attractive statistical properties, such as rising distractor selection ratios (Love, 1997). The NRM modeling approach seems most apt for conditions in which the item response is based on a comparative evaluation of all response categories. Consider the following item from a test of English usage: Example Item 1. Select the one underlined part of the following sentence that must be changed in order to make the sentence grammatically correct. The average soda can has a tensile strong capable of supporting a weight (A) (B) (C) (D) of one hundred kilograms. Requests for reprints should be sent to Youngsuk Suh, Department of Educational Psychology, University of Texas at Austin, 1 University Station D5800, Austin, TX 78712, USA. E-mail: yssuh327@gmail.com 2010 The Psychometric Society 454

YOUNGSUK SUH AND DANIEL M. BOLT 455 An anticipated strategy for such an item entails evaluating each response option and selecting the option that seems to best satisfy the requirement of the stem. In this case, option (B) should emerge as the appropriate selection for a high ability examinee. By contrast, many multiple-choice item types invoke strategies that involve problem-solving independent of the response categories. Consider for example the following item from a test of elementary mathematics: Example Item 2. A store sells 168 CDs each week. How many CDs does it sell in 24 weeks? (A) 2196 (B) 3210 (C) 4032 (D) 6118 An expected strategy for this item would entail multiplying 168 by 24, which leads to 4032, and selecting response option (C), with no more than a surface-level evaluation of other response options as not being 4032. Under such a strategy, the distractors are only considered as potential responses if the examinee is unable to solve the item. The process might be viewed as one in which a problem-solving strategy precedes a guessing strategy (see, e.g., Hutchinson, 1991; San Martin, del Pino, & De Boeck, 2006), and where evaluation of response options only occurs when the problem-solving strategy fails. The purpose of this paper is to present a modeling framework that may provide a better approximation to this latter process. It is further shown that the new models, unlike previous models for multiple-choice data, possess an attractive property of category collapsibility across all distractor options. This property is argued to be of practical value in settings where distractor information may be needed for some applications of the IRT model but not others. For example, studies of cheating behavior (Wollack, 1997) or appropriateness measurement (Drasgow, Levine, & Williams, 1985) commonly find value in attending to distractor selection information, but are often studied in the context of tests where items are intended to be scored correct/incorrect, and where applications such as test equating may be more easily handled using traditional binary models. In still other applications, such as when using an IRT model to estimate latent ability, attending to distractor selection may be useful for some items but not others, depending on the ability of the item writer to design distractors whose attractiveness varies in relation to the trait. It would thus appear that IRT models that are consistent in how the correct response option is modeled, whether including or excluding distractor information, could be of considerable practical value. A potential limitation of multiple-choice models such as the NRM is their lack of a distractor collapsibility property. The decision to model all possible responses under the NRM, for example, implies a multivariate logit with as many categories as response options. Assuming an item with more than two categories, the correct response characteristic curve under the NRM is incompatible with the corresponding curve under a binary logistic model where the item is scored dichotomously (0/1). Figure 1 provides an illustration of the best fitting NRM and 2-parameter logistic model (2PLM) correct response curves applied to the same five-category test item using the same response data, but where all distractors scored incorrect in the 2PLM case. (The item and data for this example came from a real data illustration described shortly.) As the example illustrates, the difference between curves can be fairly substantial. A second purpose of this paper is therefore to illustrate the advantages of a model that possesses collapsibility with respect to all distractor categories. In the current paper, we use the models to examine distractor information on an item-by-item basis, and demonstrate the potential to retain or ignore information provided by distractors in a variable way using just one model.

456 PSYCHOMETRIKA FIGURE 1. Illustration of best-fitting NRM correct response curve, 2PLM curve. 1. Bock s Nominal Response Model The NRM uses a multivariate logit to model category selection. Assume v = 1,...,m i possible response categories for item i. A propensity function Z iv (θ j ) = ζ iv + λ iv θ j represents the attractiveness of category v as a function of an examinee ability level θ j using two item category parameters: a slope parameter λ iv and an intercept parameter ζ iv. Z iv (θ j ) is mapped to a probability metric as: exp Z iv(θ j ) P iv (θ j ) = mi k=1 expz ik(θ j ). (1) The probability of selecting category v is thus affected not only by the propensity toward v, but also by the propensities toward all other categories, making the NRM a divide-by-total model (Thissen & Steinberg, 1986). To resolve a statistical indeterminacy, for all θ j we set m i Z iv (θ j ) = 0 (2) which also implies m i λ iv = 0 and m i ζ iv = 0, resulting in 2(m i 1) free parameters to be estimated per item. (For detailed NRM estimation procedures, see Baker & Kim, 2004, pp. 239 241.) Despite extensions of the NRM to address issues related to random guessing (e.g., the MCM of Thissen & Steinberg, 1984; the SGM of Samejima, 1979), the NRM generally provides as good a fit to real data as these more complex models (Drasgow, Levine, Tsien, Williams, & Mead, 1995). 2. Nested Logit Models Nested logit models (NLMs; McFadden 1981, 1982) provide an alternative to multivariate logit models, and are appropriate for choice settings where selection possesses a hierarchical structure, as when a final choice decision is made through a sequential process. An NLM represents the final choice among a discrete set of choice options conditional upon choices made at

YOUNGSUK SUH AND DANIEL M. BOLT 457 higher levels in the hierarchy. The resulting probability of each discrete choice is modeled as a product of the conditional and unconditional probabilities across levels of the hierarchy. In this paper, we adapt the NLM approach to incorporate latent traits, such as an ability θ in IRT, to provide a competing approach to the NRM for multiple-choice test items. Using the NLM framework, we assume the correct response probability to be formulated by the 2PLM or 3-parameter logistic model (3PLM), and model distractor selection conditional upon an incorrect response using Bock s NRM. This results in an NLM having two levels a higher level (level 1) introducing branches that distinguish a correct versus an incorrect response, and a lower level (level 2) introducing branches that distinguish distractors. The response options are consequently separated into two nests, one nest possessing the correct response only, the second nest possessing all distractors. Formulated in this way, NLMs provide a different portrayal of how the examinee arrives at a correct response; while the NRM emphasizes a comparative evaluation of response options, the NLMs emphasize a solution strategy that occurs independent of evaluating the options. Although the most accurate representation probably lies somewhere in-between (see Section 5), the NLM strategy might be expected to provide a better approximation for many multiple-choice items, such as items represented by example item 2. 2.1. 3PL-Nested Logit Model While we will consider both 2PLM and 3PLM versions of the NLMs described above (denoted 2PL-NLM and 3PL-NLM, respectively), we consider the 3PL-NLM in greater detail, recognizing the 2PL-NLM as a special case. Suppose a multiple-choice test is composed of n items and each item has one correct answer and m i distractors, or a total of m i + 1 response categories. Let U ij represent the item i response by examinee j (j = 1,...,N) once keyed for correctness (i.e., U ij = 1 if correct, 0 if incorrect). Further, let D ij v denote the item response in an item examinee distractor category array such that D ij v = 1 when examinee j selects distractor category v (v = 1,...,m i )ofitemi, and 0 otherwise. Under the 3PLM, the probability that an examinee of ability θ j chooses the correct response category on item i is modeled as P ( [ ) U ij = 1 θ j = γi + (1 γ i ) 1 1 + exp (β i+α i θ j ), (3) where β i is an intercept parameter, α i a slope parameter, and γ i a lower asymptote parameter, also referred to as a pseudo-guessing parameter. The probability that an examinee selects distractor category v is modeled as the product of the probability of an incorrect response and the probability of selecting distractor category v conditional upon an incorrect response: P ( U ij = 0,D ij v = 1 θ j ) = P ( Uij = 0 θ j ) P ( Dij v = 1 U ij = 0,θ j ) { [ }[ 1 = 1 γ i + (1 γ i ) 1 + exp (β i+α i θ j ) exp Z iv(θ j ) mi k=1 expz ik(θ j ). (4) As under the NRM, we use Z iv (θ j ) = ζ iv + λ iv θ j to define a propensity toward each distractor category v, now conditional upon an incorrect response. Unlike the NRM, the denominator in the conditional probability is obtained by summing exp Z ik(θ j ) across only distractor categories. Following Bock (1972), an arbitrary linear restriction in Equation (2) is imposed for the distractor category parameters. Figure 2 plots item category characteristic curves (ICCCs) for a simulated multiple-choice item with four response categories, where the fourth category represents the correct response. When item responses for the item are scored as binary and analyzed by the 3PLM, the leftside plot represents the characteristic curve for the correct response. For the same item, use of

458 PSYCHOMETRIKA FIGURE 2. ICCCs for a simulated item under the 3PLM and 3PL-NLM. the 3PL-NLM results in ICCCs shown to the right. It should be noted that the item parameter estimates for α, β, and γ are identical under both models, as the correct response probability is formulated in both instances under the 3PLM and is not informed by the particular distractors selected. Naturally the 2PL-NLM is also represented by Equations (3) and (4) above, but where γ i = 0. An appealing feature of the 2PL-NLM is that it contains the same number of parameters as the NRM. Thus for a given dataset, the two models can be compared with respect to loglikelihood in terms of which provides the better structural representation of the data. 2.2. Item Parameter Estimation via an MML Approach for the 3PL-NLM Estimation of the 3PL-NLM is possible using a variant of Bock and Aitkin s (1981) marginal maximum likelihood (MML) procedure. Using the U ij and D ij v notation above, let U j denote the correct response vector for examinee j, and D j represent the response pattern matrix of distractor categories for examinee j, and let [U j, D j denote the complete n [max(m i ) + 1 item response matrix for examinee j. To simplify the notations in Equations (3) and (4), let P(U ij = 1 θ j ) = P i (θ j ), P(U ij = 0,D ij v = 1 θ j ) = P iv (θ j ), and P(D ij v = 1 U ij = 0,θ j ) = P iv u=0 (θ j ). Then assuming local independence, the conditional probability of a response pattern matrix for examinee j given θ j, is the joint probability: P ( [U j, D j θ j,ϖ ) = = [ n m i P i (θ j ) u ij P iv (θ j ) d ij v i=1 m i [ n P i (θ j ) u ( ij 1 Pi (θ j ) ) d ij v P iv u=0 (θ j ) d ij v, (5) i=1

YOUNGSUK SUH AND DANIEL M. BOLT 459 where ϖ denotes all item parameters. Under Bock and Aitkin s (1981) approach, the marginal probability of the observed response pattern for examinee j is expressed as P([U j, D j ) = P([Uj, D j θ,ϖ)g(θ τ)dθ, where g(θ τ) is a density function with unknown parameters τ. (The j subscript on θ is dropped because θ j can be seen as a random subject sampled from a population.) When combined across examinees, we write the likelihood as L = N j=1 P([U j, D j ). And the natural logarithm of the likelihood is log L = N log P ( [U j, D j ). (6) j=1 The total number of estimable item parameters is n i=1 (2m i + 3). However, in deriving the likelihood equations, it proves convenient to substitute, following the restriction of Equation (2), a reparameterization of the NRM probability (i.e., P iv u=0 (θ j )) that reduces the number of parameters by 2n. Following Bock (1972), instead of estimating ζ v and λ v (v = 1,...,m i ), we use parameters η v and ξ v (v = 1,...,m i 1) that are defined by difference contrasts of the parameters ζ v and λ v. For example, when m i = 3, the new parameters are defined as [ 1 1 ζ1 ζ B 2 3 T 3 2 = 2 ζ 3 1 0 λ 1 λ 2 λ 3 0 1 [ ζ1 ζ = 2 ζ 1 ζ 3 = λ 1 λ 2 λ 1 λ 3 [ η1 η 2, (7) ξ 1 ξ 2 where T is a transformation matrix. The likelihood equations can be derived with respect to these new parameters η v and ξ v for the distractor categories, as well as with respect to the item parameters for the correct response category (β, α, and γ ). Suppose ω ih represents an item parameter to be estimated for item i and category h. The likelihood in Equation (6) can be differentiated with respect to ω ih as where log L ω ih = N { ( P [Uj, D j )} 1 j=1 log P ( [U j, D j θ,ϖ ) = { [ log P ( [U j, D j θ,ϖ ) ω ih P ( [U j, D j θ,ϖ ) g ( θ τ ) dθ { n m i log P i (θ) u [ ( ij + log 1 Pi (θ) ) d ij v + log P iv u=0 (θ) d } ij v. (8) i=1 When derived for the correct response category of item i, the first partial derivative of Equation (8) with respect to ω ih can be written as ω ih log P ( [U j, D j θ,ϖ ) = ω ih [ m i log P i (θ) u ij + log(1 P i (θ)) d ij v = ω ih [ log Pi (θ) u ij + log ( 1 P i (θ) ) 1 u ij. (9) The summation across items in Equation (8) can be eliminated by assuming that the item parameter estimates are independent across items. As shown in Equation (9), the estimation of the },

460 PSYCHOMETRIKA correct category parameters proceeds independent of the distractor category parameters. The first derivative for a distractor category parameter reduces to ω ih log P ( [U j, D j θ,ϖ ) = ω ih [ mi log P iv u=0 (θ) d ij v, (10) implying the distractor category parameters can be estimated independent of the correct category parameters. Estimates of the new parameters η and ξ can then be used to find the values of the estimates of the original and more conventional parameters ζ and λ for each item. For the case in which m i = 3, using Equation (7) and the constraints (i.e., m i ζ iv = 0 and m i λ iv = 0) yields and ζ 1 = η 1 + η 2 3 λ 1 = ξ 1 + ξ 2 3, ζ 2 = η 2 2η 1 3, λ 2 = ξ 2 2ξ 1 3, ζ 3 = η 1 2η 2, (11) 3, λ 3 = ξ 1 2ξ 2. (12) 3 An EM estimation algorithm was programmed using FORTRAN (Digital Equipment Corporation, 1997). The quadrature points and weights, and initial values of the item parameters were chosen using the same default values as in BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003). The convergence criterion for both the Newton Raphson iterations and EM algorithm in terms of parameter change was set to 0.0001, and the maximum number of EM cycles to 200. Additional details on derivations of the likelihood equations and implementation of the EM algorithm (including the procedure for computing the standard errors of the item parameter estimates) and the software can be obtained from the first author. 2.3. Information Functions for the 3PL-NLM A potential advantage of the NLMs relates to their quantification of item information, and the ease of studying the relative contribution of distractor categories. Due to the distractor collapsibility property, it becomes possible to directly compare the relative amounts of information provided when including versus excluding distractor information using the estimates of just the one model. As noted earlier, such a property is not present in the NRM, where the collapsing of distractor categories results in a change to the correct response ICCC that also implies a change in information. Such a feature makes it difficult to compare information under the two different forms of scoring. Information functions are particularly useful for the NLMs as they can be used to quantify the increase in the precision of ability estimates when attending to distractors. Information functions for the 3PL-NLM can be derived as follows. As shown in Equation (5), the conditional probability of a response pattern matrix for examinee j is written as L j = P ( [U j, D j θ j,ϖ ) = [ n m i P i (θ j ) u ij Q i (θ j ) d ij v P iv u=0 (θ j ) d ij v, (13) i=1 where 1 P i (θ j ) = γ i + (1 γ i ) 1 + exp (β i+α i θ j ), Q i(θ j ) = 1 P i (θ j ),

and YOUNGSUK SUH AND DANIEL M. BOLT 461 exp Z iv(θ j ) P iv u=0 (θ j ) = mi k=1 expz ik(θ j ). To simplify notation, let P i (θ j ) = P ij, Q i (θ j ) = Q ij, P iv u=0 (θ j ) = P ij v u=0, and Z ik (θ j ) = Z ij k. Then taking the natural logarithm of the likelihood function for examinee j yields log L j = { n m i [ } u ij log P ij + d ij v log Qij + log P ij v u=0, (14) i=1 and the first partial derivative of the loglikelihood with respect to θ j is where log L j θ j = { n i=1 u ij α i P ij P i (θ j ) = Q ij P ij + m i [ ( mi ) d ij v α i Pij + k=1 ez ij k (λ iv λ ik ) }, (15) 1 1 + exp (β i+α i θ j ) m i S ij and S ij = exp Z ij k. Then, the second partial derivative of the log-likelihood with respect to θ j is where 2 log L j θ 2 j ( m i k=1 expz ij k (λ iv λ ik )) S ij θ j { n [ P 2 = u ij αi 2 P ij ij γ i Q ij P 2 i=1 ij [ ( mi m i ( k=1 ez ij k (λ iv λ ik ) )} + d ij v αi 2 P ij Q ij + S ij ), (16) θ i = S ij ( m i k=1 expz ij k λ ik (λ iv λ ik )) m i k=1 expz ij k (λ iv λ ik )( m i k=1 λ ik exp Z ij k ) Sij 2 (17) and Q i (θ j ) = 1 P i (θ j ). This second partial derivative contains observed data values. Following usual practice (Kendall & Staurt, 1967), the u ij and d ij v are replaced by their expected values P ij and (1 P ij )P ij v u=0, respectively, resulting in ( 2 ) log L j E θj 2 = k=1 [ n m i P ij I iu (θ j ) + P ij v I iv (θ j ), (18) i=1 where P ij v = (1 P ij )P ij v u=0. The test information function is then given by ( 2 ) [ log L j n m i I(θ j ) = E θj 2 = P ij I iu (θ j ) + P ij v I iv (θ j ). (19) i=1

462 PSYCHOMETRIKA For each item, the item information function is given by m i I i (θ j ) = P ij I iu (θ j ) + P ij v I iv (θ j ), (20) where P ij I iu (θ j ) is the contribution of the correct response category to item information, and P ij v I iv (θ j ) is the contribution of distractor category v. Each of these terms is referred to as the information share of a category (Baker & Kim, 2004; Samejima 1969, 1972, 1977). Here, the information share of the correct response category is [ P ij I iu (θ j ) = γ i + (1 γ i ) 1 1 + exp (β i+α i θ j ) [ P 2 αi 2 P ij Q ij ij γ i P 2 ij, (21) the same as in the traditional 3PLM, while the information share of any distractor category v is P ij v I iv (θ j ) { [ } 1 exp Z [ ij v = 1 γ i + (1 γ i ) 1 + exp (β i+α i θ j ) mi α 2 k=1 expz ij k i P ij Q ij mi k=1 + expz ij k (λ iv λ ik ) m i k=1 λ ik exp Z ij k S ij ( m i S 2 ij k=1 expz ij k λ ik (λ iv λ ik )), (22) allowing for quantification of the incremental information provided by the distractor categories. 3. Simulation Studies Simulation studies were conducted to investigate (1) the parameter recovery of the NLMs, (2) the empirical distinguishability of the 2PL-NLM, 3PL-NLM, and NRM, and (3) the statistical performances of the NLMs in testing for and quantifying distractor information. 3.1. Simulation Study Designs 3.1.1. Simulation 1: Parameter Recovery Study. Parameter recovery for the 2PL-NLM and 3PL-NLM was evaluated for varying sample size (1,000 and 5,000 examinees) and test length (10-, 20-, and 50-item tests) conditions. For each combination of conditions, examinee ability parameters were generated as θ Normal(0, 1). Item parameters were generated randomly from the following distributions: α Uniform(0.75, 2) and β Uniform( 2.5, 2.5) for the correct response category, and λ v Uniform( 2, 2) and ζ v Uniform( 2, 2) for the distractor categories, followed by the imposition of constraints m i ζ iv = 0 and m i λ iv = 0. Four-category item responses (one correct response and three distractor categories) were generated following either the 2PL-NLM or 3PL-NLM. The same item parameters were applied to generate both the 2PL-NLM and 3PL-NLM data, with γ for the 3PL-NLM set at 0.25 for all items. As has been observed when estimating the 3PLM, the γ parameter is generally difficult to recover without a prior; our intent in using a constant value of 0.25 was based on our desire to match the true parameter with the prior so as to better ascertain the impact of the presence of the guessing parameter on recovery of the other parameters. Thus, the γ parameter was assigned a beta prior with parameters of 5 and 15 during the EM process for item parameter estimation (for detailed procedures, see Baker & Kim, 2004, pp. 188 191). 100 replications were simulated for each combination of conditions. The accuracy

YOUNGSUK SUH AND DANIEL M. BOLT 463 of item parameter estimates was evaluated with respect to bias (estimated minus true) and root mean squared error (RMSE). For the distractor category parameters, the estimated η iv and ξ iv (v = 1,...,m i 1) were converted to ζ iv and λ iv (v = 1,...,m i ). In order to demonstrate ability parameter estimation and the value of attending to distractors under the NLM, Expected a Posteriori (EAP) estimates were obtained under the 2PLM and 2PL- NLM for the 10- and 50-item test length conditions. Response patterns for 5,000 examinees were simulated at each of 13 discrete θ values ranging from 3.4 to 3.4. Bias and RMSEs were then computed at each of the θ levels generated with respect to both the 2PLM and 2PL-NLM. In addition, the test information functions under the two models were evaluated at each of the θ levels. 3.1.2. Simulation 2: Model Comparison Study. To evaluate the empirical distinguishability of the NLMs and NRM, a second simulation study was conducted in which the NLMs and NRM were fit to data generated from each of the three models. The models were compared using several likelihood-based criteria: AIC (Akaike, 1974), BIC (Schwarz, 1978), and CAIC (Bozdogan, 1987). Item response data were simulated using as generating parameters the corresponding item parameter estimates from the 36 items studied in the real data analysis reported in the next section (see Table 5 of Section 4 for the 2PL-NLM estimates 1 ). In each case, data were generated for 3,000 examinees as θ Normal(0, 1). 100 datasets were simulated with respect to each of the 2PL-NLM, 3PL-NLM, and NRM. Each of these datasets was then fit using each of the three models, with the AIC, BIC, and CAIC applied to evaluate whether the correct generating model was identified. For the 3PL-NLM, the component of the loglikelihood associated with the prior 2 was removed when calculating the indices. 3.1.3. Simulation 3: Distractor Information Study. As noted, an important practical benefit of the NLMs is their potential to quantify the contribution of distractor information to the overall information provided by an item. We consider two aspects of this process: (1) testing whether distractors provide incremental information; and (2) quantifying the information provided. In testing for information, we apply a likelihood ratio (LR) test for comparing models under conditions in which the distractors are assumed to provide no information (a reduced model) against conditions in which they do (an augmented model). The reduced model under the 2PL-NLM and 3PL-NLM assumes λ v = 0 for all distractor categories. The LR test was performed on an item-by-item basis in which the reduced model assumed all items but the studied item was fit so as to allow distractor information. While such a test may be desirable to determine whether distractors provide information, in actual practice greater value may be placed on the quantification of information provided by distractors. Following Section 2, such a quantification is provided by the estimated information share of the distractor categories, as computed from the item parameter estimates. In order to evaluate the performance of the LR test and the precision of distractor information estimates, a third simulation was conducted. The first part of the simulation evaluated the Type I error performance of the LR test. Data were generated from each of the 2PL-NLM, 3PL-NLM, and NRM. The generating item parameters were based on those observed as estimates in the real data analysis, but with restrictions imposed to reflect the reduced condition for each of the three models. For the NRM, the reduced condition was simulated by setting λ 1 = =λ 4 = α/4 across the distractor categories. Data were generated for 3,000 examinees and 36 items. 1 The 3PL-NLM and NRM item parameter estimates (and their standard errors) are available from the first author upon request. 2 A beta prior with parameters of 4 and 16 was used, when estimating the guessing parameters in this case, where each item has five response categories.

464 PSYCHOMETRIKA TABLE 1. Simulation 1 results: average bias and RMSE for correct category parameters. 2PL-NLM 3PL-NLM 1000 5000 1000 5000 n α β α β α β γ α β γ Bias 10 0.01 0.02 0.00 0.02 0.07 0.15 0.03 0.06 0.12 0.02 20 0.01 0.01 0.00 0.02 0.05 0.12 0.03 0.05 0.11 0.02 50 0.00 0.00 0.00 0.01 0.06 0.12 0.02 0.04 0.08 0.02 RMSE 10 0.12 0.10 0.05 0.05 0.22 0.25 0.23 0.12 0.17 0.09 20 0.11 0.10 0.05 0.05 0.19 0.20 0.20 0.10 0.14 0.08 50 0.11 0.09 0.05 0.04 0.18 0.20 0.19 0.10 0.13 0.08 TABLE 2. Simulation 1 results: average RMSE for distractor category parameters. 2PL-NLM 3PL-NLM 1000 5000 1000 5000 n λ ζ λ ζ λ ζ λ ζ RMSE 10 0.19 0.13 0.08 0.06 0.24 0.17 0.10 0.07 20 0.16 0.13 0.07 0.06 0.20 0.15 0.09 0.07 50 0.15 0.13 0.07 0.06 0.19 0.16 0.08 0.07 The second part of this simulation evaluated the recovery of the estimated information share of distractor categories. Data were again generated under the three models using the real data item parameter estimates, but now without the restrictions implied by the reduced condition. Recovery was evaluated by comparing the estimated information share compared against the true information share as calculated from the generating parameters. For the NRM, the information share of categories was calculated using methods described by Baker and Kim (2004, pp. 251 254). 3.2. Simulation Study Results 3.2.1. Simulation 1: Parameter Recovery Study Results. Bias and RMSEs for the distractor category parameters were collapsed across item categories and items to create an average bias and RMSE for each distractor category parameter type. Similarly, the recovery results for the correct response category parameters were averaged across items. Results are provided in Tables 1 and 2 across all conditions. Bias for the correct response parameters is close to zero under the 2PL-NLM for all conditions, implying no apparent evidence of systematic underestimation or overestimation. (It should be noted that the bias for the distractor parameters is forced to 0 due to the constraint in Equation (2).) Bias is somewhat larger under the 3PL-NLM (i.e., positive for β and negative for α), which may be attributed in part to the greater influence of the priors and small departures of the distribution of generating parameters from that assumed by the priors. RMSEs for the correct response parameters are smaller than those for the distractor parameters under the 2PL-NLM. Similar patterns in relation to the effects of conditions are found for the 2PL-NLM and 3PL- NLM. No systematic patterns related to the number of items are apparent. Also, as expected, the

YOUNGSUK SUH AND DANIEL M. BOLT 465 TABLE 3. Simulation 1 results: average bias and RMSE for ability parameters. Bias RMSE 2PLM 2PL-NLM 2PLM 2PL-NLM θ 10 50 10 50 10 50 10 50 3.4 1.97 0.93 1.53 0.53 1.98 0.96 1.54 0.59 2.9 1.44 0.54 1.03 0.28 1.46 0.61 1.06 0.39 2.3 0.96 0.26 0.61 0.12 1.01 0.40 0.68 0.29 1.7 0.57 0.11 0.27 0.04 0.68 0.32 0.42 0.23 1.1 0.28 0.04 0.05 0.00 0.52 0.27 0.36 0.17 0.6 0.09 0.00 0.02 0.00 0.48 0.24 0.35 0.15 0.0 0.03 0.01 0.01 0.01 0.48 0.23 0.34 0.15 0.6 0.16 0.02 0.02 0.01 0.49 0.23 0.36 0.17 1.1 0.30 0.05 0.11 0.02 0.54 0.26 0.37 0.22 1.7 0.50 0.10 0.34 0.07 0.66 0.30 0.47 0.27 2.3 0.77 0.23 0.65 0.20 0.86 0.38 0.72 0.35 2.9 1.08 0.43 1.01 0.40 1.14 0.52 1.05 0.49 3.4 1.49 0.70 1.45 0.68 1.52 0.74 1.47 0.72 Weighted average 0.02 0.00 0.02 0.00 0.54 0.26 0.38 0.18 RMSEs for the 3PL-NLM were consistently larger than for the 2PL-NLM. Most importantly, the overall results seen here appear comparable to those previously observed using MML techniques under the NRM (Wollack, Bolt, Cohen, & Lee, 2002), suggesting that the NLMs appear to be at least as good as the NRM in terms of parameter recovery. We further confirmed the consistency of our recovery results in comparison to both the 2PLM and 3PLM by comparing results for our generated data when estimated using BILOG-MG using the same γ prior, which were effectively the same. Bias and RMSEs for θ are provided in Table 3. Based on the test information functions shown in Figure 3, it would appear that the distractors provide their greatest relative increases in information at both low and intermediate levels of θ, which is expected as the distractors are more commonly selected among examinees not of high ability. These results are also supported by the RMSEs of Table 3, where the greatest relative declines in RMSE when moving from the 2PLM to the 2PL-NLM are seen for lower θ levels. Not surprisingly, there is also a reduction in bias under the 2PL-NLM at the extreme low θ levels, again owing to the greater amount of information about θ provided by attending to distractors. The results of Table 3 appear to consistently support the value of attending to distractor information. The weighted average statistics in the final row of Table 3 show that when assuming a Normal(0, 1) distribution for θ, we appear to get an approximately 30% decrease in RMSEs. 3.2.2. Simulation 2: Model Comparison Study Results. Table 4 shows the number of times out of 100 that each model fits best according to each likelihood criterion for each generating model. Although there is some tendency for confusion between the 2PL-NLM and 3PL-NLM when the 3PL-NLM is the generating model (particularly under CAIC), the distinction between the NRM and NLMs seems clearer. The sometimes better comparative fit for the 2PL-NLM compared to the 3PL-NLM when the 3PL-NLM is the generating model is perhaps not surprising, as the ultimate value of the pseudo-guessing parameter is often questionable, especially when the majority of items are relatively easy. Overall, it would thus appear that as statistical models, the NLM and NRM approaches not only provide competing structural representations, but also ones that may be statistically distinguishable when applied to actual test data.

466 PSYCHOMETRIKA FIGURE 3. Test information under the 2PLM and 2PL-NLM. TABLE 4. Simulation 2 results: frequencies of model selection. True model Estimated model NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM NRM 2PL-NLM 3PL-NLM 2logL 98 2 0 0 100 0 0 0 100 AIC 98 2 0 0 100 0 0 0 100 BIC 98 2 0 0 100 0 0 5 95 CAIC 98 2 0 0 100 0 0 21 79 3.2.3. Simulation 3: Distractor Information Study Results. In evaluating the Type I error performance of the LR test, we considered alpha levels of 0.05, 0.01, as well as Bonferroni corrected levels at 0.05 (p = 0.0014). When using the 2PL-NLM as both the generating and fitted model, we observe clear evidence of Type 1 error inflation with rejection rates under the 2PL-NLM of 0.14, 0.04, and 0.01, respectively, averaging across the 36 items. Similarly, when using 3PL-NLM, the corresponding rejection rates are 0.17, 0.06, and 0.02. Even greater inflation is observed when using the NRM as the generating model, with rates of 0.25, 0.11, and 0.04, respectively, for the 2PL-NLM, and 0.22, 0.09, and 0.03 for the 3PL-NLM. Overall, there is clearly evidence of Type I error inflation in applying the LR test, and potential for mistaken inferences when relying solely on the LR test as a basis for including distractor information. At the same time however, we note that in virtually all Type I error cases, the estimated distractor information is near 0, even when the NRM is the generating model. As it is anticipated that most practitioners would attend to the amount of distractor information when deciding whether to attend to distractors, we conducted a follow-up study that evaluated the accuracy of the NLMs in recovering the amount of distractor information.

YOUNGSUK SUH AND DANIEL M. BOLT 467 Regardless of whether the 2PL-NLM, 3PL-NLM, or NRM is the generating model, the relative amount of information provided by distractors appears well-recovered. When the 2PL-NLM was the generating model, the correlation between the estimated and true distractor information was 0.97 when fitting the 2PL-NLM and 0.95 when fitting the 3PL-NLM. The mean absolute differences (MADs) were 0.01 and 0.04, respectively. When the 3PL-NLM was the generating model, the respective correlations were 0.93 and 0.98 and the MADs were 0.05 and 0.02, while when the NRM was the generating model, the correlations were still 0.98 and 0.93 and the MADs were 0.02 and 0.09, suggesting that, even in the presence of some model misspecification, recovery appears quite good. 4. Real Data Illustration Data from a 36-item college level mathematics placement test (Center for Placement Testing, 1998) were analyzed. For purposes of reporting model estimates and testing for distractor information, 3,000 examinees were randomly selected from a full dataset of 12,800 examinees. Each item contained five response categories (one correct response and four distractor categories). Inspection of the items suggested a response process more consistent with that discussed in relation to example item 2 shown earlier in the paper. That is, most items would appear to be best solved through use of a problem-solving strategy that initially does not consider the response options. Both the 2PL-NLM and 3PL-NLM were investigated as potential competitors to the NRM. The overall 2 loglikelihood for the 2PL-NLM was 280185.34 as compared to 280264.33 for the NRM (when the NRM was fit using the same algorithm). As both models possess the same number of parameters, it would appear that the 2PL-NLM thus provides a better representation of the data. Table 5 displays the 2PL-NLM estimates. The average standard errors for distractor category slopes and intercepts were both 0.06, and for correct response slope and intercept were 0.05 and 0.04, respectively. To further examine how the NLMs compare to the NRM in terms of model fit, 10 nonoverlapping random samples were drawn from the full dataset, each consisting of 1,000 examinees. Each of the 2PL-NLM, 3PL-NLM, and NRM was fit to the 10 datasets. Table 6 shows 2loglikelihoods under each model, as well as the AIC, BIC, and CAIC indices. On the whole, the 2PL-NLM appears to show comparatively better fit than the 3PL-NLM and NRM across all 10 samples. Figure 4 shows plots of information functions for several example items; the item information share of distractor categories 1 4 (Equation (22)) and the correct response category 5 (Equation (21)), as well as the total item information function (Equation (20)). Apparent from these graphs is the substantial variability across items in the contribution of distractors to overall item information. For example, items 6 and 7 show large amounts of information both in the correct response category and most of the distractor categories, while items 12 and 32 show very small amounts of information, especially in the distractor categories. As noted, the item information share of distractor categories can be used as the basis for a decision about whether to collapse distractor categories and score items simply as correct/incorrect. Using the item parameter estimates from the 2PL-NLM, a comparison of item information functions when including versus excluding distractor information can be performed without any revision of the model, as the item parameter estimates for the correct responses from the 2PL-NLM are the 2PLM item parameter estimates. The item information functions under the 2PLM and 2PL-NLM are plotted in Figure 5. Note that for items 6 and 7, the 2PL-NLM provides a larger amount of information at relatively low levels of ability relative to the 2PLM. Items 12 and 32 yield almost the same information under the two models, suggesting virtually no practical advantage to considering distractor information.

468 PSYCHOMETRIKA TABLE 5. 2PL-NLM parameter estimates, mathematics placement test. Slopes Intercepts Item λ 1 λ 2 λ 3 λ 4 α ζ 1 ζ 2 ζ 3 ζ 4 β 1 0.14 0.01 0.21 0.06 0.48 0.69 0.32 0.58 0.21 0.32 2 0.44 0.11 0.26 0.07 1.55 0.92 0.21 1.74 0.60 0.62 3 0.13 0.08 0.03 0.24 0.96 0.19 0.15 0.43 0.78 0.88 4 0.09 0.43 0.49 0.02 0.59 0.14 1.11 0.62 0.35 0.08 5 0.13 0.07 0.08 0.02 0.41 0.30 0.25 0.28 0.23 0.36 6 0.12 0.44 0.53 0.21 0.99 0.48 0.53 0.82 0.86 0.07 7 0.02 0.50 0.85 1.33 1.01 0.78 0.70 1.96 1.87 0.09 8 0.01 0.01 0.14 0.17 0.82 0.41 0.36 0.29 0.24 0.31 9 0.25 0.32 0.05 0.02 1.49 0.27 1.04 0.20 0.97 0.65 10 0.07 0.18 0.14 0.25 0.47 0.43 0.33 0.14 0.62 1.07 11 0.07 0.34 0.22 0.19 0.81 0.92 0.69 0.15 0.38 1.42 12 0.10 0.01 0.02 0.07 0.86 0.18 0.54 0.07 0.42 0.40 13 0.06 0.39 0.22 0.11 0.86 0.20 1.38 0.86 0.33 0.77 14 0.37 0.18 0.18 0.38 1.23 1.05 0.18 0.21 1.44 0.36 15 0.20 0.27 0.39 0.46 0.97 1.08 0.84 0.43 0.18 0.18 16 0.07 0.33 0.31 0.10 0.65 0.47 0.11 0.40 0.19 0.31 17 0.26 0.32 0.12 0.06 0.51 0.72 0.28 0.46 0.90 0.21 18 0.16 0.26 0.01 0.40 1.01 0.70 0.31 0.17 0.85 1.43 19 0.28 0.28 0.05 0.05 0.68 0.77 0.76 0.23 0.24 0.06 20 0.38 0.61 0.03 0.19 1.45 0.63 2.18 0.61 0.94 0.20 21 0.10 0.10 0.19 0.19 0.77 0.10 0.36 0.10 0.16 0.69 22 0.06 0.12 0.03 0.09 0.67 0.04 0.38 0.22 0.55 0.72 23 0.06 0.09 0.50 0.36 0.85 0.21 0.35 1.74 1.18 1.03 24 0.42 0.03 0.31 0.14 0.92 0.01 0.39 0.52 0.14 0.31 25 0.14 0.12 0.01 0.03 0.65 0.40 0.51 0.27 0.38 0.51 26 0.12 0.19 0.28 0.35 0.86 0.35 0.18 0.40 0.23 0.24 27 0.15 0.12 0.11 0.16 1.08 1.00 0.21 0.80 0.40 0.76 28 0.23 0.07 0.08 0.08 1.07 0.63 0.78 0.12 0.02 0.23 29 0.02 0.04 0.36 0.38 1.04 0.16 0.34 0.85 1.03 1.09 30 0.12 0.36 0.33 0.15 0.80 1.12 1.27 0.65 0.50 0.30 31 0.07 0.30 0.17 0.40 1.19 0.02 0.39 0.42 0.01 0.94 32 0.09 0.06 0.04 0.01 0.76 0.12 0.70 0.58 0.00 1.01 33 0.13 0.05 0.23 0.05 0.66 0.10 0.12 0.33 0.35 0.52 34 0.15 0.28 0.01 0.44 1.45 0.05 0.26 0.29 0.60 0.15 35 0.24 0.29 0.42 0.11 0.77 0.03 0.46 1.05 0.57 0.04 36 0.05 0.12 0.15 0.08 0.91 0.09 0.33 0.76 0.34 0.94 Beyond quantifying item category information, the statistical significance of information in the distractor categories was evaluated through an LR test for each item using the 2PL-NLM. The results are presented in Table 7. As noted above, the 2logL augmented is equal to 280185.34 for all items. As χα=0.95,df 2 =3 = 7.81, we reject the null hypotheses for all items except items 12 and 32. Recalling the inflated Type I error performance of the LR test in Section 3.2.3, it nevertheless appears that on the whole there is evidence of distractor information in the items on this test, as the LR rejects in 34 out of 36 cases, well beyond the levels of inflation seen in the simulation. However, a quantification of the percentage increase in information, shown in the rightmost column of Table 7, suggests that the increase in information is less than 12% for half of the items.

YOUNGSUK SUH AND DANIEL M. BOLT 469 TABLE 6. Model selection comparison across 10 samples. Method Model 1 2 3 4 5 6 7 8 9 10 NRM 93515 93644 92468 93126 93586 92876 93763 92880 93687 92620 2logL 2PL-NLM 93492 93608 92420 93091 93555 92832 93742 92839 93655 92576 3PL-NLM 93436 93501 92354 93061 93525 92821 93720 92822 93635 92495 NRM 94091 94220 93044 93702 94162 93452 94339 93456 94263 93196 AIC 2PL-NLM 94068 94184 92996 93667 94131 93408 94318 93415 94231 93152 3PL-NLM 94084 94149 93002 93709 94173 93469 94368 93470 94283 93143 NRM 94379 94508 93332 93990 94450 93740 94627 93744 94551 93484 BIC 2PL-NLM 94356 94472 93284 93955 94419 93696 94606 93703 94519 93440 3PL-NLM 94408 94473 93326 94033 94497 93793 94692 93794 94607 93467 NRM 94667 94796 93620 94278 94738 94028 94915 94032 94839 93772 CAIC 2PL-NLM 94644 94760 93572 94243 94707 93984 94894 93991 94807 93728 3PL-NLM 94732 94797 93650 94357 94821 94117 95016 94118 94931 93791 FIGURE 4. Item information and information share of categories under the 2PL-NLM.

470 PSYCHOMETRIKA FIGURE 5. Item information under the 2PLM and 2PL-NLM. 5. Alternative Nested Logit Models Although the NLMs presented in this paper appear to provide a better representation of items such as example item 2 when compared to models such as the NRM, a limitation of the models as representations of response process is the absence of the correct response category in the second nest. This limitation can be addressed through overlapping nests, where the same response option is present in more than one nest. We note that such a framework also provides an appealing one in which to better understand the NLM and NRM approaches in relation to each other, as well as potential hybrid approaches. Under an overlapping nest approach, the correct response category could be present in two nests: (1) the nest associated with correct solution strategy; and (2) the nest including the distractor categories. An appealing aspect of the modeling framework is that it emphasizes two general ways by which the examinee arrives at the correct response: (1) correct problem solving apart from response category evaluation; and (2) a comparative evaluation of all response options, as might occur with an educated guess. Such a model would be obtained by generalizing Equations (3) and (4) such that the summation in the rightmost bracketed term in Equation (4) would now include the correct response category (in addition to the distractor categories). The NRM can be viewed as a special case in which β i = and α i = 0, while the 2PL-NLM is a special case in which ζ iv = and λ iv = 0 for the correct response category. The cost of the more general model in comparison to the 2PL-NLM or NRM, therefore, is the addition of two parameters per item. It remains to be seen whether there are conditions in which this model can be effectively estimated. The work of San Martin et al. (2006) likely provides some insight as to the

YOUNGSUK SUH AND DANIEL M. BOLT 471 TABLE 7. LR test results and average item information under the 2PLM and 2PL-NLM Item 2LogL reduced LR Significant Bonferroni 2PLM Info 2PL-NLM Info % Increased 1 280202.87 17.53 0.0525 0.0577 10 2 280248.72 63.38 0.4025 0.4241 5 3 280198.82 13.48 0.1702 0.1741 2 4 280304.13 118.79 0.0797 0.1202 51 5 280197.59 12.25 0.0395 0.0439 11 6 280419.47 234.13 0.2042 0.2873 41 7 280780.87 595.53 0.2103 0.4290 104 8 280200.37 15.03 0.1433 0.1485 4 9 280212.37 27.03 0.3769 0.3853 2 10 280250.39 65.06 0.0415 0.0664 60 11 280248.71 63.38 0.1021 0.1264 24 12 280190.33 4.99 0.1543 0.1561 1 13 280246.80 61.46 0.1448 0.1648 14 14 280294.64 109.30 0.2858 0.3229 13 15 280307.78 122.45 0.1954 0.2371 21 16 280277.83 92.49 0.0938 0.1262 35 17 280242.66 57.32 0.0615 0.0803 30 18 280325.34 140.00 0.1566 0.2094 34 19 280234.66 49.32 0.1037 0.1214 17 20 280243.28 57.95 0.3751 0.3943 5 21 280228.12 42.78 0.1217 0.1376 13 22 280196.28 10.95 0.0911 0.0953 5 23 280227.37 42.04 0.1316 0.1437 9 24 280324.78 139.44 0.1762 0.2229 27 25 280199.87 14.53 0.0906 0.0957 6 26 280271.99 86.65 0.1579 0.1890 20 27 280210.11 24.77 0.2178 0.2268 4 28 280202.61 17.27 0.2300 0.2358 3 29 280252.00 66.66 0.1869 0.2052 10 30 280258.12 72.78 0.1365 0.1632 20 31 280299.95 114.61 0.2444 0.2906 19 32 280191.81 6.47 0.1076 0.1100 2 33 280211.66 26.32 0.0929 0.1013 9 34 280239.91 54.58 0.3759 0.3968 6 35 280317.58 132.25 0.1317 0.1791 36 36 280208.87 23.53 0.1523 0.1607 5 potential value of these models, although their approach was applied within the framework of a Rasch model and thus did not include either correct response or distractor slope parameters. One setting that may make estimation possible is the presence of distinguishable traits across levels. For example, for some tests it may be reasonable to assume a distinct trait (e.g., testwiseness ) influences selection among response categories at level 2, as compared to the trait (e.g., math ability ) that functions at level 1. An example of this possibility is demonstrated by example item 3. Example Item 3. A tire measures 24 inches in diameter. What is the circumference of the tire in inches? Round your answer to the nearest tenth. (A) 48 (B) 119.5 (C) 75 (D) 75.4

472 PSYCHOMETRIKA While math ability may determine whether the examinee arrives at correct response at level 1, a correlated but potentially distinguishable trait may function within level 2. For example, as only two responses are reported to the nearest tenth (B) and (D), and (C) and (D) are essentially the same answer but differ only in rounding, a testwise respondent could likely ascertain (D) as the correct response. Examples of other NLMs that might be adapted for still other item types might include additional levels. One such case might involve items for which a solution strategy can be broken down into steps and where distractors are designed to catch misapplication of a particular step. Following the same example item above, for instance, it might be anticipated that the process by which a respondent arrives at (D) as the correct response involves first performing the circumference calculation: 24 3.1416 = 75.3984, and next determining the correct level at which to round. Correct execution of the first step but incorrect execution of the second would be represented by the choice of (C) as the response. While these models and others likely provide a more accurate representation of the response process than the NLMs considered in this paper, they naturally come at the cost of additional model complexity, as well as the loss of the distractor collapsibility property that motivated the 2PL-NLM and 3PL-NLM considered in this paper. 6. Conclusion Future work with NLMs can address various issues. More applications and direct comparisons against competing models, including models other than the NRM, such as the Nedelsky model (Bechger, Maris, Verstralen, & Verhelst, 2005) are needed. Estimation issues related to more complex NLMs, such as models with overlapping nests, may help clarify the potential value of the NLM strategy in other contexts. Various alternative strategies might be considered, following approaches taken in the discrete choice literature (see e.g., Train, 2003, pp. 93 96). For example, some approaches to handling overlapping nests specify a parameter indicating the degree to which a given outcome is a member of each nest. Other generalizations of the nested approach might consider probit as opposed to logit link functions. Additional practical applications of the specific models investigated in this paper may also be of interest. For example, attempts to study differential item functioning (DIF) in multiple-choice items often find value in determining whether particular distractors are responsible for differential functioning of the correct response. Such applications can be studied in a more explicit fashion using the 2PL-NLM and 3PL-NLM as presented in this paper. Still other applications may focus on the use of the models in comparing test items administered under open-ended versus multiplechoice frameworks. Here again the consistency in how the correct response category is modeled should allow for more direct assessment of the consequences of adding multiple-choice response options. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716 723. Baker, F.B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker. Bechger, T.M., Maris, G., Verstralen, H.H.F.M., & Verhelst, N.D. (2005). The Nedelsky model for multiple-choice items. In L.A. van der Ark, M.A. Croon, & K. Sijtsma (Eds.), New developments in categorical data analysis for the social and behavioral sciences. Mahwah: Lawrence Erlbaum Associates. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29 51. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443 459.