Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring

Size: px

Start display at page:

Download "Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring"

Blanche Bruce
5 years ago
Views:

1 Journal of Educational and Behavioral Statistics Fall 2005, Vol. 30, No. 3, pp Making the Most of What We Have: A Practical Application of Multidimensional Item Response Theory in Test Scoring Jimmy de la Torre Rutgers, The State University of New Jersey Richard J. Patz R. J. Patz, Inc. This article proposes a practical method that capitalizes on the availability of information from multiple tests measuring correlated abilities given in a single test administration. By simultaneously estimating different abilities with the use of a hierarchical Bayesian framework, more precise estimates for each ability dimension are obtained. The efficiency of the proposed method is most pronounced when highly correlated abilities are estimated from multiple short tests. Employing Markov chain Monte Carlo techniques allows for straightforward estimation of model parameters. Keywords: ability estimation, Bayesian estimation, item response theory, Markov chain Monte Carlo, multidimensionality It is not unusual for several tests measuring different abilities (i.e., a battery ) to be given in one test administration. Although these tests may tap different latent abilities, the abilities are usually not independent of one another. For example, in achievement tests such as the National Assessment of Educational Progress (NAEP) or the California Achievement Tests, these abilities have high positive correlations, typically greater than 0.70 (CTB/ McGraw-Hill, 2002; Johnson & Carlson, 1994). However, a common practice in educational measurement is to estimate these abilities independently of each other. This article proposes a more efficient method of estimating these abilities that takes into account this correlational structure. The method uses a hierarchical Bayesian approach to simultaneous estimation of abilities that is based on a simple structure, multidimensional item response theory (IRT) model. The approach may be applied in a straightforward manner to improve the scoring of test batteries for certain purposes by using the simple structure reflected in the construction of the test battery and the IRT item parameter estimates employed in traditional unidimensional scoring of the component tests. This research was started during the first author s summer internship at CTB/McGraw-Hill in The authors thank Howard Wainer and two associate editors for their helpful comments and suggestions and CTB/McGraw-Hill for the data used in this study. 295

2 de la Torre and Patz One advantage of IRT is its flexibility in incorporating various auxiliary information into the model. Mislevy (1987) has shown that including educational variables (e.g., grade level) in the estimation process can improve the precision of parameter estimates. In this article we exploit for the purpose of ability estimation the auxiliary information available from responses by examinees to other tests. Exploiting hierarchical structure to improve the accuracy of parameter estimation is a well-known statistical technique (see, for example, Purcell & Kish, 1979). In the context of test scoring, Wainer et al. (2001) derived an empirical Bayes approach by using test reliabilities and intertest correlations to improve the accuracy of test scores. We show that our hierarchical modeling approach provides estimates very similar to those of Wainer et al. in the case of simple structure and that our approach may be extended more easily to accommodate complex structure and a richer set of auxiliary information. The primary focus of this article is to investigate whether and under what conditions the simultaneous estimation of abilities from different dimensions yields moreaccurate ability estimates. In particular, the article examines how the number of dimensions, the number of items in each dimension, and the degree of correlation between abilities affect the accuracy of the estimates. Because the resulting problem involves maximization in high dimensional space, traditional methods of estimation that rely on derivatives may not be carried out in a straightforward manner. Hence, Markov chain Monte Carlo (MCMC) simulation is used in estimating the abilities. Although not always feasible and with its own drawback, the dimensionality and complexity of the problem can be substantially reduced by estimating the correlation between abilities separately and fixing the correlation in the scoring process (see Segall, 1996). Model To extend the three-parameter logistic model (Lord, 1980) to the multidimensional context, Reckase (1996) used the following generalization: where 296 ( ) = + ( ) exp( + β ) j i j PX ij = 1 i, j, β j, γ j γ j 1 γ j 1 + exp j i + β j ( ), ( 1) P(X ij = 1 i, j, j, j ) is the probability of examinee i responding to item j correctly; X ij is the response of examinee i to item j (0 = incorrect, 1 = correct); i is the ability vector of the examinee; j is the vector of item parameters related to the discrimination power of the item; β j is the parameter related to the difficulty of the item; γ j is the pseudo-guessing parameter of the item; i = 1,..., I (the total number of examinees); and j = 1,..., J (the total number of items).

3 For this article, a simple structure is assumed (i.e., each item measures one dimension of ability and thus j contains only one nonzero element). Under this assumption, the model in Equation 1 can be reexpressed as: where (,,, ) PX ij ( d ) = 1θ i ( d ) α j ( d ) β j ( d ) γ j ( d ) ( ) = γ + 1 γ j( d) j( d) Multidimensional IRT in Test Scoring ( α j( d) θi( d) + βj( d) ) ( α j( d) θi( d) βj( d) ) exp 1 + exp + X ij(d) is the response of examinee i to the jth item of dimension d; θ i(d) is the dth component of the vector i (i.e., i = {θ i(d) }); α j(d), β j(d), and γ j(d) are the parameters of the jth item of dimension d; d = 1,..., D (the number of dimensions); j(d) = 1(d),..., J(d); and D d=1 J(d) = J., ( 2) A graphical representation of the hierarchical structure of the model is given in Figure 1. 1 Inv-Wishart v0 (Λ 0 ) µ Σ θ i(d) α j(d) β j(d) γ j(d) X ij(d) FIGURE 1. A directed acyclic graph of the model. 297

4 de la Torre and Patz The item response X ij(d) has a likelihood P ij(d) (θ i(d) ) given by P ( ( )) ( θ ) = P X = 1θ, α, β, γ ij( d) i( d) ij( d) i( d) j( d) j( d) j( d) Let X i = {X i1(1),..., X ij(1),..., X i1(d),..., X ij(d),..., X i1(d),..., X ij(d) } represent the response vector of examinee i. The corresponding likelihood of this vector is D Jd ( ) Pi( Xi i) = Pijd ( ) ( θid ( ) ). ( 4) d = 1 jd ( )= 1 Finally, the likelihood of the data matrix X is given by 1 Xij( d ) ( 1 PX ( ij( d ) = 1θi( d ), αj( d), βj( d), γj( d ))). ( 3) I D Jd ( ) P X P ij ( d ) θ i ( d ). ( 5) ( ) = ( ) i= 1 d= 1 jd ( )= 1 Xij( d ) Estimation Prior, Posterior, and Conditional Distributions The prior distribution of i is parametrized as ~ MVN 0, ( ) i ( ) 6 ( ) 1 ~ Inv-Wishart v. ( 7) 0 0 Of primary interest is the joint distribution of i and. Using the notations X = {X ij }, = { i }, = { j }, = { j }, and = { i }, this joint posterior can be expressed as ( ) ( ) ( ) ( ) P, X,,, P X,,,, P P. ( 8) The posterior distribution in Equation 8 cannot be evaluated in a straightforward manner (i.e., samples cannot be drawn directly from the joint posterior distribution). MCMC simulation is used to draw samples iteratively from the full conditional distributions X,,,, and X,,,, (Casella & George, 1992; Gamerman, 1997; Gelman, Carlin, Stern, & Rubin, 1995). For each examinee, the full conditional distribution is 12 P( i Xi,,,, ) exp 1 1 i i Pi( Xi i ). ( 9) 2 Although Equation 9 is not a known distribution, samples can be drawn from this distribution indirectly by using the Metropolis-Hastings algorithm (Chib & Greenberg, 1995; Gilks, Richardson, & Spiegelhalter, 1996; Tierney, 1994). The full conditional distribution of is 298

5 Multidimensional IRT in Test Scoring ( ) = ( ) ( ) ( ) P X,,,, P P P. ( 10) With the use of the prior distribution and the hyperdistributions given in Equations 6 and 7, the full conditional posterior distribution of is an Inv-Wishart νi ( I 1 ), where ν I =ν 0 + I and I = 0 + i i. This full conditional distribution is a known distribution and can be sampled directly. MCMC Algorithm The following prior parameters were used: ν 0 = D + 2, and the diagonal and offdiagonal elements of 0 were set to 1 and 0.5, respectively. Below is an outline of the MCMC algorithm used in the estimation of the parameters. At iteration 0: Let (1) (0) = I (0) (2) i MVN(0, I) At iteration t: (1) Draw (t) (t 1) from Inv-Wishart νi ({ I } 1 (t 1) ), where ν I =ν 0 + I, and I = 0 + (t 1) (t i 1) i. (2) For, draw the candidate value (*) i from MVN( (t 1) i, θ ), and accept (*) i with probability ( t) Pi i i P i t p( ( ) i i ) ( X * * ) ( ) 1, * = min Pi Xi ˆ ( t ) ( t t i P ) ( ) ( ) ( i ), ( ) For the present article, each chain is iterated 10,000 times. The first 2,000 iterations are discarded, and inference is based on the remaining 8,000 iterations. Two parameters are of interest: the ability vectors i and the correlation matrix. Both were estimated using the MCMC output. The ability estimates denoted as the multidimensional expected a posteriori (EAP-M) method was computed as follows: 10, ( t) = E ( X,,, ). ( 12) 8, 000 t= 2, 001 Although not the primary focus of this article, an estimate of underlying correlational structure between the abilities can be obtained. The covariance matrix was estimated as 10, ( t) = E ( X,,, ). ( 13) 8, 000 t= 2, 001 The estimated covariance was standardized to obtain the correlation estimate ~. The accuracy of the ability and correlation estimates was gauged by comparing them to the generating parameters. Specifically for the ability estimates, Pearson 299

6 de la Torre and Patz correlation and mean squared error (MSE) were computed to summarize the correspondence between the estimated and the generated abilities, in addition to the precision of the ability estimates as measured by the posterior variance. It should be noted that when the correlation between the abilities is zero or assumed to be zero (i.e., is set to zero in estimating ability), the resulting ability estimates are equivalent to the unidimensional expected a posteriori (EAP-U; Bock & Aitken, 1981). The additional precision from simultaneous estimation can be obtained by comparing the MSE of the ability estimates when > 0 to the MSE of the ability estimates when =0. In particular, the MSE of EAP-U over the MSE of EAP-M is a measure of the relative efficiency of the proposed method compared with the unidimensional approach. Factors Affecting Multidimensional Ability Estimation: A Simulation Study In this section we present results of a simulation study that examines the performance of the hierarchical model ability estimates under a variety of realistic configurations. The factors investigated in this article are: (a) the number of abilities, (b) the number of items, and (c) the degree of correlation between the abilities. The different number of abilities were 2 and 5; the different number of items were 10, 30, and 50; and the different degrees of correlation were 0.00, 0.40, 0.70, and The levels of each factor were crossed completely to yield 24 conditions. The item parameters used in simulating the examinee responses and scoring the examinees were obtained from a pool of 550 nationally standardized mathematics items. Ten items whose mean information function is closest to the mean information function of all the items were selected. Of the 1 billion randomly constructed 10-item tests, the selected test has the minimum mean absolute deviation 10 Ij ( θ) 10 Ij ( θ) 550 f( θ) dθ, ( 14) j = j = 1 where I j (θ) = D 2 a j 2 [1 P j (θ)][p j (θ) c j ] 2 /(1 c j ) 2 P j (θ) is the item information function (Hambleton & Swaminathan, 1985; Samejima, 1977), and f(θ) is the standard normal distribution. Multiples of the 10-item test were used according to the requisite of the different conditions. For each combination, 1,000 examinees were drawn from MVN(0, D ), where Σ D = 1 1, 1 and their responses simulated. The constraint on retained the structure of the design but did not in any way affect the estimation process. 300

7 TABLE 1 Estimate of Correlation Between Abilities Multidimensional IRT in Test Scoring Number of Abilities J D = D = Estimates of Correlation Table 1 gives the MCMC estimates of the correlations. Results demonstrate that the correlations were accurately estimated by using MCMC. In general, additional precision can be expected for the estimates as more items and abilities are considered. However, because the large number of examinees allowed for the accurate estimation of the correlations even when only two abilities and 10 items were considered, the additional information afforded by adding more abilities and items became negligible in the process. Ability Estimates Correlation With True Ability and Posterior Variance Table 2 lists the correlations between the true ability and the estimated ability. It can be observed that the correlation between the true ability and the estimated ability increases as the correlation between the underlying abilities, the number of items, and the number of abilities increases. As to be expected, when the underlying abilities are not correlated, correlation between the true and estimated abilities increases only with the number of items and not with the number of dimensions. With at least 30 items, the abilities can be well estimated even when the underlying abilities are not correlated and, hence, increasing the underlying correlation has marginal impact. TABLE 2 Correlation Between True and Estimated Abilities Number of Abilities J D = D =

8 de la Torre and Patz TABLE 3 Posterior Variance of the Ability Estimates Number of Abilities J D = D = The largest improvement was observed when five abilities were simultaneously estimated and the correlation between the abilities is 0.90 (i.e., correlation increases from 0.83 to 0.92). Table 3 shows that the number of abilities, number of items, and the degree of correlation affect the posterior variance of the estimates in the same way that these factors affect the correlations between the true and estimated abilities. Specifically, an increase in the number of abilities, the degree of correlation between the abilities, and the number of items resulted in more precise estimates. Relative Efficiency To quantify the amount of improvement attributable to simultaneous estimation, relative efficiency was computed, and the results are presented in Table 4. Relative efficiency was defined here as the MSE of the EAP-U estimates (i.e., =0) over the MSE of the EAP-M estimates. Thus, a ratio greater than 1.00 can be interpreted as the EAP-M having higher efficiency compared to the EAP-U. In addition, the ratio also indicates the factor by which the test length needs to be increased for the EAP-U estimates to have the same precision as the EAP-M estimates obtained with the original test length. When only two abilities are concurrently considered, the efficiency of the EAP-M method was not evident unless the abilities are very highly correlated (i.e., =0.90). TABLE 4 Mean Squared Error and Relative Efficiency Number of Abilities J ( ) 0.34 (1.00) 0.29 (1.16) 0.23 (1.44) D = ( ) 0.13 (1.10) 0.13 (1.15) 0.11 (1.40) ( ) 0.09 (1.03) 0.09 (1.08) 0.07 (1.24) ( ) 0.29 (1.08) 0.23 (1.36) 0.16 (1.95) D = ( ) 0.13 (1.08) 0.12 (1.24) 0.08 (1.69) ( ) 0.09 (1.02) 0.08 (1.16) 0.06 (1.52) 302

9 Efficiency at this level ranged from 1.24 to Depending on the length of the test, this was equivalent to adding 4 to 12 items to the test. When more dimensions were simultaneously used, the efficiency of the EAP-M method was evident for abilities that were reasonably highly correlated (i.e., 0.70). Efficiency ranged from 1.16 to For 10-item tests where five abilities with =0.90 were simultaneously estimated, the precision of the EAP-M estimates was equivalent to the precision of the EAP-U estimates obtained from tests twice as long. For other conditions, depending on the original test length the additional precision was equivalent to adding 4 to 26 items to the test. The increase in precision from increasing the number of abilities was less evident when long tests were used. This is consistent with the results discussed earlier, which indicate that improvement is marginal when abilities are already well estimated. Although efficiency may not be as high for long tests, the corresponding number of additional items turned out to be larger. It can be noted that the simulation study shows that the average posterior variances given in Table 3 are very close to the MSEs of the estimates in Table 4. In real data analysis where MSE cannot be computed, approximate relative efficiency can be obtained by comparing the average posterior variances of the estimates. The effects of the underlying correlations between abilities on the estimates for five dimensions and 10 items are shown in Figures 2 through 4. Figure 2 shows = 0.00 = 0.40 = 0.70 = 0.90 FIGURE 2. Five dimensions and 10 items: scatter plots of true and estimated abilities. 303

10 FIGURE 3. Five dimensions and 10 items: smoothed deviations between true and estimated abilities. θ FIGURE 4. Five dimensions and 10 items: smoothed posterior variances of the ability estimates. 304 θ

11 that higher underlying correlation resulted in more-compact scatter plots along the identity line. Figure 3 shows that the estimation bias for extreme abilities can be substantially reduced by simultaneous estimation when the underlying correlation is high. Finally, Figure 4 shows the dramatic decrease in posterior variance for all values of θ as the abilities become more correlated. Augmenting Ability Estimates Wainer et al. (2001) presented different methods of improving ability estimation by computing the empirical Bayes estimates of abilities on the basis of responses to multiple tests. The methods they described can be used for both number correct and IRT scores and utilize the multivariate analog of test reliability in regressing the examinee scores. Because their procedure for augmenting IRT scores involved the modal a posteriori (MAP; pp ) estimates, the formulas they presented are slightly modified to allow comparison between their method and the method proposed in this article, which gives EAP estimates. For test d, the unregressed IRT score for the test is given by θ * d θd =, ( 15) where θ d is the EAP-U ability estimate and d is the test reliability. The reliability of test d is computed as Var ( θd ) d =, ( 16) Var θ PVar θ ( d) + ( d) where P Var ( θ d ) is the average posterior variance of ability estimates in test d. Let S u be the covariance matrix of unregressed ability estimates θ*, and define S c = S u D to be the covariance matrix corrected for reliability, where D is a diagonal matrix whose dth nonzero entry is (1 d )s u dd. Also define θ * to be the mean vector of unregressed ability estimates. Then the empirical Bayes ability estimate for examinee i is given by θ () 1 θ * c u 1 θ * θ * i = + ( ) S S i. ( 17 ) To compare the two methods of augmenting ability estimates, unidimensional ability estimates were computed for all the conditions in the previous section. These unidimensional estimates were transformed to the empirical Bayes estimates, θ (1), using 17 and compared with the ability estimates obtained using simultaneous estimation, denoted by θ (0). Comparison was based on the correlation between the two estimates and correlation and MSE between the true and estimated abilities. Listed in Table 5 are the summary statistics for the measures across the 24 conditions. These statistics indicate that the two methods provide almost identical estimates. In the d 305

12 de la Torre and Patz TABLE 5 Comparison of Methods of Computing Augmented Ability Estimates Absolute Difference Statistic Cor( ~ θ (0), ~ θ (1) ) Cor(θ, ~ θ) MSE Min Q Q Q Max worst case scenario, the correlation between the two estimates is still almost perfect (0.9953), and the differences between the correlation and MSE are not evident until the third decimal place. Although the results appear almost identical, the following statistics suggest that ~ θ (0) may be slightly better than ~ θ (1). The mean correlation between θ and ~ θ (0) and θ and ~ θ (1) are and , respectively, whereas the average MSEs across the 24 conditions are and for the simultaneous estimation and Wainer et al. s method, respectively. The difference between the two estimates can be observed in an example displayed in Figure 5. In this example, five 10-item tests with = 0.00 = 0.40 = 0.70 = 0.90 Empirical Bayes Empirical Bayes Empirical Bayes Empirical Bayes FIGURE 5. Five dimensions and 10 items: scatter plots of estimated abilities. 306

13 Multidimensional IRT in Test Scoring reliability 0.68 are scored. Compared with ~ θ (0) estimates, a greater magnitude of shrinkage can be observed in ~ θ (1) estimates for extreme values of θ. The degree of tail shrinkage is seen to depend on the magnitude of the correlation between the abilities. In this case, we see that using the test reliabilities to improve ability estimates causes greater regression to the mean compared to using the covariance matrix as a prior distribution. It should be noted, however, that this difference in shrinkage occurs in areas where only a small proportion of examinees can be found and that, for most practical purposes, Wainer et al. s method, which is easier to implement, can be used without negative implications. Analysis of a Grade 9 Test Battery The responses of 2,255 Grade 9 examinees on four content areas Math (MA; 25 items), Math-Computation (MC; 20 items), Spelling (SP; 20 items), and Social Studies (SS; 25 items) were analyzed. The abilities of each examinee on the four content areas were estimated using the EAP-M and EAP-U. The prior distributions discussed in the simulation section were used. Each chain was of length 25,000 iterations with the first 5,000 iterations as burn-in. Correlation Estimates The estimates of the correlation between the four abilities are given in Table 6. The highest correlation was between Math and Math-Computation (0.89), whereas the lowest correlation was between Math and Spelling (0.66). The average correlation between the four abilities was These results indicated that the data analyzed were close to the simulated condition where =0.70. Ability Estimates The posterior variances of the EAP-U and EAP-M estimates and the approximate relative efficiency of EAP-M method for each content area are given in Table 7. The highest relative efficiency was obtained in estimating the Math ability. The EAP-M estimates were on the average 28% more precise compared to the EAP-U estimates. This translates to an additional 7 items for the EAP-U method to arrive at the same precision. The lowest efficiency was in estimating the Social Studies ability, Nonetheless, the higher precision using the EAP-M method was equivalent to an additional 5 items. The mean efficiency of 1.22 for four abilities that have an average correlation of 0.75 was reasonable and consistent with the simulation results when one takes into account the additional noise involved in analyzing real data. TABLE 6 Correlation Estimates for the Grade 9 Test Battery Content Area MC SP SS MA MC SP

14 de la Torre and Patz TABLE 7 Posterior Variance and Approximate Relative Efficiency of Ability Estimates for the Grade 9 Test Battery Content Area Method MA MC SP SS EAP-U EAP-M Relative efficiency Discussion The multidimensional approach to simultaneous ability estimation can be viewed as a more general framework for obtaining expected a posteriori estimates of ability. The method gives the same results as the unidimensional approach when abilities are uncorrelated. However, when abilities are correlated, taking the correlation into account can lead to noticeable improvements in ability estimates, especially when there are multiple short tests and the underlying correlation is high. Among several methods of ability estimation, EAP-U has been preferred for its small bias and standard error (Kim & Nicewander, 1993; Thissen & Orlando, 2001). But, as the results of this article have shown, employing simultaneous estimation can further reduce the bias and standard error of the estimates. In addition to improvements to ability estimates, the hierarchical formulation used in this paper provides a framework that allows for the direct estimation of the correlation between the abilities. This obviates the need for a two-step approach (i.e., estimation of abilities in the first step and estimation of the correlation matrix using the ability estimates in the second step), which leads to biased estimates (Little & Rubin, 1983; Mislevy, 1984; Segall, 1996). The multidimensional approach should be beneficial in many testing situations. The administration of multiple tests during one sitting is not uncommon, and as Johnson and Carlson (1994) reported, the different abilities measured by these tests are usually highly correlated. Although some of the improvements from using this approach are relatively modest, it can be achieved without much additional cost (i.e., only the estimation process was changed in scoring the same data sets). In a practical sense, use of this method means that, given a fixed number of items, the score can be made more reliable, or given a desired level of reliability, the number of items can be reduced without loss of accuracy. The valid use of scores obtained using auxiliary information of any type must be considered carefully. NAEP, for example, uses auxiliary information regarding student and school characteristics to obtain more accurate scores for subpopulations of students. The student-level scores (i.e., plausible values ) computed in the analysis may not be used to characterize individual student proficiency because they depend on the auxiliary demographic variables, which should not inform characterizations of proficiency. In the multidimensional approach proposed in this article, all 308

15 Multidimensional IRT in Test Scoring auxiliary information is directly obtained from test performance, but validity concerns remain. Although the multidimensional scores have favorable statistical properties, they also have a more complex interpretation. These scores may not be desirable when straightforward interpretation of test scores as a summary of test responses within a domain is favored. In any type of competition between students within a domain (e.g., identifying the top 10% of scores on a science achievement), it would not be appropriate to allow scores on other domains (e.g., mathematics) to affect the scores and rankings. When the consequences for examinees do not depend on comparisons with other examinees, then the greater accuracy of the multidimensional scores may be useful. For example, if more accurate score profiles lead to more efficient diagnosis or more precise targeting of instructional resources, then their use could be supported. Finally, we observe that multidimensional scores may serve to complement rather than to replace traditional scoring of test batteries. Traditional unidimensional scores may be reported at the domain level (e.g., scale scores and their associated normreferenced and/or criterion-referenced derived scores), and multidimensional scores could be used to inform finer-grained reporting such as skills profiles and objectivelevel scores. Traditional approaches to this type of fine-grain reporting suffer from insufficient reliability because of the small numbers of items associated with each fine-grain reporting category. Hence, the continued reliance on a single, more global composite score (Wainer et al., 2001). Our analysis suggests that a multidimensional approach to scoring could be a promising application given the nature of the test batteries (i.e., multiple short sections that are highly correlated). It should be noted that the assumption of simple structure does not limit the usefulness of the proposed method. On the contrary, the assumption makes the application of the method more straightforward in that it can be applied to existing tests that have already been calibrated without any changes in the item response models. The simultaneous estimation procedure discussed in this article is akin to the method employed by Wainer et al. (2001) in that both augment scores on one test by using information from other tests, and, as our analyses show, under a variety of testing conditions the two approaches yield very similar results. However, the two methods differ in some important respects. One important distinction is that although both methods can be classified as Bayesian procedures, the method discussed by Wainer et al. uses an empirical Bayesian approach whereas the current method uses a fully Bayesian approach. Another important distinction is that the method described by Wainer et al. involves elements of classical test theory whereas the method described here is solely based on IRT and its extensions. For example, the former method uses the multivariate analog of reliability in regressing the examinee scores. Future research might take a variety of directions. First, the flexibility of IRT formulation should allow for other information to be readily incorporated in the model. Aside from responses to other tests, one can also consider other information such as academic or demographic variables routinely collected in most testing situations. Second, the present article uses item parameters with known values. The approach can be broadened to include item parameter estimation such as was done by Patz and 309

16 de la Torre and Patz Junker (1999a, 1999b) in the unidimensional IRT case. Finally, the proposed method can be tried with other item response models such as the generalized graded unfolding model of Roberts, Donoghue, and Laughlin (2000) and other testing contexts. References Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46, Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49, CTB/McGraw-Hill. (2002). Technical Bulletin 1 of California Achievement Tests Forms C and D. Monterey, CA: Author. Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. London: Chapman & Hall. Gelman, A., Carlin, J. B., Stern, H., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman & Hall. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 1 17). London: Chapman & Hall. Kim, J. K., & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psychometrika, 58, Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff. Johnson, E. G., & Carlson, J. (1994). The NAEP 1992 Technical Report (Report No. 23-TR- 20). Washington, DC: National Center for Education Statistics. Little, R. J. A., & Rubin, D. B. (1983). On jointly estimating parameters and missing data by maximizing the complete-data likelihood. The American Statistician, 37, Lord, F. M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, Mislevy, R. J. (1987). Exploiting auxiliary information about examinees in the estimation of item parameters. Applied Psychological Measurement, 11, Patz, R. J., and Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response theory. Journal of Educational and Behavioral Statistics, 24, Patz, R. J., and Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, Purcell, N. J., and Kish, L. (1979). Estimation for small domains. Biometrics 35, Reckase, M. D. (1996). A linear logistic multidimensional model. In W. J. van der Linder & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp ). New York: Springer-Verlag. Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general model for unfolding unidimensional polytomous responses using item response theory. Applied Psychological Measurement, 24, Samejima, F. (1977). The use of information function in tailored testing. Applied Psychological Measurement, 1,

17 Segall, D. O. (1996). Multidimensional Adaptive Testing. Psychometrika, 61, Thissen, D., & Orlando, M. (2001). Item Response Theory for Items Scored in Two Categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp ). Mahwah, NJ: Erlbaum. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Annals of Statistics, 22, Wainer, H., Vevea, J. L., Camacho, F., Reeve III, B. B., Rosa, K., Nelson, L., Swygert, K. A., & Thissen, D. (2001). Augmented scores Borrowing strength to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp ). Mahwah, NJ: Erlbaum. Authors Multidimensional IRT in Test Scoring JIMMY DE LA TORRE is Assistant Professor, Department of Educational Psychology at Rutgers Graduate School of Education, 10 Seminary Place, New Brunswick, NJ 08901; His research interests include psychometrics, item response models, and cognitive diagnosis. RICHARD J. PATZ is President, R. J. Patz, Inc., 1414 Soquel Avenue, Suite 212, Santa Cruz, CA 95062; His research and consulting interests include large-scale assessment design and implementation, and statistical models for item response data. Manuscript received September 9, 2002 Revision received February 9, 2004 Accepted June 15,

PIRLS 2016 Achievement Scaling Methodology 1

CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally