Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions

R U T C O R R E S E A R C H R E P O R T Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions Douglas H. Jones a Mikhail Nediak b RRR 7-2, February, 2! " ##$%#& ' (&)%$%&# *' (&)%$%$() +',- - '..- -./ a Rutgers-Newark; Department of Management Science and Information Systems; 2e Ackerson Hall; 8 University Avenue; Newark, NJ 72; e-mail: dhjones@rci.rutgers.edu b RUTCOR, Rutgers University, 64 Bartholomew Road, Piscataway, NJ 8854; e-mail: msnediak@rutcor.rutgers.edu

RUTCOR RESEARCH REPORT RRR 7-2, February, 2 Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions Douglas H. Jones Mikhail Nediak Abstract. This research improves MCMC sampling techniques for Bayesian item calibration. A major benefit of MCMC calibration is production of entire posterior distributions, not just point estimators. This feature enables an assessment of the quality of the calibration using posterior confidence intervals. This study shows that MCMC calibration replicates BILOG results. Two sets of calibration are performed. One involving 2,6 simulated test takers and another involving the 28,84 real test takers on items. Following a suggestion in Patz and Junker (999), this paper develops new item and ability proposal densities for the Metropolis-Hastings algorithm. It is shown that new proposal densities greatly increase the efficiency of the algorithm. Keywords: MCMC algorithm, item response models, BILOG, LSAT, item calibration Acknowledgements: We thank Dr. Xiang-Bo Wang and Ms. Susan Dalessandro for performing the BILOG calibrations, and Dr. Lori McLeod for reviews and helpful suggestions. This research was supported by contract to Rutgers University from Operations, Testing, and Research, Law School Admission Council.

MCMC Item Calibration Page Introduction A popular Bayesian item calibration methodology is based on the deterministic approximation of marginal maximum likelihood estimates using the EM algorithm (Bock and Aitken, 98). BILOG is a software package using this approach to calibrate items (Bock and Mislevy, 985). This paper explores a method for item calibration using stochastic approximation of the posterior density. Stochastic approximation consists of Monte Carlo sampling directly from the posterior distribution. The sampled process is obtained by the Hastings-Metropolis algorithm that generates a Markov chain with a stationary distribution the same as the desired posterior distribution. In this case, stochastic approximation is called Markov Chain Monte Carlo (MCMC) simulation. Estimation of the posterior density and its moments is based on these sampled data. The reader is referred to Patz (996), and Patz and Junker (997), who outlined general MCMC sampling methods for obtaining estimates of parameters in the three-parameter logistic IRT model. General references on stochastic approximation of Bayes posterior distributions include Gamerman (997); Gelman, et al. (995); Gilks, et al. (996); Gelman, et al. (995). The main advantage of MCMC over BILOG is that MCMC offers an approximation to the entire posterior while BILOG yields only point estimates of parameters. The reader is referred to Geyer (996) for a general discussion comparing MCMC with the EM algorithm. Patz and Junker (999) presented a prototype program for MCMC item calibration written in S plus. For purposes such as large-scale item calibration and online item calibration, one of the main goals of this research was production-level item calibration software written in the C++ language; see Jones and Nediak (999b). In addition, we improved Patz and Junker s algorithm by implementing their suggestions for refining the proposal density using the Fisher information matrix. Other MCMC capabilities and algorithmic efficiencies will be discussed in the paper.

MCMC Item Calibration Page 2 The next section describes the improved MCMC scheme for approximating the posterior distribution of item parameters from the three-parameter logistic model. The following section presents comparisons between MCMC estimates and BILOG estimates for LSAT items using simulated and real data. Confidence bands of the item response function and the item information function were generated for each of LSAT items. Trace curves of both MCMC and BILOG calibrated items are superimposed on the confidence bands, allowing comparisons and identifying discrepancies between the two calibration methods. Stochastic Approximation of Posterior Distributions IRT Model Let u denote a response to a single item from an individual with ability level θ, univariate. Let φ T = (a, b, c) be a vector of unknown parameters associated with the item. Assume that all responses are scored either correct, u =, or incorrect, u =. An item response function (IRF), P(θ, φ), is a function of θ, and describes the probability of correct response of an individual with ability level θ. We shall focus on the family of three-parameter logistic (3PL) response functions: P(; θ φ)= c+ ( c) R(; θ φ), where () R(; θ φ)=. (2).7 a( θ b) + e The three IRT characteristics of a three-parameter item are discrimination power, a; difficulty, b; guessing, c. Now denote I = {,, n} to be a set of examinees and J = {,, m} to be a set of items. Let I(j) be a set of examinees that responded to item j and J(i) to be the set of items that have been administered to examinee i. Denote by Θ = {θ i : I i} the vector of abilities of examinees in I and by Φ = {φ j : J j} the matrix of parameters of items in J. We can introduce the submatrices

MCMC Item Calibration Page 3 Φ J(i) of the matrix Φ corresponding to the subset of items J(i) and similar notation Θ I(j) for the subset of examinees I(j). We will also consider the vector of all responses U={u ij : I i, J(i) j} indexed by all valid examinee-item pairs. The vectors of all responses corresponding to examinee i and item j for brevity will be denoted as U J(i) and U I(j) respectively. Under the assumption of the independence of responses corresponding to the different items and different examinees, the expression for the joint likelihood of ability vector Θ and parameter matrix Β given the vector of all responses U is: (3) L( Θ, Φ ; U) = L( φ ; Θ, U ) = L( θ ; Φ, U ) j J j I( j) I( j) i J( i) J( i) i I where the likelihoods for the single item and single examinee are uij uij j I( j) UI( j) θi j θi j (4) i I( j) L( φ ; Θ, ) = P( ; φ ) [ P( ; φ )] and uij uij i Φ J() i UJ() i = i φj i φj. (5) j J() i L( θ ;, ) P( θ ; ) [ P( θ ; )] We will denote the prior distribution for a vector of abilities and a matrix of parameters as π(θ,φ), then the joint posterior distribution for them is p(θ, Φ; U) L(Θ, Φ; U) π(θ, Φ). Metropolis Algorithm with Blocking and Gibbs Sampling Following Patz and Junker, we will employ blocking in the transition mechanism of Metropolis-Hastings algorithm. Let the proposal density for a single examinee be denoted by q θ (θ,θ ) where θ is a value in the current state of the Markov chain and θ is a proposed value. Similar notation q φ (φ,φ ) is introduced for a proposal density for an item. We will assume that the prior distribution is a product of the priors for individual items and examinees: i j. (6) i I j J π( Θ, Φ ) = π( θ ) π( φ )

MCMC Item Calibration Page 4 Under this assumption, the joint posterior distribution will factor into products of marginal posterior distributions of individual items or examinees. This means that we can sample new ability and item parameters individually for each examinee and item very efficiently without computing a joint posterior at each step. Let φ k j and θ k i be parameters of the j th item and ability of the i th examinee at the k th stage of the algorithm. Then the algorithm is Gibbs sampling with two steps, each using the Metropolis-Hastings algorithm, as follows:. Sample new ability parameters for each examinee i: Draw θ * i from q θ (θ k- i θ i ). Accept θ k i = θ * i with probability k * * * k L( * () i ; i, () i ) ( i) q (, ) k Φ J θ UJ π θ θ θi θi αθ ( i, θi) = min, k k k k *. L( ΦJ() i ; θi, UJ() i ) π( θi ) qθ( θi, θi) Proceed to step 2. 2. Sample new parameter vectors for each item j: Draw φ * j from q φ (φ k- j,φ j ). k Accept φ j = φ * j with probability k * * * k L( * ( j) ; j, ( j) ) ( j) q (, ) k Θ I φ UI π φ φ φj φj α( φj, φ j) = min, k k k k *. L( ΘI( j) ; φj, UI( j) ) π( φj ) qφ( φj, φj) Proceed to step. Ideally, all sampled values of the Markov chain are from the desired posterior density. However, this is not the case. The Metropolis algorithm insures only that the stationary distribution of the sampled Markov chain is the desired posterior density. Therefore, one must use only the last portion of the Markov chain, where one is fairly certain that the stationary distribution is in effect. The latter portion of the Markov chain is called the active set, which is used for estimation. The beginning portion of the Markov chain is called the burn-in set, which is set aside. Also, the rate at which new obsevations are accepted determines the efficiency of the

MCMC Item Calibration Page 5 sampling. As long as the ability of the chain to explore the support set of posterior ditribution is not japordized, a higher rate means the sampling is completed in a shorter time. This rate is called the acceptance rate. As for particular choices of priors and proposal densities, different options were examined. One choice is the standard normal prior for ability and uniform prior (over sufficiently large rectangular region) for item parameters. We decided not to choose proposal densities from the normal family since it is recommended that the tails of the proposal density dominate the tails of posterior distribution. Thus our proposal density comes from the Student t ν family. In particular, the proposal density for examinee ability was q θ( θ, θ ) ν 2 2 ( θ θ ) = C + 2 νσθ, where θ is a present value, θ is a proposed value, C is a constant coefficient and constant scale factor. 2 σ θ is a Morover, all proposal densities for item parameters that do not adapt to the shape of the posterior distribution, when the number of examinees grows, resulting in a very low acceptance rate. In was suggested in the paper by Patz and Junker that proposal densities using the information matrix for item parameters may be employed. This approach did prove to be very fruitful. Consider the information matrix for an item with parameters φ when it is administered to an examinee with ability level θ: m ( φ θ) ( R) ( P) 2 2 2 ( θ) ( θ) ( θ) 2 2 2 a( b θ) S a S as ( b θ) S as b S a b S b S/.7 2 2.7 P 2 ; = /.7 /.7 /.7 /.7 Where S = ( c) R( θ β) R= R( θ β) P= P( θ β) ; ; ; ; ;..

MCMC Item Calibration Page 6 By the additivity property of information, the information matrix for an item when it is administered to examinees in the set I(j) is M( φ ) = m( φ ; θ ). j j i i I( j) The expected information matrix with respect to a prior distribution on ability, π(θ), can also be considered: + M( φ ) = m( φ; θπθ ) ( )dθ. In particular, we considered the standard normal prior on ability. The proposal density used for item parameters was multivariate Student t ν centered at the previous value of parameters with the covariance matrix propotional to the inverse of the corresponding information matrix or the appropriately renormalized (multiplicatied by I(j) ) expected information matrix (coefficient of proportionality ν/(ν 2) depending on the degrees of freedom ν): ν+ 3 2 2 qφ( φ, φ ) = C{det[ M ( φ )]} { ν+ ( φ φ ) M ( φ ) ( φ φ )}, where C is a constant depending only on ν. Data and Estimation Schedule The data consist of two sets coming from 28,84 live responses and 2,36 simulated responses to LSAT items. BILOG calibrations were performed at LSAC using these data. This paper presents the results of MCMC calibration using the same data. The burn-in set for MCMC was 2 samples followed by the active set of 5 samples. The prior distribution for item parameters was uniform in the rectangular region [-6., 6.] by [.5, 2.5] by [.,.5]; for the ability, the prior distribution is the standard normal truncated on the interval [-4., 4.]. Number of degrees of freedom for transition kernels is ν =. The scaling factor in the transition

MCMC Item Calibration Page 7 kernel for θ was σ 2 θ =., as recommended in Patz and Junker. The transition kernel for φ was based on the expected information. We note that in the real dataset, there were examinees who omitted or did not reach some of the questions. It is not immediately obvious how to handle this situation using the MCMC algorithm and we are currently studying various approaches. Here we report results that use a likelihood ignoring questions omitted or not-reached by a particular examinee MCMC and BILOG Estimates for LSAT Items Table contains the BILOG and MCMC estimates of item parameters using real data. Cursory reviews of the values reveals that BILOG and MCMC estimates are close. In addition, Table displays the MCMC acceptance rates for individual items. The average acceptance rate, 37.5%, which is still in the optimal range 2-5% recommended in the literature, compares to the 25% experienced by Patz and Junker, representing an improvement in efficiency of our MCMC algorithm for item calibration. Table 2 contains the BILOG and MCMC estimates of item parameters using simulated data. Both sets of estimated are unscaled. Cursory reviews of the values reveals that BILOG is not as close to the true values as the MCMC estimates. The average acceptance rate, 38.5%, replicating the value obtained with real data. A different view of the performance of the MCMC estimator is based on the IRF s. In particular, we calculate the distance between estimated IRF s for the same item based on the weighted-hellinger deviance (see the Appendix). We have chosen weighting density to be the probability density functions of the N(µ,) distribution, µ = -, -2,,, 2. Figure displays histograms of the deviance between the MCMC and BILOG IRF s based on real data (MCMCR), the deviance between the MCMC and TRUE IRF s (MCMC), and the deviance between the BILOG and TRUE IRF s (BILOG). The latter two estimates are based on

MCMC Item Calibration Page 8 the simulated data. The weighting density for each deviance type is indicated in the legend of each histogram. The histogram intervals begin at zero with width.5 for all histograms. The results based on the real data indicate there is virtually no difference between MCMC and BILOG estimates of the IRF. Table 2 and Figure indicate that MCMC is superior to BILOG based on simulated data. Figure also displays the estimated trace curves of the items corresponding to the 75 th percentile of the deviancies for each data set. The item id s are indicated in the legends of the plots. These plots serve to guide our intuition about the relation between deviance and the proximity of two trace curves. They also serve to indicate how much MCMC is superior to BILOG in the simulated data. In addition, the MCMC and BILOG estimates are indistinguishable for the real data. This is a remarkable feat since the BILOG estimates have undergone propriety scaling at LSAC, but the MCMC estimates have not. Finally, Figure 2 presents the trace curves of selected IRF s and item information functions (IIF s) corresponding to 5 samples from the posterior distribution of the item parameters obtained by the MCMC procedure. For all but item 58, the IRF s and IIF s corresponding to BILOG estimates (indicated by an x ) are practically always absorbed by the envelopes. Discussion Looking at Table we observe that parameters of item 58 (and some other items like 4, 6, 2-7, 2, 64) obtained by MCMC do not agree with those obtained by BILOG based on real data. To explain these differences we should take into account a number of effects. First, we used the uniform priors for item parameters in comparison to normal, lognormal and beta priors used by BILOG for the difficulty, discrimination and guessing respectively. This means that BILOG estimates will be shrunk towards certain values in comparison to ours. Note also the above items have a relatively low acceptance rate.

MCMC Item Calibration Page 9 Fine-tuning of the algorithm parameters allowed us to achieve the acceptance rates higher than 3% for most items (see Tables and 2). This is due to our choice of proposal densities for item parameters using the information matrix. We can see however that the acceptance rates can drop below 2% for the items with low guessing parameter. This can be explained by the fact that when the mode of posterior distribution is close to the boundary of its support set S then the significant fraction of all proposed parameter vectors is outside S and rejected right away. As noted earlier, item 58 has boundary parameters and one of the lowest acceptance rates, only.87 in Table. The use of a truncated kernel q can solve this problem. Note that we would have to truncate it only with respect to the guessing parameter. Conclusion This study shows that MCMC calibration replicates or surpasses BILOG results. A major benefit of MCMC calibration is production of entire posterior distributions, not just point estimators as with the EM algorithm. This feature enables a true assessment of the quality of the calibration. This is particularly important for calibration in an on-line environment since it provides information for continuing the sampling of examinees for individual items. Our implementation and modification of Patz and Junker s suggestion for a proposal density shows that the acceptance rate rises to about 38% from 25%. This sizable increase in efficiency warrants the use of our proposal densities in MCMC sampling for IRT. APPENDIX The Hellinger deviance between two discrete probability distributions f (x) and f 2 (x) is /2 /2 2 [ ( i) 2 ( i)]. i= H = f x f x The maximum value of the Hellinger deviance is 2, and the minimum is. If there are two parameter estimates φ and φ, the Hellinger deviance between corresponding IRF s at a particular ability level θ is (after a simple transformation):

MCMC Item Calibration Page ( )( ) H( φ, φ ; θ ) = 2 P( ; φ ) P( ; φ ) P( ; φ ) P( ; φ ) θ θ θ θ. The weighted-integral Hellinger deviance between IRF s is obtained by integrating out θ with respect to some suitable density function w(θ): + H( φ, φ ) = 2 H( φ, φ ; θ) w( θ) dθ References Baker, F.B. (992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel Dekker, Inc. Bock, R.D., and Aitken, M. (98). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika 46, 443-46. Bock, R.D., and Mislevy, R.J. (985). BILOG (Computer program). Scientific Software. Gamerman, D. (997). Markov Chain Monte Carlo: Stochastic simulation for Bayesian inference. London: Chapman & Hall. Geyer, C.J. (996). Estimation and optimization of functions. In Markov Chain Monte Carlo in Practice. (Eds. W.R. Gilks, S. Richardson, and D.J. Spielgelhalter), 24-258. London: Chapman & Hall. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (995). Bayesian Data Analysis. London: Chapman & Hall. Gilks, W.R., Richardson, S., and Spielgelhalter D.J. (eds.) (996). Markov Chain Monte Carlo in Practice. London: Chapman & Hall. Patz, R.J. (996). Markov chain Monte Carlo methods for item response theory models with applications for the National Assessment of Educational Progress. Doctoral dissertation, Department of Statistics, Carnegie Mellon University.

MCMC Item Calibration Page Patz, R.J., and Junker, B.W. (999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics 24, 46-78.

MCMC Item Calibration Page 2

MCMC Item Calibration Page 3

MCMC Item Calibration Page 5 9 8 7 6 5 4 3 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22.9.8.7.6.5.4.3.2. -. -.9 -.7 -.5 -.2..5.9.3.8 2.2 2.7 3. 3.5 3.9 4.2 4.5 4.7 4.9 5. MCMCR N(2,) MCMC N(2,) BILOG N(2,) N (2,) MCMCR 67 BILOGR 67 MCMC 59 TRUE 59 BILOG 36 TRUE 36 9 8 7 6 5 4 3 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22.9.8.7.6.5.4.3.2. -2. -.9 -.7 -.5 -.2 -.9 -.5 -..3.8.2.7 2. 2.5 2.9 3.2 3.5 3.7 3.9 4. MCMCR N(,) MCMC N(,) BILOG N(,) N (,) MCMCR 63 BILOGR 63 MCMC TRUE BILOG 28 TRUE 28 9 8 7 6 5 4 3 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22.9.8.7.6.5.4.3.2. -3. -2.9-2.7-2.5-2.2 -.9 -.5 -. -.7 -.2.2.7..5.9 2.2 2.5 2.7 2.9 3. MCMCR N(,) MCMC N(,) BILOG N(,) N (,) MCMCR 28 BILOGR 28 MCMC 22 TRUE 22 BILOG 3 TRUE 3 9 8 7 6 5 4 3 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22.9.8.7.6.5.4.3.2. -4. -3.9-3.7-3.5-3.2-2.9-2.5-2. -.7 -.2 -.8 -.3..5.9.2.5.7.9 2. MCMCR N(-,) MCMC N(-,) BILOG N(-,) N (-,) MCMCR 3 BILOGR 3 MCMC 88 TRUE 88 BILOG 4 TRUE 4 9 8 7 6 5 4 3 2 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 2 22.9.8.7.6.5.4.3.2. -5. -4.9-4.7-4.5-4.2-3.9-3.5-3. -2.7-2.2 -.8 -.3 -.9 -.5 -..2.5.7.9. MCMCR N(-2,) MCMC N(-2,) BILOG N(-2,) N (-2,) MCMCR 5 BILOGR 5 MCMC 86 TRUE 86 BILOG 87 TRUE 87 FIGURE. Histograms of Hellinger deviancies and pairs of IRF s corresponding to the 75 th percentile of the Hellinger deviance.

MCMC Item Calibration Page 7 IRF4 IRF58.8.8.6.6.4.4.2.2 3 2 2 3 3 2 2 3.8 IIF4.8 IIF58.7.7.6.6.5.5.4.4.3.3.2.2.. 3 2 2 3 3 2 2 3 FIGURE 2. Bayes posterior envelopes for IRF s and IIF s for items 4 and 58.