BAYESIAN MODEL CHECKING STRATEGIES FOR DICHOTOMOUS ITEM RESPONSE THEORY MODELS. Sherwin G. Toribio. A Dissertation

Size: px

Start display at page:

Download "BAYESIAN MODEL CHECKING STRATEGIES FOR DICHOTOMOUS ITEM RESPONSE THEORY MODELS. Sherwin G. Toribio. A Dissertation"

Hugo Crawford
5 years ago
Views:

1 BAYESIAN MODEL CHECKING STRATEGIES FOR DICHOTOMOUS ITEM RESPONSE THEORY MODELS Sherwin G. Toribio A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 26 Committee: James H. Albert, Advisor William H. Redmond Graduate Faculty Representative John T. Chen Craig L. Zirbel

2 ii ABSTRACT James H Albert, Advisor Item Response Theory (IRT) models are commonly used in educational and psychological testing. These models are mainly used to assess the latent abilities of examinees and the effectiveness of the test items in measuring this underlying trait. However, model checking in Item Response Theory is still an underdeveloped area. In this dissertation, various model checking strategies from a Bayesian perspective for different Item Response models are presented. In particular, three methods are employed to assess the goodness-of-fit of different IRT models. First, Bayesian residuals and different residual plots are introduced to serve as graphical procedures to check for model fit and to detect outlying items and examinees. Second, the idea of predictive distributions is used to construct reference distributions for different test quantities and discrepancy measures, including the standard deviation of point bi-serial correlations, Bock s Pearson-type chi-square index, Yen s Q 1 index, Hosmer- Lemeshow Statistic, Mckinley and Mill s G 2 index, Orlando and Thissen s S G 2 and S X 2 indices, Wright and Stone s W -statistic, and the Log-likelihood statistic. The prior, posterior, and partial posterior predictive distributions are discussed and employed. Finally, Bayes factor are used to compare different IRT models in model selection and detection of outlying discrimination parameters. In this topic, different numerical procedures to estimate the Bayes factors for these models are discussed. All of these proposed methods are illustrated using simulated data and Mathematics placement exam data from BGSU.

3 iii ACKNOWLEDGMENTS First of all, I would like to thank Dr. Jim Albert, my advisor, for his constant support and many suggestions throughout this research. I also wish to thank him for the friendship and all the advice that he shared about life in general. I also want to extend my gratitude to the other members of my committee, Dr. John Chen, Dr. Craig Zirbel, and Dr. William Redmond, for their time and advice. I am grateful to the department of Mathematics and Statistics for all the support and for providing a wonderful research environment. I especially wish to thank Marcia Seubert, Cyndi Patterson, and Mary Busdeker for all their help. The dissertation fellowship for the period was crucial to the completion of this work. I wish to thank my colleagues and friends from BG, Joel, Vhie, Merly, Florence, Dhanuja, Kevin, Mike, Khairul and Shapla, and all the other Pinoys for all the fun and interesting discussions. Finally, I thank my beloved wife, Alie, for all her support, love, and patience, and Simone for bringing all the joy and happiness in our lives during our stay in Bowling Green. Without them this work could never have come to existence. Sherwin G. Toribio Bowling Green, Ohio August, 26

4 iv TABLE OF CONTENTS CHAPTER 1: ITEM RESPONSE THEORY MODELS Introduction Item Response Curve Common IRT Models One-Parameter Model Two-Parameter Model Three-Parameter Model Exchangeable IRT Model Parameter Estimation Likelihood Function Joint Maximum Likelihood Estimation Bayesian Estimation Albert s Gibbs Sampler An Example - BGSU Mathematics Placement Exam Advantages of the Bayesian Approach CHAPTER 2: MODEL CHECKING METHODS FOR BINARY AND IRT MODELS Introduction Residuals

5 v Classical Residuals Bayesian Residuals Chi-squared tests for Goodness-of-fit of IRT Models Wright and Pachapakesan Index (WP) Bock s Index (B) Yen s Index (Q 1 ) Hosmer and Lemeshow Index (HL) Mckingley and Mills Index (G 2 ) Orlando and Thissen Indices (S χ 2 and S G 2 ) Discrepancy Measures and Test quantities Predictive Distributions Prior Predictive Distribution Posterior Predictive Distribution Conditional Predictive Distribution Partial Posterior Predictive Distribution Bayes Factor CHAPTER 3: OUTLIER DETECTION IN IRT MODELS USING BAYESIAN RESIDUALS Introduction Detecting Misfitted Items Using IRC Interval Band Detecting Guessers Examinee Bayesian Residual Plots

6 vi Examinee Bayesian Latent Residual Plots Detecting Misfitted Examinees Application To Real Data Set CHAPTER 4: ASSESSING THE GOODNESS-OF-FIT OF IRT MODELS USING PREDICTIVE DISTRIBUTIONS Introduction Checking the Appropriateness of the One-parameter Probit IRT Model Point Biserial Correlation Using Prior Predictive Using Posterior Predictive Item Fit Analysis Using Prior Predictive Using Posterior Predictive Using Partial Posterior Predictive Examinee Fit Analysis Discrepancy Measures for Person Fit Detecting Guessers using Posterior Predictive Application To Real Data Set CHAPTER 5: BAYESIAN METHODS FOR IRT MODEL SELECTION Introduction Checking the Beta-Binomial Model using Bayes Factors Beta Binomial Model

7 vii Bayes Factor Laplace Method for Integration Estimating the Bayes Factor Application to Real Data Approximating the Denominator of the Bayes Factor Using Importance Sampling Exchangeable IRT Model Approximating the One-parameter model Approximating the Two-parameter model IRT Model Comparisons and Model Selection Computing the Bayes Factor for IRT models IRT Model Comparison Finding Outlying Discrimination Parameters Using Bayes Factor Using Mixture Prior Density Application To Real Data Set CHAPTER 6: SUMMARY AND CONCLUSIONS 142 Appendix A: NUMERICAL METHODS 145 A.1 Newton Raphson for IRT Models A.2 Markov Chain Monte Carlo (MCMC) A.2.1 Metropolis-Hasting A.2.2 Gibbs Sampling

8 A.2.3 Importance Sampling viii Appendix B: MATLAB PROGRAMS 151 B.1 Chapter 1 codes B.2 Chapter 3 codes B.3 Chapter 4 codes B.4 Chapter 5 codes REFERENCES 174

9 ix LIST OF FIGURES 1.1 A typical item response curve Item response curves for 3 different difficulty values Item response curves for 3 different discrimination values Items with high discrimination power have higher chances of distinguishing two examinees with different ability scores than items with low discrimination power Scatterplots of 35 actual item parameter versus their corresponding estimates Scatterplots of 1 actual ability scores versus their corresponding estimates Scatterplots of 35 actual item parameter versus their corresponding Bayesian estimates Scatterplot of 1 actual ability scores versus their corresponding Bayesian estimates Summary plot of the JML estimates of the parameters of the 35 items in BGSU Math placement exam Scatterplot of the JML estimates of the ability scores versus their corresponding exam raw score Summary plot of the Bayesian estimates of the parameters of the 35 items in BGSU Math placement exam Scatterplot of the Bayesian estimates of the ability scores versus their corresponding exam raw score

10 x 1.13 Scatterplots that compare the Bayesian estimates with the JMLE estimates of the item parameters A scatterplot that depicts a strong correlation between the Bayesian and JMLE estimates of the ability scores Classical Residual Plot A 9% interval band for the fitted item response curves of items 15 and 3 using the Two-parameter IRT model A 9% interval band for the item response curves of items 1 (above) and 26 (below) fitted with the (left) One-parameter IRT model and (right) Twoparameter IRT model Posterior residual plots of items 1 (above) and 26 (below) fitted with the (left) One-parameter IRT model and (right) Two-parameter IRT model Examinee residual plot of someone with ability score θ = Examinee residual plots of examinees with ability scores of θ = 1.15 (left) and θ = 2.19 (right) Examinee residual plots of examinees with ability scores of θ = 1.22 (left) and θ = 2.2 (right) Examinee residual plots of two guessers Examinee latent residual plot of an examinee with ability score θ = Examinee latent residual plots of examinees with ability scores of θ = 1.15(left) and θ = 2.19(right)

11 xi 3.1 Examinee latent residual plots of examinees with ability scores of θ = 1.22(left) and θ = 2.2(right) Examinee latent residual plots of two guessers Histograms of the number of examinees (out of 1) who scored (left) much too high and (right) much too low Examinee residual and latent residual plots of examinee no Residual and latent residual plots of examinee no. 82 (above) and 854 (below) Examinee residual and latent residual plots of examinee no IRC band and Posterior Residual Plot of Item Item response curves of item 21 (above) and item 3 (below) Histogram of 5 simulated values of std(r-pbis) using prior predictive distribution This histogram of the 1 simulated prior predictive p-values illustrates that the distribution of the prior p-value of std(r pbis ) is close to uniform[,1] Histograms of 1 observed std(r pbis ) when data sets were generated using (left) two-parameter and (right) one-parameter model Histogram of 5 simulated values of std(r-pbis) Histogram of 1 posterior predictive p-values Residual plots of the two guessers, examinee 236 and Residual plots of examinee Histogram of the 995 non-guessers Histogram of 5 simulated values of std(r-pbis)

12 xii 4.1 The 9% interval bands for the item response curves of items 11 (upper left), 3 (upper right), 33 (lower left), and 34 (lower right) fitted with the oneparameter IRT model The 9% interval bands for the item response curves of items 14 (left) and 15 (right) fitted with the one-parameter IRT model Latent residual plots of six students marked as potential guessers by the W and L statistics using the posterior predictive distribution Scatterplots of the exact values versus the approximate values of the logdenominator of the Bayes factor Parameter estimates obtained using the exchangeable model is compared with the actual values: (left) Item difficulty, and (right) Ability scores Item Parameter and Ability scores estimates obtained using the exchangeable model is compared with the observed data: (left) Item difficulty vs. No. of correct students, and (right) Ability scores vs. Students raw scores Scatterplot of the discrimination estimates obtained using the Exchangeable model and the Two-parameter model Estimates obtained using the two exchangeable models (one with random s a and one with fixed s a =.25) are compared: (left) Item difficulty; (right) Ability scores Estimates obtained using the One-parameter model and the exchangeable model with fixed s a =.1 are compared: (left) Item difficulty; (right) Item discrimination

13 xiii 5.7 Estimates obtained using the One-parameter model and the exchangeable model with fixed s a = 1 are compared: (left) Item difficulty; (right) Ability scores Scatterplot of estimates of ability scores obtained using the One-parameter model and the exchangeable model with fixed s a = Histogram of 1 log 1 BF of Exchangeable model (s a =.25) vs. (left) Twoparameter model and (right) One-parameter model Values of log 1 BF of exchangeable models with varying standard deviations compared to the approximate Two-parameter model. The right plot is just a close up look at the peak of the graph Values of log 1 BF of exchangeable models with varying standard deviations compared to the approximate One-parameter model. The right plot is a closer look at the peak of the graph (left) Scatterplot of the actual vs. estimated item discrimination parameters. (right) Estimated probability of each item having an outlying discrimination parameter. Note that items 1, 2, and 3 have much bigger probabilities than the rest Values of log 1 BF of exchangeable models with varying standard deviations compared to the two-parameter model using the BGSU Math placement data set. The right plot is just a close up look at the peak of the graph Histogram of the 1 posterior sample values of µ a for the BGSU Math placement data using the exchangeable model with s a =

14 xiv 5.15 Histogram of the 1 posterior sample values of µ a for the BGSU Math placement data using the exchangeable model with s a =

15 xv LIST OF TABLES 1.1 First and Second Derivatives of Item and Ability Parameters for the Two- Parameter Logistic Model Two extreme questions in the exam levels of evidence by log 1 BF Orlando and Thissen (2) simulation results: Proportion of significant p- values (<.5) Percentage of p-values <.5 out of Percentage of p-values <.5 out of Percentage of significant p-values when the one-parameter probit model is used on items with no guessing parameter(c = ) Percentage of significant p-values when the one-parameter probit model is used on items with guessing parameter value of c = Percentage of significant p-values when the two-parameter probit model is used on items with no guessing parameter (c = ) Percentage of significant p-values when the two-parameter probit model is used on items with guessing parameter value of c = Percentage of p-values <.5 out of 1 using G 2 (pp1 and pp2 represent the one-parameter and two-parameter probit model) The 17 misfitted examinees with P W <.5 (* signifies a guesser)

16 4.1 The 16 misfitted examinees with P L <.5 (* signifies a guesser) The percentage of P L and P W <.5 (* signifies a guesser) xvi 5.1 Twenty simulated observations from Beta-binomial model Twenty generated binomial observations Range of values of log 1 BF Levels of evidence by log 1 BF Barry Bonds hitting data from 1986 to The log 1 BF(M l out /M) for each item in the artificial data. Note that the value of log 1 BF(M l out /M) for items 1, 2, and 3 are all bigger than 3, marking them as items with outlying discrimination parameter The ˆγ for each item represents the likelihood that its discrimination parameter is outlying. Note that the value of ˆγ for items 1, 2, and 3 are all much bigger than the rest, marking them as items with outlying discrimination parameter Bayesian estimates of â j, log 1 BF, and γ for the BGSU Math placement exam.141

17 xvii OVERVIEW The focus of this dissertation is to discuss the available model diagnostic procedures for Item Response Theory (IRT) models and to propose new methodologies to assess the goodness-of-fit of these models. The first two chapters cover the material needed to understand the different IRT models and some Bayesian ideas which will be utilized later. Chapters 3, 4, and 5 cover the proposed Bayesian methodologies to assess the goodness-of-fit of the IRT models. In Chapter 1, the different IRT models used in this work are introduced. Classical and Bayesian methods to estimate the parameters in the IRT models are also discussed. This includes discussions of some numerical methods like Newton Raphson and Gibbs Sampling. These methods are illustrated using a Mathematics placement data set from BGSU. The chapter ends with a discussion on the advantages of the Bayesian estimation method over the classical estimation method. Chapter 2 covers the ideas of classical and Bayesian residuals. The concept of residuals is used to construct different chi-squared indices which are currently being used to check model fit of IRT models. These different indices are used later within a Bayesian framework as discrepancy measures. The idea of predictive distributions and measures of surprise are also discussed in this chapter. These standard Bayesian ideas are useful to construct reference distributions for different test quantities and discrepancy measures. Another important Bayesian concept that will be employed later is the Bayes factor, which is introduced in the last section of this chapter.

18 xviii Chapter 3 deals mostly about graphical procedures that can be used to assess the fit of the IRT models. These visual diagnostic plots are constructed based on Bayesian residuals. The item response curve probability interval band proposed by Albert (1999) is a simple but very useful plot to check for item fit. This plot is described in the first section. Two other diagnostic plots, the examinee Bayesian residual and latent residual plots, are proposed in the second section. These two plots will be utilized to check how a particular examinee performed in the test. They may also help detect examinees who were simply guessing their responses. In the third section, a Bayesian procedure to detect examinees who scored much too low or much too high in the exam is proposed based on another Bayesian residual. These Bayesian methods and plots were applied to a real data set in the last section. In Chapter 4, new quantitative methods are proposed to give objective assessments to the fit of IRT models. In particular, the prior, posterior, and partial posterior predictive distributions are used to construct reference distributions for the standard deviation of the item point-biserial correlations, and for eight different discrepancy measures the six χ 2 -indices described in Chapter 2 and two more discrepancy measures for person fit. A simulation study is performed to illustrate and compare the effectiveness of these different discrepancy measures and different predictive distributions in detecting misfitted items and examinees, as well as the overall model misfit. The chapter ends with the application of these predictive methods to a real data set. In Chapter 5, the Bayes factor is used to illustrate a quantitative method for comparing goodness-of-fit of different IRT models and for model selection. The first section of this chapter covers different numerical methods that could be used to calculate the Bayes factor. These methods are then modified and applied to estimate the Bayes factor for IRT models

19 xix in later sections. This Bayes factor is used to choose between competing IRT models and for the detection of outlying discrimination parameters. The effectiveness of this method will be illustrated using simulated data. Again, these methods were applied to a real data set in the last section. Finally, the last chapter gives a summary of all the proposed Bayesian methods along with discussions regarding their performances in the assessment of goodness-of-fit of different IRT models.

20 1 CHAPTER 1 ITEM RESPONSE THEORY MODELS 1.1 Introduction Item Response Theory (IRT) models are commonly used in Educational and Psychological testing. In these fields of study, researchers are usually interested in measuring the underlying ability of examinees such as intelligence, mathematical abilities, or scholastic abilities. However, these kinds of quantities cannot be measured directly as one measures physical attributes like weight or height. In this sense, these underlying abilities are latent traits. One of the main objectives of IRT is to measure the amount of (latent) ability that an examinee possesses. This is usually done using a questionnaire or an examination. It is important that the items used in the questionnaire or test are appropriate to accurately and effectively measure the underlying trait. Consequently, the second main objective of IRT is to study the effectiveness of different test items in measuring a particular underlying trait. Although the idea of IRT has been around for almost a century now, it only became popular in the last two decades. This is mainly due to the extensive computational requirements of the IRT methods. Up until the 198 s, the Classical Test Theory (CTT) had been the mainstay of psychological and educational test development and test score analysis. The classic book of Gulliksen (195) is often cited as the defining volume for CTT. Today there are countless numbers of achievement, aptitude, and personality tests that were constructed using CTT models and procedures. However, there are many well-documented shortcomings of the ways in which educa-

21 2 tional and psychological tests are usually constructed, evaluated, and used within the CTT (Hambleton & van der Linden, 1982). For one, the values of commonly used item statistics in test development, such as item difficulty, depend on the particular sample of examinees from which they were obtained. That is, one particular item can be labeled as easy when given to a group of well prepared students and as difficult when given to a group of unprepared students. For more information about the shortcomings of CTT, see the book by Hambleton and Swaminathan (1985). By the late 198 s, the power of computers had developed to a point where it allowed people working in measurement theory to employ the more computationally intensive methods of IRT. This new theory is conceptually more powerful than CTT [11]. Based upon items rather than test scores, IRT addresses most of the shortcomings of the CTT. In other words, IRT can do all the things that CTT can do and more. An extensive comparison between these two theories are discussed in the book by Embretson and Reise (2). 1.2 Item Response Curve In this dissertation, the latent ability of examinees (usually denoted by θ) is assumed to be continuous and one-dimensional. That means that the performance of an examinee on a particular item of an exam depends only on this one characteristic. Theoretically, the range of this latent variable is from negative infinity to positive infinity. But for most practical purposes, it is sufficient to limit this range between 3 and 3. An examinee with higher ability score is expected to perform better in answering a particular item in the test compared to an examinee with a lower ability score. In the case where items in the test can only be answered either correctly or incorrectly,

22 3 let y denote the examinee s response to a particular item, and take y = 1 if the response is correct and y = if incorrect. This is a Bernoulli random variable with success probability p that depends on the latent ability of the examinee. That is, p = Pr(y = 1) = F (θ), where F represents a known function, called the link function. Because p should be an increasing function of θ and are supposed to take on values between and 1, a natural class for the function F is provided by the class of cumulative distribution functions, or cdf s. The two most commonly used link functions in IRT models are: 1. Probit link (standard normal cdf). F (x) = x 1 2π e 1 2 t2 dt, x R 2. Logistic link (standard logistic distribution function). F (x) = ex 1 + e x, x R Inferences obtained using either link functions are essentially the same. Previously, people working with IRT models prefer the logistic link because of its nice properties that simplifies the mathematical calculations in parameter estimation. However, with the advancement of computing power and the introduction of Bayesian methods in parameter estimation, the probit link has gained more popularity as it is more natural and easier to implement numerically. In the IRT model, an examinee with a certain ability level will have a certain probability of answering a particular item correctly. Plotting these probabilities and their corresponding

23 ability scores will yield a plot like the one shown Figure 1.1. This curve is called an Item Response Curve (IRC). 4 1 Typical Item Response Curve.9.8 PROBABILITY OF CORRECT RESPONSE LATENT ABILITY Figure 1.1: A typical item response curve. 1.3 Common IRT Models One-Parameter Model The probability that an examinee will answer a particular item in a test correctly should also depend on the characteristics of the item. For example, if item 2 is more difficult than item 1, then the probability that a particular examinee will get item 2 correctly should be lower than the probability that he/she gets item 1 correctly. Under the assumption that each item in the test can be described using this single difficulty parameter, one could model the probability of correctly answering a particular item in the test by Pr(y = 1) = F (θ b) (1.3.1)

24 5 where b represents the difficulty parameter of the item. To see the effect of b in the Item Response Curve (IRC), consider the three different plots given in Figure 1.2 with varying difficulty values. 1.9 a=1,b= a=1,b= 1 a=1,b=1 Item Response Curves for 3 Different Item Difficulty Values.8 PROBABILITY OF CORRECT RESPONSE Easier (b= 1) (b=) Harder (b=1) LATENT ABILITY Figure 1.2: Item response curves for 3 different difficulty values. Note that b serves as a location parameter. When b takes negative values, the IRC is shifted to the left and the probability that a particular examinee, with a certain ability score θ, correctly answers the item increases. Hence, lower b values correspond to easier items and higher b values correspond to more difficult items. When the link function F is taken to be the cumulative distribution function of the standard normal distribution (denoted by Φ), this model is known as the One-parameter probit model. But when this link function F is taken as the logistic cumulative distribution

25 6 function, this model becomes the famous Rasch Model (Rasch, 1966). Pr(Y = 1) = e(θ b). (1.3.2) 1 + e (θ b) Two-Parameter Model Suppose that each item in the exam can be described by two parameters a discrimination parameter a j and a difficulty parameter b j. Then the probability that a particular examinee with latent ability score of θ i correctly answers item j is modeled as Pr(Y ij = 1 θ i ) = F (a j θ i b j ). (1.3.3) Again, when the link function F is taken to be the cumulative distribution function of the standard normal distribution, this model is known as the Two-parameter probit model. But when this link function F is taken as the logistic cumulative distribution function, this model is called the Two-parameter logit model. To see the effect of the discrimination parameter a in the item response curve, consider the three different plots shown in Figure 1.3 with varying discrimination values. Note that a serves as a scale parameter that represents the slope of the item response curve. It indicates how well a particular item discriminates between students with different abilities. Take for example two examinees one with ability score and another with ability score 1. If an item has a discrimination parameter value of.5, then the difference in the probabilities of getting the correct answer to this item by these 2 examinees will be about.19 (see Figure 1.4). On the other hand, if an item has a discrimination parameter value of 2, then this difference in probabilities will be about.48.

26 7 PROBABILITY OF CORRECT RESPONSE a=1,b= a=.5,b= a=2,b= Item Response Curves for 3 Different Item Discrimination Values High Low LATENT ABILITY Figure 1.3: Item response curves for 3 different discrimination values a=1,b= a=.5,b= a=2,b= High Discrimination vs. Low Discrimination P1 P =.48 PROBABILITY OF CORRECT RESPONSE P1 P = LATENT ABILITY Figure 1.4: Items with high discrimination power have higher chances of distinguishing two examinees with different ability scores than items with low discrimination power.

27 Hence, the item with higher discrimination parameter value has a better chance of finding the examinee with higher ability score Three-Parameter Model Sometimes, especially on multiple choice items, examinees can get the correct answer purely by guessing. To include this guessing parameter in the model, one could model the success probability as Pr(y ij = 1 θ i ) = c j + (1 c j )F (a j θ i b j ). (1.3.4) where c j represents the probability that any examinee will get item j correct by pure guessing. This model is known as the Three-parameter probit model when the standard normal cumulative distribution is used as the link function. But when the logistic link function is used, this model is called the Three-parameter logit model. The latter model was introduced by Birnbaum in Exchangeable IRT Model The one-parameter IRT model assumes that all items in the exam have the same discrimination parameters (usually all equal to one), while the two-parameter IRT model assumes that each item can have a different discrimination parameter value. Some people think that the one-parameter model is too restrictive, while others think that the two-parameter model is already over-parameterized. In the Bayesian framework, there is a way to get a compromise between these two models. This is achieved by considering an exchangeable IRT model in which the item discrimination parameter values are shrunk toward a common

28 value. More details about this model will be discussed in Chapter 5, where this model will be used extensively Parameter Estimation There are two main methods of obtaining estimates for the parameters in the above mentioned models: the classical Joint Maximum Likelihood Estimation (JMLE) and by Bayesian Estimation. In either case, one has to work with the likelihood function. To facilitate the discussion of these two estimation methods, they will be discussed using only the two-parameter IRT models. These two estimation procedures can be easily modified to work for the other IRT models Likelihood Function Let y i1, y i2,..., y ik denote the binary responses of the ith individual to k test items, a = (a 1,..., a k ) and b = (b 1,..., b k ) be the vectors of item discrimination and difficulty parameters, respectively. Assuming that an individual taking the test answers each item independently (local independence assumption), then the probability of observing the entire sequence of responses of the ith individual is given by Pr(Y i1 = y i1,..., Y ik = y ik θ i, a, b) = = k Pr(Y ij = y ij θ i, a, b). j=1 k F (a j θ i b j ) y ij [1 F (a j θ i b j )] (1 yij). j=1

29 1 Finally, if the responses of each of the n individuals to the test items are assumed to be independent, then the likelihood function for all responses of all individuals will be L(θ, a, b) = n k F (a j θ i b j ) y ij [1 F (a j θ i b j )] (1 yij). (1.4.1) i=1 j=1 This function represents the likelihood of obtaining the observed data as a function of the model parameters. Therefore, it is logical to estimate these model parameters using those values that maximize this likelihood function. This is what Maximum Likelihood Estimation (MLE) or Joint Maximum Likelihood Estimation (JMLE) is all about Joint Maximum Likelihood Estimation One of the most common ways of maximizing a likelihood function is to take its partial derivatives with respect to each parameter in the model and set them to zero. Actually, because likelihood functions are most often expressed as the product of several density functions, it is often more convenient to maximize the natural logarithm of the likelihood (ln(l)). Since logarithmic functions are increasing in R, then the maximum of the likelihood function will occur at the same point as the maximum of the log-likelihood. In the case of the two-parameter IRT model, the log-likelihood is ln L = n k {y ij ln(p ij ) + (1 y ij ) ln(1 p ij )} (1.4.2) i=1 j=1 where p ij = F (a j θ i b j ). Taking its partial derivatives with respect to each parameter and setting them to zero will yield a system of n + 2k equations with the same number of unknowns. The solutions of this system of equations are the potential maximum likelihood estimates of the model

30 11 parameters. For this reason, people working with the IRT models preferred to use the logistic link because it simplifies the derivative expressions nicely and greatly facilitated the required calculations. For the two-parameter logistic IRT model, where p ij = e(a jθ i b j ) 1 + e (a jθ i b j, the first partial ) derivatives are given by p ij a j = p ij q ij θ i, p ij b j = p ij q ij, and p ij θ i = p ij q ij a j, where q ij = 1 p ij. Using these partial derivatives, the first and second partial derivatives of the log-likelihood (1.4.2) using the logistic link, can be obtained easily and are summarized in Table 1.1, shown below. Derivative ln(l) a j ln(l) b j ln(l) θ i 2 ln(l) a 2 j 2 ln(l) b j a j 2 ln(l) b 2 j Expression n θ i (y ij p ij ) i=1 n (y ij p ij ) i=1 k a j (y ij p ij ) j=1 n p ij q ij θi 2 i=1 n p ij q ij θ i i=1 n p ij q ij i=1 Table 1.1: First and Second Derivatives of Item and Ability Parameters for the Two- Parameter Logistic Model. However, even with the logistic link, the resulting equations are not linear. Thus, to get the maximum likelihood estimates, one needs to solve these equations numerically. Two

31 12 popular numerical methods used for this purpose are the Newton-Raphson and Fisher s Method of Scoring (see Appendix). Using the mathematical software Matlab, the author has written programs to implement the Newton-Raphson algorithm to estimate the parameters of the two-parameter logistic model. This program is described in full in the Appendix under the name pl2 mle. To show how close the JMLE estimates are to the actual parameters, a simple simulation was performed where a data set of s and 1 s were generated using 1 simulated ability scores and 35 test items each with 2 parameters. The 1 simulated ability scores and the 35 item difficulty parameter values were generated from N(,1), while the 35 item discrimination parameters were randomly selected from the possible values of {.2,.4,.6,.8, 1., 1.2, 1.4, 1.6}. Once the parameter values were specified, the probability of answering a particular item correctly by a certain simulated student was computed using the logistic link to obtain a 1 35 matrix of probabilities. Finally, this matrix was converted into a matrix of s and 1 s to simulate a particular exam result. Using the 1 35 data matrix of simulated responses, the JMLE estimates were obtained using the program pl2 mle. The two scatterplots shown in Figure 1.5 display the relationship between the actual item parameters and their JMLE estimates. The left plot of Figure 1.5 shows a linear pattern of dots that were very close to the line y = x, which illustrates the accuracy of the estimates of the 35 difficulty parameters obtained. The right plot of Figure 1.5 also shows a linear trend, but the dots in this plot are more scattered revealing a lower precision for the estimates of the discrimination parameters of the 35 items. Also, notice that the linear pattern is slightly above the line y = x, suggesting a positive bias in the estimation of the discrimination parameters by the JMLE. This positive bias was previously noted by Lord

32 13 2 Classical Approach : (r =.9965) 1.8 Classical Approach : (r =.981) Estimated Item Difficulty.5.5 Estimated Item Discrimination Actual Item Difficulty Actual Item Discrimination Figure 1.5: Scatterplots of 35 actual item parameter versus their corresponding estimates (1983). 4 Classical Approach : (r =.8964) 3 2 Estimated Ability Score Actual Ability Score Figure 1.6: Scatterplots of 1 actual ability scores versus their corresponding estimates. The scatterplot of the actual ability scores versus their estimated values is given in Figure 1.6. Again, notice the linear pattern of the dots that cluster around the line y = x. The variability of the estimates around this line depends on the number of items in the exam. That is, if there were more items in the exam, these dots would be closer to the line y = x.

33 This plot also revealed the increased variability of the estimates for extreme ability scores Bayesian Estimation In the classical (or frequentist) framework, parameters in a model are considered as fixed quantities. On the other hand, in the Bayesian framework, these parameters, ξ = (ξ 1,..., ξ N ), are considered as random variables that follow a certain distribution, π(ξ). Bayesian methodology requires the specification of a prior distribution, π (ξ), for each parameter ξ in the model. This will represent the prior belief regarding the parameters in the model. After observing the data through the likelihood function, L(data; ξ), the belief about the parameters is modified (or updated) by computing their posterior distributions, π(ξ data). This is done with the use of the Bayes Rule formula: P (A B) = P (A B) P (B) = P (B A)P (A) P (B) P (B A)P (A). Or in our terms, π(ξ data) L(data; ξ)π (ξ). Once the posterior distributions of the parameters are obtained, all inferences pertaining to these parameters are based on their respective posterior distributions. For the Bayesian method of estimation, using the probit link is more natural and easier to implement as will be seen later. For this reason, the Bayesian estimation method was discussed using the two-parameter probit model. Using Bayes rule, the joint posterior distribution will be proportional to the product of the likelihood function obtained in Section and the joint prior density of the parameters. That is, π(θ, a, b data) n k Φ(a j θ i b j ) y ij [1 Φ(a j θ i b j )] (1 yij) π (θ, a, b). (1.4.3) i=1 j=1

34 15 where Φ is the standard normal cumulative distribution and π (θ, a, b) is the joint prior density of the parameters in the model. It is a standard practice to use values for θ i mostly between 3 and 3. For this reason, a N(, 1) prior is assigned for θ i, i = 1,..., n. This also solves the problem of nonidentifiability of the parameters in the model. To avoid the problem of getting unbounded estimates for the item difficulty parameters, b j is assigned a N(, s b ) prior, j = 1,..., k, where s b < 5. Finally, a N(, s a ) prior is also assigned for a j, j = 1,..., k, where s a is fixed. For simplicity, s b and s a were both set to 1 in the actual computations. Combining these prior densities with the likelihood function, the posterior density of the two-parameter IRT model is proportional to n k π(θ, a, b data) L(θ, a, b) φ(θ i ;, 1) φ(b j ;, s b )φ(a j ;, s a ). (1.4.4) i=1 j=1 As mentioned before, all Bayesian inferences about a parameter will be based on its posterior distribution. Consequently, Bayesian analysis will require the study of the important parameters based on this joint posterior distribution or their corresponding marginal posterior distributions. However, it is quite difficult to study this complicated posterior distribution or to derive the marginal posterior distributions analytically. An alternative method is to simulate values of these parameters from the joint posterior distribution. Inferences about a parameter can then be made using this sample. For example, one could take the average of the sample to serve as an estimate of the mean of the parameter, or construct an approximate 95% probability interval for the parameter. However, drawing a sample from a high-dimensional posterior distribution is not an easy task. Fortunately, there is Gibbs Sampling (Geman and Geman, 1984).

35 16 Gibbs sampling, as discussed in Gelfand and Smith (199), is a special type of Markov Chain Monte Carlo (MCMC) that makes use of the full conditional distribution of a set of parameters. The idea is, to simulate from f(x, y, z) (the joint distribution of X, Y, and Z), one iteratively draws from the full conditional distributions. That is, from initial values x, y, and z, one draws x 1 from g(x y, z ), then y 1 from g(y x 1, z ), and then z 1 from g(z x 1, y 1 ). This will constitute a single iteration of the Gibbs Sampling. To simulate m points from the f(x, y, z), simply repeat this cycle m + l times, where l is the number of cycles it takes to converge to the desired distribution (also called the burn-in period). The points from the last m cycles can be considered as a (dependent) sample drawn from the joint distribution f(x, y, z). Some other people would repeat the cycle km + l times and select every other k points among the last km points in order to reduce the dependency of the sample points. However, Gibbs sampling assumes that it is possible to simulate from the full conditional distributions. If each of the full conditional distributions turned out to be a common standard distribution that can be directly or easily simulated from, then there would be no problem. But if some of these distributions are nonstandard density functions, then one may need to employ a more general MCMC algorithm, like the Metropolis-Hasting (MH) algorithm, to obtain a sample from them (see the appendix for details on MH algorithm). Sometimes, finding the full conditional distribution from a joint distribution may be the problem. Especially in complicated distributions, like our joint posterior distribution given in equation 1.4.4, it can be very challenging.

36 Albert s Gibbs Sampler To facilitate the computation of these full conditional distributions, Albert (1992) introduced a latent variable Z ij that has a normal distribution with mean m ij = a j θ i b j and variance 1. This continuous variable serves as the underlying mechanism that generates the responses. We say that the response is positive (y ij = 1) when z ij > and negative (y ij = ) when z ij <. This ingenious idea greatly simplified the simulation of samples from the conditional posterior distributions, as they turned out to be just variations of the normal distribution. With the introduction of these continuous latent data Z = (Z 11,..., Z nk ), the joint posterior density of all model parameters is given by π(θ, Z, a, b data) n k [φ(z ij ; m ij, 1)I(Z ij, y ij )] i=1 j=1 n φ(θ i ;, 1) i=1 j=1 k φ(b j ;, s b )φ(a j ;, s a ). (1.4.5) where I(z, y) is equal to 1 when {z >, y = 1} or {z <, y = }, and equal to otherwise. To simulate from the joint posterior (1.4.5), the Gibbs sampling procedure can iteratively draw from three sets of conditional probability distributions: g(z θ, (a, b), data), g(θ Z, a, b, data), and g((a, b) Z, θ, data). The conditional posterior distribution of Z ij given (θ i, a j, b j, data) is simply a truncated normal distribution with mean m ij = a j θ i b j and variance 1. The truncation is from the left of if the corresponding response is correct (y ij = 1), and from the right of if it is incorrect (y ij = ). The conditional posterior distribution of θ i given (Z ij, a j, b j, data) is a normal distribution with mean and variance m θi = k j=1 a j(z ij + b j ) k j=1 a2 j + 1 and ν θi = 1 k j=1 a2 j + 1.

37 Finally, the conditional posterior distribution of (a j, b j ) given (Z ij, θ i, data) is the multivariate normal distribution with mean 18 M j = [X X + Σ 1 ] 1 [X Z + [µ a ] Σ 1 ] and covariance matrix where, Σ 1 = (1999). ν j = (X X + Σ 1 ) 1, Sa 2 and X is the known covariate vector (θ Sb 2 i, 1). For more details on these conditional posterior distributions, see Albert and Johnson To implement Albert s Gibbs Sampler on the two-parameter probit IRT model with burn-in period, the author modified Albert s Matlab program to get the program pp2 bay (see Appendix). To see how close the estimates are to the actual parameters, the same simulated parameter values that were used in the previous section were used to generate a data set of s and 1 s using the probit link. Using the generated 1 35 data matrix of responses, the Bayesian estimates were obtained using the program pp2 bay. The two scatterplots shown in Figure 1.7 display the relationship between the actual item parameters and their Bayesian estimates. The left plot in Figure 1.7 shows a linear pattern of dots which resembles very closely to the corresponding plot obtained earlier using the classical approach and is shown in Figure 1.5. The correlation coefficient between the actual difficulty values and their Bayesian estimates was This indicates the accuracy of the Bayesian estimates. The plot on the right of Figure 1.7 also shows a linear pattern of dots that looks similar to the one obtained using the JMLE method, except that the Bayesian item discrimination estimates are better since they centered around values close to

38 Bayesian Approach (r =.9948) 1.8 Bayesian Approach : (r =.9794) Estimated Item Difficulty Estimated Item Discrimination Actual Item Difficulty Actual Item Discrimination Figure 1.7: Scatterplots of 35 actual item parameter versus their corresponding Bayesian estimates 3 Bayesian Approach (r =.9557) 2 Estimated Ability Scores Actual Ability Scores Figure 1.8: Scatterplot of 1 actual ability scores versus their corresponding Bayesian estimates

39 2 the actual discrimination values. Figure 1.8 shows a very nice linear pattern around the line y = x, which illustrates the accuracy of the Bayesian estimates of the ability score. In addition, the problem of higher variability at extreme values that were observed earlier, when the JMLE method was used, no longer exist. 1.5 An Example - BGSU Mathematics Placement Exam Every year, the Mathematics and Statistics Department of BGSU administers a placement exam to determine the proficiency of the incoming freshmen students. In 24, there were three different exams (A, B, and C) given to a total of 557 students. Exam A was composed of 35 questions and were given to a total of 1286 students. Data set A contains the results of these 1286 students to each of the 35 items. It is a table of s and 1 s with a ij = 1 when the ith student answered the jth item correctly and otherwise. To illustrate the kind of results that one gets from the estimation procedures discussed in the previous two sections, those methods were applied to the responses of the 1286 students who took exam A. A. Using Joint Maximum Likelihood Estimation Before the JMLE procedure can be applied to the data set for exam A, students who got either a score of zero or a perfect score, as well as items that were answered correctly or incorrectly by all students had to be removed to avoid getting unreasonable results (This issue will be discussed later in the last section of this chapter). After checking, 11 students (2 zero scores and 9 perfect scores) were removed from the data set. The JMLE procedure was then applied to this slightly smaller data set and the resulting item parameter estimates

Lesson 7: Item response theory models (part 2)

Lesson 7: Item response theory models (part 2) Patrícia Martinková Department of Statistical Modelling Institute of Computer Science, Czech Academy of Sciences Institute for Research and Development of